A Survey on the Fairness of Recommender Systems

Recommender systems are an essential tool to relieve the information overload challenge and play an important role in people's daily lives. Since recommendations involve allocations of social resources (e.g., job recommendation), an important issue is whether recommendations are fair. Unfair recommendations are not only unethical but also harm the long-term interests of the recommender system itself. As a result, fairness issues in recommender systems have recently attracted increasing attention. However, due to multiple complex resource allocation processes and various fairness definitions, the research on fairness in recommendation is scattered. To fill this gap, we review over 60 papers published in top conferences/journals, including TOIS, SIGIR, and WWW. First, we summarize fairness definitions in the recommendation and provide several views to classify fairness issues. Then, we review recommendation datasets and measurements in fairness studies and provide an elaborate taxonomy of fairness methods in the recommendation. Finally, we conclude this survey by outlining some promising future directions.


INTRODUCTION
Nowadays, the amount of information available on the Internet has far exceeded individuals' information needs and processing capacity, which is known as information overload [43].As a tool to alleviate information overload, recommender systems are widely used in people's daily lives (e.g., news recommendations, career recommendations, and even medical recommendations) and play a crucial role.Utility (such as click-through rate, dwell time, etc.) has been the most vital metric for recommender systems.However, only considering utility may lead to problems like the Matthew effect [117] and filter bubble [74].Hence more views of recommender system performance have been Wang, et al. proposed, such as diversity, efficiency, privacy, etc. Fairness is one of these critical issues.Recommender systems serve a resource allocation role in society by allocating information to users and exposure to items.Whether the allocation is fair can affect the personal experience and social good [77].
Fairness problems have received increasing attention from academia, industry, and society.Unfairness exists in different recommendation scenarios and various resources for both users and items.For users, there are significant differences in the recommendation accuracy between users of different ages and genders in movie recommendations and music recommendations, with female users and older users getting worse recommendation results [24].In addition to accuracy, existing studies have also found considerable differences in other recommendation measurements such as diversity and novelty [97].For items, existing research has found that minority items could get worse ranking performance and less exposure opportunity [6,122].Besides, in premium business scenarios, paid items may receive worse services from the platform than non-paid items [60].Moreover, there are potential unfairness issues with various recommendation methods.Both traditional recommendation methods [24] and deep learning models [42] can suffer from unfairness.
Mitigating these unfairness phenomena is of great importance for recommender systems.Here are some reasons.(1) from an ethical perspective, as early as ancient Greece, fairness was listed by Aristotle as one of the crucial virtues to make people live well [3].Fairness is an important virtue, and a fundamental requirement for a just society [78].(2) from a legal perspective, Anti-discrimination laws [41] require that employment, admissions, housing, and public services do not discriminate against different groups of people based on gender, age, race, etc.For example, minority-owned companies should be recommended at a similar rate to white-owned companies in a job recommendation scenario [63].(3) from a user perspective, a fair recommender system facilitates the exposure of different information in the recommendations, including some niche information, which may help break the information cocoon, alleviate the societal polarization, broaden users' horizons and enhance the value of recommendations.( 4) from an item perspective, a fair recommender system can allocate more exposure to long-tail items, alleviating the Matthew effect [58].It may also motivate these providers of niche items and then improve the diversity and creativity of items.(5) from a system perspective, a fair recommender system is conducive to its long-term interest.For example, an unfair recommender system may recommend popular content for users with niche interests, resulting in a bad experience.Similarly, it may also provide little exposure for niche providers.The lack of positive feedback may lead to a tendency for niche groups to leave the platform, which will reduce the diversity of content and users on the platform in the long run and affect the platform's growth [69].Therefore, addressing unfairness is a critical issue for recommender systems.
A concept closely related to fairness is bias, which has also attracted extensive attention in current years.Some biases in recommender systems can lead to unfairness problems, such as popularity bias [119] and mainstream bias [57].There are also some biases that have little to do with fairness, such as position bias.Generally speaking, fairness reflects normative ideas about how a recommender system should be, while bias is more concerned with statistical issues, such as the difference between what the model learns and the real world.
Although fairness has been studied in computer science for decades [29], and there is a lot of related work in machine learning [106,107], fairness in recommendation has its unique problems.First, recommender systems are two-sided platforms that serve both users and items, where two-sided fairness needs to be guaranteed.Second, fairness in recommendation is dynamic in nature as there exists a feedback loop between users and the system.Third, on most platforms, the recommendation needs to be personalized by considering the unique needs of each user.Fairness in recommendation should also take users' personalization into account.Furthermore, apart from accuracy, fairness needs to be jointly considered with other measurements in the recommendation, such as diversity, explainability, and novelty.
Manuscript submitted to ACM Therefore, current fairness work in machine learning, which mainly focuses on classification, could hardly be leveraged in recommender systems directly.
For the above reasons, fairness in recommendation has become an important topic in the research community.
The attracted attention is increasing, which trends have been shown in Fig. 1.As shown in Table 1, more than sixty fairness-related papers about recommendations have been published in top IR-related conferences and journals (e.g., TOIS, SIGIR, WWW, and KDD) in recent five years.In the table, researches on fairness are summarized with their different definitions, targets, subjects, granularity, and optimization objects (details on these definitions are given in Table 3 and Section 3).We can find the focus of current studies.For example, consistent fairness (CO) is the most common definition of fairness, and current studies mainly focus on the group level.These trends are further discussed in the corresponding sections below.
Table 1.A lookup table for the reviewed papers about fairness in recommendation (Here "CO" means consistent fairness, "CA" means calibrated fairness, "CF" means counterfactual fairness, "EF" means envy-free fairness, "RMF" means Rawlsian maximin fairness, "PR" means process fairness and "MSF" means maximin-shared fairness, details on these definitions are given in Table 3).
Research on the fairness in recommendation is blossoming.However, due to various scenarios, diverse stakeholders, and different measurements, the research on fairness in the recommendation field is scattered.In order to fill this gap, this survey systematically reviews the existing formally published research on fairness in the recommendation from Manuscript submitted to ACM several perspectives.The corresponding summary and discussion can guide and inspire future work.In summary, the contributions of this survey are as follows.
• We summarize existing definitions of fairness in recommendation and provide several views for classifying fairness issues in recommendation.• We introduce some widely used measurements for fairness in recommendation and review fairness-related recommendation datasets in previous studies.
• We review current methods for fair recommendations and provide an elaborate taxonomy of methods.
• We outline several promising future research directions from the perspective of definition, evaluation, algorithm design, and explanation.
Several surveys are related to the topic of this survey.As far as we know, Castillo [13] firstly reviews fairness and transparency in information retrieval briefly.However, it only covers the related work before 2018, and in recent years, fairness in recommendations has developed greatly.[14,67] concentrate on fairness in machine learning, but fairness in recommendation is not covered, especially its unique characteristics.Chen et al. [16] recently reviewed bias in recommender systems and introduced fairness issues, but fairness is not their main focus, and fairness measurements and datasets are not covered.To the best of our knowledge, there is no survey dedicated to systemically reviewing and detailing the fairness in the recommendation in a complete view.This survey is structured as follows.In Section 2, we introduce existing definitions of fairness in the recommendation and discuss some related concepts.In Section 3, we present several perspectives to classify fairness issues in the recommendation.In Section 4, we introduce representative measurements for measuring fairness in the recommendation.
In Section 5, we provide a taxonomy of methods to address unfairness in the recommendation.In Section 6, we introduce fairness-related datasets in recommender systems.In Section 7, we present possible future research directions.Finally, we conclude this survey in Section 8.

DEFINITIONS OF FAIRNESS IN RECOMMENDATION
In this section, we first provide definitions of fairness and then discuss the relationship between fairness and some related concepts in recommender systems.It is worth noting that discussions about fairness have existed since ancient times, but there is still no consensus on fairness.Due to the multitude of discussions related to fairness, it is impossible Manuscript submitted to ACM Wang, et al. to list all relevant definitions.Therefore, we will introduce several definitions of fairness appearing in the research on recommendation, which can also be applied to other domains.The taxonomy of the reviewed fairness definitions is illustrated in Fig. 2. To our knowledge, the definitions listed here are sufficient to cover the research on fairness in the recommendation.Besides, the notations used in the definitions are shown in Table 2.As we mentioned in the introduction, recommender systems play a resource allocation role in society, allocating information to users and exposure to items.For allocation, there are two aspects worthy of attention.One is the allocation process, such as the fairness of the recommendation model.The other is the allocation outcome, such as the fairness of the information received by users.Depending on whether the focus is on the process or outcome, fairness can be divided into process fairness and outcome fairness.

Process Fairness.
• Process Fairness.Process fairness believes that the fair allocation should be fair in process [54,73], which is also called procedural justice [54].

Manuscript submitted to ACM
Existing studies [37,38,54] generally focus on whether the information utilized in the allocation process is fair.In the case of job recommendation, process fairness concerns whether the recommendation model is fair, such as whether some unfair features (e.g., race) are used and whether the learned representations are fair.
• Outcome Fairness.Outcome fairness holds that the fair allocation should lead to fair outcomes [26,54], which is also called distributive justice [54].
For example, in the case of job recommendation, outcome fairness concerns the recommendation outcome, such as whether whites would be more likely to be recommended than blacks even if they have the same ability.
The difference between these two kinds of fairness is similar to the difference between teleology and deontology in ethics.Teleology believes that whether a behavior is good or bad is related to outcomes, while deontology believes that it is only related to processes [4].
As the majority of existing research in recommendation focuses on the fairness of outcomes, we concentrate on definitions related to outcome fairness in the following.Outcome fairness can be further sub-grouped according to the target and concept.
Grouped by Target.Based on whether the target is to ensure group-level or individual-level fairness, outcome fairness can be further categorized into group fairness and individual fairness.
• Group Fairness.Group fairness holds that outcomes should be fair among different groups.
There are various ways to divide groups, and the most common is based on some explicit fairness-related attributes, such as gender, age, and race.When there are multiple fairness-related attributes, the whole may be divided into numerous subgroups.Fairness should be considered in these subgroups, as even if the groups under each single attribute division are fair, subgroups may be unfair to each other [49,50].
• Individual Fairness.Individual fairness believes that outcomes should be fair at the individual level.
Individual fairness in some work refers to the idea that similar individuals should be treated similarly [7,23].However, there are also other definitions of individual-level fairness.For the sake of clarity, we use individual fairness to refer to a more general definition, i.e., fairness at the individual level.
Group fairness is more complex than individual fairness as different divisions may exist, and the divisions may be dynamic, i.e., one individual may belong to different groups at different times [32].Moreover, individual fairness can be theoretically regarded as a special case of group fairness, in which each individual belongs to a unique group.Compared to targets, fairness concepts include more information about fairness, and we can give more concrete formal definitions.We present these fairness concepts in the following.

Grouped by
A lot of papers [7,32,58] define fairness based on the similarity of the input (i.e., the individuals or groups receiving the allocation) and the output (i.e., the outcome of the allocation), which we call consistent fairness.
This concept of fairness first appeared in Aristotle's quote, "like cases should be treated alike" [3], which is thought to describe the consistency of fairness [45].Dwork et al. [23] first formalized this definition at the individual level using the Lipschitz condition in a classification task.As a proper distance function of individuals is difficult to define, existing studies [30,77] in recommendation usually use a trivial case as an alternative of consistent fairness, in which all individuals (or groups) are assigned similar outcomes.For the distance function of outcomes, current work often uses the difference between specific metrics (e.g., NDCG for users [24]) to measure distance.
• Calibrated Fairness.Calibrated fairness [90] requires that the value of the outcome of an individual (or group) should be proportional to its merit, which is also called merit-based fairness [71].Formally, at the individual level, a fair model ℎ should satisfy: for any two individuals  and , it has ( ) .The group-level formalization is similar and only requires replacing ,  with   ,   .
This concept of fairness is closely related to Adams' Equity Theory [2].Calibrated fairness requires two functions to measure the merit of individuals (or groups) and the value of the allocation outcome.The measure of merit often depends on the scenario, and the measure of value usually are some commonly used metrics (e.g., CTR for items [71]).
Most research focuses on the above two concepts of fairness, while a small number of papers explore other concepts.
• Envy-free Fairness.Envy-free fairness requires that individuals should be free of envy, i.e., they should not be jealous of the results of others' outcomes [21,35].Formally, a fair model ℎ should satisfy: for every individual  and its outcome  ℎ (), it has   ( ℎ ()) ≥   ( ℎ ( )) for any other individual .
• Counterfactual Fairness.Counterfactual fairness requires that individuals have the same outcome in the real world as they do in the counterfactual world [76].This means that if an individual belongs to a different group from the current one, its outcome will not change.Formally, a fair model ℎ should satisfy: for every individual , it has  ℎ () =  ℎ ()  ∈  for any other group   .The counterfactual  ℎ ()  ∈  can be calculated according to Pearl's three steps [76].
• Rawlsian Maximin Fairness.Rawlsian maximin fairness requires maximizing the value of the outcome of the worst individual or group [78].Formally, at the individual level, a fair model ℎ should satisfy: The group-level formalization is similar and only requires replacing  with   .
Table 3 divides the reviewed papers according to their definitions.It can be found that existing research pays more attention to outcome fairness.While in outcome fairness, previous studies mainly concentrate on group fairness in terms of target and focus on consistent fairness and calibrated fairness in terms of fairness concepts.Meanwhile, a few researchers have recently explored other concepts of fairness, such as Rawlsian maximin fairness.
Although many efforts about fairness definitions have been made, there are still some issues.Firstly, the relationships among these fairness definitions, especially in recommender systems, lack adequate exploration.If these fairness definitions conflict, which definition is more important is also a problem.Consensus on what kind of fairness should be achieved in recommender systems is necessary.Note that people may have different fairness needs [59], and consensus Manuscript submitted to ACM may not be the same in different scenarios.Besides, most studies concentrate on a single concept and target of fairness.
Only a few recent studies [31,75] attempt to achieve multiple fairness definitions simultaneously.If it is necessary to satisfy multiple fairness definitions, ensuring different fairness at the same time is also a question worth exploring.

Relationships between Fairness and Other Concepts
In this subsection, we discuss the relationship between fairness and some related concepts in recommender systems.
Bias.Bias is ubiquitous in recommender systems, which can exist in data, models, and outcomes [16].Bias may increase both outcome unfairness and process unfairness.For example, Zhu et al. [119] demonstrate theoretically that matrix factorization models suffer from popularity bias in the learning process, which causes popular items to be preferred when the true preferences are the same.Besides, the inductive bias of representation learning may tend to learn some sensitive information to increase the information contained in the representation, which may increase process unfairness.Thus, removing fairness-related biases in data and models is helpful in alleviating unfairness [42].
Besides, there are also some biases that are not related to fairness, such as position bias.In general, bias is more concerned with statistical issues, while fairness all reflects normative ideas about how a recommender system should be.
Diversity.Diversity in recommendation means the diversity of items in the recommendation list, which is closely related to user satisfaction [28].For item fairness, improvements in item fairness are possible to increase diversity.It is because when optimizing item fairness, the recommendation list tends to contain more cold items as well as items from more categories [47,64], which means higher recommendation diversity.However, increasing diversity does not necessarily improve item fairness.The recommender system may recommend more popular items in each category, and cold items will still be treated unfairly.For user fairness, some studies find that existing methods to optimize recommendation diversity may exacerbate user unfairness [55].Generally speaking, fairness is an evaluation criterion beyond diversity.Except for the fairness of accuracy, we can also consider the fairness of diversity [97].
Privacy.Privacy requires that external attackers cannot obtain sensitive information about users through recommendation results or parameters of the recommendation model [85].Compared with privacy, fairness is an internal Manuscript submitted to ACM Wang, et al. perspective of the recommender system, with no consideration of external attackers.Nevertheless, some fairness definitions may imply privacy, such as process fairness and counterfactual fairness.Process fairness requires that the recommendation process should be as fair as possible, such as using fair representations.If we consider that fair representations should be independent of fairness-related attributes, then a fair representation will also satisfy privacy for these attributes.Moreover, in the counterfactual perspective, Li et al. [59] demonstrate that counterfactual fairness of users can be guaranteed by making user representations independent of fairness-related attributes.This implies that user representations satisfying privacy can guarantee counterfactual fairness.

VIEWS OF FAIRNESS IN RECOMMENDATION
The definitions of fairness introduced in Section 2 can be applied to any allocation process and are not limited to the recommendation.Whereas, in recommender systems, there exist multiple allocation processes corresponding to different fairness issues.In this section, to deepen the understanding of fairness, we present several views to classify fairness issues in the recommendation.These views and corresponding work are summarized in Table 4. Item fairness concerns whether the recommendation treats items fairly, such as similar prediction errors for ratings of different types of items [77] or allocating exposure to each item proportional to its relevance [7].If the recommendation treats different items unfairly, the providers of these discriminated items may lack positive feedback and leave the platform.Calibrated fairness is frequently applied to item fairness, while there is little work about calibrated fairness of users, probably because items are easily associated with concepts such as value and quality.The value of an item is Manuscript submitted to ACM often measured by its relevance to users [71] or the number of interactions in history [32].Note that some researchers [1,10] divide the subjects into consumer fairness and provider fairness.In contrast, we divide the subjects into user fairness and item fairness here, as provider fairness can be considered a kind of item fairness at the group level, where groups are divided according to providers.
User fairness concerns whether the recommendation is fair to different users, such as similar accuracy for different groups of users [24] or similar recommendation explainability across different users [30].If the recommendation cannot be fair to users, it may lose users with specific interests.The most commonly used fairness definition in user fairness is consistent fairness, as it is often believed that different people are similar and should not be treated differently.However, here are some particular scenarios where fairness means treating people differently.For example, premium members should get better recommendations than standard members [19].Moreover, there are some differences between user fairness in group recommendations and general recommendations.User fairness in general recommendations concerns all users [58], while group recommendations only care about the users in the group receiving the recommendation [83].
Joint fairness concerns whether both users and items are treated fairly [103].In most recommendation scenarios, it is necessary to consider joint fairness, as user fairness and item fairness are vital to most recommender systems.It is worth noting that user fairness and item fairness can conflict with each other.When item fairness is improved, user fairness must worsen or remain the same [103], making joint fairness a challenging problem.
In addition to users and items, a few other stakeholders may exist in recommender systems.Their fairness issues have recently received attention from some researchers [1].

Granularity
We refer to the granularity of the allocation process as fairness granularity.Fairness in recommendation can be further divided into single fairness and amortized fairness.
A single recommendation list can be considered as the minimum allocation process in the recommendation, which corresponds to single fairness.Single fairness requires that the recommender system meets fairness requirements each time it generates a single recommendation list.In other words, the outcomes  ℎ (•) are only related to a single recommendation, and each recommendation should satisfy the specific fairness definition.For example, for item fairness, different types of items in a single recommendation list should satisfy the fair distribution [90].For user fairness, a single recommendation list should be similarly relevant for different users in the group recommendation [83].
However, requiring every single recommendations list to be fair may be difficult and performance-damaging.An alternative is that we require the recommendations to be fair on the cumulative level, which is called amortized fairness [7].Amortized fairness requires that the cumulative effect of multiple recommendation lists is fair, while a single recommendation list in them may be unfair.In other words, the outcomes  ℎ (•) are related to multiple recommendations.
For example, suppose we expect the exposure of books by male authors and books by female authors to be close in book recommendations.Single fairness requires that each recommendation list has approximately the same number of books by male authors as by female authors.In contrast, amortized fairness will only require that the system recommends approximately the same number of books by male authors as by female authors in all recommendations over time (e.g., within a day).
As demonstrated in Table 4, previous studies concentrate on amortized fairness, which is probably because single fairness is not achievable in some scenarios [7].Existing work [71,109] often uses the average value as the cumulative effect, such as the average exposure of a group across multiple recommendation lists.However, even if the average Manuscript submitted to ACM Wang, et al. values are the same, the variance may be different, which may also be unfair.A high variance may mean that the recommendation performance is not stable and may bring more negative experiences to users.Nevertheless, no previous work has been done that takes variance into consideration.

Optimization Object
We refer to the aspect in which we are concerned about the allocation for subjects as optimization object, which is consistent with how the value function  (•) is defined in Section 2. There are many kinds of optimization objects, containing exposure and hit ratio of items [71], and accuracy of recommendations for users [97].According to whether to consider the impact of allocation [5,111], they can be divided into two main types, i.e., treatment-based fairness and impact-based fairness.Treatment-based fairness only considers whether the treatments of the recommender system are fair or not, such as the predicted scores to different users [120] and the allocated exposure to different items [75].In contrast, impact-based fairness takes the impact caused by recommendations (i.e., user feedback) into account.
Taking item fairness as an example, in the Top-N ranking task, treatment-based fairness may require that the exposure of different items conforms to a fair distribution [34].In contrast, impact-based fairness may require that the CTR of different items conforms to a fair distribution [71].
As shown in Table 4, most previous studies have focused on treatment-based fairness.It may be because it is more difficult to consider impact-based fairness as we cannot control user feedback directly.While most work only focuses on treatment-based fairness or impact-based fairness, it is also necessary to consider both impact-based fairness and treatment-based fairness together.Using item fairness as an example, on the one hand, if we only consider exposure without concerning the accuracy of recommendations, then there is a risk that the recommender system tends to recommend discriminated items to some inactive users.Although the exposure increases, the drop in click-through rate may instead lead to a loss of confidence of the provider.On the other hand, if we only consider the accuracy without considering the exposure of the recommendation, it may lead the recommender system to reduce the exposure chance of the discriminated items to reduce the decrease of the accuracy, which is also unfavorable for the discriminated items.
Therefore, it is necessary to consider both impact-based fairness and treatment-based fairness.

Overview of Fairness Metrics
We introduce some widely used metrics for fairness in the recommendation, as shown in Table 5.Since there are different fairness definitions, the measurements of unfairness are not the same.Moreover, as the characteristics of fairness issues mentioned in Section 3 also affect the design and choice of fairness metrics, different metrics have different scopes of application, which are also marked in Table 5.
As demonstrated in Table 5, most fairness metrics are proposed for outcome fairness as it is the focus of most work, where more metrics for consistent fairness and calibrated fairness.Thus, we mainly present the corresponding metrics for these two fairness definitions in sections 4.2 and 4.3, respectively, and show all the others in section 4.4.
When selecting fairness metrics based on definitions, it is important to note that different metrics do not have the same scope of application.For consistent fairness, Absolute Difference, Variance, and Gini coefficient are commonly used measurements at the two-group, multi-group, and individual levels.These three metrics have a wide range of applicability to different subjects, granularity, and optimization objects.For calibrated fairness, KL-divergence and L1-norm are common measurements for multi-group and individual fairness.These two metrics also have broad Manuscript submitted to ACM applicability.Due to many groups in the group-level calibrated fairness studies, there are no metrics specifically designed for the two group situations.These common metrics are generic and can be used for both users and items but are relatively coarse-grained.They have two main drawbacks: (1) These common metrics typically use the first-order moment like the average to describe groups, ignoring higher-order information; (2) These metrics do not consider the characteristics of user fairness and item fairness.
In order to address the first point, some researchers [96,120] use statistical tests like KS statistic or ANOVA that consider the population distribution.For the second point, for users, some researchers [110] consider user fairness on each item and then aggregate them.For items, some researchers [34,108] consider unfairness across different positions and then aggregate them.Although limited in application, these metrics could be more proper for specific fairness issues.Specific details of these metrics are described below.
Since the metrics for different fairness definitions are not the same, we next present the corresponding metrics based on the fairness definitions.The meanings of the commonly used symbols are shown in Table 6.

Metrics for Consistent Fairness (CO)
As mentioned in Section 2, current work on consistent fairness in recommendation requires that all individuals or groups should be treated similarly.Therefore, the corresponding measurements mainly measure the inconsistency of the utility distribution.Most metrics apply to both user fairness and item fairness.They consider the utility of each individual or group as a number and then measure the inconsistency of these numbers.Due to many metrics on consistent fairness and that early studies concentrate on situations where only two groups exist, we will present these metrics in the order of metrics for two groups, multiple groups, and individuals.
Absolute Difference.Absolute Difference (AD) is the absolute difference of the utility between the protected group  0 and the unprotected group  1 .For user, the group utility  () is often defined as the average predicted rating [120] or the average recommendation performance in the group  [30,58].For item, the group utility  () can be defined as the whole exposure in the recommendation lists for the group  [99].The lower the value, the fairer the recommendations.
KS statistic.Kolmogorov-Smirnov statistic is a nonparametric test used to determine the equality of two distributions.
It measures the area difference between two empirical cumulative distributions of the utilities for groups.The utilities are often defined as the predicted ratings in the group [46,120].Compared to  using the average utility, KS statistic can measure the high-order inconsistency.The lower the value, the fairer the recommendations.
Here  is the number of intervals in the empirical cumulative distribution,  is the size of each interval, G( 0 , ) is the number of utilities of the group  0 that are inside the -ℎ interval.
rND, rKL and rRD.rND, rKL and rRD measure item exposure fairness for a ranking  [108].Unlike previous metrics, these metrics take the exposure position into account, calculating the normalized discounted cumulative unfairness similar to NDCG.Experiments show that rKD is smoother and more robust than rRD, and that rRD has Manuscript submitted to ACM Table 5.A lookup table for the reviewed fairness measurements with the order of Def. and the Target."✓" denotes the presence of existing work using the metric under the corresponding conditions."-" means that there is no work to use the metric in the corresponding condition, but the metric could theoretically be used in the corresponding condition as well."×" indicates that the metric is not theoretically available for the corresponding condition.The abbreviations of the definitions are shown in Table 3.We use "1" to denote those measurements without a name in the original paper and "2" to denote those measurements with the original name.
limited application scope.The lower the value, the fairer the recommendations are for these metrics.
Here the normalizer Z is the highest possible value of corresponding measurements, | + 1... | is the number of the protected group in the top-i of the ranking ,  + is the number of the unprotected group in the whole ranking.
Manuscript submitted to ACM Here  represents the ranking accuracy for a pair of items   ,   from different groups  1 ,  2 . (  ) and  (  ) are the predicted score for the recommendation query .  and   are the true feedback, which are collected through randomized experiments.
Value Unfairness and its variants.Value unfairness is proposed to measure inconsistency in signed prediction error between two user groups [110].There are three variants of Value unfairness.Absolute Unfairness measures the inconsistency of absolute prediction error, while Underestimation Unfairness and Overestimation Unfairness measure inconsistency in how much the predictions underestimate and overestimate the true ratings, respectively.The lower the value, the fairer the recommendations.
Here  0 [ r ]  is the average predicted score for the -ℎ item from group 0, and  0 [ ]  is the average rating for the -ℎ item from group 0.
Manuscript submitted to ACM Wang, et al.
The above metrics are only applicable to measure inconsistency between two groups.In the following, we present the metrics to measure unfairness for three or more groups.It is worth noting that since we can consider individual fairness as a special case of group fairness (i.e., each individual belongs to a unique group), theoretically, these group fairness metrics below can also apply to individual fairness.However, in practice, the common metrics for individual and group fairness are different.
Variance.Variance is a commonly used metrics for dispersion, which is applied to both group-level [77,103] and individual-level [77,103,105].The utility can be the rating prediction error [77], the predicted recommendation satisfaction for a single user [103,105] and the average exposure for an item group [103].The lower the value, the fairer the recommendations.
Min-Max Difference.Min-Max Difference (MMD) is the difference between the maximum and the minimum of all allocated utilities.This metric is used to measure the inconsistency of the average exposure for multiple item groups [39], and the disagreement for users in group recommendation at the individual level [92].The lower the value, the fairer the recommendations.
F-statistic of ANOVA.The one-way analysis of variance (ANOVA) is used to determine any statistically significant differences between the mean values of three or more independent groups.Its F-statistic can be considered a fairness measurement.The utility can be the rating prediction error for a single rating [96].The lower the value, the fairer the recommendations.
Here  (  ) is the utility of an individual belong to   ,   is the mean utility of group   ,  is the mean utility of all individuals.
In the following, we present some metrics commonly used for individual fairness.Note that in addition to the metrics below,   above is also often used to measure individual fairness.
Gini coefficient.Gini coefficient is widely used in sociology and economics to measure the degree of social unfairness [30,32,55,64,65].To our knowledge, it is also the most commonly used metric for consistent individual fairness.The utility can be the predicted relevance for a user [30,55] or the exposure for an item [32,64,65].The lower the value, the fairer the recommendations.
Jain's index.Jain's index [44] is commonly used to measure unfairness in network engineering.Some studies use it to measure the inconsistency of predicted user satisfaction in group recommendations [105] and the inconsistency of item exposure [118].The higher the value, the fairer the recommendations.
Manuscript submitted to ACM Entropy.Entropy is often used to measure the uncertainty of a system.In recommendation, it is used to measure the inconsistency of item exposure [64,65,75].The lower the value, the fairer the recommendations.
Min-Max Ratio.Min-Max Ratio is the ratio of the minimum to the maximum of all allocated utility.Some studies [48,105] use it to measure the inconsistency of the predicted user satisfaction in group recommendation.The higher the value, the fairer the recommendations.
Least Misery.Least Misery is the minimum of all allocated utility.It is also a commonly used fairness metric in group recommendation [48,79,105].The higher the value, the fairer the recommendations.

Metrics for Calibrated Fairness (CA)
Calibrated fairness requires defining the merit of an individual or group.We denote  (•) as a merit function that measures the merit of an individual or group.We can calculate the fair distribution of the allocation based on  (•), i.e., the proportion of the individual's or group's allocation to the total allocation in the fair case, i.e., (  ) .We can also calculate the proportion of the total allocation for an individual or group in the current situation, i.e.,  (  ) =  (  )   (  ) .Most measurements of calibrated fairness measure the difference between the distribution of utilities  and the distribution of merits   .Since all the group fairness metrics in calibrated fairness can be applied to multiple groups, we will present them in the order of group fairness and individual fairness.
MinSkew and MaxSkew.The deviation (Skew) on a certain group  can be defined as log(   ()  () ).And then, we can define the min-skew and the max-skew as follows.Here the utility can be the exposure of the item group, while the   is a predefined distribution [34].For MinSkew, the higher the value, the fairer the recommendations.For MaxSkew, the lower the value, the fairer the recommendations.
KL-divergence.KL-divergence measures how one probability distribution is different from the other.It can be used to measure the difference between   and .Here the utility can be the exposure of the item group, while the   can be calculated by the group's historical exposure [33,60,90,96].The lower the value, the fairer the recommendations.
Manuscript submitted to ACM Wang, et al.

NDKL.
NDKL is an item unfairness measure based on KL-divergence [34].It computes the KL-divergence for each position and then obtains a normalized discounted cumulative value.The lower the value, the fairer the recommendations.
Here the normalizer Z is computed as the highest possible value, and    is the KL-divergence of the top-i ranking.JS.Like KL-divergence, JS-divergence also measures how one probability distribution differs from the other.Some work [70] uses JS-divergence as a metric instead of KL-divergence as it is symmetrical while KL-divergence is asymmetrical.
The lower the value, the fairer the recommendations.
Overall Disparity.Overall disparity measures the average disparity of the proportion of the utility and merit among different groups.The utility can be exposure-based or click-based [71,109].The lower the value, the fairer the recommendations.
Generalized Cross Entropy.Generalized cross entropy [19,60] also measures how one probability distribution is different from the other.The higher the value, the fairer the recommendations.
Here  is a hyperparameter.
In the following, we present calibrated fairness measures frequently used at the individual level.
L1-norm.L1-norm is the sum of the magnitudes of the vectors in a space.Some researchers [7,8,51] treat the merit and utility distributions as vectors and then use the L1-norm to calculate the distance between the vectors.This metric is often used for individual-level measurement [7,8], and there is also work [51] that uses it to measure group-level unfairness.The lower the value, the fairer the recommendations.
It is worth noting that some measures of calibrated fairness and consistent fairness are interconvertible.Theoretically, for a calibrated fairness measurement, if we set   to a uniform distribution, it can become a measurement for consistent fairness.On the other hand, for a consistent fairness measurement which contains  (), we can set  () to () , then it become a calibrated fairness measurement.

Metrics for Envy-free Fairness (EF).
Envy-free fairness requires a definition of envy, which can be different in different scenarios.In group recommendations, different users in the group receive the same recommendations.Serbos [83] defines envy as follow: Envy-freeness (in group recommendation).Given a group , a group recommendation package , and a parameter , we say that a user  ∈  is envy-free for an item  ∈  if  , is in the top-% of the preferences in the set { , :  ∈  }.

Manuscript submitted to ACM
This envy definition can be applied to a single item.This definition means that a user  feels envy on an item if at least % users in the group like this item more than .It is impossible for all users in a group to be envy-free (i.e., the user is envy-free for all items in the package).In practice, m-envy-free is often used, which means that the user in the group is envy-free for at least  items.
A measurement for envy-free fairness can be the proportion of m-envy-free users: where |   | is the number of m-envy-free users.The higher the value, the fairer the recommendations.
In general recommendations, different users receive different recommendations.Patro et al. [75] define envy-freeness as follow: Envy-freeness(in general recommendation).Given a utility metrics  and all the recommendation lists L, we say that a user  is envy-free for a user  if and only if  (  , ) ≥  (  , ) and the degree of envy can be defined as  ( (  , ) −  (  , ), 0).Here  (, ) is the predicted relevance sum for the user  with the recommendation list .This envy definition is applied to each pair of users.Unlike envy in group recommendations, this definition does not involve the third user.Moreover, it is feasible to make all users envy-free with utility metrics properly chosen.
The average of envy among users can be a measurement of envy-free fairness: where  (  ,   ) =  ( (  ,   ) −  (  ,   ), 0).The lower the value, the fairer the recommendations.
The two metrics above utilize predicted preferences that are often not consistent with users' true preferences.
Envy-free fairness based on true preferences requires answering counterfactual questions and is difficult to measure in the offline setting.As a complement, some researchers [21] propose a multi-armed bandit-based algorithm to audit envy-free fairness based on true preferences.

Metrics for Counterfactual Fairness (CF).
Li et al. [59] demonstrate that counterfactual user fairness can be guaranteed when user embeddings are independent of fairness-related attributes.Therefore, they use a classifier to predict fairness-related attributes based on user embeddings and use classification measurements to measure counterfactual fairness.The classification measurements can be Precision, Recall, AUC, and F1 et al.

Metrics for Rawlsian Maximin Fairness (RMF).
Rawlsian maximin fairness argues that fairness depends on the worst individual or group.A simple measurement is the utility of the worst case, but it is vulnerable to noise.In order to make the metrics robust, some work [22,121] uses the average utility of the bottom n% as a measurement.The higher the value, the fairer the recommendations.

Metrics for Maximin-shared Fairness (MSF).
Maximin-shared fairness requires the outcome of each individual to be more than its maximin share.A measurement for item maximin-shared fairness is the proportion of individuals satisfying this condition, where the maximin share for every item is a constant value, i.e., the average exposure [75].
The higher the value, the fairer the recommendations.A fair representation should be independent of fairness-related attributes, so some work [9,102] trains a classifier to Manuscript submitted to ACM predict fairness-related attributes of users and items according to their representations.Then they use some classification measurements (e.g., precision) to measure the fairness of representations, which are similar to the counterfactual fairness measurements [59].Since there are more methods in the last two categories, we further grouped the methods in these two categories, and the specific sub-groups are also illustrated in Fig. 3.The reviewed methods and corresponding brief descriptions are summarized in Table 7.It can be observed that there are only a few data-oriented methods.For ranking methods, regularization and adversarial learning are the dominant methods.At the same time, reinforcement learning has also gained attention in recent years due to being more suitable for modeling dynamics and long-term effects.For re-ranking methods, slot-wise re-ranking methods are dominant, but an increasing amount of recent work has focused on global-wise re-ranking.

Regularization
As shown in Table 8, we also summarize the types of fairness issues solved by each method type.It can be found that each method type can solve several different types of fairness issues, and most fairness issues, on the other hand, can be solved using multiple methods.However, some fairness issues are more specific.For example, process fairness and counterfactual fairness issues are solved only using adversarial learning.Rawlsian maximin and maximin-shared fairness tend to be solved using global-wise re-ranking.Indeed, it may be because there is less work related to these fairness issues.It is also worth exploring to design other methods to solve these issues.

Data-oriented Methods
The data-oriented methods improve fairness by modifying the training data.Compared with other types of methods, there are fewer data-oriented methods.
Manuscript submitted to ACM

Paper Type Brief Description
Publication Year [24] data-oriented adjust the proportion of the protected group by resampling FAT* [77] data-oriented add antidote data to the training data WSDM [110] regularization use fairness metrics (e.g., value fairness) as fair regularization NIPS [46] regularization use distribution matching and mutual information terms as regularization FAT* [12] regularization add fairness regularization to SLIM FAT* [120] regularization induce orthogonality between insensitive latent factors and sensitive factors CIKM [6] regularization add pairwise fairness regularization based on randomized experiments KDD [96] regularization use F-statistic of ANOVA as regularization WSDM [9] adversarial learning fairness constraints for graph embeddings ICML [122] adversarial learning learn fair predicted scores by enhancing score distribution similarity SIGIR [102] adversarial learning learn fair representations in graph-based recommendation WWW [57] adversarial learning add text-based reconstruction loss to learn fair representations WSDM [59] adversarial learning learn personalized counterfactual fair user representations SIGIR [101] adversarial learning learn fair user representation in news recommendation AAAI [56] adversarial learning a GAN-based fair learning algorithm WWW [63] reinforcement learning add fairness-related rewards to improve long term fairness PAKDD [32] reinforcement learning add fairness-related constraints to improve long term fairness WSDM [33] reinforcement learning achieve Pareto efficient fairness-utility tradeoff by multi-objective RL WSDM [27] other ranking method a hybrid fair model with probabilistic soft logic RECSYS [8] other ranking method add a noise component to VAE MEDES [42] other ranking method use a pre-training and fine-tuning approach with bias correction techniques WWW [60] other ranking method adjust the gradient based on the predefined fair distribution BIGDATARES.[112] slot-wise re-ranking maximize ranking utility with group fairness constraint by two queues CIKM [83] slot-wise re-ranking use greedy algorithm to maximize fairness in group recommendation WWW [47] slot-wise re-ranking fairness-aware variation of the maximal marginal relevance UMAP [90] slot-wise re-ranking calibrated recommendation through maximal marginal relevance RECSYS [34] slot-wise re-ranking improve multiple group fairness by interval constrained sorting KDD [79] slot-wise re-ranking find pareto optimal items in group recommendation SAC [62] slot-wise re-ranking personalized fairness-aware re-ranking RECSYS [89] slot-wise re-ranking personalized fairness-aware re-ranking with different user tolerance UMAP [48] slot-wise re-ranking ensure fairness in group recommendation in a ranking sensitive way RECSYS [71] slot-wise re-ranking ensure fairness in dynamic learning to rank through p-controller SIGIR [109] slot-wise re-ranking ensure fairness in dynamic learning to rank by maximal marginal relevance WWW [82] slot-wise re-ranking enumerate fair packages for group recommendations WSDM [105] user-wise re-ranking fair group recommendation from the perspective of Pareto Efficiency RECSYS [68] user-wise re-ranking a series of recommendation policies to combine fairness and relevance CIKM [7] user-wise re-ranking ensure amortized fairness through integer linear programming SIGIR [86] user-wise re-ranking linear programming from the perspective of probabilistic rankings KDD [81] user-wise re-ranking mitigate outlierness in fair rankings through linear programming WSDM [93] global-wise re-ranking 0-1 integer programming with providers constraint RECSYS [75] global-wise re-ranking a re-ranking method for both user fairness and item fairness WWW [30] global-wise re-ranking fairness-aware explainable recommendation through 0-1 integer programming SIGIR [65] global-wise re-ranking a re-ranking method based on maximum flow TOIS [58] global-wise re-ranking ensure user group fairness through 0-1 integer programming WWW [22] global-wise re-ranking a re-ranking method for joint fairness via Lorenz dominance NIPS [103] global-wise re-ranking a re-ranking method for both user fairness and provider fairness SIGIR [121] global-wise re-ranking a learnable re-ranking method for fairness among new items SIGIR Considering that user unfairness might result from the data imbalance between different user groups, Ekstrand et al. [24] use re-sampling to adjust the proportion of different user groups in the training data.Experiments on the Movielens 1M dataset show that this approach can alleviate unfairness, but not significantly.
Manuscript submitted to ACM Rastegarpanah et al. [77] design a relatively more complex but effective method.They draw on data poisoning attacks to address the unfairness problem by adding additional antidote data (e.g., fake user data) to the training data.Adding antidote data during training will affect the predicted rating matrix, which further affects the fairness of recommendations.The antidote data can be updated by optimizing the fairness objective function through the gradient descent method.Compared to the re-sampling method, this approach can better mitigate unfairness, but it is also relatively more time-consuming.
In summary, we can adjust the training data to improve the fairness of recommendations.The advantage of these methods is their low coupling with the recommender system since these methods do not require modification of the original recommendation model.Besides, as these methods work at the front part of the recommendation pipeline, there are fewer constraints on the candidate set.They have the potential to improve the fairness of the recommendation results significantly.However, since multiple stages exist between the data and the final presentation, their performance might be degraded by subsequent stages such as re-ranking for diversity.It is challenging to design effective data-oriented methods.

Ranking Methods
Ranking methods mainly modify recommendation models or optimization targets to learn fair representations or prediction scores.The ranking is the main focus of research in recommendation techniques.It is natural to use some advanced techniques to solve the problems of fair representation learning and long-term fairness, which is difficult for the other two types of methods.Compared to data methods, the results of sorting methods are less different from the final presentation, and the improvement in fairness is more straightforward.Nevertheless, since a re-ranking stage may Manuscript submitted to ACM exist after the ranking stage, similar to data-oriented methods, their performance may be damaged by downstream re-ranking stages.
Depending on the different techniques, current fairness methods for the ranking phase can be divided into regularizationbased methods, adversarial learning-based methods, reinforcement learning-based methods, and others.

5.
3.1 Regularization.One common approach is adding a fairness-related regularization term to the loss function.
Formally, denote   as the traditional recommendation loss function and    as the fairness-related regularization term, then the loss function considering fairness is formalized as  =   +  •    .
One direct approach is to add the fairness evaluation metrics [46,96,110,113] to the loss function as a regularization term, which requires that the metric is differential.It is difficult to use this approach to address unfairness in exposure or ranking as the corresponding metrics are not differential, so existing related work is more focused on unfairness in rating prediction.The advantage of this approach is its simplicity and effectiveness, while the disadvantage is that it is limited in application and often results in a loss of recommendation performance.
In contrast, some approaches [6,12,120] impose indirect regularization on the model.Compared to direct methods, indirect methods can achieve better fairness and recommendation performance.Here we introduce some representative methods below.
In order to reduce the correlation between predicted scores and fairness-related attributes, Zhu et al. [120] propose a fairness-aware tensor-based recommendation framework(FATR), which induces orthogonality between the representations of users (or items) and the corresponding vector of fairness-related attributes by adding a regular term in the tensor-based recommendation model.The loss function is the following Eq.( 32) [120].
Here  is a tensor denoting the complete preferences of users, [[•]] is the Kruskal operator, [ ] is the matrices concatenating operator, ⊙ is the Khatri-Rao product, and ⊛ is the Hadamard product. denotes observations, and Ω is the non-negative indicator tensor indicating whether we observe  . 1 , ...,   denote the latent factor matrices of all the modes of the tensor.Here   ∈    × is the latent factor matrix of the fairness-related mode mode-n, where  is the dimension of the latent factors and   is the entity number of the mode-n, and it can be split into two part  ′  ,  ′′  .Experiments on real datasets show that FATR can achieve better recommendation performance and fairness than directly using the fairness metric as a regular term, reflecting the advantages of the indirect approach.
While the above work focuses on the fairness of point-wise predicted scores, Beutal et al. [6] investigate the fairness from the perspective of pair-wise ranking.They demonstrate that the fairness of point-wise ranking tasks does not guarantee the fairness of pair-wise ranking.To improve pair-wise ranking fairness, they add the residual correlations of fairness-related attributes and predicted preferences as regular terms to motivate the model to have similar prediction accuracy across item groups.The loss function is the following Eq.( 33) [6].The second term is the regular fairness term, Manuscript submitted to ACM Wang, et Here   is the binary fairness-related attribute for item ,  is the query consisting of user and context features,  is the user click feedback,  is the post-click engagement,   (, ) is the predictions ( ŷ, ẑ) for item , ( ŷ, ẑ) is the monotonic ranking function from predictions. is experimental data, and both  and  are random variables over pairs from .
In summary, we can add a fairness-related regularization term to the loss to improve fairness.Compared to other ranking methods, regularization-based methods are more flexible and easily extensible.However, simply adding regularization terms may make it difficult for the model to learn fairness-related information, which might lead to suboptimal performance.

Adversarial
Learning.Several studies use adversarial learning to address the fairness problem [9,57,59,101,102,122].As mentioned earlier, process fairness requires that recommender systems use fair representations.Even though sensitive information is not directly used as input, it may still be indirectly learned by the model into the representation.
Adversarial learning is an effective method to reduce the sensitive information in the representation.In addition, it can also be applied to learn fair predicted scores.The basic frameworks of adversarial learning are illustrated in Fig. 4. A series of studies [9,59,101,102] are aimed to learn fair representation through adversarial learning.The basic framework is shown in Fig. 4(a).Apart from the recommendation model, they often introduce a discriminator for each fairness-related attribute.These discriminators will predict the corresponding attribute value based on the representations outputted by a filter module which is designed to remove unfair information in original representations.If the discriminator cannot determine the value of these fairness-related attributes according to the filtered representations, the filtered representations will be fair.The learning process can be formalized as the following two-player minimax game.

Adversarial Learning
Here   is the recommendation loss, and   is the attribute prediction loss of discriminators. is the parameters for the recommendation model, and  is the parameters of discriminators. is a hyperparameter.We briefly introduce these methods below.
Manuscript submitted to ACM Avishek and William [9] propose a method to reduce the sensitive information contained in the node representation in graph neural networks, which can be applied to multiple fairness-related attributes simultaneously.This method introduces multiple filter modules in the model, each corresponding to a fairness-related attribute, and removes the corresponding fairness-related attribute information from the node representation.After sensitive information is filtered, all filtered representations of that node are averaged together to obtain a representation without sensitive information .The discriminator of each fairness-related attribute will predict the corresponding fairness-related attribute of that node based on the representation .For recommendation, they will only use fair representations .
In addition to node representations, the network structure around nodes is also important information, which is ignored in the above approach.Wu et al. [102] add discriminators to the graph network recommendation model, which predicts the fairness-related attributes of nodes based on their embeddings and the embeddings of the network structure around the nodes.Experimental results on real datasets also validate that better results can be achieved than the method considering only node information.
The discriminator proposed by Li et al. [59] also predicts the fairness-related attributes of users based on their embeddings.The main difference from previous work is that users can personalize their fairness-related attribute settings.
Unlike the above methods, Wu et al. [101] focus on the fairness of user representations in news recommendations, where the user representation is constructed from the user's reading history.They add a discriminator to the news recommendation model to learn fair user embeddings, which predicts the fairness-related attributes of users based on their embeddings.Besides, they also add an attribute prediction task to learn unfair user embeddings.Furthermore, they add regularization to the loss function to enhance the orthogonality between fair and unfair embeddings.
Apart from learning fair representations, adversarial learning can also be used to learn fair predicted scores.Zhu et al. [122] add discriminator to the recommendation model, which predicts the fairness-related attributes of items based on the predicted scores of the recommendation model.Then they ensure item fairness through the adversary between the recommendation model and the discriminator.The training process can be formalized as the following Eq.( 35) [122].

𝑚𝑖𝑛
Here   () is the log-likelihood loss for an MLP adversary to classify items, and   is the KL-loss between the score distribution of each user and a standard normal distribution, which will make the score distribution of each user conform to normal distribution.  is the recommendation loss.Θ and Ψ are learnable parameters for the recommendation model and discriminator. +  is the set of positive items for user . and  are hyperparameters.
Unlike the above methods that use discriminators to predict fairness-related attributes, some methods [56,57] utilize discriminators in other ways.Li et al. [57] apply discriminators to reconstruct user/item information.They leverage textual information to improve user fairness.The key point is to promote user and item representations to restore the original textual information as much as possible and thus reduce the mainstream bias in minority representations.
In addition, Li et al. [56] propose a GAN-based algorithm consisting of a ranker and a controller.Each component contains both a discriminator and a generator.The ranker learns user preferences, and its discriminator distinguishes real interactions from model-generated interactions.The controller provides fairness signals to make the ranker fair, and its discriminator distinguishes the generated exposure distribution from the exposure distribution calculated based on the predictions of the ranker.
Manuscript submitted to ACM Wang, et al.
In summary, current work often leverages adversarial learning to learn fair representations or predicted scores to improve recommendation fairness.Adversarial learning is well-suited to fair representation learning and is the dominant approach to this problem.However, since its optimization objective is a minimax optimization problem, it is more difficult to train than the traditional minimization problem.An intuitive way to improve fairness is to design fairness-related rewards.In order to improve long-term fairness, Liu et al. [63] first propose a reinforcement learning-based method.They introduce fairness-related rewards to make recommendations fair.The reward is defined as the following Eq.( 36) [63].
Here   * is the optimal allocation for group  and    is the allocation for group  in time step . is the item set and    is the item with the attribute value   .   is the user feedback on item   . is a hyperparameter.
They also propose a reinforcement learning-based model based on the actor-critic architecture.The actor-network learns a dynamic fairness-aware ranking strategy vector , which contains user preferences and the system's fairness status.Then ranking score is calculated based on  and item ID embedding.The critic-network estimates the value according to  and a fairness allocation vector, which provides information about the current allocation distribution of different groups.
Similarly, Ge et al. [33] also improve fairness by introducing fairness-related rewards.Still, the difference is that they formalize the problem as a multi-objective Markov decision process and solve it using conditioned networks.
Their approach can seek the Pareto frontier of fairness and utility, thus facilitating decision-makers to control the fairness-utility trade-off.

Manuscript submitted to ACM
The above work adds fairness signals by changing the rewards, while some work promotes fairness by adding fairness-related constraints.Ge et al. [32] consider the dynamics in long-term fairness, in other words, the changes of group labels or item attributes due to the user feedback during the whole recommendation process.The dynamic fairness problem is modeled as the constrained Markov decision process, which has been well studied.Specifically, they add constraints to ensure the fairness of recommendations.The constraint is defined as the following Eq.(37).
Here   ( 0 ) and   ( 1 ) are the number of exposure in group  0 and group  1 at iteration , and  is a hyperparameter.
They additionally define the cost function as the number of sensitive group items in the recommendation list and find that the fairness constraint can be transformed into a constraint on the cost function.Thus the fairness problem can be formalized as a Markov decision problem with constraints and then solved.They also apply the actor-critic architecture, but the main difference is that their model contains two critics, which approximate the reward and cost, respectively.Compared to the above method containing explicit input about fairness status [63], this model has no fairness-related explicit input.
In summary, existing work on reinforcement learning achieves fair recommendations via modifications to the state, reward, or additional constraints.Compared to other methods, reinforcement learning can optimize long-term and dynamic fairness.Nevertheless, reinforcement learning is difficult to evaluate with offline data and has poor stability [94].

Other Methods.
There are also several other fairness ranking methods.Islam et al. [42] use transfer learning to learn fair user representations for career recommendations, and they propose a fair neural model based on neural collaborative filtering (NCF) [40].They first learn a pre-trained model on insensitive items, then transform the pretrained user embeddings to mitigate fairness-related attribute biases in them, and then fine-tune them on sensitive items.
Li et al. [60] propose a contextual framework for the fairness-aware recommendation, which is suitable for different fair performance distributions.Specifically, the framework will infer a coefficient for each user/item from the predefined fair distribution.Then the framework will adjust the gradient during the optimization process based on the coefficient.
Borges et al. [8] improve recommendation fairness by adding a stochastic component to a trained VAE model.They find that introducing a normally distributed noise with high variance to the sampling phase can promote fairness despite a slight loss in the recommendation performance.
Farnadi et al. [27] propose a rule-based fairness method.They use probabilistic soft logic to implement a fairness-aware hybrid recommender system.

Re-ranking Methods
Re-ranking methods mainly adjust the outputs of recommendation models to promote fairness.Re-ranking methods have the advantage that their results are nearly identical to the final presentation, making their improvement in outcome fairness the most straightforward.Besides, similar to data-oriented methods, they also have low coupling with the recommender system, as they do not require changing recommendation models.However, because the candidate set in the re-ranking stage is typically small, the performance of re-ranking methods may be hampered.Moreover, they cannot resolve the fair representation issue in the ranking stage.
We divide current re-ranking methods into the following three types: slot-wise, user-wise, and global-wise.Fig. 6 illustrates the differences between these three types.The slot-wise re-ranking method re-ranks a recommendation list

Slot-wise.
A few studies [48,83,105] propose slot-wise re-ranking methods to improve user fairness in group recommendations.Serbos et al. [83] use a greed-based approach to guarantee user fairness in group recommendations.
They define the number of satisfied users for a recommendation package  as   ().Then the gain from adding a new item  to the current recommendation package  can be defined as   (, ) = |  ( ∪ )\  ()|.The recommendation package can be constructed greedily, i.e., start with the empty set and gradually add items to the set to maximize   (, ).Lin et al. [105] also design a greedy algorithm to ensure user fairness in the group recommendation scenario.The difference is that they consider the Pareto efficiency between fairness and recommendation performance.
They define the overall recommendation performance as  (,  ) and the fairness utility as  (,  ).Then the Pareto frontier can be obtained by maximizing  ×  (,  ) + (1 − ) ×  (,  ).Further, recommendations can be obtained by adding items to the recommendation list one by one through a greedy strategy.Similarly, Sacharidis [79] finds Pareto optimal items to promote fairness.After obtaining the candidate Pareto optimal items, they generate ranking scores by linear aggregation strategies and estimate the probability of an item being ranked in Top-K in any strategy.Items are finally ranked based on the estimated probability.While previous studies considered only fixed-length recommendation lists, Kaya et al. [48] consider the fairness of different positions simultaneously.The method greedily selects each item and optimizes the following objective.
Manuscript submitted to ACM Here  is the set of candidate items. is the recommendation list recommended to the group. ( |, ) is the probability that item  is relevant to user .
In the general recommendation scenario, existing methods introduce fairness from two main perspectives, one is to maximize utility while satisfying fairness constraints, and another is to optimize both fairness and utility jointly.
The former class of methods ensures that the final result is compliant with the fairness constraint, which may entail a relatively large performance loss, while the latter class of methods makes a trade-off between performance and fairness.
Some studies [34,112] propose algorithms to satisfy the fairness constraint as much as possible at each position.
Zehlike et al. [112] propose a priority queue-based approach FA*IR for item fairness scenarios where only two groups exist.FA*IR will maintain two priority queues sorted by relevance, corresponding to two groups.At each position, FA*IR will determine whether the current representation of the protected group satisfies the fairness constraint.If not, select the item with the highest relevance in the protected queue; otherwise, compare the two queues and select the item with the highest relevance.Based on FA*IR, Geyik et al. [34] propose three slot-wise methods to re-rank the results for more than three groups.The first two methods can be considered as extensions of FA*IR.The third algorithm regards fairness constrained re-ranking as an interval constrained sorting problem.
To make a trade-off between fairness and recommendation performance in the re-ranking phase, several studies [47,62,89,90] optimize fairness and utility jointly, and provide hyperparameters to control the loss of recommendation performance.The process of these algorithms can be formalized as the following equation.
Here  is the set of item candidates. is the recommendation list for user , which is empty initially. (, ) is the predicted preference of user  to item . (, , ) is the fairness score. is a hyperparameter to control the trade-off between fairness and utility.For each time, these algorithms will select an item  * from all the available candidate items and then put it into the recommendation list .
Different studies have different definitions of fairness score  (, , ), which is the main difference between them.
Steck [90] defines the fairness score from the user perspective and argues that the recommendation should be calibrated for the interests of users, which are measured through interaction history.The fairness score is defined as the KLdivergence between the distribution over different item groups in the history of user  and the distribution over item groups in the recommendation list  ∪ {}.From the item perspective, Karako and Manggala [47] draw on the ideas of the Maximal Marginal Relevance (MMR) re-ranking algorithm, and they define the fairness score based on item embeddings, which measures how the new item  contributes to the embedding difference between two groups.In addition, considering different diversity tolerance of users, Liu et al. [62] propose a personalized re-ranking method for item fairness.The method of Liu et al. [62] is only available for a single attribute.Based on it, Sonboli et al. [89] find that user tolerance is different across item attributes.They define the personalized fairness score based on multiple item attributes and achieve a better trade-off between fairness and utility.
In the dynamic ranking scenario, Morik et al. [71] propose a re-ranking method based on proportional controller.
This method also use a linear strategy to combine recommendation performance and fairness.And they theoretically prove that fairness can be guaranteed when the number of rankings is large enough.
Here R( |) is the estimate relevance for item  to user  and   () is the error term measuring how fairness will be violated if  is recommended.
Further, in the dynamic ranking scenario, Yang and Ai [109] take the marginal fairness into account, i.e., the gain in fairness each time a new item is selected to be added to the recommendation list.They find that the group that maximizes marginal fairness has the lowest current utility-merit ratio.Based on this finding, they propose a probabilistic re-ranking method that jointly optimizes utility and fairness.Specifically, the method will recommend the most relevant item d  in a probability of  and the fairness-aware item    in a probability of (1 − ).
Here    is the item selected for  ℎ position of the presented list at time step , and  is a hyperparameter.
In summary, the slot-wise methods re-rank independently for each user and add items to the re-ranked list one by one.
Compared to other re-ranking methods, slot-wise re-ranking tends to be more intuitive and efficient but shortsighted, which may lead to suboptimal performance.

User-wise.
Apart from picking items slot by slot, we can also directly find the recommendation list for a user based on the optimization goal of the whole list.A popular paradigm is integer programming.The basic idea is that we can treat some decisions of the re-ranking as decision variables and impose some constraints so that the re-ranking problem can be transformed into an integer programming problem.
In the group recommendation scenario, Lin et al. [105] propose an integer programming-based algorithm to ensure user fairness.The integer programming problem can be formalized as the following Eq.42 [105].The binary decision variables   means whether the recommendation set contains the item .The optimization objective consists of a linear combination of the overall recommendation performance  (,  ) and the fairness utility  (,  ).
Here  is the length of the recommendation list, and  is a hyperparameter.This problem is NP-hard.They relax   to fractional numbers between zero and one to turn it into a convex optimization problem and then select items with the greatest values of   .
In general recommendation scenario, Biega et al. [7] also formalize the fairness problem as an Integer Linear Programming (ILP) problem.The binary decision variables  , means whether the i-th item is placed at the j-th position, and the optimization objective is an amortized unfairness metric calculated through the previous ranking results.
Unlike the previous work, they prevent large losses in recommended performance by adding constraints related to recommended performance.The ILP problem is formalized as the following Eq.( 43) [7] and then solved by Gurobi, an efficient heuristic algorithm.where  −1  denotes the cumulative attention value of the ith item in the previous  −1 ranking results, and   denotes the attention value assigned to the jth position. −1  denotes the cumulative relevance value, and    denotes the relevance of the ith item in the current ranking. is the threshold, which means the changed NDCG is required not to fall below a certain value.
Different from the previous work, Singh and Joachims [86] formalize the problem as a linear programming problem and solve it from a probabilistic ranking perspective.The problem is formalized as the following Eq.( 44) [86].After the linear programming problem is solved, the final ranking can be sampled through Birkhoff-von Neumann decomposition.
Here the decision variable  , is fractional, which denotes the probability of the item  being placed in the position .
The optimization objective    is the expected recommendation performance,   is the predicted relevance score for item  and   is the position coefficient for position .Noting that the outliers in rankings may influence the exposure of items, [81] further extends this approach to mitigate outlierness for fair rankings.
In addition to programming-based methods, Mehrotra et al. [68] propose several fairness-aware recommendation strategies.The traditional recommendation strategy will maximize the relevance.Assuming that  is the set of candidate recommendation lists, the traditional recommendation strategy can be formalized as  *  =   ∈  (, ), where  (, ) is the relevance estimate function, while a recommendation strategy that considers only fairness is  *  =   ∈   (), where  () is the fairness estimate function.To combine fairness with relevance, they propose an interpolation strategy,  *  =   ∈  (1 − ) (, ) +  () , and a probabilistic strategy as the following Eq.( 45) [68].
Here  is a hyperparameter.
In summary, the user-wise methods also re-rank independently for each user, and they try to find the optimal list based on the optimization goal of the whole list.Integer programming based on heuristic algorithms is the mainstream method.Compared to slot-wise methods, it considers the information of the whole list to get better performance but is more time-consuming.Besides, compared with the global-wise methods, user-wise methods independently re-rank for each user, which sometimes results in suboptimal performance.Mathematical programming is still a common paradigm.Unlike user-wise methods, the decision variable in global-wise re-ranking methods is usually a binary variable indicating whether an item is recommended to a user.We introduce some representative methods below.
Similar to the user-wise approach, Li et al. [58] propose an integer programming-based approach to solve the user unfairness problem in the general recommendation scenario, which formalizes the problem as the following Eq.( 46) Manuscript submitted to ACM Here the decision variable    is the binary variable indicating whether item  is recommended to the user . , is the preference of user  to item . 1 and  2 are two groups of users.  is the measurement for user unfairness so that the user group fairness can be guaranteed. is the length of the recommendation list. is a hyperparameter.
While previous work focuses on the fairness of the recommendation performance across different users, Fu et al.
[30] use integer programming to solve fairness problems in the knowledge-based explainable recommendation.The integer programming problem is similar to Li et al. [58], and they add a fairness constraint to the optimization problem, which controls the unfairness of explanation diversity in the knowledge graphs.
The above studies focus on user fairness.For item fairness, Sürer et al. [93] also propose an integer programmingbased method.They first formalize the fairness problem as a 0-1 integer programming problem with provider fairness constraints, then relax the conditions using the Lagrangian method, and finally optimize the problem using the subgradient method.
In addition to programming-based methods, there are also some other re-ranking methods.Mansoury et al. [64,65] propose a post-processing method for item fairness based on maximum flow matching.The algorithm will build a bipartite graph where the weight between user  and item  is calculated based on the preference of  to  and the degree of , and then iteratively solve the maximum flow matching problem on the graph.Finally, recommendation lists will be constructed based on the candidates identified by the algorithm.Besides, Zhu et al. [121] propose a parametric post-processing framework for solving the item fairness problem in cold-start scenarios.The method applies an autoencoder to transform the predicted user preference vector.The transformation needs to satisfy two requirements: the predicted score distribution of under-served items should be as close to the distribution of best-served items as possible, and the predicted score for every user should conform to the same distribution.They propose a generative method and a score scaling method to achieve these requirements.
The above work only considers one-sided fairness.In order to improve joint fairness, Patro et al. [75] propose a re-ranking method, which consists of two phases.The first phase greedily assigns the most relevant feasible item to each user's recommendation list with limited exposure to each item in the round-robin manner, which ensures that the exposure of each item is greater than a certain value.The second phase does not limit the exposure to the item and recommends the most relevant item for users who have not received enough recommendations.They theoretically prove that this method can guarantee both envy-free fairness for users and maximin-shared fairness for items.Also, at the individual level, Virginie et al. [22] consider Rawlsian maximin fairness for both users and items.They propose to re-rank by maximizing the concave welfare functions of users and items and provide a tractable inference method based on the Frank-Wolfe algorithm.
While the above methods guarantee joint fairness at the individual level, Wu et al. [103] propose an offline re-ranking method and an online method that improve item fairness at the group level and user fairness at the individual level.We introduce the algorithm for the offline version here, as the online version is similar.The algorithm will recommend items for all users from position 1 to position k, i.e., the algorithm will not recommend for a certain position until the Manuscript submitted to ACM positions before it has been recommended.The users are sorted by current recommendation quality (random for the first position).Then the algorithm greedily assigns the most relevant feasible item to each user's recommendation list with limited exposure to each provider.If there is no available item, the position will be skipped.After items in position k are selected, the skipped positions will be recommended with an item with the lowest provider exposure to reduce unfairness further.Experiments show that this algorithm can achieve better fairness than the above algorithm of Patro et al. [75].
In summary, global-wise methods take global effects into account and re-rank multiple lists each time.Since it re-ranks different users simultaneously, it is more suitable for user fairness than other re-ranking methods and tends to achieve better performance in amortized fairness.However, the dependency between different lists makes the re-ranking process difficult to parallelize and more time-consuming.
6 DATASETS FOR FAIRNESS RECOMMENDATION STUDY

Overview of Fairness Recommendation Datasets
As mentioned in Section 2, most work is aimed at improving group fairness in the recommendation.Group fairness requires certain criteria to divide groups, usually the attributes contained in the dataset, such as the gender of users.
However, not all recommendation datasets have such attribute information, and existing researchers have not paid the same attention to different attributes.For researchers to easily find fairness-related attributes and the relevant datasets, we survey the recommendation datasets used in the previous fairness studies and list the attributes that researchers have considered in their studies.The reviewed datasets are summarized in Table 9.
It is worth mentioning that fairness can also be studied on datasets without attributes.If researchers want to study fairness on attribute-free datasets, there are two options to our knowledge.One is to research fairness issues not requiring additional attributes to divide groups, such as the Rawlsian maximin fairness at the individual level.The other is to manually construct attributes based on interaction information, such as item popularity and user activity.
These attribute-free datasets used for fairness studies generally only need to contain user and item ID information and ID-aligned user feedback(e.g., rating, click, and purchase) and may not be limited to the datasets summarized below.
The existing datasets for fairness recommendation studies are relatively rich.As seen in Table 9, there are a relatively large number of recommendation datasets containing attribute information.The scenarios of these datasets are diverse, containing movie recommendations (e.g., Movielens, Flixter, and Netflix), e-commerce recommendations (e.g., Amazon, ModCloth), and job recommendations (e.g., Xing).These datasets contain both large-scale datasets (e.g., Amazon) and small-scale datasets (e.g., Movielens 100K).The types of interactions in existing datasets are diverse, containing impressions, clicks, and ratings.Moreover, some datasets contain multi-modal information, such as Amazon and Yelp.
As there is different available information in different scenarios, the attributes considered by researchers are often dataset-specific and vary significantly from one dataset to another, especially for item attributes.For user attributes, gender and age are frequently considered since these attributes are demanded to be fairly treated by anti-discrimination laws.In contrast, item attributes researchers are concerned about are more diverse and contain categories, publishing years, providers, etc.
Apart from these data-specific attributes, there are also some generic attributes to divide groups, such as user activity and item popularity, which only depend on interactions and can be obtained in all datasets.Researchers who cannot use sensitive attributes for some reasons (e.g., privacy) could consider using interaction information to construct these generic attributes.However, it should be noted that such generic attributes are often dynamic, i.e., an individual may Manuscript submitted to ACM belong to different groups at different times.For example, a current popular item may be cold in the previous time, which means it belonged to the protected group previously but is in the unprotected group now [32].
While the existing datasets for fairness studies are diverse, some scenarios and attributes are still worth exploring.
For one thing, fairness research can be conducted on some emerging scenarios, such as the short video recommendation scenario, which contains multiple modalities such as video and text.For another, the existing dataset lacks information on some attributes receiving considerable attention, such as race, which is emphasized in anti-discrimination laws [41].
New data may need to be collected to facilitate relevant research, but privacy concerns must also be considered.
Manuscript submitted to ACM Since most work is attribute-based, we present the datasets with fairness-related attributes and the datasets without fairness-related attributes in Sections 6.2 and 6.3, respectively.

Datasets with Fairness-related Attributes
Amazon.This dataset contains product reviews of various categories from Amazon with user and item profiles, including 142.8 million reviews.For user fairness, previous studies divided user groups based on gender [96] or user activity [30,58].Gender information is not directly accessible, so some researchers use the interaction with Clothing products to infer gender identities [96].The active and inactive users can be grouped based on their number of interactions, total consumption capacity, or maximum consumption level [58].For item fairness, previous studies usually divided item groups according to their categories [122].A few studies use the gender of the model appearing in product images as a grouping method, which is detected through industrial face detection API [96].
Ciao.This dataset is collected from a popular Web review site of products, which contains user trust networks and ratings.The whole dataset contains 484K ratings from 12.3K users on 106K items.Some researchers use the item popularity to divide item groups [33].
Ctrip Flight.This dataset contains ticket orders on an international flight route from Ctrip with basic information on customers and some ticket information.The entire dataset includes 3.8K customers, 6K kinds of air tickets and 25K orders.Some researchers treat the airline that the ticket belongs to as the provider, dividing item groups by providers [103].
Flixter.This dataset is a classical movie recommendation dataset and contains 9.1 million movie ratings from Flixter.
Some researchers use the item popularity to divide item groups [46].The movies are first sorted by the interaction number in descending order, then the protected and unprotected groups can be divided according to whether the movie is in the top 1% of the sorted list.
Google Local.This dataset is a location recommendation dataset and contains 11.4 million reviews about 3.1 million local businesses from Google Maps.Some researchers divide item groups based on the business of the reviewed item [75,103].
Insurance.This dataset is an insurance recommendation dataset in Kaggle, which contains users' information such as gender and occupation.Some researchers divide user groups according to their gender, marital status, and occupation [59].
Last.FM 1K.This dataset is a music recommendation dataset containing 1K play records of 992 users from Last.FM.
It contains user demographic information such as gender and age, which are used to divide user groups by some researchers [24].
Last.FM 360K.This dataset is similar to Last.FM 1K but has a larger size, including 17 million play records of 360K users.It also contains the gender and age of users, and some researchers divide user groups based on these attributes [24,102].
ModCloth.This dataset is an e-commerce recommendation dataset where many products include two human models with different body shapes.The entire dataset contains 100K reviews about 1K clothing products from 44K users.
Additionally, there are records of the product sizes which users purchase.For users, some researchers divide users into different body shape groups according to the average size of their purchase [96].For items, they divide items into different groups according to the body shape of their models [96].
Movielens 100K.This dataset is a classical movie recommendation dataset containing 100K movie ratings with user and item profiles.Some researchers divide items into two groups by item popularity, i.e., the number of exposures for Manuscript submitted to ACM Wang, et al.
each item [32].Besides, some studies also divide items into old movies and new movies according to the year of the movie [120], and some researchers randomly assign movies among some providers [63,93].
Movielens 1M.This dataset is similar to Movielens 100K and has a larger size, including 1 million ratings from 6K users on 4K movies.For users, previous studies divide user groups by their gender, age, and occupation [42,59,102].
For items, the movie genres are seen as a fairness-related attribute [122], and item popularity is also considered [32].
Movielens 20M.This dataset is also collected from Movielens, containing 20 million ratings from 138K users on 27K movies.Some studies consider genres [90] and production companies of movies [71] as fairness-related attributes.
Sushi.This dataset includes 5K responses to a questionnaire survey of preference in sushi which contains preference data and demographic data.Some researchers consider three types of fairness-related attributes: age, gender, and whether or not a type of sushi is seafood [46].
Xing.This dataset is a user-view-job dataset, which contains 320M of interactions with user and item profiles such as career level.Some researchers [19] consider the membership, education degree, and working country as fairness-related attributes for items.In addition, whether the user is premium or not is also regarded as a fairness-related attribute for users [60].
Yelp.This dataset is a business review dataset.Some studies only focus on the restaurant business and divide item groups based on the food genres of restaurants [122].

Datasets without Fairness-related Attributes
We also survey the recommendation datasets without fairness-related attributes in current fairness studies, which are usually used in research on individual fairness.
BeerAdvocate.This dataset [57] contains 1.5 million beer reviews from the BeerAdvocate, including products, user information, and their ratings.Some researchers [57] leverage the reviews in this dataset to enhance the representation of non-mainstream users by adding a textual information reconstruction task.
CiteULike.This dataset [121] includes about 200K records of user preferences toward scientific articles from 5K users.Some researchers [121] utilize this dataset to explore Rawlsian maximin fairness issues in the cold start scenario.
Epinions.This dataset [64,80] is collected from a Web review site of products, which contains user bidirectional connections and ratings.The whole dataset contains 512K ratings from 16K users on 129K items.Some researchers [64,80] use this dataset to explore consistent fairness issues at the individual level.
KGRec-music.This dataset [48] is a music recommendation dataset that contains knowledge graphs.The dataset includes about 750K interactions from about 5K users on 8K songs.Some researchers use this dataset [48] to investigate the individual fairness of users in group recommendations.
Million Song.This dataset [8] contains audio attributes and metadata for a million tracks from contemporary popular music, including 1 million songs with 515K dated tracks.Some researchers use this dataset [8] to explore the individual fairness of items.
Netflix.This dataset [8] is a movie recommendation dataset from Netflix, containing 100 million ratings from 480K users over 17K movies.Some researchers [8] leverage this dataset to study the individual fairness of items.

FUTURE DIRECTIONS
Fairness is essential for recommender systems and needs to be further exploited.In this section, we discuss some promising future directions for fairness in the recommendation from the perspectives of definition, evaluation, algorithm design, and explanation.
Manuscript submitted to ACM

Definition
A general definition of fairness.As mentioned earlier, many different definitions of fairness have been applied in recommender systems.These fairness definitions may conflict with each other.For example, calibrated fairness may be damaged when ensuring Rawlsian maximin fairness and vice versa.Therefore, it is important to determine the priority between different definitions of fairness, but to our knowledge, there is no work on it yet.In addition, a general definition of fairness may not exist.The appropriate fairness definition may vary in different scenarios.A consensus in each scenario would be helpful.

Evaluation
Fair comparison between different fairness methods.No effective benchmarks may result in non-reproducible evaluation and unfair comparison and damage the development of the research community.Existing fairness research suffers from this problem since many different fairness measurements and data-processing strategies exist.Hence, it is necessary to propose a standard experimental setting including but not limited to data preprocessing methods, hyper-parameter tuning strategies, and evaluation metrics.
Dataset for new emerging scenarios.Existing datasets for fair recommendation studies are diverse, but there is a lack of investigation on some emerging recommendation scenarios.For example, short video recommendation plays an important role today, and it contains multiple modal information, which is quite different from traditional recommendation scenarios.However, there is a lack of fairness-related work on short-video recommendation datasets.
Whether there are serious unfairness issues in these emerging scenarios and how to address them deserve to be explored.

Algorithm Design
A win-win for fairness and accuracy.Existing methods often improve the fairness in recommendation with a loss in recommendation performance, and many papers have revealed such a tradeoff between fairness and performance.In the optimal case, fairness is not always in conflict with recommendation performance; for example, for user fairness, both recommendation performance and fairness are optimal if all users receive the most accurate recommendation list.
In practice, some work on classification tasks has also found that improving fairness may improve overall accuracy [53].
For industrial recommender systems, the degradation of recommendation performance may result in an unacceptably large loss of revenue.For successfully applying fairness methods to recommender systems, it is necessary to investigate methods to improve fairness while ensuring recommendation accuracy.
Fairness for both user and item.Many approaches to improving fairness have been proposed, but most focus on only one type of user fairness or item fairness.However, both user and item fairness are essential and should be guaranteed in most recommender systems.Hence, it is worthwhile to propose adequate methods for joint fairness.
Note that there may be a natural conflict between user fairness and item fairness [103], which also makes joint fairness a challenging topic.
Joint fairness issues can be addressed using different types of methods.First, current practices about data-oriented methods only consider one-sided fairness, and it is worth exploring how to adapt for joint fairness.Second, the joint fairness problem can be regarded as multi-objective learning.The trade-off between multiple fairness and recommendation accuracy might be improved by drawing on Pareto optimization [61] and "seesaw phenomenon" related work such as [95].Third, most existing re-ranking methods for joint fairness are non-parametric re-ranking algorithms.
Manuscript submitted to ACM Wang, et al.
It is worth investigating how to design learnable re-ranking algorithms, as they have shown better performance on one-sided fairness [121].
Fairness beyond accuracy.Most user fairness work focuses on one evaluation criterion, i.e., accuracy.However, there are many other measurements beyond accuracy, such as diversity, unexpectedness, and serendipity, which are also closely related to user satisfaction.Research has found that unfairness also exists in these measurements [97].
Therefore, we also need to consider more measurements beyond accuracy when ensuring user fairness.
Causal inference for fairness.Eliminating unfairness at the causal level is considered essential and has received a growing interest in machine learning [52,104].Similarly, causal inference in recommendation has attracted increasing attention and has become a popular recommendation debiasing technique [114,116].However, as we mentioned in Section 2, fairness and bias are different.Only a little work [59] in recommendation focuses on fairness at the causal level.In our opinion, two problems need to be solved.The first problem is how to construct causal graphs for fair recommendations.Most current work focuses on the models based on only ID information.For these models, we can manually design causal graphs.However, for models using additional features, since there exists some causal relationship between these features as well, how to construct causal graphs becomes a challenging problem.The second problem is how to eliminate the influence of unfair factors based on the already constructed causal graphs, especially in the complex causal graphs mentioned above.There is still much room for using causal inference to achieve fair recommendations.
Fairness with missing data.Existing studies usually assume all fairness-related attributes are available in the dataset.However, there are users or items whose fairness-related attributes are missing in many real-world scenarios.
For example, some users may fill in their gender as confidential or even false information.In this case, we cannot identify whether the sensitive group is treated unfairly, and existing fairness methods will be ineffective.Therefore, it is necessary to investigate methods to improve fairness when fairness-related information is missing.Solving this problem also helps to reduce the risk of sensitive information leakage since we only depend on partial information.
There has been some related work on classification tasks [17,18,51,98,106], while in recommender systems, it is still a problem to be explored.
Fairness in a real system.Industrial recommender systems usually consist of three phases: recall, ranking, and re-ranking.Some objectives other than accuracy, such as diversity, are often involved in the re-ranking phase.Existing studies have found that some post-processing techniques, such as diversity re-ranking, may increase user unfairness [90], implying that some fairness methods applied before re-ranking will be ineffective.Therefore, it is necessary to investigate how to improve fairness more effectively in real-world systems.For example, we can consider adding fairness-oriented recall in the recall phase.On the other hand, industrial recommender systems usually require a short response time, so more efficient re-ranking algorithms need to be proposed.Moreover, industrial recommender systems often have multiple objectives, such as reading time and purchase.However, the fairness of the corresponding model for each goal cannot guarantee the fairness of the final recommendations [99], which also poses a challenge for applying fairness in industrial systems.

Explanation
What are the causes of unfairness?Explainability is crucial for recommender systems as it can improve the persuasiveness of recommendations, increase user satisfaction, and enhance the transparency of the whole system [115].Explaining why unfairness occurs can deepen the understanding of unfairness and facilitate the design of more effective fairness methods.Although many approaches have been developed to improve fairness in the recommendation, Manuscript submitted to ACM

Fig. 1 .
Fig. 1.The statistics of publications related to fairness in recommendation.We omit the work in 2022 in the figure, as most of the work in 2022 has not been published.
Concept.Although fairness can be classified according to the target, we do not know what kind of outcomes are fair up to this point.About this, different researchers have different opinions, which we call fairness concepts.These concepts reflect researchers' understanding of what requirements should be met for fair outcomes.

4. 4 . 5
Metrics for Process Fairness (PR).One criterion of process fairness is that the model should use fair representations.

5 METHODS FOR FAIR RECOMMENDATION 5 . 1
Overview of Fairness MethodsTo our knowledge, existing methods for improving fairness can be categorized into three classes according to their working position in the recommendation pipeline, i.e., data-oriented methods, ranking methods, and re-ranking methods, as shown in Fig.3.Data-oriented methods are proposed to alleviate the unfairness problem by changing the training data.Ranking methods mainly design fairness-aware recommendation models or optimization targets for learning fair recommendations.Re-ranking methods mainly adjust the outputs of recommendation models to improve fairness.

Fig. 3 .
Fig. 3. Taxonomy of fairness methods in the recommendation and their position in the recommendation pipeline.

5. 3 . 3
Reinforcement Learning.Several studies use reinforcement learning (RL) to address the fairness problem[32,63].Compared to other methods which mainly consider the immediate fairness impact, reinforcement learning-based fairness methods can optimize fairness in the long run.Fig.5shows all the differences between existing fair RL methods and general RL methods for the recommendation.Current work mitigates unfairness in recommendations by introducing fairness information in states, rewards, or additional constraints.RL for the fair recommendation

Fig. 5 .
Fig. 5. Illustrations of RL for recommendation and RL for the fair recommendation.Note that (b) shows all the differences between the existing fair RL methods and the general RL methods for the recommendation.In other words, certain fair RL methods may not contain all the differences.

5. 4 . 3
Global-wise.Unlike slot-wise and user-wise methods that re-rank a single recommendation list each time, global-wise methods consider global effects and re-rank multiple lists each time, which are more suitable for solving user fairness problems than the user-wise methods.

Table 2 .
Notations used in fairness definitions and their explanations ℎ (•) the outcome (e.g., predicted scores or recommendation lists) of model ℎ given individuals or groups  (•, •) the distance function between individuals or groups   (•, •) the distance function between outcomes  (•) the value function of outcomes   (•) the personalized value function of outcomes for certain individual   (•) the merit function of individuals or groups

Table 3 .
A lookup table for the reviewed fairness definitions in recommendation.

Table 4 .
A lookup table for the reviewed fairness work from several views.

Table 6 .
Notations and Explanations of Common Variables { 1 , ...,   } the whole set of users I = { 1 , ...,   } the whole set of items L = {  1 , ...,    } the whole set of recommendation lists, |   | =  R = { , } the whole set of feedback V the whole set of individuals or groups, which can be either users or items  (•) the utility function for individuals or groups [6,99]se Ranking Accuracy Gap.Pairwise Ranking Accuracy Gap (PRAG) measures item unfairness in the pairwise manner[6,99].Unlike previous metrics focusing on exposure or click-through rate, PRAG measures the unfairness of pairwise ranking accuracy, and it is calculated on data from randomized experiments.The lower the value, the fairer the recommendations.

Table 7 .
A lookup table for the reviewed fairness methods in Recommendation.
al.which will be bigger if the model has a better prediction ability for the clicked item in one group than the other.

Table 9 .
A lookup table for datasets used in existing fairness research in recommendation.We only list the attributes considered in the previous fairness work as fairness-related attributes in the table, and there may be other attributes in the dataset."-" represents empty.Datasets are arranged in dictionary order.
* User activity and item popularity are not attributes in common sense, but researchers also use them to divide groups as attributes.Thus we add them to the table.