FARA: Future-aware Ranking Algorithm for Fairness Optimization

Ranking systems are the key components of modern Information Retrieval (IR) applications, such as search engines and recommender systems. Besides the ranking relevance to users, the exposure fairness to item providers has also been considered an important factor in ranking optimization. Many fair ranking algorithms have been proposed to jointly optimize both ranking relevance and fairness. However, we find that most existing fair ranking methods adopt greedy algorithms that only optimize rankings for the next immediate session or request. As shown in this paper, such a myopic paradigm could limit the upper bound of ranking optimization and lead to suboptimal performance in the long term. To this end, we propose \textbf{FARA}, a novel \textbf{F}uture-\textbf{A}ware \textbf{R}anking \textbf{A}lgorithm for ranking relevance and fairness optimization. Instead of greedily optimizing rankings for the next immediate session, FARA plans ahead by jointly optimizing multiple ranklists together and saving them for future sessions. Specifically, FARA first uses the Taylor expansion to investigate how future ranklists will influence the overall fairness of the system. Then, based on the analysis of the Taylor expansion, FARA adopts a two-phase optimization algorithm where we first solve an optimal future exposure planning problem and then construct the optimal ranklists according to the optimal future exposure planning. Theoretically, we show that FARA is optimal for ranking relevance and fairness joint optimization. Empirically, our extensive experiments on three semi-synthesized datasets show that FARA is efficient, effective, and can deliver significantly better ranking performance compared to state-of-the-art fair ranking methods. We make our implementation public at \href{https://github.com/Taosheng-ty/QP_fairness/}{https://github.com/Taosheng-ty/QP\_fairness/}.


INTRODUCTION
Ranking systems are one of the important cornerstones of information retrieval (IR).Existing ranking systems are usually constructed to optimize ranking relevance with the Probability Ranking Principle (PRP) [34] where items of greater likely relevance should be ranked higher.The PRP is a user-centered ranking strategy that helps save users energy and time since users could satisfy their needs with the top-ranked items [18].However, recent research has shown that, besides users, item providers also draw utility from ranking systems, and the PRP could result in severe unfairness for item providers [4,37].Particularly, the PRP always assigns a few top items with high-rank positions.Those top items usually get the majority of exposure while other items rarely get exposure, although other items might still be relevant [19,29,40,49].The unbalanced exposure leads to unfair opportunities and unfair economic gains for item providers.Such unfairness will eventually force unfairly treated providers to leave the system, and fewer options will be left for users [51].Therefore, IR researchers have argued that ranking relevance and fairness are both important for modern ranking systems [37,38].Many fair ranking algorithms have been proposed to optimize both of them jointly [29,55].
However, existing fair algorithms are mostly greedy algorithms and could only deliver suboptimal ranking performance in the long run.In particular, existing fair ranking algorithms [4,24,26,37,47] usually behave greedily to sequentially produce the locally optimal ranklist for the next immediate session without being aware of the influence of future sessions 1 (see more discussion in §2).The unawareness could lead to unmitigated ranking conflict between relevance and fairness optimization.For example, imagine a case that there are in total 3 items in consideration, item A, item B, and item C, where item A is the most relevant one and item C is the least relevant one.Ranklist [, , ] is the ranklist to maximize ranking relevance.We now consider a scenario where item C is severely unfairly treated in history.To optimize exposure fairness, we need to allocate item C more exposure by boosting item C to a higher position.However, If we try to greedily boost item C within the next immediate session, it is highly likely that item C will be boosted to the first rank to get the maximum exposure and the result ranklist is [, , ].However, Ranklist [, , ] is of poor ranking relevance due to the ranking conflict that the least relevant item (item C) is put on the most important rank (the first rank).
Intuitively, the ranking conflict can be smoothed if we plan ahead and jointly optimize multiple future sessions' ranklists together instead of greedily optimizing the next immediate session.For example, the multiple ranklists after joint optimization can be [[, , ], [, , ], ...], where item C is smoothly boosted in multiple ranklists and the most relevant item, i.e., item A, is still ranked the highest.Based on the above idea, we propose FARA, a novel Future-Aware Ranking Algorithm for relevance and fairness optimization.Briefly, FARA precomputes and jointly optimizes multiple ranklists together and saves them for future use.Particularly, to be able to plan for the future, FARA first uses the Taylor expansion to investigate how future ranklists will influence fairness.Then, based on the influence, FARA uses a two-phase optimization to jointly optimizes multiple ranklists together for future use.In phase 1, we solve an exposure planning problem and get the optimal future planning for item exposure.In phase 2, we construct the optimal ranklists according to the optimal future planning for item exposure.We prove FARA's optimum in terms of ranking relevance and fairness joint optimization in § 5. Extensive experiments on three semi-synthesized datasets also demonstrate FARA's effectiveness and efficiency compared to existing fair ranking algorithms ( § 6).

RELATED WORK
Ranking Fairness: Due to the importance of rankings for providers (sellers, job seekers, content creators, etc.), [9,17,37,53,54], there has been growing interest in ranking fairness for providers [5,12,15,20,25,32,41,41,47].However, the definitions of ranking fairness vary a lot in the existing literature, and there exists no universal definition.At a high level, existing fairness definitions can be grouped into probability-based fairness and exposure-based fairness [29,55].Probability-based fairness [3,6,13] usually requires a minimum number or proportion of protected (e.g., race, gender) items to be distributed evenly across a ranklist.However, only considering the number or proportion of items in a ranklist neglects the fact that different ranks usually have different importance.To address this, exposure-based fairness [4,8,11,37,48] assigns values to each ranking position based on the expected user attention or click probability.Exposure-based fairness argues that total exposure is a limited resource for a ranking system and advocates for fair distribution of exposure among items to ensure fairness for item providers [29].In this paper, we limit our discussion of fair ranking algorithms within the scope of exposure-based fairness.Fair Ranking Algorithms: Recently, a few ranking algorithms [10,15,23,29,35,41,55] have been proposed to achieve exposure-based fairness.In this work, we classify them as open-loop algorithms or feedback-loop algorithms depending on whether historically generated ranklists are used to correct ranking scores.For openloop fair algorithms [15, 26, 36-38, 42, 46], each item usually has a static and fixed ranking score once the ranking model is optimized.Then ranklists are stochastically sampled for each session according to the static ranking scores.Various techniques have been used to optimize the static ranking model, such as linear programming [15,37], policy gradient [38], differentiable PL model optimization [26].However, given the fact that ranking scores are static, open-loop algorithms are usually not robust.To improve ranking robustness, feedback-loop algorithms [4,24,48,52] dynamically take historical ranklists as input to correct items' scores.For example, Morik et al. [24] proposes to use a proportional controller to boost ranking scores of historically unfairly treated items.

BACKGROUND AND PRIOR KNOWLEDGE
In this section, we provide readers with background knowledge of the paper.A summary of notations we use throughout the paper can be found in Table 1.
Ranking Services Workflow: We take web search as an example to detail the ranking service workflow.At time step , a ranking session starts when user  issues a query .For query , there exist candidate items provided by item providers.With the query and candidate items, the ranking system first estimates each item's relevance and then constructs a ranklist of candidate items by optimizing certain ranking objectives.Then the ranking system presents the ranklist to users and collects users' feedback (e.g., clicks) which can be utilized to update the relevance estimator.Partial and Biased Feedback: Relevance estimation is usually updated using users' feedback.However, such feedback is usually a noisy and biased indicator of relevance since users only provide meaningful feedback for items they have examined.If we consider user clicks as the main signal for user feedback, then we have where , ,  are binary random variables indicating whether an item is examined, perceived as relevant, and clicked, respectively.With User Examination Hypothesis [33], we can model users' click probability as For the rest of the paper, we use  =  ( = 1) to simplify the notation.Although there exist several types of biases in the examination probability  ( = 1), we focus on two most important ones: positional bias and selection bias.
Positional Bias [7]: Examination probability is decided by the rank (also called position) and drops along ranks.Particularly, the examining probability is denoted as  rank( | ) , where rank( |) is item 's rank in ranklist .
Selection Bias [27,28]: This bias exists when not all of the items are selected to be shown to users, or some lists are so long that no users will examine every item in them.Assuming the items ranked lower than rank   will not be examined [27], we model this as: Ranking Utility Measurement: Here we introduce the evaluation of ranking performance from both the user side and the provider side, which in later sections will guide the ranking optimization.
The User-side Utility: User-side utility measures a ranking system's ability to put relevant items on higher ranks.A popular user-side utility measurement is DCG [16].Specifically, given a query  and a ranklist , DCG@  is defined as, where  [16].In this paper, following previous works in [37], we set   as the examination probability   at the  ℎ rank when computing DCG.Furthermore, based on DCG, we can measure multiple ranklists with the cumulative NDCG, eff.@  = cum-NDCG@  (, ) where 0 ≤  ≤ 1 is a constant discount factor and  is the current time step. * is the ideal ranklist constructed by ranking items according to true relevance, and we use DCG@  ( * ), referred to as IDCG, to normalize DCG@  (  ).By ignoring , we can get the average NDCG as, where   @  () is the cumulative exposure at top   ranks, where   [ ] indicates the  ℎ item in ranklist   , 1 is an indicator function which means we only accumulate item 's exposure.To simplify notations, we use   () to denote   @  () and eff.for eff.@  , where   is the ranklist length introduced in Eq. 3.
The Provider-side Utility (Fairness): As items' rankings can have significant effects on their providers' profit, it is important to create a fair ranking environment.To evaluate whether exposure is fairly allocated to users, we use the negative exposure disparity between item pairs as fairness measurement [26], where  () is the set of candidate items that will construct the ranklist of query  and  = | ()|.The intuition of the above fairness measurement is that the optimal fairness can be achieved when items exposure is proportional to their relevance, i.e., ∀  ,   ∈  (), (  ) .In other words, exposure fairness means we should let items of similar relevance get similar exposure.In this paper, we choose the exposure fairness evaluation proposed by Oosterhuis [26] instead of the original evaluation proposed in [37].The reason for the choice is that the fairness evaluation in [37] needs to divide the exposure of an item by its relevance, i.e.,   ( )  ( ) , which has zero denominator problem when item  is irrelevant, and () is near zero.This paper uses average unfairness across different queries to evaluate a ranking algorithm.We also refer to the average unfairness as the unfairness tolerance.

PROPOSED METHOD
Most existing fair algorithms are greedy algorithms, i.e., they sequentially construct the locally optimal ranklist for the next immediate session.Therefore they usually fail to optimize the construction procedure if we expect multiple sessions will come for the same query in the future.To mitigate this gap and reach a global optimal for a query, we propose to (i) plan ahead and precompute multiple ranklists for future use ( § 4.1) and (ii) jointly optimize those ranklists together to maximize both fairness and ranking relevance ( § 4.2 & § 4.3).We hypothesize that jointly optimizing multiple ranklists can construct better ranklists compared to sequentially greedily optimizing one single ranklist at each time step.Such hypothesis is verified by both the theoretical analysis in § 5 and the empirical results in § 6.2.

Future-aware Ranking Objective
We first propose a ranking fairness objective to plan and optimize the future Δ ranklists for a query .Specifically, when we are at time step  + 1, the objective is to pre-compute the optimal ranklists B * = [  +1 , ...,   +Δ ] that can maximize the marginal fairness Here, fairness evaluation in Eq. 9 is a direct objective in our method.
To the best of our knowledge, there is no trivial algorithm to get the optimal ranklists B * due to the discontinuity of ranking problem [21].One example of discontinuity is that increasing an item's ranking score may not change the output ranklist, and fairness stays the same unless the increased score can surpass another item's score, and fairness will experience a sudden change.To alleviate the discontinuity of B * ← argmax  Δfair., we propose a novel two-phase solution path by introducing a continuous variable, Δ, where Δ (), also referred to as the planning exposure, is the marginal (or incremental) exposure we plan to assign to item  within the next Δ timesteps.Δ * () is the optimal marginal exposure.We found that introducing Δ helps to effectively maximize Δfair. in phase 1 of our solution.In phase 2, we construct the optimal ranklists B * by allocating the optimal exposure Δ * to each item with a vertical allocation method (more details are in §4.3).
To get Δ * , we carry out a Taylor series expansion to investigate how future exposure will influence the fairness objective, where Δ () =   +Δ () −   (), the exposure increments. and  are the first and the second order derivative, i.e., the gradient vector and the Hessian matrix, respectively, By observing Eq. 8, we could derive two facts about the above second-order expansion in Eq. 12. (i) The above second-order expansion is not an approximation, but equality since (un)fairness in Eq. 8 is defined as a polynomial of  with a degree of two, and its derivative of order higher than two is zero.(ii) Since (un)fairness in Eq. 8 is defined as a sum of squares, the second order derivative  is semi-definite.Being equality, Eq. 12 allows us to correctly estimate future fairness given marginal exposure Δ even when we consider a long-term future (large Δ and Δ).Based on the correct future fairness estimation, it is possible to find the optimal marginal exposure planning, denoted as Δ * , that can maximize future fairness.Since  is semi-definite, Quadratic Programmingbased (QP) optimization is valid to find Δ * .We give the specific QP problem formulation to find Δ * in § 4.2 and leave constructing optimal ranklists B * from Δ * in § 4.3.

Phase 1: Future Exposure Planning
When giving the QP problem formulation, we noticed that existing ranking fairness optimization usually considers two settings: (i) the post-processing setting [4,37,38] where relevance is assumed to be known or well estimated in advance; and (ii) the online setting [24,48] where fairness is optimized while relevance is still being learned.
where   is the length of ranklists,   is the examining probability at rank .Eq.13b indicates that the sum of items' marginal exposure should equal the sum of the Δ ranklists' exposure.In Eq.13c, we introduce the NDCG constraint, where  ( order  ) means the  ℎ largest estimated relevance.According to Eq.(4&6),  ∈ () Δ () () indicates the DCG and Δ =1   =1    ( order  ) indicates IDCG.Then, it is straightforward that (1 − ) indicates the minimum NDCG requirement we want to guarantee.In the post-processing setting, relevance is assumed to be given or already well-estimated prior to ranking optimization.Eq.13d indicates that an item's marginal exposure Δ () should be no less than 0. In Eq.13e, Δ () should be more than the accumulation of the first rank's exposure in the Δ ranklists because an item should be unique in a ranklist and there are Δ ranklists under consideration.
Here we assume that the first rank's exposure is the largest and exposure drops from the top to the lower ranks.4.2.2The online setting.In the online setting, ranklists are optimized while relevance is still being learned.How to actively explore items and get more accurate relevance for ranking optimization is critical.Yang et al. [50] show that a more accurate relevance estimation for an item can be achieved by exposing an item more because more exposure leads to more interaction with users.Based on this, we do explorations by setting a minimum exposure requirement where  indicates the importance of exploration,  () ∀ ∈  () are slack variables to encourage exploration.In other words, both  () and Δ () ∀ ∈  () are decision variables in the setting.In Eq. 14d,   is the minimum exposure requirement,   () is item 's exposure accumulated till time step , and Δ () is the marginal exposure we plan to allocate to item  within the next Δ steps. () can be interpreted as the additional exposure still needed to satisfy the minimum exposure requirement after the next Δ steps.When   () ≥   for item , i.e., minimum exposure requirement is already satisfied for item , it is straightforward that  () will be 0 and will not contribute to the ranking objective in Eq. 14a.When   () <   , i.e. item  does not meet the minimum exposure requirement, the objective in Eq. 14a will try to minimize  ().In other words, Δ () will be boosted in order to satisfy the constraint in Eq. 14d.With more exposure, item  will be explored more.In this paper, we refer to the introduction of  as Exploration.And we treat  in Eq. 14a and   as hyper-parameters to control the degree of exploration.
As quadratic programming has been well studied, there are many available existing solvers.In this paper, we use quadratic programming library qpsolvers 2 within python to solve Equation 13 and Equation 14 to get the optimal exposure planning Δ * .

Phase 2: Ranklists Construction
Following the solution path in Eq. 11, the next step is to construct the optimal ranklists B * according to Δ * as Δ * has been solved in Phase 1.Here, we should allocate each item exactly its optimal exposure Δ * within B * .However, we find that the allocation solution of B is not unique, and they share the same aver-NDCG@  (see Theorem 5.3).Therefore, we additionally aim to find the optimal B * that can optimize all top ranks' effectiveness, i.e., aver-NDCG@  , ∀  ≤   .Optimizing top ranks' effectiveness is important since users usually pay more attention to top ranks.
Inspired by [51], we propose a vertical exposure allocation method in Algorithm 1 to construct the optimal B * based on Δ * .The difference between a vertical allocation and a horizontal allocation is the ranklist construction order.As shown in Fig. 1, a horizontal allocation prioritizes earlier ranklists and first fills out all ranks of the  ℎ ranklist   before filling out  +1 .However, a vertical allocation prioritizes top ranks and fills out the  ℎ ranked items of all ranklists before filling out any ( + 1) ℎ ranked item.Since top ranks are usually more important, our proposed Algorithm 1 adopts a vertical allocation to fill out B * .In our proposed Algorithm 1, B * is the generated Δ ranklists and B * (, ) denote the  ℎ rank of the  ℎ ranklist.To fill out B * (, ), the proposed vertical allocation first generates a feasible candidate set, i.e., , which contains items that are not selected for this session before ( ∉ ()) but still have planned exposure left (Δ () − Ẽ () ≥   ).Here, examination probability   serves as the margin, Ẽ () stores the actual exposure item  receives.Within , our algorithm selects the most relevant item from the candidate set to fill out B * (, ).Algorithm 1 is theoretically justified to accurately allocate exposure Δ * (see Theorem 5.1) and can construct optimal B * for aver-NDCG@  ∀  ≤   (see Theorem § 5.2).Although inspired by the vertical method in [51], the proposed vertical allocation is different from it.Yang et al. [51] focus on a certain share of exposure to be guaranteed and have a complicated 3-step procedure, i.e., allocation, appending, and resorting, which cannot be used to allocate Δ * for our problem.The proposed allocation algorithm in this paper uses up all Δ * , and the allocation procedure is less complicated and more straightforward than those introduced by Yang et al. [51].

FARA: Future-Aware Ranking Algorithm
Combining Phase 1 and Phase 2, we propose a future-aware ranking algorithm for fairness optimization, FARA, detailed in Algorithm 2. FARA serves users in an online manner where we pre-compute B * , the next Δ ranklists, and randomly pop out one ranklist from B * when needed.When B * is used up and empty, We will re-compute Algorithm 2: FARA: Future-aware Ranking Algorithm Besides, FARA does not depend on any specific relevance estimation model, therefore, can be seamlessly integrated into most existing ranking applications.In this paper, we follow works by [50] to use the following unbiased estimator of relevance, where cumC  () Moreover, it is worth noting that the relevance estimator can be replaced with other relevance estimators as well.Proof.Here we provide theoretical proof that vertical allocation, i.e., phase 2, can optimize effectiveness (aver-NDCG) when exposure planning Δ is given.Specifically, maximizing aver-NDCG@  in Eq. 6 is equivalent to

Theoretical Analysis
where normalization is ignored in Eq. 16a, the sum of top ranks exposure should be a constant in Eq. 16b, and the top ranks exposure should be less than the total exposure planning in Eq. 16c.According to Rearrangement Inequality [14], it is straightforward to know that aver-NDCG@  , i.e., Eq. 16a, can be optimized by letting item of greater relevance  get more exposure at top ranks, i.e., greater @  .In other words, we should prioritize letting items of greater relevance  fulfill their exposure planning Δ at top   ranks since @  is bounded in [0, Δ].By assuming that aver-NDCG at higher ranks is more important [34], we should maximize aver-NDCG@  before maximizing aver-NDCG@(  + 1), ∀ 1 ≤   <   .As we maximize aver-NDCG from top to lower ranks, it is straightforward that the optimal way is to follow a greedy selection strategy to let an item of greater relevance  fulfill its exposure planning Δ at its highest possible ranks.In Algorithm 1, the proposed vertical allocation exactly follows the above greedy selection strategy to let item of greater relevance  (line 11 in Algorithm 1) fulfill its exposure planning Δ at the highest possible ranks (setting rank loops as the outer loop in 3 and line 4 of Algorithm 1).So it can reach optimal effectiveness at the top ranks.□ Theorem 5.3.Effectiveness and fairness are fixed when Δ is fixed.
Proof.Given the same exposure planning Δ, effectiveness (aver-NDCG@  ) and fairness are fixed since we can substitute exposure planning Δ () for   @  () in Eq. 6 and substituting Δ for  in Eq. 8, respectively.In other words, for any ranklist B * , as long as exposure planning Δ can be accurately allocated in B * , the effectiveness and fairness are fixed.□

EXPERIMENTS 6.1 Experimental setup
Datasets: In this work, we use three public Learning-to-Rank (LTR) datasets: MQ2008 [30], MSLR10k 4 and Istella-S [22].Datasets' statistics are shown in Table 2. MQ2008 has a three-level relevance judgment (from 0 to 2).MSLR10k and Istella-S have a five-level relevance judgment (from 0 to 4).Queries in each dataset are already divided into training, validation, and test partitions according to a 60%-20%-20% scheme.In this work, we mainly focus on comparison within the LTR tasks.However, the proposed method can be adapted to recommendation tasks, which we leave for future studies.
Baselines: In this paper, we compare the following methods: • TopK: Sort items according to  () • RandomK: Randomly rank items.
• FairCo Among the above ranking algorithm, TopK and RandomK are unfair algorithms, while the others are fair algorithms.While all the fair ranking algorithms aim to maximize effectiveness and fairness, FARA and FARA-Horiz.differ from others by taking a joint optimization across multiple ranklists rather than a traditional greedy optimization approach.For fair ranking algorithms, there exists a tradeoff parameter , similar to  in Eq. 13, to balance effectiveness and fairness.For fair algorithms, the greater  is, the more we care about fairness while potentially sacrificing more effectiveness.For example, when increasing  in Eq. 13c, FARA can maximize fairness with less effectiveness constraint.For different fair algorithms,  lies in different ranges.For FairCo, MCFair, LP,  are originally within [0.0, +∞], and we adopt  ∈ [0.0, 1000.0] which is enough according to our experiments.For ILP, MMF, PLFair, FARA-Horiz.and FARA,  ∈ [0.0, 1.0].Although the vertical allocation in Algorithm 1 was inspired by [51], [51] cannot be used as a baseline because [51] works with offline ranking services where all user queries are known in advance.However, in this paper, we consider the online services depicted in Algorithm 2.
Ranking Service Simulation: Following the workflow in Algorithm 2, at each time step, a simulated user will issue a query , which is randomly sampled from the training, validation, or test partition.Corresponding to the query , a ranking algorithm will 4 https://www.microsoft.com/en-us/research/project/mslr/construct a ranklist  of candidate items and present it to the simulated user.To collect users' feedback for the ranked list , we need to simulate relevance and examination (see Eq.2).Same as [2], the relevance probabilities of each document-query pair (, ) are simulated with their relevance judgement  as where   is the maximum value of relevance judgement , i.e., 2 or 4 depending on the datasets.Besides relevance, following [24,28], we simulate users' examination probability as, otherwise For simplicity, we only simulate users' examination behavior on top ranks, and we set   to 5 throughout the experiments (refer to Eq. 3 for more details of   ).With  ( = 1|, , ) and  ( = 1|, ), we sample clicks with Equation 2. The advantage of the simulation is that it allows us to do online experiments on a large scale while still being easy to reproduce by researchers without access to live ranking systems [28].For simplicity, same as existing works [24,28,48,50], we assume that users' examination  ( = 1|, ) is known in experiment since many existing works [1,2,31,45] have been proposed to estimate it.Due to different data sizes, we simulate 200k steps for MQ2008 and 4M steps for MSLR10k and Istella-S.Experiment Settings: We noticed that LP and ILP methods are proposed in the post-processing setting, where relevance is already known or well estimated in advance.However, in most real-world settings, ranking optimization and relevance learning are carried out at the same time, which we refer to as the online setting.To give a comprehensive comparison, we evaluate ranking methods in both settings.In the post-processing setting, all the ranking methods in Section 6.1 are based on true relevance , and FARA will set  as 0. In the online setting, all the ranking methods in Section 6.1 are based on the relevance estimation  in Eq. 15 to perform ranking optimization.FARA set  to 1 and   = 10 unless otherwise explicitly specified, as they work well across all our experiments.Evaluation: We use the cum-NDCG (cNDCG) in Eq. 5 with  = 0.995 (same  adopted in [43,44]) to evaluate the effectiveness at different cutoffs, 1 ≤   ≤ 5. Aside from effectiveness, unfairness defined in Eq. 8 is used for unfairness measurement.We run each experiment five times and report the average evaluation performance on the test partition.We use the Fisher randomization test [39] with  < 0.05 to do significant tests.Due to the time cost (see Table 3), we do not run ILP and LP on the larger datasets, MSLR10k and Istella-S, and the performances are not available (NA).

Results and Analysis
In this section, we first compare the ranking relevance performance given different degrees of fairness requirements.Then we dive deep into our method to offer more insights into FARA's supremacy.6.2.1 Can FARA reach a better balance between fairness and effectiveness?In Figure 2, we compare ranking methods' effectivenessfairness balance given different fairness requirements.To generate the balance curves in Figure 2, we incrementally sample  from the minimum value to the maximum value within 's ranges indicated in Section 6.1.For each method, twenty  are sampled with the unfairness tolerance (Eq.8) in the post-processing setting and the online setting.Given the same unfairness, the higher curves or points lie, the better their performances are.Our methods FARA and FARA-Horiz.lie higher than all fair baselines in all figures.ILP and LP are unavailable for MSLR10k and Istella-S due to time costs (refer to Table 4).Table 3: Comparison of cNDCG@(1,3,5) and unfairness tolerance in the post-processing setting.Significant improvements or degradations with respect to FairCo are indicated with +/-.Within fair algorithms, the best performance with statistical significance is bolded and underlined.Here,  is set to the maximum value (see Sec. 6.1 for 's range) for each fair algorithm respectively, which means that all algorithms are trying their best to optimize ranking fairness (Eq.9) and the numbers in the table represents their unfairness lower bound.Results are rounded to one decimal place.

MSLR-10k
Istella-S MQ2008 cNDCG@1 cNDCG@3 cNDCG@5 unfair.cN@1 cN@3 cN@5 unfair.cN@1 cN@3 cN@5 unfair.Then we connect different 's (cNDCG), unfairness) pair to form a curve for each method respectively in Figure 2. All the curves start from the top right to the bottom left as  increases, which means there exists a tradeoff between fairness and effectiveness (cNDCG).The reason behind this tradeoff is that requiring more fairness will bring more constraints on optimizing effectiveness.Since TopK and RandomK do not have trade-off parameters, both of them only have one single pair of (cNDCG, unfairness), and their performances are shown as single points in Figure 2.
In Figure 2, our methods FARA and FARA-Horiz.outperform all other fair methods since our methods reach the best cNDCG given the same unfairness tolerance.And FARA's supremacy is consistent in both post-processing and online settings.All fair ranking algorithms are effective fair ranking algorithms since they all show the tradeoff, i.e., higher cNDCG when increasing the unfairness tolerance.For unfair algorithms, TopK performs differently in post-processing and online settings.In the post-processing setting, TopK reaches the highest cNDCG since relevance is known, and ranking relevance is the only consideration.However, in the online setting, TopK can not reach the highest cNDCG.We think the drop in cNDCG is that TopK naively trusts the relevance estimation without any exploration when optimizing ranking relevance.However, fair algorithms are shown to be robust to the online setting since they mostly can reach better cNDCG than Topk when increasing unfairness tolerance.We think the reason for the robustness is that fair algorithms usually rerank items for different sessions to optimize fairness, and such reranking brings explorations.

6.2.2
What is the fairness upper bound that FARA can reach?In Table 3, lower unfairness means higher fairness capacity and fairness upper bound, i.e., the maximum possible fairness one algorithm can  [26]).However, the feature representation is initially designed for relevance which makes PLFair suboptimal for fairness optimization.In Table 3, ILP and LP are NA for MSLR10k and Istella-S due to time costs (refer to Table 4).Due to the page limit, we show the ranking performance of the online setting in Fig. (2), instead of in Table 3.

6.2.3
How is FARA's effectiveness at different cutoffs?In Table 3, we show cNDCG at different cutoffs.Although FairCo, LP, FARA-Horiz.and FARA have similar fairness capacities, FARA significantly outperforms those fair algorithms for cNDCG@1 and cNDCG@3 on all three datasets.Compared to FARA-Horiz., shown in Table 3, FARA still significantly outperforms FARA-Horiz.at top ranks, which shows the necessity of vertical allocation.future Δ sessions.FARA can reach comparable time efficiency with non-programming-based algorithms like TopK, RandomK, and FairCo.Compared with those non-programming-based algorithms, the slightly additional time cost of FARA is acceptable given FARA' superior ranking performance in Tab. 3 and in Fig. 2. 6.2.5 How does Δ influence FARA ?In Figure 3, we show the results of cNDCG and unfairness by varying the value of Δ .With greater Δ , we see a clear boost of cNDCG for FARA, while such a boost does not happen for FARA-Horiz.We know that the proposed vertical allocation is the key reason to have better-ranking relevance when we get the optimal exposure planning Δ * , and we theoretically analyze the reason in § 5.2.Besides, as we increase Δ , unfairness does not vary much, and its value stays close to the minimum unfairness we can achieve in Table 3.We think the reason for the steady value of unfairness is that FARA already reaches the upper limit of fairness when Δ is small, and it is hard to improve when we increase Δ .6.2.6 How does exploration influence FARA in the online setting ?To study how the exploration part (the slack variables  in Eq. 14a) influences FARA, we did an ablation study for FARA with or without exploration.Due to the page limit, we only show the ablation results on the larger dataset, i.e., MSLR10k and Istella-S, in Fig- ure 4. The advantage of exploration is two-folded based on Figure 4. Firstly, FARA lies higher than FARA-w/o-Exp.in the figure, which suggests exploration leads to a better effectiveness-fairness balance.Secondly, FARA has a smaller lower bound of unfairness tolerance, which implies exploration enables FARA to have a higher fairness capacity and can meet a more strict fairness requirement.

Figure 1 :
Figure 1: The ranklist construction order of the horizontal allocation and vertical allocation.for items and propose the following QP formulation, max Δ

Figure 2 :
Figure2: c-NDCG vs. unfairness tolerance (Eq.8) in the post-processing setting and the online setting.Given the same unfairness, the higher curves or points lie, the better their performances are.Our methods FARA and FARA-Horiz.lie higher than all fair baselines in all figures.ILP and LP are unavailable for MSLR10k and Istella-S due to time costs (refer to Table4).Table3: Comparison of cNDCG@(1,3,5) and unfairness tolerance in the post-processing setting.Significant improvements or degradations with respect to FairCo are indicated with +/-.Within fair algorithms, the best performance with statistical significance is bolded and underlined.Here,  is set to the maximum value (see Sec. 6.1 for 's range) for each fair algorithm respectively, which means that all algorithms are trying their best to optimize ranking fairness (Eq.9) and the numbers in the table represents their unfairness lower bound.Results are rounded to one decimal place.

Figure 3 :
Figure 3: The numbers of planning session Δ 's influence on FARA in the post-processing setting on MQ2008. is set as 1.

Figure 4 :
Figure 4: Ablation study of exploration in the online setting.The higher curves lie, the better their performances are.

Table 1 :
A summary of notations in this paper.,,()For a query ,  () is the set of candidates items. ∈  () is an item., ,  All are binary random variables indicating whether an item  is examined ( = 1), perceived as relevant ( = 1) and clicked ( = 1) by a user respectively.,   , ,   =  ( = 1| ), is the probability of an item  perceived as relevant.  =  ( = 1| ( | ) =  ) is the examination probability of item  when it is put in  ℎ rank in a ranklist  . is item's accumulated examination probability (see Eq.7).  ,   Users will stop examining items lower than rank   due to selection bias (see Eq. 3).  is the cutoff prefix to evaluate Cum-NDCG and   ≤   .
[] indicates the  ℎ ranked item in ranklist , ( [], ) indicates  []'s relevance to query , cutoff   indicates the prefix we want to evaluate,   indicates the weight we put on  ℎ rank.
To consider both settings, we first illustrate the QP problem formulation in the post-processing setting in § 4.2.1 and then extend it to work in the online setting in § 4.2.2.4.2.1 The post-processing setting .To get the optimal exposure planning Δ * , we propose the following QP formulation with Δ () ∀ ∈  () as decision variables, max Δ Δfair.(, ,  + Δ ) (13a) s.t.∑︁  ∈ () The number of planning sessions to consider Δ , fairness-effectiveness tradeoff parameter .And in the online setting, we need to additionally give the exploration parameters  and   (see Eq. 14) ; 2 Initialize time step  ← 0, initialize an empty dictionary B = {} to store ranked lists, items' exposure  ← 0 and 1 Input: 10 Get B * with Algorithm 1; 11 Randomly shuffle B; 12 B[  ] ← B * ; 13 Pop out a ranking list from B[  ] and present it; 14 Update cumulative click  and items' exposure ; 15 Update the relevance estimation  via Eq.15; B * for future Δ timesteps.B[] is used to store B * for query .

Table 2 :
Datasets statistics.For each dataset, the table below shows the number of queries, the average number of docs for each query, and the relevance annotation 's range.and the identity  ∈ () Δ () ≡  ∈ () Ẽ (), we would know that Δ () = Ẽ () ∀ ∈ , which means exposures are perfectly allocated according to Δ ().Combing the two scenarios, the vertical allocation in Algorithm 1 can theoretically guarantee Δ () − Ẽ () ≤    for at least | | −   items.Since Δ () >>    and | | >>   when using FARA, we can claim that Algorithm 1 correctly allocates exposure.□ Theorem 5.2.FARA can reach the optimal NDCG with the given exposure planning.

Table 4 :
The average time (seconds per 1k ranklists) cost with standard deviations in parentheses.Since ILP and LP are timeconsuming on large datasets, the time costs on MSLR-10k and Istella-S are estimated by only running 1k steps instead of the total simulation steps indicated in Sec.6.1.reach.Fair effective ranking algorithms, including FairCo, LP, FARA-Horiz.and FARA, have similar fairness capacity and outperform unfair ranking algorithms in terms of unfairness.The success of FARA-Horiz.and FARA validates the proposed quadratic programming formulation can optimize fairness.Similar cNDCG@5 and unfairness for those effective algorithms are expected according to Theorem 5.3.For other fair ranking algorithms, ILP and MMF, and PLFair show inferior fairness capacity.As for the possible reason, ILP uses the integer linear programming method, which may not be effective in optimizing fairness.MMF actually follows a slightly different definition of fairness which require fairness at any cutoff should be fair, which is more strict than the definition we use in this paper.As for PLFair, PLFair tries to learn the ranking score that optimizes fairness based on the feature representation (the exact setting in original paper 6.2.4 How is FARA's time efficiency ?Besides fairness and effectiveness optimization, we also empirically compare the time efficiency.In Table4, ILP and LP are really time-consuming, especially on large datasets, MSLR10k and Istella-S.Compared to ILP and LP, FARA is more than 1000× time efficient on MSLR10k and Istella-S, although all three of them are programming-based methods.There are two reasons behind FARA's time efficiency.The first one is that FARA has a much fewer number of decision variables since FARA only has  () decision variables, while ILP and LP have  ( 2 ) decision variables.The second one is that FARA does not need to solve quadratic programming for every time step.By solving quadratic programming once, we can get Δ ranklists used for