Doubly-Robust Estimation for Correcting Position-Bias in Click Feedback for Unbiased Learning to Rank

Clicks on rankings suffer from position-bias: generally items on lower ranks are less likely to be examined – and thus clicked – by users, in spite of their actual preferences between items. The prevalent approach to unbiased click-based learning-to-rank (LTR) is based on counterfactual inverse-propensity-scoring (IPS) estimation. In contrast with general reinforcement learning, counterfactual doubly-robust (DR) estimation has not been applied to click-based LTR in previous literature. In this paper, we introduce a novel DR estimator that is the first DR approach specifically designed for position-bias. The difficulty with position-bias is that the treatment – user examination – is not directly observable in click data. As a solution, our estimator uses the expected treatment per rank, instead of the actual treatment that existing DR estimators use. Our novel DR estimator has more robust unbiasedness conditions than the existing IPS approach, and in addition, provides enormous decreases in variance: our experimental results indicate it requires several orders of magnitude fewer datapoints to converge at optimal performance. For the unbiased LTR field, our DR estimator contributes both increases in state-of-the-art performance and the most robust theoretical guarantees of all known LTR estimators.


INTRODUCTION
The basis of recommender systems and search engines are ranking models that aim to provide users with rankings that meet their preferences or help in their search task [24].The performance of a ranking model is vitally important to the quality of the user experience with a search or recommendation system.Accordingly, the field of learning-to-rank (LTR) concerns methods that optimize ranking models [24]; click-based LTR uses logged user interactions to supervise its optimization [16].However, clicks are biased indicators of user preference [17,33] because there are many factors beside user preference that influence click behavior.Most importantly, the rank at which an item is displayed is known to have an enormous effect on whether it will be clicked or not [8].Generally, users do not consider all the items that are presented in a ranking, and instead, are more likely to examine items at the top of the ranking.Consequently, lower-ranked items are less likely to be clicked by users, regardless of whether users actually prefer these items [18].Therefore, clicks can be more reflective of where an item was displayed during the gathering of data than whether users prefer it.This form of bias is referred to as position-bias [3,8,50]; it is extremely prevalent in user clicks on rankings.Correspondingly, this has lead to the introduction of unbiased LTR: methods for click-based optimization that mitigate the effects of position-bias.Wang et al. [49] and Joachims et al. [18] proposed using inverse-propensity-scoring (IPS) estimators [13] to correct for position-bias.By treating the examination probabilities as propensities, IPS can estimate ranking metrics unbiasedly w.r.t.position-bias.This has lead to the inception of the unbiased LTR field, in which IPS estimation has remained the basis for most state-of-the-art methods [1,2,28,29,47].However, variance is a large issue with IPS-based LTR and remains an obstacle for its adoption in real-world applications [29].
Outside of LTR, doubly-robust (DR) estimators are a widely used alternative for IPS estimation [19,35], for instance, for optimization in contextual bandit problems [10].The DR estimator combines an IPS estimate with the predictions of a regression model, such that it is unbiased when per treatment either: the estimated propensity or the regression model is accurate [19].Additionally, the DR estimator can also bring large decreases in variance if the regression model is adequately accurate [10].Unfortunately, existing DR estimators are not directly applicable to the unbiased LTR problem, since the treatment variable -that indicates whether an item was examined or notcannot be observed in the data.This is the characteristic problem of position-biased clicks: when an item is not clicked, we cannot determine whether the user chose not to interact or the user did not examine it in the first place.Consequently, the unbiased LTR field has not progressed beyond the usage of IPS estimation.
Our main contribution is the first DR estimator that is specifically designed to perform unbiased LTR from position-biased click data.Instead of using the actual treatment: user examination, which is unobservable in click data, our novel estimator uses the expectation of treatment per rank to construct a covariate instead.Similar to DR estimators for other tasks, it combines the preference predictions of a regression model with IPS estimation.Unlike IPS estimators which are only unbiased with accurate knowledge of the logging policy, our DR estimator requires either the correct logging policy propensity or an accurate regression estimate per item.As a result, our DR estimator has less strict requirements for unbiasedness than IPS estimation.Moreover, it can also provide enormous decreases in variance compared to IPS: our experimental results indicate that the DR estimator requires several orders of magnitude fewer datapoints to converge at optimal performance.In all tested top-5 ranking scenarios, it needs less than 10 6 logged interactions to reach the performance that IPS reaches at 10 9 logged interactions.Additionally, when compared to other state-of-the-art methods DR also provides better performance across all tested scenarios.Therefore, the introduction of DR estimation for unbiased LTR contributes the first unbiased LTR estimator that is provenly more robust than IPS, while also improving state-of-the-art performance on benchmark unbiased LTR datasets.

Structure of the Paper
The remained of this work is structured as follows: Section 2 discusses relevant existing work on click-based and unbiased LTR and earlier methods that have applied DR estimation to clicks.Then Section 3 explains our LTR problem setting by describing our assumptions about user behavior, how click data is logged and our LTR goal.Our background section is divided in two parts: Section 4 provides background on counterfactual estimation methods in general, not specific to LTR.These generic methods are introduced so that we can later contrast them with LTR specific methods and illustrate the adaptations that are required to deal with position-bias specifically.Subsequently, Section 5 describes the existing IPS method that is specifically designed for LTR and position-bias.Furthermore, it discusses existing regression loss estimation and earlier DR estimation methods that have been applied to clicks.
Our novel methods, the novel DR estimator designed to correct position-bias and a novel crossentropy loss estimator, are introduced in Section 6.Then Section 7 details the experiments that were performed to evaluate our novel method, the results of these experiments are presented and discussed in Section 8. Finally, Section 9 provides a conclusion of this work, followed by the appendices that provide extended proofs for the theoretical claims of the work.

RELATED WORK
This section provides a brief overview of the existing literature in the unbiased LTR field, in addition, relevant work on dealing with position-biased clicks and existing methods that apply DR estimation to click data outside of the LTR field are also discussed.
Optimizing ranking models based on click-data is a well-established concept [16].Early methods took an online dueling-bandit approach [39,55] and later an online pairwise approach [26].The first LTR method with theoretical guarantees of unbiasedness was introduced by Wang et al. [49] and then generalized by Joachims et al. [18].They assume the probability that a user examines an item only depends on the rank at which it is displayed and that clicks only occur on examined items [8,50].Then using counterfactual IPS estimation they correct for the selection bias imposed by the examination probabilities.The introduction of this approach launched the unbiased LTR field: Agarwal et al. [1] expanded the approach for optimizing neural networks.Oosterhuis and de Rijke [28] generalized the approach to also correct for the item-selection-bias in top- ranking settings by basing the propensities on a stochastic logging policy.Agarwal et al. [2] showed that user behavior shows an additional trust-bias: increased incorrect clicks at higher ranks [17], Vardasbi et al. [47] extended the IPS estimator with affine corrections to correct for this trust-bias.Singh and Joachims [41] use IPS to optimize for a fair distribution of exposure over items.Jagerman et al. [14] consider safe model deployments by bounding model performance.Oosterhuis and de Rijke [29] introduced a generalization of the top- and affine estimators that considers the possibility that the logging policy is updated during the gathering of data.Wang et al. [48] proposed a ratio-propensity-scoring (RPS) estimator that weights pairs of clicked and non-clicked items by their ratio between the propensities.RPS is an extension of IPS that introduces bias but also reduces variance.
In contrast with the rest of the field, recent work has proposed some methods that do not rely on IPS to the field.Zhuang et al. [56] and Yan et al. [53] fit predictive models to observed click data that explicitly factorizes relevance and bias factors, while they report promising real-world results, their methods do not provide strong theoretical guarantees w.r.t.unbiasedness.Ovaisi et al. [31] propose an adaptation of Heckman's two-stage method.Besides these exceptions and to the best of our knowledge, all methods in the unbiased LTR field are based on IPS.
Interestingly, methods for dealing with position-biased clicks have also been developed outside of the unbiased LTR field.Komiyama et al. [21] and Lagrée et al. [22] propose bandit algorithms that use similar IPS estimators for serving ads in multiple on-screen positions at once.Furthermore, Li et al. [23] also propose IPS estimators for the unbiased click-based evaluation of ranking models.This further evidences the widespread usage of IPS estimation for correcting position-biased clicks.
Nevertheless, there is previous work that has applied DR estimators to position-biased clicks: Saito [36] proposed a DR estimator for post-click conversions that estimates how users treat an item after clicking it.Kiyohara et al. [20] designed a DR estimator for policy-evaluation under cascading click behavior [46].Lastly, Yuan et al. [54] introduced a DR estimator for click-through-rate (CTR) prediction on advertisements placements.Section 5.3 will discuss these methods in a bit more depth, and how they differ from the prevalent IPS approach to unbiased LTR and our proposed DR estimator.Important for our current discussion is that each of these methods tackles a different problem setting than the LTR problem setting in this work.Moreover, the latter two DR estimators use corrections based on action propensities, similar to generic counterfactual estimation, and in stark contrast with the examination propensities of unbiased LTR.Finally, these works focus on policy evaluation instead of LTR, thus it appears that the effectiveness of DR for ranking model optimization is currently still unknown to the field.

PROBLEM DEFINITION
This section describes the assumptions underlying the theory of this paper and details our exact problem setting.Specifically, it introduces the assumed mathematical model by which clicks are generated, the metric we aim to optimize, and notation to describe logged data.

User Behavior Assumptions
This paper assumes that the probability of a click depends on the user preference w.r.t.item  and the position (also called rank)  ∈ {1, 2, . . .,  } at which  is displayed.Let   =  ( = 1 | ) be the probability that a user prefers item  and for each  let   ∈ [0, 1] and   ∈ [0, 1] such that   +   ∈ [0, 1], the probability of  receiving a click  ∈ {0, 1} when displayed at position  is: This assumption has been derived [47] from a more interpretable user-model proposed by Agarwal et al. [2].Their model is based on the examination assumption [34]: users first examine an item before they interact with it, i.e. let  ∈ {0, 1} indicate examination then  = 0 →  = 0. Additionally, they also incorporate the concept of trust-bias: users are more likely to click against their preferences on higher ranks because of their trust in the ranking model.This can be modelled by having the probability of a click conditioned on examination vary over : The proposed user model results in the following click probability: by comparing Eq. 1 and 3 we see that: Agarwal et al. [2] provide empirical results that suggest this user-model is more accurate than the previous model that ignores the trust-bias effect: ∀,   = 0 [18,49,50].Since the assumption in Eq. 1 is true in both models, our work is applicable to most settings in earlier unbiased LTR work [2,14,18,28,29,31,41,[47][48][49].

Definition of the LTR Goal
The goal of our ranking task is to maximize the probability that a user will click on something they prefer.Let  be the ranking policy to optimize, with  ( | ) indicating the probability that  places  at position  and let  indicate a ranking of size :  = [ 1 ,  2 , . . .,   ], lastly, let  = { 1 ,  2 , . . .,   } be the collection of items to be ranked.Most ranking metrics are a weighted sum of item relevances, where the weights   depend on the item positions: Discounted cumulative gain (DCG) [15] is a very traditional ranking metric:  DCG  = log 2 ( + 1) −1 , however, DCG has no clear interpretation in our assumed user model.In contrast, we argue that our metric should actually be motivated by our user behavior assumptions, accordingly, this work will use the weights:   = (  +   ).This choice results in an easily interpreted metric; for brevity of notation, we first introduce the expected position weight per item: using Eq. 1, our ranking metric can then be formulated as: This formulation clearly reveals that our chosen metric directly corresponds to the expected number of items that are both clicked and preferred in our assumed user behavior model.Hence, we will call this metric: the number of expected clicks on preferred items, abbreviated to ECP.Given a ranking metric, the LTR field provides several optimization methods to train ranking models.A popular approach for deterministic ranking models is to optimize a differentiable lower bound on the ranking metric [6,51].For probabilistic ranking models, a simple sampled-approximation of the policy-gradient can be applied [52].Alternatively, recent methods provide more efficient approximation methods specifically designed for the LTR problem [25,45].
Finally, we note that our main contributions work with any choice of weights   and are therefore equally applicable to most traditional LTR metrics.Furthermore, for the sake of simplicity and brevity and without loss of generalization, our notation and our defined goal are limited to a single query or ranking context.We refer to previous work by Joachims et al. [18] and Oosterhuis and de Rijke [29] as examples of how straightforward it is to expand these to expectations over multiple queries or contexts.

Historically Logged Click Data
Lastly, in the unbiased LTR setting, optimization is performed on historically logged click data.This means a logging policy  0 was used to show rankings to users in order to collect the resulting clicks.We will assume the data contains  rankings that were sampled from  0 and displayed to users, where   is the th ranking and   () ∈ {0, 1} indicates whether  was clicked when   was displayed.The bias parameters   and   have to be estimated from user behavior, we will use α and β to denote the estimated values.To keep our notation brief, we will use the following to denote that all the estimated bias parameters are accurate: We will investigate both the scenarios where the  and  bias parameters are known from previous experiments [2,11,50] and where they still have to be estimated.Similarly, the exact distribution of the logging policy  0 may also have to be estimated, the following denotes that the estimated distribution π0 for item  is accurate: To summarize, our goal is to maximize R () based on click data gathered using  0 from the position-biased click model in Eq. 1.

BACKGROUND: APPLYING GENERIC COUNTERFACTUAL ESTIMATION TO CLICK-THROUGH-RATES
This section will give an overview of counterfactual estimation for generic reinforcement learning [10,19,44], its purpose is two-fold: (i) to illustrate why it is not effective for the LTR problem; and (ii) to contrast the generic estimators with the existing estimators for LTR and our novel estimators.While it appears that the field is aware that generic counterfactual estimation is not a practical solution to LTR [18,20,37], to the best of our knowledge, previous published research has not gone in depth on the reasons for this ineffectiveness.

The Generic Estimation Goal
The first issue between the LTR problem and generic counterfactual estimation is that the latter assumes that the rewards for the actions performed by the logging policy are directly observed, as is the case for standard reinforcement learning tasks [10,44].However, in the LTR problem, clicks are observed instead of the relevances   on which metrics are based.Consequently, for this section, we will restrict ourselves to estimating CTR instead, luckily this task is similar enough to the LTR problem for our discussion.Let  [] indicate the th item in ranking , under our assumed click model, the CTR of a policy  is: The goal in this section is thus to estimate CTR() for any given policy  using data collected by a logging policy  0 .Importantly, to be effective at optimization, the estimation process should work for any possible  in the model space.

Generic Inverse-Propensity-Scoring Estimation
Standard inverse-propensity-scoring (IPS) estimation corrects for the mismatch between  and  0 by reweighting the observed action inversely w.r.t.their estimated propensity: the probability that  0 takes this observation [10].In our case, an action is a ranking  and its estimated propensity is π0 ().Applying standard IPS estimation to the LTR task results in the following generic IPS estimator: The IPS weight aims to correct for the difference in action probability between  0 and , e.g. if an action is underrepresented in the data because  0 () <  () then IPS will compensate by giving more weight to this action as:  ()/ π0 () > 1.To understand when this approach can provide unbiased estimation, we first look at its expected value: ratio between estimated and real propensity The result is very similar to the formula for CTR (Eq.11) except for the ratio between the estimated and real propensity.Clearly, for unbiasedness this ratio must be equal to one, i.e. the estimated propensity needs to be correct, additionally, each action that  may take must have a positive propensity from  0 : estimated propensity is correct and positive Thus unbiased estimation via generic IPS is possible, however, the requirements are practically infeasible for any large ranking problem due to its enormous action space.Importantly, for unbiasedness,  0 has to give a non-zero probability to each action  may take, but data is often collected without knowledge of what  will be evaluated (or reused for many different policies).Moreover, when using IPS for unbiased optimization it should be unbiased for any possible  in the policy space that optimization is performed over.In practice, this means that the number of possible rankings is enormous and each should get a non-zero probability by  0 , which translates to  0 giving a positive probability to ! | |  rankings.This is quite undesirable since the user experience is likely to suffer under such a random policy, but moreover, it also brings serious variance problems.In particular, when we consider the variance of the generic IPS estimator, we see that small π0 () propensities increase variance by a massive degree: Thus, to summarize, to apply generic IPS to LTR unbiasedly, all rankings require a positive propensity, but due to the enormously large number of rankings this leads to extremely small propensity values which lead to enormous variance.As a result, generic IPS is not a practical solution for unbiased low-variance LTR and has been widely avoided in practice [20].Section 5.1 will describe the IPS approach specifically designed for LTR that has much more feasible unbiasedness requirements.

The Generic Direct Method
Before we continue to LTR specific methods, we will describe the direct-method (DM) and generic doubly-robust (DR) estimation, in order to compare them with our novel estimator later.First, the direct method uses regression estimates from a regression model to estimate policy performance [10].
Let R indicate a regression estimate of   , the generic DM estimator is then: Clearly, DM is unbiased when the regression estimates are correct for all rankings that  may show: predicted click probabilities are correct (16) However, this is not a very useful requirement in practice, since solving the regression problem is arguably just as difficult as the subsequent LTR problem.Nonetheless, its unbiasedness requirements are very different than those for IPS, a benefit of DR is that it combines both requirements advantageously.

Generic Doubly Robust Estimation
Besides DM and IPS, DR estimation makes use of a covariate (CV), that aims to have a large covariance with IPS estimation but the same expected value as DM.We will call this the generic CV: .
CV makes use of the actions sampled by  0 , allowing it to have a high covariance with IPS, but uses predicted click probabilities instead of using the actual clicks, enabling it to have the same expected value as DM.Importantly, when IPS is unbiased, i.e. when the estimated propensities are correct, CV is also an unbiased estimate of DM: estimated propensity is correct and positive The DR estimator is a combination of the above three estimators [10,19,35]: between observed click and predicted click prob.
The DR starts with regression-based estimate of DM then for each logged ranking it adds the difference between the observed clicks and the predicted clicks.In other words, DR uses DM as a baseline and adds an IPS estimate of the difference between the regression model and the actual clicks.
The first advantage of DR estimation are its unbiasedness requirements.It has the following bias w.r.t.CTR: As we can see the bias of DR is a summation over rankings, where per ranking the error in propensity is multiplied with the error in predicted click probability.Due to this product, only one of the two errors has to be zero per ranking for the total bias to be zero.In other words, DR is unbiased when for each ranking either the propensity or the predicted click probabilities are correct: propensity is correct and positive As a result, if either DM or IPS is unbiased then DR is unbiased, furthermore, it can potentially be unbiased when neither DM or IPS are.In addition, to the beneficial unbiasedness requirements, DR can also allow for reduced variance if there is a positive covariance between IPS and CV: In practice, this means that somewhat accurate regression estimates can provide a decrease in variance over IPS.Nevertheless, this decrease is not enough to overcome the variance problems that stem from the small propensities that are involved in LTR.Consequently, state-of-the-art unbiased LTR does not apply standard IPS and DR [4,29].

BACKGROUND: COUNTERFACTUAL ESTIMATION FOR UNBIASED LEARNING-TO-RANK
While Section 4 covers the standard counterfactual estimation techniques and why they are ineffective for the LTR problem, this section describes the existing IPS estimator that is specifically designed for LTR and discuss previous work that has applied DR estimation to click data.

Inverse-Propensity-Scoring in Unbiased LTR
As discussed in Section 2, the main approach in state-of-the-art unbiased LTR work is based on inverse-propensity-scoring (IPS) estimation [13].Under the affine click model in Eq. 1, the propensities are not the probability of observation, as in the earliest unbiased LTR work [18,49], but the expected correlation between the click probability and the user preference for an item  under  0 [29,47]: where  0 ( | ) indicates the probability that  0 places  at position .Because  are  may be unknown, we use the following estimated values: where the clipping parameter  ∈ (0, 1] prevents small ρ values and is applied for variance reduction [18,42].The state-of-the-art IPS estimator introduced by Oosterhuis and de Rijke [29] first corrects each (non-)click with β  () -the β value for the position where the click took placeto correct for clicks in spite of preference (trust-bias), and then inversely weights the result by ρ to correct for the correlation between the user preference and the click probability (position-bias and item-selection-bias [28]): where   () indicates the position of  in the th ranking.
The main difference between the generic IPS estimation (Eq.11) and the LTR IPS estimator (Eq.25) is that the former bases its corrections on the differences between the action probabilities of  and  0 , while the latter uses the correlation between clicks and relevance under  0 .Thus while both use the behavior of the logging policy  0 , the LTR IPS estimator uses the assumed click model of Eq. 1 as well.While the reliance on the click model makes the estimator more effective, it also makes the estimator specifically designed for this click behavior.The remainder of this section discusses that, due to this specific design, the theoretical properties of the LTR IPS estimator are clearly preferable over those of generic IPS estimation applied to the LTR problem.
To start, the IPS estimator has the following bias: error from α, β and π Appendix A provides a derivation of its bias, it also proves that the IPS estimator is unbiased when both the bias parameters and the logging policy distribution are correctly estimated and clipping has no effect [18,47]: π0 is correctly estimated and clipping has no effect Conversely, IPS is biased when clipping does have an effect, even if the bias parameters and logging policy distribution are correctly estimated.Importantly, these unbiasedness requirements are much more feasible than those for the generic IPS estimator (Eq.21); where the generic IPS estimator requires each ranking to have a positive probability (∀,  0 () > 0) of being displayed during logging, the LTR IPS estimator requires each item to have a propensity greater than  (∀,   ≥ ).In other words, the correlation between clicks and relevances should be greater than , this is even feasible under a deterministic logging policy that always displays the same single ranking if all items are displayed at once [18,28].However, the IPS estimator does need accurate estimate of the bias parameters α and β in addition to an accurate estimate of the logging policy π0 , but previous work indicates that this is actually doable in practice [3,18,49,50].In summary, compared to generic IPS estimation, the IPS estimator for LTR has replaced infeasible requirements on the logging policy with attainable requirements on bias estimation.
The variance of the IPS estimator can be decomposed in the following parts: cov. between click and correction .
We see how clipping prevents extremely large values for the variance multiplier from the IPS weight by preventing small ρ values and can thereby greatly reduce variance [18,42].Importantly, the reduction in variance is often much greater than the increase in bias, making this an attractive trade-off that has been widely adopted by the unbiased LTR field [1,28].There is currently no known method for variance reduction in IPS-based position-bias correction that does not introduce bias, and thus, in practice unbiased LTR methods are actually often applied in a biased manner 1 .
to our problem setting.

Doubly-Robust Estimation for Correcting Position-Bias in Click
Feedback for Unbiased Learning to Rank 0:11

Existing Cross-Entropy Loss Estimation
As discussed, DM requires an accurate regression model to be unbiased or effective.In the ideal situation, one may optimize a regression model for estimating relevance using the cross-entropy loss: However, this loss cannot be computed from the click-data since   cannot be observed.Luckily, Bekker et al. [5] have introduced an estimator that can be applied to position-biased clicks: Saito et al. [38] showed that this estimator is effective for recommendation tasks on position-biased click data.L ′ is unbiased [5,38] when there is no trust-bias, propensities are accurate and clipping has no effect: π is correctly estimated and clipping has no effect In Section 6.5, we propose a novel variation on this estimator that can also correct for trust-bias and that treats non-displayed items in a more intuitive way.

Existing Doubly-Robust Estimation for Logged Click Data
Our discussion of generic counterfactual estimation in Section 4 concluded with DR estimation and the advantageous properties it can have over IPS and DM [19,35].Given that the current state-of-the-art in LTR is based on IPS, it seems very promising to apply DR to unbiased LTR.Unfortunately, at first glance it seems DR is inapplicable to the LTR problem, since treatment is the examination of the user and we cannot directly observe whether a user has examined a non-clicked item or not.Because DR estimation balances IPS and regression estimates unbiasedly using the knowledge of treatment in the data, e.g. which actions where taken [10] (cf.Eq. 19), it appears this characteristic problem of position-biased clicks makes existing DR estimation inapplicable.However, as discussed in Section 4, this is not a problem for generic counterfactual estimators for CTR estimation from logged click data.Accordingly, previous work that has applied DR estimation to clicks has taken the generic approach with corrections based on purely based on action propensities.For instance, Yuan et al. [54] use IPS and DR estimators for CTR prediction on advertisements that are presented in different display positions.Their IPS weights are based on the difference in action probabilities between the logging policy and the evaluated policy (cf.Eq. 11 & 19).In a similar vain, Kiyohara et al. [20] propose a DR estimator for predicting a CTR-based slate-metric under cascading user behavior, they also use IPS weights based solely on action probabilities.These method are very different from the IPS approach for unbiased LTR (Section 5.1) because their corrections are not based on the mismatch between clicks and relevance, but on the mismatch between action probabilities between policies.As a result, they cannot handle situations where  0 is deterministic and position-bias occurs, in contrast with the LTR IPS estimator.We thus argue that the approaches of Yuan et al. [54] and Kiyohara et al. [20] are better understood as methods for correcting policy differences, instead of methods designed for correcting position-bias in clicks directly.
Another DR estimator applied to click data was proposed by Saito [36], who realized that when estimating post-click conversions, the click signal can be seen as the treatment variable.This avoids the unobservable examination problem as clicks are always directly observable in the data.The propensities of Saito are thus based on click probabilities, instead of action or examination probabilities.While being very useful for post-click conversions, their method cannot be applied to predicting click probabilities or our LTR problem setting.
In summary, existing DR estimation does not seem directly applicable to position-bias since the treatment variable, item examination, is unobservable in click logs.To the best of our knowledge, DR estimators that have been applied to clicks correct for the mismatch between logging policy and the evaluated policy.Currently, there does not appear to be a DR estimator that uses the correlations between clicks and relevances as state-of-the-art IPS estimation for LTR.

METHOD: THE DIRECT METHOD AND DOUBLY ROBUST ESTIMATION FOR LEARNING TO RANK
In Section 4 the generic IPS, DM and DR estimators were introduced, subsequently, Section 5 showed how IPS has been successfully adapted for the LTR problem specifically.This naturally raises the question whether adaptations of DM and DR estimation for LTR could bring additional success to the field.To answer this question, this section introduces novel DM and DR estimators for LTR and also a novel estimator for the cross-entropy loss.

The Direct Method for Learning to Rank
As discussed in Section 4, the direct-method (DM) solely relies on regression to estimate performance, in contrast with IPS which uses click frequencies and propensities [10].The generic DM estimator in Eq. 15 estimates CTR with the estimated bias parameters α and β and the relevance estimates R .However, to estimate R (Eq.7), we only require the weight estimate   and R per item .The DM estimate of R () then is: While to the best of our knowledge it is novel, our DM estimator is extremely straightforward: for each item we multiply its estimated expected position weight ω with its relevance estimate: R .The biggest difference with the generic DM is that instead of using the policy probabilities π () or α and β directly, it uses ω which is based on their values (Eq.24).By considering Eq. 7, 24 and 32, we can clearly see R DM has the following condition for unbiasedness: In other words, both the bias parameters and the regression model have to be accurate for R DM () to be unbiased.The first part of the condition is required because accurate α and β are needed for an accurate estimate of the  weights.The second part of the condition: that all regression estimates R need to be correct, show that it is practically infeasible for DM to be unbiased since finding an accurate R values appears to be as difficult as the ranking task itself.This reasoning could explain why -to the best of our knowledge -no existing work has applied DM to unbiased LTR.However, the experimental findings in this paper cast doubt on this reasoning, since they show that DM can be more effective than IPS, especially when the number of displayed rankings  is not very large.An advantage of DM over IPS is how non-clicked items are treated: The IPS estimator (Eq.25) treats items that are not clicked in the logged data as completely irrelevant items that should be placed at the bottom of a ranking.As pointed out in previous work [48], this seems very unfair to items that were never displayed during logging, and this winner-takes-all behavior could potentially explain the high variance of IPS.In contrast, because DM relies on regression estimates it can provide non-zero values to all items, even those never displayed.Additionally, DM does not require any estimate of the logging policy π0 whereas IPS heavily relies on π0 .But DM does not utilize any of the click-data, and thus, DM cannot correct for inaccuracies in the regression estimates.Furthermore, the unbiasedness criteria for DM are much less feasible than those of IPS.Ideally, the advantageous properties of both IPS and DM should be combined in a single estimator, while avoiding the downsides of each approach.The following subsection considers whether DR estimation could result in such a combination.

A Novel Doubly-Robust Estimator for Relevance Estimation under Position-Bias
Now that we have IPS and DM estimators for LTR, we only require a covariate (CV) to construct a doubly-robust (DR) estimator [10].As discussed in Section 4, CV should have the same expected value as DM when IPS is unbiased, while simultaneously having a high covariance with IPS.With these requirements in mind, we propose the following CV: Interestingly, CV does not use the observed clicks but utilizes at what ranks an item was displayed in the logged data.The last part of the estimator represents the increase in click probability an item receives by being displayed at a position, the hope is that this correlates with the actual observed clicks.Importantly, CV has the same expected value as DM when π is correct and clipping has no effect (see Theorem B.2 for proof): If we compare the above condition with the unbiasedness condition of IPS in Eq. 27, we see that the latter encapsulates the former.In other words, CV is an unbiased estimate of DM when IPS is an unbiased estimate of R.
Since our CV has the required properties, we can straightforwardly propose our novel DR estimator (cf.Eq. 19): diff. between observed click and predicted click prob. .
We see that our DR estimator follows a similar structure as generic DR estimation (Section 4): it starts with DM as a baseline then adds an IPS estimate of the difference between DM and the true reward R. Concretely, the difference between each observed click signal   () and the predicted click probability α  () R + β  () is taken and reweighted with an IPS estimate.Effectively, the observed clicks are thus used to estimate and correct the error of DM.An intuitive advantage of DR is that for items that were never displayed during logging (i.e.∀0 <  ≤  , α  () = 0), DR relies solely on regression to estimate their relevance, similar to DM.Yet for items that have been displayed many times, DR will estimate relevance more similar to IPS for those items, thereby it is able to correct for regression mistakes with clicks.The combination of these properties, means that DR can avoid the winner-takes-all behavior of IPS where all non-displayed or non-clicked items are seen as completely non-relevant and pushed to the bottom of the ranking.We expect this to mean that DR does not have the same variance problems as IPS.At the same time, DR still relies on clicks and thus does not require perfectly accurate regression estimates.Our theoretical analysis shows that this enables DR to have more reasonable unbiasedness requirements than DM.
The main difference with standard DR estimation for contextual bandits, i.e. as described by Dudík et al. [10], and our DR estimator is that our CV uses a soft expected-treatment variable α  () .This difference is necessary because relevances   cannot be observed directly and have to be inferred from click signals   ().Thus, while standard CV would use the observed reward signal, our CV infers the relevance from the observed click.We call α  () a soft expected-treatment variable because it can be seen as the expected effect that relevance had on the click probability.To the best of our knowledge, our DR estimator is the first to use such a soft-treatment variable.
Moreover, in contrast with the existing methods described in Section 4 and 5.3 that use propensities based on the mismatch between  0 and  [20,36,54].Our DR estimator uses the correlation between clicks and relevance to correct for position-bias, it is thus also applicable when the logging policy is deterministic and inherents the advantages that the LTR IPS estimator has over generic IPS estimation.We thus argue that our DR estimator is the first that is designed to directly correct for position-bias, and therefore provides a very significant contribution to the unbiased LTR field.

Theoretical Properties of the Novel Doubly-Robust Estimator
Sections 7 and 8 experimentally investigate the performance improvements our contribution brings, whereas Appendix C proves several theoretical advantages DR has over both IPS and DM in terms of bias and variance.We summarize our main theoretical findings in the remainder of this section.
Theorem C.1 shows that our DR has the following bias: error from π , α, β and clipping error from π and clipping Furthermore, Corollary C.3 shows that if the bias parameters are correctly estimated, this can be simplified to: error from π and clipping error from regression The multiplication of errors in the bias can be beneficial for more robustness (cf.Eq. 20), however, it only occurs when the bias parameters are correct.From the simplified bias, Theorem C.4 derives the following unbiasedness conditions: π is correct and clipping has no effect (39) In other words, DR is unbiased when the bias parameters are correctly estimated and per item either the logging policy distribution is correctly estimated and clipping has no effect or the regression estimate is correct.In contrast, remember that IPS needs an accurate π0 () and clipping to have no effects for all items, and DM needs accurate regression estimates for all items.Therefore, DR is unbiased when either IPS or DM is but can also be unbiased in situations where neither is.Clearly, our DR is more robust than IPS and DM, yet all of the LTR estimators still require accurate bias parameters.This seems inescapable since our reward R is also based on user behavior, i.e. due to its  weights accurate  and  estimates are needed.
In addition to the better unbiasedness conditions, Theorem C.5 proves our DR estimator has less or equal bias than IPS when α and β are accurate and each R estimate is less than twice the true   value: We see that our DR estimator is able to reduce bias with somewhat accurate regression estimates.In particular, it appears that it mitigates some of the bias introduced to IPS by clipping.Overall, it appears that our DR estimator has better unbiasedness criteria than IPS or DM and has lower bias than IPS given adequate regression estimates.
Besides bias, we should also consider the variance of our DR estimator, from Eq. 36 it follows that (cf.Eq. 22): Thus, a large covariance between IPS and CV allows for a reduction in the variance of our DR estimator.To better understand when this may be the case, Theorem C.6 proves the following condition for improved variance over IPS: variance of DR is less or equal than that of IPS .
We note that this is the same condition as in Eq. 40: correct α and β estimates and regression estimates that are somewhat correct.Interestingly, this shows that under this condition DR can improve over IPS both in terms of bias and variance.In contrast, while the practice of clipping reduces variance but introduces bias [18,42], it appears that under certain conditions DR can avoid this tradeoff altogether.
Finally, we note that there is also an important exception; our DR estimator is equivalent to IPS when all α  () are equal to their corresponding   : There are only two non-trivial situations where this can occur: (i) when the logging policy  0 is deterministic and clipping has no effect: ∀ ∈ ,   ≥ ; and (ii) when all regression estimates are zero: ∀ ∈ , R = 0.In all other scenarios, our DR estimator does not reduce to IPS estimation.This means that even when  0 is deterministic, DR can have benefits over IPS when clipping is applied.
To summarize, we have introduced a novel DR estimator that is specifically designed for the LTR problem.In terms of bias and variance, our DR estimator is more robust than both the IPS and the DM estimators: when either of IPS or DM is unbiased the DR estimator is also unbiased, and in addition, there exist cases where DR is unbiased and neither IPS nor DM are.Moreover, when the bias parameters are accurate and all regression estimates are between zero and twice the true preferences, we can prove that both the bias and variance of DR are less or equal to those of IPS.
In terms of theory, our novel DR estimator is a breakthrough for the unbiased LTR field: it is the first unbiased LTR method that uses DR estimation to directly correct for position-bias, importantly, this makes it provenly more robust than IPS estimation in terms of both bias and variance.Our DR estimator is applicable in any unbiased LTR setting where IPS can be applied and with any regression estimates, allowing for widespread adoption across the entire field.

Applying LTR to Doubly-Robust Ranking Metric Estimates
It might not be directly obvious how LTR can be performed with the DR estimator, while it is actually very straightforward.To begin, we consider the common approaches for LTR when relevances are known: bounding [6,51] and sample-based approximation [25,45,52].Bounding has a long tradition in LTR for optimizing deterministic models [6]; Wang et al. [51] introduced the LambdaLoss method and proved that it can bound ranking metrics, let  be a vector of all true item relevances then: LambdaLoss(, ) ≤ R ().For probabilistic policies, the policy gradient can be approximated based on sampled rankings [52], recently Oosterhuis [25] proposed the PL-Rank method: PL-Rank(, ) ≈   R ().To apply LTR methods like these to estimated metrics, we follow Oosterhuis and de Rijke [28] and reformulate the R DR () to a sum over items that with expected-rank weights ω and relevance estimates μ : Let  indicate a vector of all relevance estimates, Oosterhuis and de Rijke [28] prove that LambaLoss can be used as a bound on the estimated metric: LambdaLoss(, μ) ≤ R DR ().Similarly, the derivation of Oosterhuis [25] is equally applicable to R DR () and can thus approximate the policy gradient: PL-Rank(, μ) ≈   R DR ().As such, existing LTR methods are straightforwardly applied to our DR estimator in order to optimize ranking models w.r.t.unbiased click-based estimates of performance.

Novel Cross-Entropy Loss Estimation
While our DR estimator can be applied with any regression model, we will propose a novel estimator for the cross-entropy loss to optimize an accurate regression model.There are two issues with the existing L ′ estimator (Eq.30) we wish to avoid: (i) L ′ does not correct for trust-bias, and (ii) for ACM Trans.Inf.Syst., Vol.0, No. 0, Article 0. Publication date: 2023.Doubly-Robust Estimation for Correcting Position-Bias in Click Feedback for Unbiased Learning to Rank 0:17 any never-displayed item  the L ′ estimate contains the log(1 − R ) loss that pushes R towards zero.In other words, L ′ penalizes positive R values for items that were never displayed during logging, while it seems more intuitive that a loss estimate should be indifferent to the R values of never-displayed items.We propose the following estimator: Our novel estimator has β corrections to deal with trust-bias and utilizes the α  () to weight the negative part of the loss: log(1− R ).One possible interpretation is that α  () replaces the 1 in Eq. 30 using the fact that E ∼ 0 [   () /  ] = 1.Another interpretation is that the second weight looks at the difference between the expected click probability if the item was maximally relevant (  = 1) and the observed click frequency, the expected difference reveals how much relevance the item lacks.
Regardless of interpretation, the important property is that E ,∼ 0 1 ρ α  () + β  () −   () = (1 −   ).Furthermore, when an item  is never displayed, the corresponding R does not affect the estimate since in that case: π is correctly estimated and clipping has no effect These are the same conditions as those we proved for the IPS estimator (Eq.27): the bias parameters and the logging policy need to be accurately estimated and clipping should have no effect.Thus our novel cross-entropy loss estimator can correct for position-bias, even when trust-bias is present, and is indifferent to predictions on never-displayed items.

EXPERIMENTAL SETUP
In order to evaluate our novel DR estimator, we apply the semi-synthetic setup that is common in unbiased LTR [12,18,26,27,31,47,57].This simulates a web-search scenario by sampling queries and documents from commercial search datasets, while user interactions and rankings are simulated using probabilistic click models.We use the three largest publicly-available LTR industry datasets: Yahoo!Webscope [7], MSLR-WEB30k [32] and Istella [9].Each dataset contains queries, preselected documents per query and for the query-document pairs: feature representations and labels indicating expert-judged relevance, with label() ∈ {0, 1, 2, 3, 4} we use  ( = 1 | ) = 0.25 • label().The queries in the datasets are divided into training, validation and test partitions.Our logging policy is obtained by supervised training on 1% of the training partition [18].At each interaction , a query is sampled uniformly over the training and validation partitions and a corresponding ranking is sampled from the logging policy.Clicks are simulated using the click model in Eq. 1.
We simulate both a top-5 setting, where only five items can be displayed at once, and a fullranking setting where all items are displayed simultaneously.The parameters for the top- , because these closely match the top-5 parameters while being applicable to longer rankings.We simulate both top-5 settings where  and  are known and where they are estimated with expectation-maximization (EM) [2,47].All models are neural networks with two 32-unit hidden layers, applied in Plackett-Luce ranking models optimized using policy gradients estimated with PL-Rank-2 [25].The only exception is the logging policy in the full-ranking settings which is a deterministic ranker to better match earlier work [1,18,47].Propensities   use frequentist estimates of the logging policy: π0 we clip with  top-5 = 10/

√
in the top-5 setting and  full = 100/ √  in the full-ranking setting.Early stopping is applied using counterfactual estimates based on clicks on the validation set.
Our main performance metric is the expected number of clicks on preferred items (ECP), as introduced in Section 3.2.In addition, Appendix E also provides our main results measured with the NDCG metric.
Our results evaluate our DM estimator (Eq.32) and our DR estimator (Eq.32) both using the estimates of a regression model optimized by our L loss (Eq.45).Their performance is compared with the following baselines: (i) a naive estimator that ignores bias (Eq. 25 with  = 1); (ii) IPS (Eq.25); (iii) ratio-propensity-scoring (RPS) [48]; and (iv) DM optimized with L ′ (Eq.30 & 32) from previous work [5,38].None of the estimators receive any information about queries that were not sampled in the training data.To compare the differences with the optimal performance possible, we also optimize a model based on the true labels (full-information).
As an example for clarity, the following procedure is used to evaluate the performance of our DR estimator at  displayed rankings in the top-5 setting with estimated bias parameters: (1)  queries are sampled with replacement from the training and validation partitions, a displayed ranking is generated for each sampled query using the stochastic logging policy.(2) Clicks on each ranking are simulated using the click model in Eq. 1 and the true  and  parameters.(3) EM is applied to the simulated click data to obtain estimated α and β parameters.(4) A regression model is optimized using L, α, β and the click data simulated on the training set, R is computed for each item.(5) A ranking model is optimized to maximize the DR estimate of its ECP, using α, β, R and the training click data, early stopping criteria are estimated with the validation click data.( 6) Finally, the true ECP (R  , Eq. 7) of the resulting ranking model is computed on the test-set and added to our results.We repeat each procedure twenty times independently and report the mean results in addition to standard deviation and 90% confidence intervals.Statistical differences with the performance of our DR estimator were measured via a two-sided student's t-test [43].

RESULTS
Our main experimental results are displayed in Figure 1 and Table 1.Both display performance reached, in terms of the ECP metric (Eq.7), for different estimators on varying amounts of simulated interaction data; and both are split in three rows, each indicating the results of one of the three simulated settings.The displayed results are means over twenty independent runs, 90% confidence intervals are visualized in Figure 1 so that meaningful differences can be recognized.Furthermore, Table 1 displays standard deviations and statistical significant performance differences with DR using a two-sided student's t-test [43].

Performance of Inverse Propensity Scoring
To begin our analysis, we consider the performance of IPS; In Figure 1, we see that in both top-5 settings IPS is unable to reach optimal ECP when  ≤ 10 9 on any of the datasets, an observation also made in previous work [29].In the top-5 setting with known bias parameters, IPS is theoretically proven to be unbiased and will converge at optimal ECP as  → ∞.Consequently, we can conclude that it is high variance which prevents us from observing IPS's convergence in the top-row of   2 .This observation illustrates the importance of variance reduction: it is not bias but high variance that prevents IPS from reaching optimal performance with feasible amounts of interaction data.In contrast with the two top-rows of Figure 1, the bottom row shows that IPS can reach optimal ECP on all datasets in the full-ranking setting.A plausible explanation is that interactions on a complete ranking provide much more information than when only the top-5 can be interacted with.Possibly, the item-selection-bias in the top-5 settings greatly increase the variance of IPS due to the winner-takes-all behavior described in Section 6.1.Overall, we see that while IPS can approximate optimal ECP in the full-ranking setting with reasonable amounts of data, its variance prevents it from reaching good ECP in top-5 settings even when given an enormous number of interactions.

Performance of Novel Direct Method and Doubly-Robust Estimators
Next we consider whether our DR estimator provides an improvement over the performance of IPS. Figure 1 reveals that this is clearly the case: in all settings it outperforms IPS when  ≈ 10 4 and only in the full-ranking setting does IPS catch up around  ≈ 10 7 .Moreover, DR always has a higher or comparable mean ECP than IPS when  ≥ 10 4 , across all datasets and settings.Table 1 does not report a single instance of IPS significantly outperforming DR but many instances where DR significantly outperforms IPS.In all of the top-5 settings, the ECP of DR when  = 10 6 is not reached by IPS when  = 10 9 , regardless of whether bias parameters are known or estimated.Therefore the DR appears to provide an increase in data-efficiency over IPS of a factor greater than 1,000 in all of the top-5 settings.We thus confidently conclude that DR provides significantly and considerably higher performance than state-of-the-art IPS, given that  ≥ 10 4 , which in top-5 settings leads to an enormous increase in data-efficiency.
Subsequently, we compare the performance of our DM estimator with IPS.In Figure 1, we see that in some cases DM has substantially better ECP than IPS, particularly in the top-5 settings on Yahoo! and MSLR.Yet, we also see that, on all settings on the Istella dataset, the ECP differences between DM and IPS are much smaller; and on the full-ranking setting on Yahoo!, DM appears to converge at noticeably worse ECP than IPS.These results are quite surprising as it shows that this simple but previously-unconsidered approach is actually a competitive baseline to IPS.We conclude that DM appears preferable over IPS in top-5 settings where not all items can be displayed at once, but not necessarily in full-ranking settings.
Lastly, our comparison considers both our novel DM and DR estimators.In Figure 1, we see that overall DM has lower ECP than DR, but in some cases it has comparable or not significantly different ECP.Table 1 reveals that in both the top-5 settings and the full-ranking setting on Yahoo!, DR has a significantly higher ECP than DM, even though these differences are smaller than compared with IPS.This result confirms that the DR approach can indeed effectively use click data to correct for the regression mistakes of DM.It appears that this is especially the case on the Istella dataset, where in both top-5 settings there is a considerable performance difference between our DR and DM estimators; here Figure 1 shows the ECP reached by our DM when  = 10 9 is reached by our DR when  ≈ 10 7 .Notably, DM appears to converge on suboptimal ECP in the full-ranking setting on Yahoo! and in both top-5 settings on MSLR, indicating that its unbiasedness criteria (Eq.33) are not met in these situations.In stark contrast, our DR estimator reaches near-optimal ECP in all tested scenarios, which corresponds with its much more robust unbiasedness criteria.Overall, our results indicate that DR in the majority of cases significantly outperforms DM.
Our observations seem to confirm several of our expectations from our theoretical analysis: The inability of IPS to reach optimal ECP when  = 10 9 in top-5 settings confirms that variance is its biggest obstacle.The large increase of DM over IPS seems to indicate that the usage of regression estimates provides a large reduction in variance.The cases of suboptimal convergence of DM show that its impractical unbiasedness criteria are infeasible in some of our experimental settings.Finally, in all settings and datasets, our DR estimator has significantly better or comparable ECP to DM and IPS, while always converging near optimal ECP.This observation shows that DR effectively combines the variance reduction of DM with the more feasible unbiasedness criteria of IPS, and clearly provides the most robust and highest performance across our tested settings and datasets.
Despite the large aforementioned advantages, we note that a downside of our DM and DR estimators is that they provide low ECP in some settings when  ≤ 10 4 .It appears that our earlystopping strategy, which is very effective for IPS, is not able to handle incorrect regression estimates very well.Future work could investigate whether this could be remedied with safe deployment strategies [14,30], that prevent deploying models with uncertain performance.However, it seems doubtful to us that in a real-world setting such little data is available that  ≤ 10 4 .Nonetheless, our results show that DM and DR are less resilient to tiny amounts of training data than IPS.

Comparison with other Baselines
Finally, our comparison also includes other baseline methods: the naive estimator, RPS and DM from previous work.We note that all of these methods are biased in our setting: the naive estimator explicitly ignores position-bias, RPS trades bias for less variance and the DM from previous work ignores trust-bias.Unsurprisingly, Figure 1 and Table 1 show that they are all unable to converge at optimal ECP in any of the settings.The effect of trust-bias appears particularly large in the full-ranking setting, where none of these baselines are able to substantially improve ECP over the logging policy.The ECP of RPS appears very sensitive to propensity clipping: when  ≈ 10 9 and clipping no longer has effect its performance completely drops.Nevertheless, these baselines show that often a decrease in variance can be favourable over unbiasedness, as some of them provide higher ECP than the unbiased IPS estimator in the top-5 settings on Yahoo! and MSLR, especially when  is small.Regardless, due their bias, they are unable to combine optimal convergence with low variance.It appears that our DR estimator is the only method that effectively combines these properties.

Correcting to Bias Introduced by the Clipping Strategy
As discussed in Section 6.3, one of the advantages of DR estimation is that, in contrast with IPS, it can potentially correct for some of the bias introduced by clipping.To experimentally verify whether these corrections can lead to observable advantages in practice, we ran additional experiments with varying clipping strategies applied to the IPS, DM and DR estimators in the top-5 setting with known bias parameters.
Figure 2 shows the learning curves of these estimators with our standard clipping strategy:  1x = 10/ √  (cf.Eq. 24), a strategy with 1000 times less clipping:  0.001x = 10 −2 / √  , and a heavy clipping strategy:  1000x = 10 4 / √  .In addition, the effect of individual threshold values are visualized in Figure 3, where ECP with  = 10 8 is displayed for values of the clipping threshold  ranging from 10 −6 to 1.
Clearly, IPS is the most sensitive to the clipping threshold as its ECP drops dramatically when heavy clipping is applied.In contrast, while there is a noticeable effect from varying  on the DM and DR estimators, the differences between light, standard and heavy clipping are relatively small on the Yahoo! and MSLR datasets.On the Istella dataset, there is a larger decrease in ECP for the DM and DR with heavy clipping, but it is still much smaller than that of IPS.Importantly, we see that, regardless of what clipping is applied, DR always has a higher ECP than DM and IPS.This  Fig. 3.The effect of the clipping parameter  (Eq.24) on the ECP (Eq.7) of three estimators in the top-5 known-bias setting when the number of impressions  = 10 8 .Results are means over 20 independent runs, shaded areas indicate the 90% confidence intervals; y-axis: policy performance in terms of ECP (Eq.7) on the held-out test-set; x-axis:  the clipping threshold.
indicates that the performance advantage of DR over DM remains stable w.r.t. the clipping strategy, where the differences with IPS become especially large under heavy clipping.Therefore, we conclude that the DM and DR are less sensitive to propensity clipping than IPS and can better correct for the bias introduced by clipping strategies.Where the performance of IPS considerably varies for different clipping strategies, DM and DR are only affected by very heavy clipping.We can thus infer that the use of regression by DM and DR makes them more robust to propensity clipping.Moreover, the advantage of DR over both DM and IPS is consistent across all our tested clipping strategies, indicating it is the optimal choice regardless of what clipping strategy is applied.

Robustness to Incorrect Bias Specification
The main results presented in Figure 1 and Table 1 reveal that there is very little difference in performance between the top-5 setting where the bias parameters are known and where they have to be estimated.While this shows that good performance is maintained when bias has to be estimated, it is unclear whether this also means that the estimators are robust to misspecified bias,  since it is possible the estimated bias parameters are actually quite accurate.Furthermore, most of our theoretical results assume that bias is correctly estimated, it is thus valuable to empirically verify whether the advantages of the DR remain when its bias parameters are incorrect.
To better understand how robust the DR estimator is to bias misspecification, the ECP of the DR, DM and IPS estimators were measured in the top-5 setting with intentionally misspecified bias parameters.For the incorrect bias parameters, we choose the mean values across positions: α = 5 =1   /5 and β = 5 =1   /5.These mean values represent a naive approach that ignores the effect of the position on the examination and trust of users, i.e. it assumes that any document that is displayed in the top-5 is treated equally by the user, regardless of its exact position.
Figure 4 displays the learning curves with these incorrect bias parameters.Clearly, the ECP reached with all three estimators drops dramatically when the bias is heavily misspecified.While IPS and DM converge on similar performance, DR provides noticeably higher ECP when  ≥ 10 5 on all three datasets.This strongly indicates that DR is more robust to heavily misspecified bias than DM and IPS.
In line with our previous observations, Figure 5 reveals IPS to have the lowest ECP, regardless of bias misspecification.Interestingly, the differences between DR and DM vary: on Istella, DR considerably outperforms DM, but on Yahoo! and MSLR, the difference is only clear when  < 0.1 and  > 0.9.When the interpolation is more in between the extreme values, DM and DR have comparable ECP where sometimes DM has slightly higher performance.As a result, we cannot conclude whether DR better deals with bias misspecification than DM.Nevertheless, the differences between DR and DM are relatively small, thus the choice does not seem very consequential.Conversely, our results clearly indicate that IPS provides worse ECP than DR and DM whether bias is misspecified or not.
In summary, our results show that DR estimation is much more robust to bias misspecification than IPS.Moreover, it appears to outperform DM under heavy or light misspecification, but results are mixed when the misspecification is moderate.Overall, our results indicate that the advantages of DR over IPS and DM are mostly still applicable when bias is incorrectly estimated or misspecified.

CONCLUSION
This paper has introduced the first unbiased DR estimator that is specifically designed to correct for position-bias in click feedback.Our estimator differs from existing DR estimators by using the expected correlation between clicks and preference per rank, instead of the unobservable examination variable or corrections solely based on action probabilities.Additionally, we also proposed a novel DM estimator and a novel cross-entropy loss estimator.In terms of theory, this work has contributed the most robust estimator for LTR yet: our DR estimator is the only method that corrects for position-bias, trust-bias and item-selection bias and has less strict unbiasedness criteria than the prevalent IPS approach.Moreover, our experimental results show that it can provide enormous increases in data-efficiency compared to IPS and better overall performance w.r.t.other existing state-of-the-art approaches.Therefore, both our theoretical and empirical results indicate that our DR estimator is the most reliable and effective way to correct for position-bias.Consequently, we think there is large potential in replacing IPS with DR as the new basis for the unbiased LTR field.
Future work hopefully finds similar gains in related tasks, e.g.exposure-based ranking fairness [40] or ranking display advertisements [22].Overall, we expect the improvements in efficiency and robustness to make unbiased LTR even more attractive for real-world applications.

Code, Resources and Data
To facilitate the reproducibility of the reported results, this work only made use of publicly available data and our experimental implementation is publicly available at https://github.com/HarrieO/2022doubly-robust-LTR.Additionally, a video presentation with accompanying slides is available at https://harrieo.github.io//publication/2023-doubly-robust.
represents the opinion of the author, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

APPENDICES A BIAS AND VARIANCE OF IPS
Theorem A.1.The IPS estimator (Eq.25) has the following bias: Proof.Using Eq. 1, 7, 23 and 25 we get the following derivation: = R () + ∑︁  ∈ ω ρ By the definitions of  (Eq.6) and ω (Eq.24): Lemma A.3.By the definitions of  (Eq.23) and ρ (Eq.24): Lemma A.4. Trivially, if the β bias parameters are correct then: Corollary A.5.When α and β are correct IPS has the bias: Proof.Follows from Theorem A.1 and Lemmas A.2 and A.4. □ Theorem A.6.The IPS estimator (Eq.25) is unbiased when α, β and π0 are correctly estimated and clipping has no effect: Proof.Follows from applying Lemma A.3 to Corollary A.5. □ Theorem A.7.The IPS estimator (Eq.25) has the variance: Proof.Follows from Eq. 1 and 25 Proof.Follows directly from Eq. 34.□ Theorem B.2.The CV estimator (Eq.34) is an unbiased estimate of DM (Eq.32) if the α and β bias parameters are correctly estimated and per item either π0 () is correct and clipping has no effect: Proof.From Lemma B.1 it clearly follows that the expected value of CV is equal to DM (Eq.32) when ρ = E ∼ 0 α () .The definition of ρ (Eq.24) shows that this is the case when π0 () is correct and clipping has no effect: Applying Eq. 57 to Lemma B.1 thus proves Theorem B.2. □

C BIAS AND VARIANCE OF DR ESTIMATOR
Theorem C.1.The DR estimator (Eq.36) has the following bias: Proof.Using Eq. 7, 32 and 36 we make the following derivation: By the definition of  (Eq.23): Corollary C.3.The bias of the DR estimator (Eq.36) can be simplified when α and β are correctly estimated: Proof.Apply Lemmas A.
Proof.From Corollary C.3 it clearly follows that the DR estimator is unbiased when the α and β bias parameters are correct and per item  either ρ or R is correct: Applying Lemma A.3 to Eq. 63 provides proof for Theorem C.4. □ Theorem C.5.If α and β are correct and the regression model predicts each preference R between 0 and twice the true   value then the bias of the DR estimator (Eq.36) is less or equal to that of the IPS estimator (Eq.25): Proof.This follows from comparing Corollary A.5 with C.3.□ Theorem C.6.The DR estimator (Eq.36) has the variance: Proof.This follows from Eq. 1 and 36.□ Lemma C.7.The covariance between clicks on an item  () and   () is: Proof.

□
Corollary C.8.If α and β are correct then the variance of the DR estimator (Eq.36) is:    Combining Eq. 76 and 77 directly proves Theorem D. 3. □

E MAIN RESULTS EVALUATED WITH DISCOUNTED CUMULATIVE GAIN
The main results presented in Section 8 used ECP (Eq.7) as the metric of performance.As argued in Section 3.2, we think ECP is the most appropriate metric as it is based on the actual user model in the simulation, i.e. it utilizes the true  and  values.Nevertheless, this makes it harder to compare our results with previous work that relies on more traditional metrics.
To better enable such comparisons, and to verify whether our conclusions translate to other metrics, Table 2 reports the same results as Table 1 but in terms of normalized discounted cumulative gain (NDCG) [15]: ,  @ () = @ () max  ′ @ ( ′ ) . (78) Comparing Table 1 with Table 2 reveals that both tables show the same trends and relative differences between the different methods.This confirms that the conclusions that were made from comparisons in Section 8 are still valid when measuring with NDCG instead of ECP.In other words, Table 2 shows that even when evaluating with NDCG, the performance improvements of DR and DM over IPS and other baselines remain very clear.

ACM
Trans.Inf.Syst., Vol.0, No. 0, Article 0. Publication date: 2023.Doubly-Robust Estimation for Correcting Position-Bias in Click Feedback for Unbiased Learning to Rank 0:9 pos. bias is correctly estimated∧ regression estimates between zero and twice true relevances bias of DR is less or equal than bias of IPS .

Fig. 1 .
Fig. 1.Policy performance in terms of ECP (Eq.7) reached on three datasets and several settings.Top row: top-5 setting with known  and  bias parameters; middle row: top-5 setting with estimated α β; bottom row: full-ranking setting (no cutoff) with known  and  bias parameters.Results are means over 20 independent runs, shaded areas indicate the 90% confidence intervals; y-axis: ECP on the held-out test-set; x-axis:  the number of displayed rankings in the simulated training set.

Fig. 4 .Fig. 5 .
Fig. 4. Effect of very incorrect bias parameters α and β on the ECP (Eq.7) of three estimators in the top-5 known-bias setting.Incorrect bias estimates are the mean of the true values across all positions:α = 5=1   /5 and β = 5 =1   /5, as if there is no position-bias effect within the top-5.Results are means over 20 independent runs; y-axis: policy performance in terms of ECP (Eq.7) on the held-out test-set; x-axis:  the number of displayed rankings in the simulated training set.