Exploration of Unranked Items in Safe Online Learning to Re-Rank

Bandit algorithms for online learning to rank (OLTR) problems often aim to maximize long-term revenue by utilizing user feedback. From a practical point of view, however, such algorithms have a high risk of hurting user experience due to their aggressive exploration. Thus, there has been a rising demand for safe exploration in recent years. One approach to safe exploration is to gradually enhance the quality of an original ranking that is already guaranteed acceptable quality. In this paper, we propose a safe OLTR algorithm that efficiently exchanges one of the items in the current ranking with an item outside the ranking (i.e., an unranked item) to perform exploration. We select an unranked item optimistically to explore based on Kullback-Leibler upper confidence bounds (KL-UCB) and safely re-rank the items including the selected one. Through experiments, we demonstrate that the proposed algorithm improves long-term regret from baselines without any safety violation.


INTRODUCTION
Learning-to-rank (LTR) methods play a key role in delivering attractive content to users living in the era of information overload, wherein new content is rushing into a database every day.Systems must provide such new content quickly and accurately to users, who will be interested in it.However, we often face a lack of information about new content, and therefore, exploration is essential, although it inevitably implies somehow baseless prediction to initially collect information, which can damage user satisfaction.Thus, systems are in a dilemma-exploration is necessary, yet its safety is indispensable.
Online learning-to-rank (OLTR) [5,19] is a promising approach to combat the information lack of new items by immediately reflecting fresh user feedback collected through a prediction-observation loop.Several conventional studies have developed OLTR methods using item features [15], and this approach is also effective for handling new items.On the other hand, features of new items may be unreliable in practice, particularly when there is an unexpected craze for a new one; in this situation, features that were previously effective will not perform well.Recent OLTR studies have explored methods based on ranking bandits [2, 8-12, 16, 21].Whereas this approach enables learning to rank new items without relying on item features, a click model is often assumed to learn item attractiveness from biased click feedback.However, accurately specifying the click model behind user feedback is generally challenging, and hence such a model-specific approach may be unsafe because misspecified models lead to inaccurate prediction and can hurt user satisfaction.Recent studies have proposed click-model-agnostic algorithms [4,13,14,20], which is safe in terms of model misspecification.Li et al. [14] proposed a click-model-agnostic method, BubbleRank, which explicitly considers the safety of exploration in an OLTR setting, where algorithms can leverage an original ranking generated by a method previously deployed in the production system.Based on the definition of Li et al. [14], BubbleRank is "safe" in the sense that the ranking shown to a user does not substantially underperform an original ranking with high probability.Nevertheless, BubbleRank is basically designed for re-ranking of originally ranked items and cannot efficiently handle unranked items, which do not appear in the original ranking; although the extension with the random exploration of unranked items is discussed, their naïve strategy is statistically inefficient as shown in this paper.
In this paper, we develop an OLTR algorithm that can safely explore unranked items by extending BubbleRank [14].To achieve safe exploration for unranked items without any preliminary information, we utilize the Kullback-Leibler upper confidence bounds [1] (KL-UCB) as the optimistic confidence measure of item attractiveness.To examine the effectiveness of our proposed algorithm in various scenarios, we conduct semi-simulate experiments on the real-world dataset.

RELATED WORK
Conventional OLTR algorithms can be classified into the clickmodel-specific approach [2, 8-12, 16, 21] and click-model-agnostic counterpart [4,13,14,20].Model-specific algorithms assume a certain click model behind user feedback data to efficiently learn optimal rankings when users follow the assumed click model, e.g., position-based model (PBM) [8,12] and the cascade model (CM) [3,10,11,21].However, it can be unsafe in the sense that their theoretical guarantees do not hold when the assumed click model does not fit the actual user behavior [4,14].By contrast, the model-agnostic counterpart only requires weak assumptions.UniRank [4] is the state-of-the-art model-agnostic algorithm with excellent performance in terms of regret, but it does not consider safety constraints.To achieve safe re-ranking without assuming click models, Li et al. [14] proposed BubbleRank, which has a severe limitation in handling unranked items because their proposed random exploration does not consider statistical efficiency.
The notion of safety in OLTR is related to that in conservative bandit algorithms.In conservative bandit, the notion of safety is defined as a constraint on cumulative rewards [6,18]; notably, at each round, algorithms are allowed to select arms that can cause high regret, as long as the constraint is respected throughout the entire rounds.Beyond such a "coarse-grained" definition, some recent studies consider stage-wise safety, which requires algorithms to be conservative in every round [7,17].This stage-wise definition is rather related to the safety of interest in this study.Unfortunately, the existing algorithms are designed for linear bandit settings [7,17] and thus cannot be applied to the OLTR settings efficiently.
In this paper, we propose a model-agnostic algorithm inspired by BubbleRank and UniRank, which enables stage-wise safe re-ranking and exploration under unranked items.

PROBLEM FORMULATION
An instance of a stochastic click bandit is a tuple (, ,   ,   ), where  ∈ N is the size of the set of all items D,   is a distribution over binary attraction vector {0, 1}  , and   is a distribution over binary examination matrices {0, 1} Π  ( D)× , with Π  (D) is the set of all permutations of  (≤ ) items from D.
For  ∈ Z + , which is the set of non-negative integers, let [] := {1, . . ., }.At each round  ∈ [ ], an algorithm shows a ranking   ∈ Π  (D) to a user and observes the user's clicks {  ()}  =1 ∈ {0, 1}  on all positions in   .Note that   depends on the past history up to round  − 1.A position is clicked if and only if it is examined and the item at that position is attractive, that is, for any where  * := max  ∈Π  ( D)  (, ,  ()) is the highest expected reward and the expectation is taken with respect to the rankings from the algorithm and the clicks.
Our problem requires the assumptions introduced by Li et al. [14].The assumptions hold in the CM and they also do in the PBM when the examination probability decreases with the position.Then the optimal ranking  * := argmax  ∈Π  ( D)  (, ,  ()) includes the item which has -th highest attractiveness at position .

PROPOSED METHOD
Our algorithm, called KL-UCB-BR, is described in Algorithm 1. KL-UCB-BR holds the following three rankings in each round : a leader ranking  LDR  , a temporary ranking  ′  , and a display ranking   .The leader ranking  LDR  is the interim best ranking estimated at round ; we initialize it to the original ranking  0 if  = 1 and, if  > 1, to the top- partial ranking of the previous temporary ranking  ′  −1 ([]).The temporary ranking  ′  is used to exchange items to safely reorder the item pairs; we initialize it with the current leader ranking  LDR  , set a single unranked item to explore at the ( + 1)-th position, and exchange items at consecutive positions to ensure the correct order.The display ranking   is presented to users, and user clicks can be observed on the items in it; we initialize it to the current temporary ranking  ′  and update it by randomly exchanging items at consecutive positions if KL-UCB-BR is not confident about the order in their attractiveness.
As the criteria to safely reorder the item pairs to be correct, we utilize the following statistics for the exchanging items in the temporary ranking [14]: where  −1 () is a position of item  in a ranking  and   (, ) := 1 {(, ) ∈ P  (  )} 1{  ( −1  ()) ≠   ( −1  ( ))}, with P  () is the set of item pairs at odd-even/even-odd consecutive positions in a ranking  in even/odd round .From Lemma 9 of Li et al. [14], when   (, ) > √︁   (, ) log(1/) with sufficiently small , item  is superior to item  with high probability.
To see the performance gain from the original ranking, we also consider the original ranking as a non-adaptive baseline method; hereafter, it is referred to as OriginalRank.As we are interested in OLTR problems with unranked items, we utilize the extension of BubbleRank described by Li et al. [14], in which the item at the bottom of the current ranking is exchanged randomly with a random unranked item, which has not yet been determined to be superior to the bottom-ranked item.
In our experiments, we use the Yandex click dataset1 to simulate user clicks on displayed rankings generated by the algorithms.This dataset includes over 30 million user sessions, which contain over 20 million unique search queries, extracted from Yandex-search logs.We basically follow the experimental protocol of the conventional studies [14,20].To simulate user behavior in the dataset, we estimate the parameters of a click model from the user sessions in the top-100 frequent queries by using PyClick library 2 .Throughout experiments, we consider two click models implemented in PyClick, the position-based click model (PBM) and the cascade click model (CM).In a single simulated user session, we generate user clicks on a ranking displayed by an algorithm according to the learned click model and then evaluate the algorithm.
For each query, we use the most frequent ranking with 10 items and consider the top-5 items in the ranking as the original ranking and the remaining 5 items as unranked ones.The goal of the simulation for each query is to rank the top-5 most attractive items in descending order of attractiveness among the 10 ranked/unranked items.We measure the performance of each algorithm by computing cumulative expected regret defined in Eq. ( 1) and safety violation for display rankings.Safety violation is the number of rounds that an algorithm violates the safety constraint3 defined in Definition 5.1.Violating the safety constraint often leads to user disengagement from applications by displaying users a ranking that is far inferior to the original one.Notably, KL-UCB-BR and BubbleRank are guaranteed that they do not violate the safety constraint until round  with high probability.be the number of incorrectly-ordered item pairs whose one or both items are in a display ranking .Then, the safety constraint for a display ranking   in round  is  (  ) ≤  ( 0 ) +  − /2, where  0 is an original ranking.

Results
In our experiments, we compare KL-UCB-BR with TopRank, Uni-Rank, BubbleRank, and OriginalRank under the click models of PBM and CM.As evaluation measures, the cumulative expected regret defined in Eq. ( 1) and the safety violation defined in Definition 5.1 are computed by taking the average of measurements obtained from 100 repeated experiments, each with  = 10 5 rounds.The shaded regions present standard errors in the measurements.