Fast and Accurate User Cold-Start Learning Using Monte Carlo Tree Search

We revisit the cold-start task for new users of a recommender system whereby a new user is asked to rate a few items with the aim of discovering the user’s preferences. This is a combinatorial stochastic learning task, and so difficult in general. In this paper we propose using Monte Carlo Tree Search (MCTS) to dynamically select the sequence of items presented to a new user. We find that this new MCTS-based cold-start approach is able to consistently quickly identify the preferences of a user with significantly higher accuracy than with either a decision-tree or a state of the art bandit-based approach without incurring higher regret i.e the learning performance is fundamentally superior to that of the state of the art. This boost in recommender accuracy is achieved in a computationally lightweight fashion.


INTRODUCTION
In this paper we revisit the cold-start task for new users of a recommender system, which remains a core challenge.When a new user joins the system it initially has no knowledge of the preferences of the user and so would like to quickly learn these 1 .The recommender system therefore initially starts in an "exploration" phase where the first few items that it asks the new user to rate are chosen with the aim of discovering the user's preferences.We focus on the simplest setup where a user explicitly rates items presented to them, e.g. on a 1-5 scale or binary like/dislike feedback, and the aim of the recommender system is to predict other items that the user may like.
One common approach to this new user cold-start task is to take ratings already collected from a population of users, use these to cluster users into groups and then train a decision-tree to learn a mapping from item ratings to the user group, see for example Figure 1(a).When a new user joins the system this decision-tree is used to decide which items the user is initially asked to rate and in this way the group to which the user belongs is initially estimated.Once the group is estimated, the system recommends items liked by members of that group e.g. using matrix factorisation or another collaborative filtering approach.
However, typically users clustered in the same group do not give identical ratings to an item.Rather there is a spread of ratings, and this intra-cluster variability between users can be thought of as adding noise to the ratings.Unfortunately, decision trees can easily make mistakes in the face of such noise.For example, Figure 1(b) shows the measured decision-tree accuracy for Netflix data clustered into 16 groups (see later for more details).It can be seen that the accuracy is as low as 50-60% for a number of groups.
The user cold-start task can be viewed as a form of single-player game.In a single-player game a sequence of moves is selected with the aim of maximising the reward or score generated by an environment (which may have some randomness).Supposing t moves have already been made, then lookahead to the next sequence of d moves can be represented by a path in a tree of depth d.Selecting the next move then involves exploring this tree to find a good future sequence of moves and then making the first move from that sequence, and the process then repeats starting from move t + 1.
We can directly map this to user cold-start as follows.A move consists of presenting a user with an item and observing their rating.After t moves we have presented a user with t items and observed t ratings.The next sequence of d moves consists of a path in a tree of depth d where each node is an item.The score of a sequence is the estimated probability of learning the correct user group by presenting that sequence of items to the user (we discuss the details of this calculation below).The first item of the sequence of d items with the highest score is then presented to the user, and the process then repeats.Since we have a budget of t max items, the maximum lookahead d at step t is equal to t max − t.
Observe that selecting the next move/item-to-present involves searching a lookahead tree of depth j.Monte Carlo Tree Search (MCTS) replaces exhaustive exploration of this tree with targetted exploration of only the most promising paths.This allows deeper tree exploration and better solutions to be found, even when the tree branching factor (the number of possible next moves/items) is large.MCTS has now largely replaced A/B trees in computer game playing and has contributed to breakthrough performance in Go (e.g.see AlphaGo [16]) and Chess (e.g.see [17]).We find that an MCTS-based approach to cold-start is able to achieve extemely fast learning of user preferences.We demonstrate that the group of a user is consistently quickly identified with significantly higher accuracy than with either a decision-tree or a state of the art bandit-based approach without incurring higher regret i.e the learning performance is fundamentally superior to that of the state of the art.This boost in recommender accuracy is achieved in a computationally lightweight fashion, with our MCTSbased implementation able to perform over 30+ recommendations per second using a single CPU core for the Netflix dataset with 8 distict user groups.It is trivially parallelisable and so linearly scalable with the number of CPU cores.

RELATED WORK
For a recent survey of solutions to the cold-start see [7,8].Passive approaches include recommending popular items, use of item-based recommendation (once user starts rating items an item-based approach is used to recommend similar items), transfer learning from another recommender system previously used by a user, and asking new users to rate a fixed list of items.Examples of early work on active learning include IGCN (information gain through clustered neighbors) which uses a decision tree with user clusters as leaves [11] and the ternary decision-tree approach of [10].More recently, [1] uses representative items i.e. after completing the ratings matrix R, k columns of R are selected, the ratings of the other items are represented as a linear combination of these and during cold start a new user is asked to rate these representative items.This approach is extended to use a decision-tree approach by [15].In [18] a matrix factorization approach is proposed whereby a decision-tree is trained to map from item ratings to the latent feature vector for a user.
Use of multi-arm bandits (MABs) for user cold-start has also received attention.In [9] after completing the ratings matrix R its rows are clustered and the average ratings vector for each cluster is used as a representative user.During cold start an MAB is used to select the average ratings vector to use and the user is asked to rate the next highest item in the vector.In [3] a MAB is used to select between recommender strategies, typically recommending popular items initially for a new user and later switching to a kNN or matrix factorisation model.Note that naive application of standard bandit algorithms to the cold-start task leads to poor performance.If we think of each recommender system item as an arm of a MAB then we run into the difficulty that (i) there are many arms and so learning is slow and (ii) repeated pulls of the same arm tend to be highly correlated.One remedy is to associate an arm with each group rather than each item [9].For each group the available items are sorted in descending order of their predicted rating by users in that group.Pulling the arm for a group then corresponds to asking the user to rate the next item from this sorted list, i.e. the unrated item predicted to have the highest rating for members of the group.While this greatly reduces the number of arms in the MAB, the learning rate remains very slow [14].This is because items rated highly by members of one group tend to also be rated highly by members of at least some of the other groups, and so the user ratings for these items do not serve to strongly distinguish between groups and so allow rapid learning.To address this, [14] identify so-called distinguisher items that tend to have distinct ratings by users in different pairs of groups.Using these they propose a Cluster-based Bandit algorithm that we use as a baseline for comparison in the present paper.
Contextual bandits have also received attention for user coldstart.These make use of contextual information, e.g. the user's location, gender, demographics.In this paper we assume that such contextual information is absent but using similar techniques to contextual bandits our approach can be readily extended to take advantage of it when it is available, see Section 5.
Monte Carlo Tree Search was introduced by [5] and quickly adopted within the two-player game community.See [2] for a survey of MCTS methods developed in the first 5 years after its introduction.MCTS was used by AlphaGo [16] to defeat the world champion in the game of Go, and also in the later AlphaZero [17] game playing engine.Use of MCTS in single player games was introduced in [12,13].For more recent work, see for example [6] and references therein.

MONTE CARLO TREE SEARCH FOR USER COLD-START 3.1 Preliminaries
We have a set G of user groups.Each user belongs to one group д ∈ G.We also have a set of items V. Given a new user our task is to quickly learn which group they belong to by asking the user to rate t max items in V, using the fact that the distribution of item ratings varies depending on the user's group.
Let V (t ) = {v 1 , . . ., v t } with v i ∈ V, i = 1, . . ., t be the set of items rated by a user and R(v i ) be the user's rating of item v i .Initially, for a new user t = 0 and V (0) is the empty set.Let p (t ) д , д ∈ G be an estimate of the probability that the user belongs to group д given the the user has rated the items V (t ) , with д ∈ G p д ≤ 1.When t = 0 these probabilities can be initialised to the uniform distribution p (0) д = 1/|G|, or alternatively to a distribution derived from population data.
Our MCTS approach is agnostic to how the probabilities p д are calculated, but for concreteness in our examples we will assume that for users belonging to group д the rating R(v) of item v is i.i.d.gaussian with mean µ(д, v) and variance σ2 (д, v).Denoting the user's group by random variable G we then have that and for observed sequence where for t = 1, 2, . . . .

Efficiently Searching The Lookahead Tree
Suppose at step t a new user has rated items V (t ) .The lookahead tree at step t embodies all item sequences of length d = t max − t where t max is the total number of cold start items to be presented to a new user, e.g see Figure 2. Note that we only ask a user to rate a given item once since repeated ratings of the same item tend to be highly correlated 2 .To select the next item to present to the user our MCTS-based approach proceeds by repeatedly executing the following actions: (i) select a sample lookahead path S = {s t +1 , . . ., s t max } of d = t max −t items from the tree, (ii) draw a random user group д according to probability distribution {p (t ) д , д ∈ G}, (iii) draw synthetic ratings R(s t +1 ), . . ., R(s t max ) for a user from group д, (iv) use V (t ) and synthetic ratings R(s t +1 ), . . ., R(s t max ) to estimate the group probabilities p (t max ) д , (v) if д is the group with highest probability generate reward +1 (the user group is correctly identified) else generate reward 0, (vi) backpropagate this reward along path S in the lookahead tree.We give more details on these steps below.
In principle, in step (i) we would like to select a different path at every turn until all paths in the lookahead tree have been visited.We can then select the path with highest reward, present the first item in that path to the user, observe the user's rating, update t to t + 1, V (t ) to V (t +1) and repeat the MCTS steps.However, the lookahead tree is generally far too large for an exhaustive search over all paths to be feasible.To see this observe that in a lookahead tree of depth d at time t = 0 there is a root node with |V | children (corresponding to each possible item in set V), and each child in turn has |V | − 1 children and so on.The tree therefore has |V |!/d!(|V | − d)! leaf nodes and when, for example, |V | = 1000 and d = 5 then the tree has around 10 12 leaf nodes.Rather than trying to carry out an exhaustive search of the lookahead tree, instead in step (iii) we try to identify good paths that are likely to generate high reward and focus on exploring these.

Finding Good Paths.
Since the full lookahead tree may be extremely large our aim is to construct a subtree containing paths that tend to generate high reward.Each node n in this subtree lies at the end of a path (i.e.sequence of items) S(n) that leads from the tree root to the node.Associated with each node n in the tree is: (i) an item s(n) (namely, the last item in path S(n)), (ii) a count N (n) of the number of times that s(n) has been included in the MCTS lookahead set S and (iii) a reward Q(n) which is the sum of the rewards generated by lookahead sets S that include s(n).
We construct this subtree (and in the process also generate sample lookahead paths) as follows.Initially create a tree consisting of a root node and |V | child nodes i.e. each child node is associated with a different item in V. Initialise counters N (•) = 0 and Q(•) = 0 for these child nodes.Pick a child node n uniformly at random and set item s t +1 = s(n).We need to extend this to obtain a sequence S of d items, and the simplest way to do this is just to pick d − 1 items s t +2 , . . ., s t max uniformly at random from V. Generate the reward for this path S and add this to counter Q(n) associated with the selected child node, also increment the N (n) counter associated with the child node.At the next turn, select a child node uniformly at random from the child nodes with N (•) = 0 and repeat this process.
Once there are no child nodes with N (•) = 0, pick a child node n with high reward, set item s t +1 = s(n)) and add |V \ {s(n)}| child nodes to this node.Select one of these children uniformly at random set this as item s t +2 .Extend sequence {s t +1 , s t +2 } to obtain a sequence S of length d by selecting d − 2 items uniformly at random.Generate the reward for this path S and add this to the counters Q(•) of the nodes corresponding to items s t +1 and s t +2 , also increment the N (•) counters associated with these nodes.
At the next turn, repeat this process.That is, traverse the current tree picking nodes with high reward until either a leaf node is Algorithm 1 gives pseudo-code for this process of growing a lookahead subtree containing paths that tend to generate high reward.
Algorithm 1 MCTS For User Cold-Start Initialise lookahead tree T ← root node for turn i = 1; i = i + 1; i ≤ max _tur ns do ▷ Construct lookahead subtree j = t + 1 n j ← best _child (T r oot ); s j = s(n j ) ▷ Select good path while (n j != null) and (j < t max ) do n j+1 ← best _child (n j );

Selecting a Node With High Reward.
Once the tree has started to grow, the process described above requires selecting a sequence of child nodes with high reward until a leaf node is reached.In doing this we would like to balance exploration (trying new items) and exploitation (selecting an existing item) and so we adopt an optimistic upper confidence bound (UCB) approach.Namely, we estimate the reward of a node n that is a child of parent node p as The first term Q(n)/N (n) is the average reward observed so far for sequences that include item s(n) (i.e.node n).However, when the number of samples N (n) on which this average reward is based is small then the value may well be inaccurate.The second term aims to compensate for this.Intuitively, when N (n) is small the second term is large, encouraging exploration of node n.As N (n) grows, however, the second term becomes smaller and this captures the fact that the first term Q(n)/N (n) is then more reliable.Another way to derive the expression for Reward(n) is to note that the reward is a Bernoulli random variable taking value 0 or 1. Applying Hoeffding's inequality, P( then e −2x 2 N (n) = e −0.5 log(N (p)) = 1/ N (p).Hence, Reward(n) can be thought of as the upper limit of a confidence interval which increases in power as the number of times N (p) that the parent node is visited grows.
Algorithm 2 gives pseudo-code using this UCB approach to select a child node with high reward.

Calculating
The Reward.Given the items V (t ) = {v 1 , . . ., v t } already rated by a user, plus a sample path {s t +2 , . . ., s t max } from the lookahead tree, we need to calculate the reward associated with the sequence of items {v 1 , . . ., v t , s t +2 , . . ., s t max }.We have the user ratings for items v 1 , . . ., v t but we lack ratings for items s t +2 , . . ., s t max .
We therefore adopt a Monte Carlo sampling approach.Namely, we select a user group д according to probability distribution {p (t ) д , д ∈ G} i.e. according to our best estimate of the user group at step t.Initially, we are unsure of the user group and {p (0) д } is the uniform distribution, but as the user rates items we hope that the distribution {p (t ) д }, and so д, will start to concentrate on the true group of the user.This concentration behaviour can be seen, for example, in Figure 3  Given user group д, we can generate synthetic ratings R(s t +2 ), . . ., R(s t max ) by making a draw from the multivariate Gaussian distribution with mean µ( д, v) and variance σ 2 ( д, v), v ∈ V of ratings by users in group д.Note that, alternatively, we could also draw ratings from the empirical distribution of ratings for a group (so relaxing the Gaussian assumption) and also by generating user ratings via a water-filling approach i.e. split the data into training and test data, pick a user from the test data and use their ratings, when we need a rating for an item that the user has not rated, pick a second user from the same group who has rated the item and merge the pair of user ratings.We found the performance of these setups to be very similar to simply drawing a new user from a Gaussian distribution.
With user ratings R(v 1 ), . . ., R(v t ) and synthetic ratings R(s t +2 ), . . ., R(s t max ) in hand, equation ( 1) can now be used to calculate the estimated probability {p (t max ) д } when the user rates items {v 1 , . . ., v t , s t +2 , . . ., s t max }.If the "true" group д is the group with highest estimated probability then the reward for the sequence is +1, else 0.
Averaging over multiple such samples, the average reward

Choosing The Next Item.
A lookahead subtree consisting of sequences of items that tend to generate high reward is constructed by executing the Monte Carlo search steps (i)-(vi) max_turns times, where max_turns is a hyperparameter.We need to select max_turns to be large enough that promising sequences of items are discovered and are visited sufficiently frequently that the estimated reward for the sequence is reasonably accurate.
In the lookahead subtree the children of the tree root at the first items in the set of lookahead sequences.Using the lookahead subtree generated at step t we choose the next item v t +1 to ask the user to rate to be the child of the tree root that has been most visited during the Monte Carlo search i.e. v t +1 = s(n * ) where n * ∈ arg max n ∈childr en(T r oot ) N (n).While we might alternatively choose the child with highest estimated reward Q(n)/N (n) we found that selecting the most frequently visited child tended to achieve slightly better performance.

Best
Distinguisher Items Are Enough.The lookahead tree expands by a factor of roughly |V | at each level.Even with the Monte Carlo search approach described above the computational and memory burden can therefore quickly become substantial for large |V |.Following [14], we note that some items are more effective than others for distinguishing between groups.For example, popular items rated highly by members of one group can tend to also be rated highly by members of at least some of the other groups, and so the user ratings for these items do not serve to strongly distinguish between groups.
Intuitively, an item v helps to distinguish whether a user belongs to group д rather than group h when (i) the mean rating of v by users in group д is very different from that of users in group h i.e. (µ(д, v) − µ(h, v)) 2 is large, and (ii) when the ratings tend to be consistent/reliable i.e. the variance σ 2 (д, v) is small.That is, we expect that is a measure of the ability of item v to distinguish group д from group h i.e. the larger Γ д,h (v) the better item v is at distinguishing group д from group h.Another way to arrive at the same conclusion is to assume that for users belonging to group д the rating R(v) of item v is i.i.d.gaussian with mean µ(д, v) and variance σ 2 (д, v).Consider similarity measure Suppose the user belongs to group g.We expect that the deviations R(v i ) − µ(д, v i ), i = 1, . . ., t tend to fluctuate around 0, as otherwise there would be a consistent offset between the user's ratings and the group ratings in which case the user would better be assigned to a different group.Therefore R t (д, h) → 1 as t → ∞ for h д and, similarly, R t (h, д) → 0 as t → ∞ for h д.Hence, by thresholding R t (д, h) we can identify the user group д.By standard concentration inequalities, to converge quickly we want t i=1 Γ д,h (v i ) to be large.Using this observation, for each pair of groups д, h ∈ G we pick the t max items v for which Γ д,h (v) is largest and add these to set V. Set V is generally much smaller than set V, e.g.Table 1 shows the size of V for the standard Netflix dataset as the number of groups is varied.In our performance evaluation below we will select items to ask the user to rate from the subset V ⊂ V of distinguisher items rather than the full set of items V.

No Rollouts.
In Algorithm 1, after reaching a leaf node in the lookahead tree T of depth j < t max we select j − t max items at random to obtain a sample item sequence of length t max .A rollout step of this sort is standard in MCTS.However, we also measured the performance of our MCTS-based approach when this rollout step was omitted i.e. a sample item sequence of length j ≤ t max is generated from the lookahead tree.Figure 4 shows typical performance measurements for the Netflix and Goodreads datasets.This plots the accuracy with which the user group is estimated vs the number of items t max that are rated by the user.Data is shown both with and without the rollout step.It can be seen that, perhaps somewhat surprisingly, the rollout step tends to degrade performance.What we suspect is happening here is that the rollout step is effectively just adding unhelpful noise since we may only sample a relatively small number of random item sequences from the full set of possible sequences of length j−t max , especially during the initial stages when j is small.Omitting the rollout step therefore offers the double advantage of reduced computation and improved performance.

Number Of Turns.
The performance of the MCTS-based algorithm depends on the max_turns parameter that determines the number of Monte Carlo samples used to generate the lookahead subtree.This needs to be sufficiently large that promising sequences of items are discovered and are visited sufficiently frequently that the estimated reward for the sequence is reasonably accurate.Initially we used a simple linear heuristic to max_turns where parameter k is varied depending on the dataset but it typically around 200. Table 2 shows the accuracy of the MCTS algorithm as the value of k is varied.Data is shown for the three datasets with |G| = 8 groups and t max = 5.It can be seen that the accuracy initially improves as k is increased.This is as expected since the lookahead substree is being more thoroughly explored.However, once k gets to a value of about 200 the accuracy starts to level off, indicating that the lookahead tree is now sufficiently large and accurate.
The value of k where the accuracy levels off depends on depth d = t max − t of the lookahead tree (d = 5 in Table 2), a smaller value of k being admissible when d is smaller, and vice versa.In our tests we therefore used k = (1.25 + (t max − t) 2 .

Software
Our MCTS-based cold start implementation and data is available on github at https://github.com/dilina-r/mcts-rec.
Clustering Users.We use training data to cluster users into groups and estimate the mean µ(д, v) and variance σ (д, v) 2 of the ratings by each group д for item v.We use the BLC matrix-factorization clustering algorithm [4] for this, although other clustering algorithms might also be used.We vary the number of groups/clusters from 4 to 32 and report results for each.
Baseline Algorithms.We compare the performance of the proposed MCTS-based approach against (i) an optimised CART decision tree and (ii) the cluster-based bandit (CBB) algorithm of [14].These are strong baselines, with good performance for cold-start active learning.Decision-trees are often considered for use in coldstart while the recently proposed CB-algorithm offers state of the art performance [14].
Modelling New Users.We generate the item ratings of a new user from group д by making a single draw from the multivariate Gaussian distribution with mean µ(д, v) and variance σ (д, v) 2 for each item equal to that estimated from the training data.This has the advantage that we can easily generate large numbers of new users in a clean, reproducible manner.In addition, we also evaluated performance when drawing ratings by splitting the data into training and test data, picking a user from the test data and using their ratings.We found the performance of these setups to be very similar to simply drawing a new user from a Gaussian distribution.
Performance Metrics.We report the accuracy with which the group of a new user is estimated, i.e. the fraction of times the correct group is estimated, vs the number of items rated by a new user.Statistics are calculated over 1000 new users per group.
Hardware.Tests were carried out on an 8-Core Intel i7-9700 CPU @ 3.00GHz, with 8GB RAM.Computational performance was measured using only a single core of the CPU.

Results
Figure 5 shows measurements of the mean accuracy vs the number of items rated by a new user for the Netflix, Goodreads and Jester dataset with users clustered in 4 and 16 groups.Data is shown when using a decision tree (DT), the CBB algorithm of [14] and our proposed MCTS-based approach (UCT).To calculate the mean accuracy, 1000 new users are generated for each group and the accuracy averaged over the users and groups i.e. averaged over 4,000 users for 4 groups and 16,000 for 16 groups.It can be seen that the MCTS approach uniformly achieves a higher accuracy than the decision-tree and CBB approaches for a given number of items rated.
Figure 6 shows typical (i.e.representative of the full data) measurements of the per-group accuracy of the DT, CBB and MCTS approaches.It can seen that the MCTS approach consistently achieves higher accuracy for every group.The variation in the accuracy across groups is also lower, particularly in comparison to that of the decision-tree approach e.g. it can be seen in Figure 6(b) that the accuracy can range from about 30% to about 90& when using the decision-tree.Table 3 shows the standard deviations of the pergroup accuracies vs the number of items rated by a new user for the Netflix, Goodreads and Jester datasets with 16 groups.Once again, it can be seen that the variation in accuracy across groups is significantly lower with the MCTS approach.Tables 4-6 summarise the measured group estimation accuracy vs the number of items rated by a new user and the number of user groups.Data is shown for 5-25 items and 4-32 groups and for the Netflix, Goodreads and Jester datasets.It can be seen that the MCTS-based approach achieves uniformly superior performance to the decision-tree and CBB approaches i.e. better performance regardless of the number of items the user rates, the number of user groups and the dataset used.The performance improvement is often considerable.For example, with 4 groups and asking the user to rate 5 items the MCTS accuracy on the Netflix dataset is 0.951 vs a decision-tree accuracy of 0.672 and a CBB accuracy of 0.705.Similarly, for 16 groups and the user rating 15 items the accuracies are 0.920, 0.631 and 0.854 for MCTS, DT and CBB respectively.
Since the performance of the MCTS approach dominates that of the other algorithms, this performance data indicates that it should always be used in preference to them if higher accuracy is desired.

MCTS Computation Time
Table 7 shows measurements of the average time taken to recommend the next item to a new user.This is the time taken at step t to construct a lookahead subtree containing item sequences that tend to generate high reward, this subtree is then used to select the next item to ask a new user to rate.The data shown makes use of a single CPU core.The MCTS algorithm is massively parallelisable in the sense that computations for different new users can be trivially run in parallel and so is linearly scalable with the number of CPU cores.The time taken depends on the lookahead tree expansion factor (i.e. the size of the set of items that a user can be asked to rate), hence why the time is somewhat longer for the Netflix dataset than for Goodreads.It also depends on the MCTS max_turns parameter that determines the number of Monte Carlo samples used to generate the lookahead subtree, shown in Table 7.
In summary, for 8 groups the MCTS-based approach takes around 30ms to recommend a item for a new user to rate, allowing over 30+ recommendations per second using a single commodity CPU core.Performance scales linearly with the number of CPU cores.

DISCUSSION 5.1 Including Contextual Information
Typically a recommender system will have some contextual information regarding a new user.For example, the country/city they are located in, user gender and demographics, data from other services used by the user.The proposed MCTS-based approach can be directly extended to incorporate such contextual information.For example, users sharing the same context (e.g.located the same country) can be gathered together, the number of user groups and the item rating means and variances for each group estimated and then the MCTS-based cold-startr approach applied to this sub-population.This is similar to the approach used in contextual bandits.Contextual information can also be used to adjust the prior probabilities p (0) д , д ∈ G of the group to which the user is initially estimated to belong.

Other Types of Feedback
In this paper we assume that new users provide item ratings as feedback, but our approach can be readily extended to encompass other types of user feedback.For example, a new user might be presented with two items and asked which they prefer.In the MCTSbased approach the nodes in the lookahead tree now correspond to pairs of items but the approach is otherwise unchanged in principle.It is necessary to be able to estimate the group probabilities p (t ) д , д ∈ G from user feedback up to time t and to generate synthetic user feedback for the Monte Carlo step in the MCTS algorithm, but standard models can be used for this e.g. the Bradley-Terry model for paired comparisons

Offline Training
In the approach considered here a new lookahead subtree is constructed at each step t in order to select the next item to ask a new user to rate.While our measurements show that this tree can be constructed rather quickly, a further computational saving might be possible by storing the generated lookahead subtrees for new users and using these as training data for a neural net.That is, it might be possible to train a neural net offline to capture the decisions made by the MCTS approach and then use this neural net to make online predictions.However, we leave investigation of this interesting topic to future work.

CONCLUSIONS
In this paper we revisit the cold-start task for new users of a recommender system whereby a new user is asked to rate a few items with the aim of discovering the user's preferences.We propose using a Monte Carlo Tree Search (MCTS) based approach to dynamically select the sequence of items presented to a new user.We find that this new MCTS-based cold-start approach is able to consistently quickly identify the preferences of a user with significantly higher accuracy than with either a decision-tree or a state of the art bandit-based approach without incurring higher regret i.e the learning performance is fundamentally superior to that of the state of the art.This boost in recommender accuracy is achieved in a computationally lightweight fashion.

Figure 3 :
Figure 3: Example illustrating evolution of probabilities p (t ) д , д = 1, 2, . . ., 8 vs the number of items t rated by a new user in group д = 4. Netflix dataset, |G| = 8 user groups.It can be seen that after rating t = 3 items the estimated probability p (t ) 4 of the user being in group 4 is higher than that of the other groups and p (t ) 4 → 1 as t increases.items t rated by the user for the Netflix dataset with |G| = 8 user groups.In this example the new user belongs to group д = 4 and p (t ) 4 → 1 as t increases while p (t ) д → 0, д 4.Given user group д, we can generate synthetic ratings R(s t +2 ), . . ., R(s t max ) by making a draw from the multivariate Gaussian distribution with mean µ( д, v) and variance σ 2 ( д, v), v ∈ V of ratings by users in group д.Note that, alternatively, we could also draw ratings from the empirical distribution of ratings for a group (so relaxing the Gaussian assumption) and also by generating user ratings via a water-filling approach i.e. split the data into training and test data, pick a user from the test data and use their ratings, when we need a rating for an item that the user has not rated, pick a second user from the same group who has rated the item and merge the pair of user ratings.We found the performance of these setups to be very similar to simply drawing a new user from a Gaussian distribution.With user ratings R(v 1 ), . . ., R(v t ) and synthetic ratings R(s t +2 ), . . ., R(s t max ) in hand, equation(1) can now be used to cal- Prob true group is correctly estimated | true group = д, user ratings of items V (t ) , next items are s t +2 , . . ., s t max ] where the expectation is over the groups G. Algorithm 3 gives pseudo-code for this Monte Carlo-based reward calculation.Algorithm 3 calculate_reward Input: Item sequences V (t ) , S Draw user group д according to probability distribution {p (t ) д , д ∈ G } Using д, generate synthetic user rating R(s) for each item s ∈ S Calculate p (tmax ) д , д ∈ G using equation (1) if д ∈ arg max д∈G p (tmax ) д then return +1 ▷ Estimated group matches "true" group д else return 0 end if

Figure 4 :
Figure 4: Group estimation accuracy with and without rollout step.

Figure 5 :Figure 6 :
Figure 5: Measured mean estimation accuracy vs number of items rated by a new user.Data is shown for the decision tree (DT), CBB and MCTS-based approaches for the Netflix, Goodreads and Jester datasets with 4 and 16 groups

Table 1 :
Number of good distinguisher items vs #groups for the Netflix and Jester datasets.

Table 2 :
Group estimation accuracy vs choice of max_turns parameter k.Netflix, Goodreads and Jester datasets, |G| = 8 groups, t max = 5 items rated by new user.

Table 3 :
Standard deviation of the measured per-group acccuracy for the decision tree (DT) and MCTS approaches.Netflix, Jester and Goodreads data with 16 groups.

Table 4 :
Mean estimation accuracy vs the number of items rated by a new user and the number of user groups.Netflix movie dataset

Table 5 :
Mean estimation accuracy vs the number of items rated by a new user and the number of user groups.Goodreads books dataset.

Table 6 :
Mean estimation accuracy vs the number of items rated by a new user and the number of user groups.Jester jokes dataset.

Table 7 :
Mean MCTS computation time to recommend next item to a new user.