Towards Simulation-Based Evaluation of Recommender Systems with Carousel Interfaces

Offline data-driven evaluation is considered a low-cost and more accessible alternative to the online empirical method of assessing the quality of recommender systems. Despite their popularity and effectiveness, most data-driven approaches are unsuitable for evaluating interactive recommender systems. In this article, we attempt to address this issue by simulating the user interactions with the system as a part of the evaluation process. Particularly, we demonstrate that simulated users find their desired item more efficiently when recommendations are presented as a list of carousels compared to a simple ranked list.


INTRODUCTION
For many years, user studies have been the key approach to evaluating all types of user-adaptive systems, i.e., interactive systems that can adapt their behavior to individual users [10].While user studies are the ultimate way to assess user-centered systems, these studies are very expensive.It is also a challenge to obtain user study data on a sufficient scale to reliably compare specific user modeling and personalization approaches.In response to these challenges, several research areas focused on user-adaptive and personalized systems established data-driven approaches for evaluating systems in these areas.For example, data-driven evaluation of learner modeling in personalized education systems is based on large collections of student problem-solving traces.The ability to better predict a learner's success in these traces is considered a sign of better quality learner modeling [16,52].Similarly, data-driven evaluation of recommender systems is based on the large volume of user past rating data.The ability to better approximate user ratings or place items positively rated higher on the ranked list is considered a sign of better quality recommendation [27,42].
The establishment of data-driven evaluation approaches was very important for the field of recommender systems.Promoted by the Netflix Prize, these approaches helped to engage a large Towards Simulation-Based Evaluation of Recommender System... 9:3 evaluation approach through two case studies.The first case study demonstrates the use of a relatively simple qualitative model of user behavior to compare the navigability of a carousel-based interface and a ranked list.The second case study demonstrates how a more complex quantitative model of user behavior supports more elaborated simulation-based studies that could answer a wider range of research questions.To perform the latter study, we developed and validated a carousel click model, which generalizes the traditional cascade models [14] that model user behavior in a ranked list.Using these click models, we demonstrate how simulation-based evaluation could be used to compare several ranking approaches within the carousel-based interface, as well as to compare the performance of a carousel-based interface and a traditional ranked list.
The article is structured as follows: In Section 2, we review previous research on simulationbased evaluation, interactive recommender systems, and user behavior models.In Section 3, we present our first case study in which a simulation-based evaluation is performed using a simple qualitative model.Section 3 presents key components of our second case study, including the carousel click model (Section 4.2), its validation (Section 4.3 and Section 4.4), and two examples of simulation-based evaluation enabled by this model (Section 4.5 and Section 4.6).Finally, we discuss our results in Section 5.

RELATED WORK
In this section, we will review the related work in three areas of recommender systems: interactive recommender systems, simulation-based studies of recommender systems, and click models in information retrieval and recommender systems.Our goal is to provide a comprehensive overview of relevant research in these areas, highlighting key findings, approaches, and challenges.

Interactive Recommender Systems
For many years, a ranked list of items was a de facto standard for recommender systems to present recommendations and results to their users.With this approach, the power of a recommender system is fully based on the power of its Algorithm leaving the users almost no opportunities to affect the behavior of the system.However, an alternative stream of research on more complex recommender interfaces that enable humans and AI to work together to discover the most relevant items has been present in the field since its early days [5].Now, when the need for human-AI collaboration is broadly accepted in a range of AI-based systems, research on interactive recommender systems, as this group or systems is often called [33], is rapidly expanding.Among the most popular (and not fully disjoint) groups of interactive recommender systems are critiquingbased systems [9], conversational recommenders [50], user-controlled recommenders [37], and visual recommenders that present recommendations results in more than one dimension, such as a grid [69] or a more complex visualization [44,67].Over the last 20 years, these groups of recommender systems have been explored and their effectiveness has been convincingly demonstrated [3,53,63,68].
Today, the most noticeable group of interactive recommender interfaces is arguably carouselbased interfaces or multilists [2, 20-22, 47, 58].While the interface with multiple carousels looks relatively complex-it presents several ranked lists, each marked with a category, in place of a single ranked list-it was embraced by the end-users and is now rapidly replacing the ranked list as a de facto standard to present recommendations in e-Commerce systems.From the prospect of recommender systems, the carousel-based interface provides an excellent example of human-AI collaboration in the recommendation context.While a single ranked list attempts to be "perfect", in reality, the intent of the user is often uncertain.Most importantly, in many real-life applications users might have multiple interests, and recommender systems rarely know which specific interest (for example, a movie genre) the users want to pursue at the given moment.A carousel-based interface leaves the task of choosing the most timely topic of interest (i.e., British documentaries) to the users.As a result, a user could quickly locate a ranked sub-list of the most relevant items while also indirectly informing the recommender system about the kind of items they prefer right now.
The popularity of carousel-based interfaces has not been ignored by researchers in recommender systems.A growing number of articles focused on carousel-based interfaces have been published in recent years [2,20,24,36,64,69].However, the evaluation of carousel-based interfaces and, more specifically, their comparison with other kinds of recommenders is still a bottleneck, since it is typically based on expensive user studies.Our work attempts to bridge this gap.In this article, we choose carousel-based recommender interfaces for the two case studies, which we present to demonstrate the application of simulation-based approach for the evaluation and comparison of recommender systems.To enable simulation-based evaluation of carousel-based interfaces, we also developed a novel carousel click model that can simulate how users interact with topic-labeled carousel interfaces.We hope that our new click model and the presented example of its use for simulation-based evaluation of carousel-based interfaces will facilitate future research in this area.

Simulation-Based Studies of Recommender Systems
A simulation-based approach has been used for exploration and evaluation in a number of fields where sufficiently detailed models of user behavior could be built.In particular, a simulation-based study is a recognized approach for evaluating various types of personalized interactive systems, from adaptive learning systems [7,18] to personalized information access systems [49,55].The goals of simulation-based evaluation differ between application areas and often depend on the reliability of behavior models that support simulation.On one end of the spectrum are cognitively grounded behavior models that are supported by studies of human cognition and confirmed by empirical studies.A well-known example is SNIF-ACT model [55] that simulates user behavior in hypertext navigation.This model is based on Information Scent theory [6] and was used to assess the quality and navigability of Web sites without real users.Popular "simulated student" models [7,48] used for evaluation of adaptive educational systems also belong to this group.On the other end, there are a range of simple behavior models [19,70] that might not be able to reliably predict the details of user behavior but could be useful to explore a range of "what if"scenarios in assessing the impact of various interface enhancements.
Early attempts to use simulations to explore information filtering and recommender systems were made in the first decade of 2000 [18,49].However, it took another 10 years for this approach to become truly noticed in this field [19,32].Although the volume of simulation-based research in the recommender system context is gradually increasing, simulations are most frequently used to explore the impact of recommender systems on various aspects of user behavior rather than to assess their performance and effectiveness in a comparative way.The most popular research direction enabled by simulation is to examine the impact of a recommender system on various aspects of user behavior as a whole [4,18,31,70].This work is typically enabled by the user choice models [32].While research on click models reviewed in Section 2.2 offers solid ground for simulation-based studies, there are very few cases where models of user click behavior were used for comparative offline evaluation of recommender system design options.A notable exception is the work of Dzyabura and Tuzhilin [19] who used simulation to compare an interface based on a combination of search and recommendation to interfaces based on search or recommendation alone.However, this work used a relatively simple behavior model that was not based on empirical observations or theory.In our work, we perform simulation-based studies using more complex and empirically grounded models, which increase our chances of obtaining useful and reliable results.
Several other works used simulation-based evaluation for different purposes and in different contexts.Zhang and Balog [71] use simulations to evaluate the conversational recommender 9:5 system.They take into account both individual preferences and the general flow of engagement with the system, to build a simulator, which produces replies that a real human would provide.Rohde et al. [61] utilize OpenAI Gym, a popular framework for simulating Reinforcement Learning agents, and create a recommendation environment that is based on a model of user visiting patterns on e-commerce sites and how people react to recommendations.In a different stream of work, Zou et al. [73] develops a customer simulator known as the World Model, which is made to imitate the environment and address the selection bias of logged data.Finally, Ie et al. [35] develops a programmable authoring tool for simulation environments for recommender systems called RecSim, which supports sequential user interaction.

Click Models in Information Retrieval and Recommender Systems
The research on click models focuses on modeling and explaining user interaction with a ranked list of search or recommendation results.It started in the field of information retrieval and was originally motivated by the need to improve the performance of the Web search engine by applying user click-through data accumulated by search engine logs [72].While "old school" information retrieval considered item relevance as the only factor determining user decision to click on a specific result, it became evident that the position of items in a ranked list has to be considered as well [39].Moreover, creative experiments demonstrated that a high item position in a ranked list could have a greater impact on click probability than item relevance [40].A sequence of eye-tracking studies with users of search engines [23,25,26,51] helped understand how users process a ranked list of results and measure the impact of item position in the list on the click probability.
This research provided a solid ground for developing click models for user interaction with ranked lists [11], which is now actively used in both information retrieval and recommender systems research.There are many click models [1,8,11,15,28,29,60]. Essentially, all of them try to explain the user behavior by a generative model which can be learned from data.As an example, the cascade model [15,60] assumes that the user examines the list of recommended items from top to bottom until they find an attractive item.After that, they click on that item and leave satisfied.This seemingly simple model explains the observed position bias in recommender systems that lower-ranked items are less likely to be clicked than higher-ranked items.Click models can be used to debias click data, and in turn to learn better ranking policies either offline [11,46] or online [12,45].In this work, we simulate click models to compare the utility of recommendations in carousels with more traditional approaches.

CASE STUDY 1: EXPLORING THE NAVIGABILITY OF CAROUSEL-BASED INTERFACES WITH SIMPLE INTERACTION MODELS
In this section, we demonstrate the importance of using user behavior models in simulation-based studies to gain valuable insights into the behavior of users in interactive recommendation settings.The study compares user interactions with two types of recommendation interfaces-a carouselbased multi-list and a traditional ranked list interface-from the perspective of navigability, which refers to the ease and efficiency with which users can explore and access information.
In this case study, we demonstrate that relatively simple user behavior models could empower simulation-based studies that could produce important and interesting results.Even with relatively simple models, we gain valuable insight into the behavior of users in interactive recommendation settings and identify important factors that influence their dynamics.As the complexity of these systems continues to increase, the use of user behavior models is likely to become an increasingly important tool for researchers and practitioners alike.In this study, we introduce a basic model of user interaction with a carousel-based recommender interface and use it along with a traditional ranked list interaction model (Section 2.3) to compare user work with two types of recommenda-tion interfaces: a carousel-based multi-list and a traditional ranked-list interface.The comparison is performed from the prospect of navigability [17], a popular research stream in the broader area of information access, which explores navigation properties or various information access artifacts and compares different approaches to create these artifacts.In this context, navigability refers to the ease and efficiency with which users can explore and access information.By comparing different approaches to creating navigable information access artifacts, researchers aim to identify the most effective strategies for supporting user navigation and improving overall user experience.The choice of navigability was important to introduce the idea of simulation-based studies to the hypertext research community, where an earlier version of this evaluation was presented [58].However, navigability is relevant today in a broader set of information-rich environments, where users face an increasing need to efficiently locate and access relevant information amidst a sea of available options.For example, navigability studies have been performed in the past to compare the navigability of different approaches to generate tag clouds [66], to examine the effect of automatic linking on navigability [34], and to compare the navigability of regular and faceted tag clouds [65].
We perform this comparison in a typical modern recommendation context where items could be associated with multiple interests and users could favor several of these interests in parallel (although probably to a different extent and at a different time).Depending on the domain, these interests could have different semantic natures.For example, it could be a movie genre such as action movies, a topic of interest such as context-aware recommendation, or even a specific approach to select items such as most popular or recently added.In all these cases, each carousel represents a specific dimension of user interests.For uniformity, we refer to these generalized interests as topics.Note that some recommender systems could model interests as latent categories rather than explicitly presented, understandable topics.In this article, we focus on domains with explicitly represented interests to separate the problem of latent interest discovery from the problems of user modeling and item ranking.As our data show, even relatively simple models of user behavior could clearly reveal the benefits of carousel-based interfaces in this modern multi-interest context, explaining the rapidly increased popularity of these interfaces.
This presentation of the first case study is structured as follows: We start by introducing basic user behavior models for carousel-based interfaces and 2D ranked lists in Section 3.1.In Section 3.2, we detail our experimental setup.In Section 3.3, we introduce different settings under which we perform comparisons of user navigation in carousel-based interfaces and 2D ranked lists.Finally, we present and discuss our results in Section 3.4.

A Basic Interaction Model for Carousels and 2D Ranked Lists
To quantify the benefit of carousels, we formalize the problem of carousel recommendation using a simple mathematical model, which we call a carousel interaction model.We have a matrix of m × K recommended items, where m is the number of rows (carousels) and K is the number of columns (items per carousel).Each carousel is associated with some topic, such as a movie genre.To simplify the exposition, we assume that each item belongs to a single topic.We refer to the item at row i ∈ [m] and column j ∈ [K] as (i, j).
The user preferences are defined by two sets of probabilities.The first are topic preferences.Specifically, p i ≥ 0 is the probability that the user is interested in topic i, for any i ∈ [m].The second set are topic-conditioned item preferences.Specifically, p j |i ≥ 0 is the conditional probability that the user is interested in item j given that they desire topic i, for any i ∈ [m] and j ∈ [K].We assume that m i=1 p i = 1, and that K j=1 p j |i = 1 for any topic i ∈ [m].

9:7
The user interacts in the carousel model as follows: First, the desired topic and the item in that topic are realized in the mind of the user, and then the user seeks them.In particular, the desired topic is sampled as I ∼ Cat((p i ) m i=1 ) and the desired item is sampled as J ∼ Cat((p j |I ) K j=1 ), where Cat(θ ) is a categorical distribution with outcome probabilities θ .In plain English, exactly one topic is chosen with probability p i , and exactly one item is chosen with probability p j |I conditioned on that topic.An equivalent way to think about this process is that exactly one (i, j) is chosen with probability p i, j = p j |i p i .The user seeks the item (I , J ) as follows: They start by examining the first carousel.If its topic does not match that of I , they proceed to the next carousel.The user examines all carousels, from top to bottom, until they stop at carousel I .After that, the user examines the items in carousel I , from left to right, until they find the desired item in column J .
To make the comparison of a two-dimensional multi-list and a one-dimensional ranked list more clear and fair, we represent a traditional single ranked list in a comparable 2D format as the matrix of m × K recommended items introduced above, which is examined row by row.This approach to present a ranked list, is becoming popular in modern recommenders due to its space-saving format [57,69].Applying traditional models of user work with a ranked list reviewed in Section 2.3 to this 2D presentation format, we obtain the following simple model of user behavior in a 2D ranked list.This user behavior model used in the study is a variant of the cascade model [14].The user starts at position (1, 1).If that item is not desired, the user proceeds to the next item (1, 2).The user examines the row 1, from left to right, until the desired item is found or the end of the row is reached.If the end of the row is reached, the user moves to item (2, 1), the first item in the next row.The user then examines this row, from left to right, and this process continues until the desired item is found.
In case study 1, we do not perform a separate validation of the suggested user behavior models against historical data, since these models are relatively simple and intuitive extensions of previous empirical research [14,60].However, as shown later in Section 4.3, these models fit well with data obtained in recent studies of carousel-based interfaces [36].Section 4.3 also shows an example of fitting more complex behavior models to real-world user data.

Experimental Setup: Navigability Simulation
We conduct a series of data-driven experiments to evaluate how our proposed carousels model performs against a standard baseline (single ranked list) model from the prospect of navigability.For our experiments, we choose the domain of movie recommendation.The choice of domain was motivated by two reasons.First, movie recommendation is a good example of a modern context where users can have multiple interests and favor different interests at different times.Second, it is the context where carousels are currently very popular, which makes it easier to simulate realistic carousel-based recommendations.
We use the MovieLens 1M Dataset [30] which consists of 1 million ratings applied to 4,000 movies in 18 genres by 6,000 users.In our experiments, we only utilize information about user ratings and movie genres.We apply a pre-processing step to remove movies with no genres or no ratings.
We assume that the user adopts two distinct browsing behaviors when searching for a movie (I , J ), provided that the results are presented as a single ranked list or a set of carousels.To generate the recommendations, we consider two sets of probabilities.The topic preferences and the topicconditioned item preferences.The preferences are computed as follows: The dataset of ratings is a set of tuples , where U t is the index of the user in data point t, j t is the index of the rated movie in data point t, and r t is the corresponding rating.The topic-conditioned item preference reflects how representative the movie is of a genre.We computed it as the sum of all the ratings of the movie over the sum of all the ratings in its genre.Formally, let G i be the set of all movies of genre i.Then, for any movie j ∈ G i , the topic-conditioned item preference of movie j in genre i is We set p j |i = 0 for any j G i .For any user u, the topic preference reflects how much the user prefers a genre.We compute it as the sum of all ratings of a user in a given genre over all ratings by that user.Formally, the topic preference of user u for genre i is Having a User Profile assigned to each user, we generate two sets of recommendations as follows: For the first set of recommendations for carousels, we use the topic preferences to sort them and then populate each one with movies using the topic-conditioned item preferences.This approach generates a set of carousels each representing a genre (18 carousels for 18 genres in the dataset).Each carousel contains all the movies within the representative genre.With an average of more than 335 movies in each genre, we assume that it is realistic for the user to scroll down or right, examine all items, and find the desirable movie.Movies are sorted by their scores, where the movie score j is m i=1 p i, j .Given the large number of movies in the dataset, we assume that users will be able to navigate through the list by scrolling down to find what they are looking for.This assumption is based on the expectation that users are familiar with browsing behavior and comfortable scrolling through long lists to find relevant items.In this evaluation, User Profile and the recommendations were not affected by further user interactions and remained unchanged throughout all sessions.
We define a session as a single instance of evaluation in which the user seeks a movie (I , J ) from the set of recommended results, which can be displayed as a single ranked list or carousels.The process of simulation is as follows: For each setting, we first generate two sets of recommendations (one using single ranked list and another using carousels) for every user in the dataset.Next we run 100 independent sessions for every user.Each session includes selecting a genre, selecting a movie within that genre, and calculating the number of interactions required to reach that movie in both models.We consider the average value of these 100 sessions as the outcome of the experiment for the given user in the given setting.
To simulate user navigation in each session, we assume that the desired genre and a movie of that genre are "realized" in the mind of the user.The desired genre is sampled as I ∼ Cat((p i ) m i=1 ) and the desired movie is sampled as J ∼ Cat((p j |I ) k j=1 ).In each session, the user is only interested in a single genre and a single movie within that genre.
There are many ways to measure the complexity of the interaction with the recommended items in single ranked list and carousels.In this use case, We employ a custom metrics to evaluate our proposed approach.We define the exiting probability which determines on average what proportion of users left the session after a certain number of interactions.

Experimental Conditions
We compare carousel-based and 2D ranked-list interfaces in three increasingly more realistic settings reviewed below.

Ideal Setting.
In the first setting, we assumed that the user continues to examine topics and items until the desired item was found.The behavior of such a user is described in Section 3. Towards Simulation-Based Evaluation of Recommender System...

9:9
We are aware that this browsing behavior is unlikely to occur in a realistic situation due to the position bias effect [14].However, we include this setting in our evaluation to highlight the difference between this and other more realistic behavioral patterns.

Impatient User.
To better model the browsing behavior of a real user, we assume that the user has limited patience to find the desired item.We implemented this behavior as follows: The user starts by examining the first topic or item at position (1, 1).The user exits with a probability of p q = 0.02 after examining a carousel or item.This means that users abandon the session after 50 interactions on average, when no items or topics are desirable.This is the same as the ideal setting except for exiting with probability p q = 0.02 upon each examination of a carousel or an item.

Distracted User.
We initially assumed that the user always knew which carousel (with a genre as a topic) includes the desired movie.However, in reality, the user might get distracted and, as a result, begin browsing the wrong carousel or pass the correct carousel and miss out on finding the desired item.We consider this assumption to be an extension of the previous assumption described in Section 3.3.2.
In both ideal settings, when the user examines an undesirable carousel, they will move to the next carousel with probability 1.We define p d = 0.05 as the distraction probability.Here user moves to the next carousel with probability 1 − p d and starts examining items in the undesirable carousel with probability p d .Similarly, when the user examines a desirable carousel, they move to the next carousel with probability p d and begin examining items in the desirable carousel with probability 1−p d .Considering a user Distracted only applies to carousels.Including this assumption in carousels allows us to capture the complexity that comes with providing additional information to the user in the form of carousel topics.Due to the lack of a large enough dataset to accurately estimate the parameters of our proposed settings, we set the values of p q and p d intuitively based on how we presume the user would behave under those settings.

Results
To compare the behavior of our model in more realistic settings, we visualize the average exiting probability of users after a certain number of interactions with the recommendations in Figures 1(a) and 1(b).
In Figure 1(a) we observe a significant difference (independent t-test, p < 0.001) between the carousels and the single ranked list under the ideal settings.In ideal settings, the user continues the examination until he reaches the desirable item.We limit the number of interactions to 50, meaning the user would exit unsatisfied if they could not find the desirable item in first 50 interactions.The higher exiting probability in carousels (blue line) shows that more users exit the system satisfied by finding their desirable item.A larger spike in exiting probability on single ranked list at the end indicates a higher number of users who left without finding their desired item.It should be noted that based on the result of this experiment, a significantly larger portion of users (just under 80%) exit the system after finding their desired item.This number drops to close to 40% when recommendations are presented in the form of a ranked list.
The exiting behavior of the simulated Impatient and Distracted users is shown in Figure 1(b).It is worth noting that in the ideal setting, the exiting probability can be considered a positive metric when the user finds the desired item after examination, but in distracted and impatient settings, the exiting probability could be an indication of either satisfactory or unsatisfactory results.In our experiments, we only compare the exiting probability under comparable settings.Although the gap between the probability of exiting the session in carousels and single ranked list models is less prominent, the former still performs better.The results of an independent t-test did not show a statistically significant difference between the models.Comparing the Impatient and Distracted exiting behavior indicates an insignificant difference between the two settings but shows a slight decrease in performance in carousels.Unlike in Figure 1(a), where the exiting probability promotes a positive event (satisfaction of finding the desirable item), in Figure 1(b), there can be also adverse reasons for exiting a session, such as "impatience" and "distraction".Therefore, the improvement in this metric compared to the ideal setting is not necessarily a positive sign.Despite this, since we compare carousels and single ranked list in Figure 1(b) under the same setting where the probabilities of "distraction" and "impatience" are similar, an improvement in the metric likely signals a positive event.

CASE STUDY 2: EXPLORING CAROUSEL-BASED INTERFACES WITH ADVANCED
CLICK MODELS OF USER BEHAVIOR Our second case study demonstrates the application of click models-more advanced and precise models of user behavior-to simulation-based studies of carousel-based recommender systems.Extending our earlier work [56], the second study advances our first case study in three important directions.First, expanding the work on traditional click models, we develop a novel Carousel Click Model (CCM) that enables more advanced simulation-based studies of carousel interfaces.Second, we demonstrate how click models of user behavior could be validated using both existing empirical data and simulations.Third, we present several examples of simulation-based studies enabled by CCM.In particular, we demonstrate how the application of a more advanced behavior model could expand the range of research questions to be answered by a simulation-based study and enable the application of more precise (and traditional) evaluation metrics such as click probability.
The presentation of the second case study is structured as follows.First, in Section 4.1, we introduce two click models that simulate user behavior while browsing a ranked list: the standard Cascade Model [15,60] and its extension, a Terminating Cascade Model (TCM), which introduces dependence on the order of items in the ranked list.Next, in Section 4.2, we introduce a Carousel click model, which allows us to simulate the user browsing behavior in a two-dimensional and topic-oriented presentation of recommendation results used by carousel-based interfaces.
Towards Simulation-Based Evaluation of Recommender System...

9:11
Then, we validate our model based on the fit to the real-world data and robustness in Section 4.3 and 4.4, respectively.Finally, we demonstrate how the developed models of user behavior could be used to perform a simulation-based evaluation of specific recommender interfaces.In Section 4.5, we use CCM as a simulator to compare several ways to rank items in carousels.In Section 4.6 use both, CCM and TCM for a fine-grained comparison of the user behavior in the carousel and ranked list interfaces.

Ranked List Click Models
In this section, we introduce two models that allow one to simulate user behavior in a ranked list.We start by explaining the traditional Cascade Model and suggest its extension into a Terminating Cascade Model, which enables a more realistic simulation.

Cascade
Model.The cascade model (CM) is a popular model of user behavior in a ranked list.In this model, the user is recommended a list of K items.The user examines the list from the first item to the last, and clicks on the first attractive item in the list.The items before the first attractive item are not attractive, because the user examines them, but does not click on them.Items after the first attractive item are not observed because the user never examines them.
The user's preference for the item a ∈ E in the cascade model is represented by its attraction probability p a ∈ [0, 1].The attraction of item a is a random variable defined as Y a ∼ Ber(p a ).Fix list A. The click on position k is denoted by C k and defined as , where E k is an indicator that position k is examined.By definition, the position is examined only if none of the higherranked items is attractive.Thus, and the probability of a click on position k is In turn, the probability of a click on list A is This model has two notable properties.First, since (1) increases whenever a less attractive item in A is replaced with a more attractive item from E, the optimal list in the CM, A * = arg max A P cm (A) , contains K most attractive items.Therefore, it can be easily computed.Second, the click probability P cm (A, k) can be used to assess whether a model reflects the ground truth.In particular, let μ ∈ [0, 1] K be the frequency of observed clicks on all positions in list A. Then the click model is a good model of reality if μ resembles the output of the model.This similarity can be measured in many ways, and we adopt the total variation distance of click probabilities, 1  2 P cm (A, •) − μ 1 , in this work.A click model is a mapping from items in a ranked list to probabilities of interaction with them.For a click model M and list A, let P M (A) denote how engaging the list A under model M is, such as the click probability P cm (A) in (1).Click models can be used to answer several types of queries.Computation of P M (A) is the evaluation of how engaging list A under model M is.A natural extension is a comparison of two lists under a fixed model.Specifically, if P M (A) > P M (A ) for lists A and A , list A is more engaging than list A under model M. Finally, we can also compare the same list under two different models.In particular, if P M (A) > P M (A) for models M and M , list A is more engaging under model M than M .

Terminating Cascade Model.
One shortcoming of the cascade model is that the order of the items in the optimal list does not affect the click probability.This is why extending this model to structured problems is difficult, because the position of the item does not matter.To introduce dependence on the order of items, we modify the CM as follows: When the examined item is not attractive, the user leaves unsatisfied with termination probability p q ∈ [0, 1].It models a situation where the user gets tired after examining unsatisfactory items (Figure 2).We call this model a terminating cascade model (TCM).
TCM is one of many extensions of the cascade model [11], such as the user browsing model.The closest related extension is the dependent click model [29], where the user may not leave satisfied after an item is clicked.This model explains multiple clicks.In comparison, we model a user that may leave unsatisfied without clicking.Our model can also be viewed as an instance of the dynamic Bayesian network model [8], where the click probability decreases with the number of examined items.
Fix list A. Let Q k be an indicator that the user leaves unsatisfied at position k, which is defined as Q k ∼ Ber(p q ).Then click on position k is defined as , where E k is an indicator that position k is examined.Since the position is examined only if none of the higher-ranked items is attractive, and the user does not leave unsatisfied upon examining these items, we have Thus, the probability of a click on position k is and that on the list A is This model behaves similarly to the CM and has all of its desired properties.First, the optimal list in the TCM, contains K most attractive items in descending order of their attraction probabilities.Order matters because the position k in (3) is discounted by (1 − p q ) k−1 .Interestingly, this list is invariant to the value of the termination probability, as long as p q ∈ (0, 1).Second, since P tcm (A, k) can be easily Towards Simulation-Based Evaluation of Recommender System... computed for any list A and position k, the fit of the TCM to empirical click probabilities can be evaluated as in Section 4.1, using the total variation distance.

Carousel Click Model
In this section, we introduce a carousel click model (CCM), which we developed to simulate user behavior in carousel-based interfaces (Figure 3).CCM is a natural extension of the single-list cascade models to the multi-list interfaces (i.e., carousels).For the purpose of modeling, a carouselbased interface could be represented as a matrix A = (A i, j , where m is the number of carousels (rows), K is the number of items per carousel (columns), and A i, j is the recommended item at position (i, j).To simplify notation, we denote carousel i in matrix A by A i,: = (A i, j ) K j=1 .We assume that no item is in more than one carousel, that is A i, j A i , j for any (i, j) (i , j ).

User Behavior Assumptions.
The assumptions of the overall user behavior in CCM extend the assumptions established for the simpler navigation-focused model in Section 3.1.The user examines the recommended matrix A from the first carousel until the last.When carousel i is attractive, at least one item in A i,: is attractive, the user starts to examine it and clicks on the first attractive item in it.To guarantee that the user can recognize an attractive carousel without examining it, we label the carousels with the topics of items in them.In this case, we can think of the user as having topics of interest on their mind and examining the first carousel with a matching topic.When carousel i is not attractive, no item in A i,: is attractive, the user proceeds to the next carousel i + 1.When the user examines an unattractive carousel or item, they leave unsatisfied with probability p q ∈ [0, 1], similar to the terminating cascade model in Section 4.1.2.Since each carousel is associated with a topic, the items in each carousel need to be semantically related.This amounts to a constraint on the items that can be in A i,: .

Click Probability. Fix matrix A.
Let Q i ∼ Ber(p q ) be an indicator that the user leaves unsatisfied after examining the carousel i.Let Q i, j ∼ Ber(p q ) be an indicator that the user leaves unsatisfied after examining item at position (i, j).Then the indicator of a click on matrix A can be written as where  is an indicator that carousel i is examined.The algebraic form of E i follows from the fact that even E i can occur only if all higher-ranked carousels are unattractive and the user does not leave unsatisfied after examining them.Thus, the probability of a click on matrix A is where is the probability that carousel i is examined.

Empirical Fit.
Similarly to the CM and TCM, we also have a closed form for the click probability on position (i, j), This can be used to assess if a model reflects the ground truth.In particular, let μ ∈ [0, 1] m×K be the frequency of observed clicks on all entries of matrix A. Then, if we treat P ccm (A, •) and μ as vectors, the total variation distance of the click probabilities 1 2 P ccm (A, •) − μ 1 measures whether the CCM is a good model of reality.

Optimal Solution.
The optimal solution in the CCM does not have a closed form anymore. Nevertheless, we can still characterize some of its properties.Specifically, in the optimal matrix A * = arg max A P ccm (A), the items in each carousel must be ordered from the highest attraction probability to the lowest.This can be seen as follows.For any matrix A and carousel i in it, P tcm (A i,: ) in ( 4) has the highest value when the attraction probabilities in A i,: are in a descending order.This argument is analogous to that in the TCM (Section 4.1.2).This change has no impact on E i in (5).Regarding the order of carousels, we approximate A * by presenting the carousels in the descending order of their total attraction probabilities, K j=1 p A i, j .This guarantees that carousels with more attractive items are presented first, which minimizes the probability of users leaving unsatisfied.

CCM vs. TCM.
To stress the difference between CCM and TCM, it is useful to show that carousel click model can lead to higher click probabilities than the TCM, under the assumption that the attraction and termination probabilities in both models are comparable.Because the parameters of the models are comparable, this shows that the structure can be beneficial in recommendations.
We compare the TCM and CCM under the assumption that all attraction probabilities are identical and small.Specifically, let p a = p for all items a and p = O(1/Km).Then (1 − p) k = O(1) for any k ∈ [Km].In the TCM, we view A as a single ranked list of Km items.Then Now when we bound j − 1 in P ccm (A) as j − 1 ≤ K(j − 1), the two objectives become equal.Since 1 − p q ≤ 1 and j − 1 ≥ 0, we get P tcm (A) ≤ P ccm (A).The improvement is due to the fact that the user is much less likely to leave unsatisfied with the CCM.

Validation of ccm: Fit to Real-World Data
To study how well ccm and tcm model user behavior, we fit them to an existing dataset of real user interactions.The dataset was collected by Jannach et al. [36] to assess the effect of different design choices on human decision-making.It contains n = 776 instances of clicks on recommendations presented in two settings, ranked list and carousels, with a comparable number of clicks in each setting.Despite its small size, we found that this dataset is the only publicly available data source of human interaction with a carousel-based interface, which can be used to validate ccm.
The dataset has two parts.In both parts, the recommended items are presented in m = 5 rows and K = 4 items per row.In the first part (conditions 1 to 4), the items are presented as a single ranked list, row by row.We hypothesize that the user scans these items row by row, from left to right, and clicks on the first attractive item.We call these data ranked list.The second part of the data (conditions 5 to 8) is similar to the first except that each row is labeled with the topic of items in that row, such as "Action Movies".This can be viewed as a list of carousels.We hypothesize that the user scans the carousels from top to bottom and stops at the first attractive topic.After that, they examine the items within that carousel from left to right and click on the first attractive item.We call these data carousel.
In both the ranked list and the carousel data, we compute empirical click probabilities for all positions and plot their logarithm in Figure 4.A closer look at these probabilities reveals different user interaction patterns in the two settings.In the carousel data, we observe more clicks in the first column, which represents the first items in all carousels.This is in contrast to ranked list data, where clicks are more concentrated in the first row, which represents the highest positions in the ranked list.The difference between the interactions is consistent with our proposed mathematical models, and we provide more quantitative evidence below.
One challenge in evaluating our mathematical models is that most items in our dataset appear only once.Therefore, it is impossible to accurately estimate their attraction probabilities.However, we know that the items are recommended in decreasing relevance order.Therefore, we parameterize the attraction probabilities of the items as follows.In the TCM, the click probability at position (i, j) is computed as in (2), where k = K(i − 1) + j and p A k = p 0 γ k−1 .Here p 0 ∈ [0, 1] is the highest attraction probability and γ ∈ [0, 1] is its discount factor.Note that the attraction probability decreases with the rank of the item in the list, which is presented as a matrix.We denote the click probability at position (i, j) by μ tcm (i, j; p 0 , γ , p q ), with parameters p 0 , γ , and p q .In the CCM, the click probability at position (i, j) is calculated as in (6), where p A i, j = p 0 γ i+j−2 and all parameters are defined as in the TCM.Again, the attraction probability decreases with the number of rows and columns, and we denote the click probability at position (i, j) by μ ccm (i, j; p 0 , γ , p q ).
Let μlist ∈ [0, 1] m×K and μcarousels ∈ [0, 1] m×K denote the matrices of empirical click probabilities estimated from the ranked list and carousel data (Figure 4), respectively.For each model This quantity measures the fit between the hypothesized model, represented by μ M (•, •; p 0 , γ , p q ) and optimized over p 0 , γ , p q ∈ [0, 1] 3 , and the empirical evidence, μ D .We approximate the exact minimization over [0, 1] 3 by grid search, where the grid resolution is 0.01.We report all total variation distances in Table 1.
Our results in Table 1 show that tcm fits the ranked list data better (smaller total variation distance 0.086) than ccm (larger total variation distance 0.095).They also show that ccm fits the carousel data better (smaller total variation distance 0.128) than the TCM (larger total variation distance 0.133).In summary, our mathematical models of click probabilities in tcm and ccm match the observed clicks.We also use the results of this experiment to set the value of the termination probability p q in the remaining experiments.This value is p q = 0.01.

Validation of ccm: Robustness
The purpose of our second validation experiment is to show that ccm generalizes to an unseen test set.Specifically, we show that the optimal recommendation under ccm in the training set also has a high value in the test set.
This experiment is carried out on the MovieLens 1M dataset [30], which consists of 1 million ratings applied to 4, 000 movies (items) in 18 genres (topics) by 6, 000 users.For simplicity, we assume that each movie is associated only with one genre.For a movie with more than one genre, we assign the genre with the highest popularity among all users.The recommended movies in ccm are organized in 18 carousels.Each carousel represents a genre and has a label that shows the topic of the carousel, such as "Action Movies".We denote by n u the number of users and by n a the number of items.
We randomly split the dataset into two equal sets, which we call the training set D and the test set D. Then, for all users and items, we estimate the ratings using matrix factorization, which is a standard approach for rating estimation in recommender systems [43].The approach involves decomposing a sparse rating matrix into a low-rank matrix that captures latent factors representing user and movie preferences.This approach has been shown to achieve high prediction accuracy and has been extensively studied by researchers.In our work, we used non-negative matrix factorization with d = 5 latent factors to estimate the ratings for movies that the user has not seen or ranked.We denote the estimated rating of item a ∈ [n a ] by user u ∈ [n u ] in the training (test) set by ru,a (r u,a ).Next, we apply a softmax transformation to the estimated ratings and convert them into attraction probabilities in both the training and test sets, .
Towards Simulation-Based Evaluation of Recommender System... 9:17 This is just a monotone transformation that transforms the estimated ratings of each user into a probability vector.We evaluate ccm as follows: Let Pccm (A, u) (P ccm (A, u)) be the click probability on recommendation A by user u on the training (test) set, calculated using (4) and attraction probabilities pu,a (p u,a ).First, we compute the best recommendation for user u on the training set Âu = arg max A Pccm (A, u), where maximization is performed as described in Section 4.2.4.Second, we evaluate Âu on the test set, by computing the test click probability P ccm ( Âu , u).Third, we calculate the best recommendation for user u on the test set A u, * = arg max A P ccm (A, u), where the maximization is done as described in Section 4.2.4.Finally, we compare the average test click probabilities, for all users u, of the best training and test recommendations, formally given by Our results are shown in Figure 5.In addition to reporting the average click probability over all users, we break it into five groups based on the number of ratings per user: "very low" to "very high" which provides an intuitive understanding of the number of ratings associated with each bin.This categorization allows us to analyze and compare different subsets of users based on the number of ratings they provided.We choose the number of ratings because it represents the size of the user profile.The results of this experiment confirm that ccm can generate recommendations comparable to the best recommendations on the test set, both overall (3.96% difference) and in the five user groups (2.9-4.3%difference).Our independent t-test did not yield a statistically significant difference between the two groups compared.This testifies to the generalizability of our proposed model.In particular, we show that our model does not overfit and performs well on the test set.

Using ccm Simulator: Comparing Different Levels of Personalization
This experiment investigates the effect of personalization on the click probability of recommendations.We compare the click probability of recommendations generated by ccm using personalized and two non-personalized attraction probabilities.
The setup of this experiment is the same as in Section 4.4.The only difference is in the definitions of ratings in the training set.This definition affects how the optimal recommendation Âu is chosen, but not how it is evaluated.That is, P ccm ( Âu , u) is the value of Âu under the personalized ratings on the test, as defined in Section 4.4.The first approach, personalized, uses the definition of training set ratings in Section 4.4.The second approach, popular, calculates the rating of item a for user u as ra = n u u=1 ru,a n u .This is not personalized because the rating of item a is the same for all users.The last approach, random, assigns random ratings ra ∈ [1,5] to all items.
Our results are reported in Figure 6.In addition to reporting the average click probability over all users, we break it down into the same five user groups as in Figure 5.As explained before, this segmentation helps to better visualize the effect of user profile size on click probability.We observe that the personalized model outperforms the non-personalized baselines by a large margin.Specifically, the average click probability over all users in popular decreases by 23.8% from that in personalized; and the average click probability over all users in random decreases by 32.3% from that in personalized.Our independent t-test results indicate that the improvement gained from the use of personalized recommendations compared to the nonpersonalized baselines is statistically significant with a p-value of less than 0.05.These improvements are consistent across all five user groups, indicating stability with respect to profile size.We conduct this experiment because the non-personalized baselines depend less on the training data of a given user, and thus may generalize better to the test set.This experiment shows that this is not the case.

Using ccm and tcm Simulators: Comparing User Behavior in Carousel and Ranked
List Earlier in this section, we demonstrated that our carousel click model is robust and fits the realworld data.Additionally, we showed that a personalized model outperforms the random and popularity-based models of user behavior in carousels.In this part, we compare the user behavior in a carousel and a ranked list under the standard user behavior assumption introduced Section 4.2.1.We also repeat the same set of simulations under a more realistic user behavior assumption that considers the initial visibility of items in carousels.Finally, we utilize the log click probability to demonstrate how users browse a carousel compared to a ranked list.

Comparing Carousels and Ranked List under Standard User
Behavior Assumption.This experiment compares the click probabilities under ccm with two other click models: tcm and a variant of ccm called ccm-nl.The goal of this experiment is to show the gain in click probabilities when using ccm.
tcm (Section 4.1.2) is a cascade click model [15] that models a user that may terminate unsatisfied.ccm-nl, which stands for a carousel click model with no labels, is a variant of ccm that models the scenario when topic labels are removed.Specifically, we calculate the optimal list using ccm and then remove the labels of the carousels.This means that the user browses the recommendations as if they were a single ranked list, and we adopt tcm for this simulation.We do this to study the importance of labels for carousels and to see to what extent they affect the average click probability in ccm.We use the same evaluation protocol as in Section 4.4.Our results are reported in Figure 7. Click probabilities are reported for three different sets of items: top 100, top 1, 000, and all; where items are ranked by the sum of their ratings.This experiment shows that the click probability in ccm is significantly (independent t-test, p-value < 0.01) higher than in tcm and ccm-nl.Specifically, when recommending the top 100 items, the click probability in tcm decreases by 27.9% from that in ccm; and the click probability in ccmnl decreases by 33.5.0%from that in ccm.The improvement increases as the number of items increases.Specifically, when recommending all items, the click probability in tcm decreases by 90.02% from that in ccm; and the click probability in ccm-nl decreases by 91.9% from that in ccm.In summary, ccm attracts about 10 times more clicks than both tcm and ccm-nl when recommending all items.This improvement trend indicates that ccm is a good candidate for practice, where a large number of recommended items is common.4.6.2A More Realistic User Behavior Assumption.ccm assigns the same termination probability to all items in the carousel.However, in practice, items that are not initially displayed are less likely to be examined because the user needs to scroll to see them.To study this behavior, we assign different termination probabilities to different parts of the carousel: p q = 0.01 to the first 10 columns and p q = 0.1 to the rest.Selecting the first 10 items as the visible part of the carousel is an intuitive choice based on real-life recommender systems.In other aspects, the setting is the same as in Section 4.6.
Figure 9 compares the ccm, ccm-nl, and tcm in the new setting.We observe that the performance of ccm worsens.In absolute terms, the click probability for top 100 items is comparable to that in Figure 7.However, for all items, it is about 5 times lower.Relatively to the baselines, when recommending top 100 items, the click probability in tcm decreases by 23.9% from that in ccm; and the click probability in ccm-nl decreases by 27.81% from that in ccm.When recommending all items, the click probability in tcm decreases by 40.11% from that in ccm; and the click probability in ccm-nl decreases by 51.4% from that in ccm.Although the improvement for all items is not as impressive as in Figure 7, ccm still outperforms both baselines by a healthy margin.

Visualizing the Log Click Probability.
To visualize the reason behind the considerably better performance in ccm, we plot the average log click probability for all users using (6) in the first ten rows and columns of the optimal recommendation in ccm and the two baselines in Figure 8.The light color (yellow) in the plots corresponds to high average click probabilities, whereas the dark color (blue) corresponds to low average click probabilities.In ccm, we observe that the items at the beginning of each row (the first few items in each carousel) are more likely to be clicked by the user.We can also observe the effect of removing labels from the carousels in ccm-nl.This is manifested by overall darker colors because the user struggles to find the carousel with attractive items and leaves unsatisfied.The last plot shows that the average click probability in tcm decreases uniformly.This is expected when the recommended items are ranked in the order of decreasing relevance.

CONCLUSIONS, LIMITATIONS, AND FUTURE WORK
In this article, we advocate the use of a simulation-based approach for the evaluation of recommender interfaces.To demonstrate the power of this approach, we presented two case studies in which a comparative evaluation of carousel-based interfaces was performed using an offline simulation.The two case studies attempt to answer research questions of different complexity and use considerably different models of user behavior to answer these research questions using simulation-based study with these models.The first case study demonstrates the use of a relatively simple qualitative model of user behavior to compare the navigability of a carousel-based interface and a ranked list.The second case study demonstrates how a more complex quantitative model of user behavior supports more elaborated simulation-based studies that could answer more comples research questions, which require a higher precision of similation.To perform the latter study, we developed and validated a carousel click model, which generalizes the traditional cascade models [14] that model user behavior in a ranked list.Taken together, the two case studies demonstrate that user behavior could be modeled at different levels of complexity and that the complexity of the simulation model should be selected by taking into account the complexity of research questions to be answered through a simulation-based study with this model.
Although we consider this demonstration to be the main contribution of the article, we stress two additional contributions that carry a separate value for the field.First, to enable simulationbased studies of carousel-based interfaces, we developed and validated a CCM.To quantify user work with multiple carousels, the model proposes that the user examine carousels proportionally to the attraction of items in them and then clicks on the items within the examined carousel proportionally to their individual attraction.This model is motivated by the cascade model developed for the traditional "single" ranked list.We consider the carousel click model as a valuable theoretical and practical contribution, which is important well beyond the scope of this article.In particular, this model enables further simulation-based quantitative evaluation of carousel-based interfaces.
Second, as part of our second case study, we were able to compare a carousel-based interface and a traditional ranked list on equal ground.In this simulation-based comparison, we used the traditional cascade model to simulate user behavior in a ranked list and a carousel click model to simulate user behavior in a carousel-based interface.Our results demonstrated that a structured examination of a large item space supported by carousels is more efficient than scanning a single ranked list.These findings help to explain a rapid rise of carousel-based recommendation interfaces, which have become a de-facto standard approach to recommending items to end users in many real-life recommenders.
We consider this work as the first attempt to explore the application of a simulation-based approach to study recommender system interfaces "beyond the ranked list", such as carousel-based interfaces.As the first attempt, the work has a number of limitations that point to several directions for future work.First, the goal of this article was to make a case for the simulation-based evaluation approach and demonstrate its applicability to answer valuable research questions (for example, to compare a carousel-based and a ranked list presentation of results).While we provided two cases to demonstrate that similation-based approach could be use to answer different research questions and that the user behavior could be modeled at different levels of complexity, we used only one recommender algorithm in each case for the demonstration.In real life, the same behavior model could be used for simulation-based studies of different algorithms, moreover, it could be used to compare different algorithms "on an equal ground".While using the simulation-based approach to compare a set of advanced recommendation algorithms was not the goal of this article, we plan to explore this opportunity in our future work.In particular, we want to use simulation-based evaluation to assess the value of multi-armed bandits-based methods in a carousel context.
Second, in the process of validating our models, we observed that there is no large-scale public dataset of user interactions with carousel interfaces.To our knowledge, the dataset used in Section 4.3 is the largest such dataset, yet it is too small to fully validate our model, which is a limitation of our study.We believe that it is important to release a large-scale public carousel interaction dataset to encourage more research in this area, similar to what the Yandex dataset [41] did for regular click models [11].We plan to collect and publish such a dataset in the future.
Third, every behavior simulation model has its limits, even if it is developed on the basis of past empirical data.To better understand the limits of simulation-based evaluation, it is important to periodically compare the results obtained by offline simulation studies with the results obtained in empirical studies with end users.This comparison has been made for the traditional ranked list interfaces leading to valuable insights [27,54,62].In our future work, we plan to combine simulation-based evaluation of carousel interfaces with empirical studies of these interfaces.
Finally, to obtain more reliable results, it is important to perform studies in multiple domains.While the study presented in this article has been carried out in the popular domain of movie recommendations, we plan to expand it to other domains, such as food recommendations.Specifically, we want to understand how carousel-based interfaces influence user preferences in culinary choices and how this might promote a healthier lifestyle.Furthermore, we plan experiments in the domain of health recommendations, particularly focusing on health-related document recommendations for patients and their caregivers.By expanding our research into these domains, we aim to gain a deeper understanding of the advantages and challenges of incorporating carousels in presenting recommendation results.
Going beyond our own future work, two important future directions should be acknowledged.First, our carousel click model is only one potential model for carousels, motivated by the cascade model in ranked lists.An alternative model could be based on the position-based model, where the user would examine the item at position (i, j) proportionally to the probability of examining the row i and the column j.We believe that many of these models could be developed in the future, similar to the countless click models developed for ranked lists [8].
Finally, the essence of our advocated approach is to use reliable models of user behavior to simulate user work with recommender interfaces and perform various types of evaluation offline without engaging real users.In this article, we made the case for simulation-based evaluation of carousel-based recommender interfaces.However, with the advancement of our knowledge of user behavior in other types of recommender interfaces, reliable behavior models could be built for more sophisticated interfaces.With that, the application scope of the simulation-based approach will be considerably increased.

Fig. 1 .
Fig. 1.Comparing the cumulative exiting probability in single ranked list and carousels and in different experimental settings reveals the advantages of using a carousel-based representation compared to a ranked list-based representation.In all experimental settings users leave after 50 interactions regardless of success in finding the desirable item.

Fig. 5 .
Fig. 5. Average test click probabilities of the best recommendations on the training and test sets.

Fig. 6 .
Fig. 6.Comparing the click probability of CCM in different settings.

Fig. 8 .
Fig. 8. Sample distribution of log click probability in CMM, CMM-NL and TCM for all users.Darker colors mean less click probability.

Fig. 9 .
Fig. 9. Comparing the average click probability in a more realistic CCM, CCM-NL and TCM.

Table 1 .
Comparing the Total Variation Distance between tcm and ccm Models on Ranked List and Carousel Data