Reproducing Popularity Bias in Recommendation: The Effect of Evaluation Strategies

The extent to which popularity bias is propagated by media recommender systems is a current topic within the community, as is the uneven propagation among users with varying interests for niche items. Recent work focused on exactly this topic, with movies being the domain of interest. Later on, two different research teams reproduced the methodology in the domains of music and books, respectively. The results across the different domains diverge. In this paper, we reproduce the three studies and identify four aspects that are relevant in investigating the differences in results: data, algorithms, division of users in groups and evaluation strategy. We run a set of experiments in which we measure general popularity bias propagation and unfair treatment of certain users with various combinations of these aspects. We conclude that all aspects account to some degree for the divergence in results, and should be carefully considered in future studies. Further, we find that the divergence in findings can be in large part attributed to the choice of evaluation strategy.


INTRODUCTION
In this article we reproduce three papers that study popularity bias in media recommender systems.Websites that host media content are known to employ recommender systems to filter through the content and provide the user with personalized suggestions.In the case of collaborative filtering, neither explicit demographics of the user nor information about the content are needed to encode 5:2 S. Daniil et al.
their taste, but only consumption and browsing history.Despite the lack of explicit input of user or item characteristics in the system, collaborative filtering approaches are still known to suffer from bias [34].Popularity bias has been identified as a relevant issue with negative implications from a multi-stakeholder perspective [2].In short, popularity bias is the algorithmic phenomenon where items already popular in the users' profiles tend to become even more popular due to being disproportionally recommended.
In the context of media recommenders, different research teams have studied the impact of popularity bias on the users.Specifically, they focused on the extent to which it impacts them disproportionally based on their interest for popular items, which they refer to as unfairness.In 2019, Abdollahpouri et al. [3] published a paper called "The Unfairness of Popularity Bias in Recommendation" where the propagation of popularity bias by different algorithms and for different user groups was reported in the setting of movie recommendation.Two subsequent papers reproduced the work to evaluate the same phenomenon in music recommendation (Kowald et al. [26]) and book recommendation (Naghiaei et al. [27]).The papers used a similar process and metrics to evaluate the unfairness of popularity propagation, but different datasets, data preprocessing, and some different algorithms.While all studies reported on the same types of results, diverging results were presented.Both Kowald et al. [26] and Naghiaei et al. [27] proposed that the differences in data characteristics might be the cause.
The studies take significant steps in the direction of understanding the unfairness of popularity bias, and a new metric for popularity bias is proposed by Abdollahpouri et al. [3].The effect of popularity bias on several baseline and state-of-the-art collaborative filtering algorithms is analyzed.The view of unfairness is user-centric, unlike previous work on the matter, which contributes to the significance of these studies in the field of recommender systems.It is, therefore, pertinent to comprehend the source of divergence in their reported results when it comes to whether certain algorithms propagate popularity bias, as well as the extent to which certain user groups are unfairly treated.
The following results were presented by all three studies: -Overall Popularity Bias Propagation: The studies observed whether frequency of an item in the users' profile and in an algorithm's recommended list correlated (the higher the correlation the larger the bias), and whether only a few items were getting recommended to all users.Their results diverged in the following ways: -Abdollahpouri et al. [3] and Naghiaei et al. [27] found that only for certain algorithms there is positive correlation, with the correlation found by the latter being stronger than by the former.-Kowald et al. [26] found that this correlation exists for all tested algorithms rather than for only some of them.-Popularity Bias Propagation per User Group: The studies evaluated unfairness of popularity bias propagation by using the %Δ Group Average Popularity (%ΔGAP) metric, which is defined as the difference in average popularity of items in a user's profile and in an algorithm's recommended list, averaged for a group of users.Specifically, they calculated %ΔGAP for different user groups defined by their propensity for popular items, and then compared the results between groups and algorithms.The higher the %ΔGAP for a group, the more unfairly this group is treated.
-Abdollahpouri et al. [3] and Naghiaei et al. [27] found that popularity bias is larger for users who prefer niche (i.e., not popular) items than for other user groups.Certain algorithms examined by Naghiaei et al. [27] did not propagate popularity bias for any of the user groups, which was not the case for Abdollahpouri et al. [3].-Kowald et al. [26] observed no such clear difference between the groups for any algorithm.
In this article, we perform an extensive reproduction study of the three above mentioned papers (Abdollahpouri et al. [3], Kowald et al. [26], Naghiaei et al. [27]).We are motivated by the divergence in results and wish to investigate and comprehensively report on the source.We do not only attempt to replicate and verify the individual claims, but also locate properties of the recommendation and evaluation process that have an impact on popularity bias.Therefore, our reproduction study allows us to draw conclusions further than the transferability of a claim to a different domain, which was the goal of the two reproduction studies (Kowald et al. [26], Naghiaei et al. [27]).By zooming in on these studies and understanding how their differences in implementation resulted in divergent results, we can offer insights and suggestions on the topic of popularity bias evaluation, and reproducibility in recommender systems in general.
First, we study the choices made by Kowald et al. [26] and Naghiaei et al. [27] when reproducing Abdollahpouri et al. [3].We recognize the challenges that stem from either lack of clarity around or differentiation in the strategies the three studies used to evaluate popularity bias.We identify four aspects that are relevant in investigating the differences in results between the three studies: (1) data; the studies use three datasets, with different characteristics such as size, sparsity and distribution of item popularity.(2) algorithms; the studies evaluate mostly different algorithms, with some exceptions.
(3) division of users in groups; the studies define propensity for popular items differently and divide them accordingly.(4) evaluation strategy; the studies make different choices in the testing process.
Second, we run a set of experiments in which we measure popularity bias with various combinations of these aspects.For example, we run the evaluation strategy of one paper on the dataset of the other paper.We find that evaluation strategy has a significant impact on the reported results.Specifically, certain algorithms trained on the same data and with the same view of popularity either show or not show propagation of popularity bias depending on the evaluation strategy.We believe that our observations can prove useful for future efforts to reproduce evaluations of recommender systems, as they give insight on how strategic choices affect the outcome even when the same evaluation metric is used.

RELATED WORK 2.1 Reproducibility in Recommender Systems
The ability to reproduce and examine published studies is valuable, as reproducibility of experimental results is a cornerstone of science [17].However, the field of AI as a whole is known to face reproducibility issues [22].Machine learning is highly dependent on the characteristics of the chosen training data, as well as parameter tuning to fix the inherent randomness of the algorithms [36].Therefore, lack of published code and data render the replication of existing work by other researchers challenging [18].Many reported results are irreproducible, though Gundersen and Kjensmo [17] found a significant increase in documentation over time.In the context of recommender systems, lack of reproducibility is recognized as one of the key components that have led the field into "a state of stagnation" [11].While reviewing recommender systems papers published in big conferences such as KDD, IJCAI and SIGIR, Ferrari Dacrema et al. [14] found less than 50% of them to be reproducible.Cremonesi and Jannach [11] consider lack of incentive to be the main reason behind limited reproducibility, as the academic system's operation often motivates researchers to publish more instead of putting in the work to provide sufficient code, data and documentation.
Recently, AI conferences have started providing checklists to ensure that the submitted papers are sufficiently reproducible [28].Additionally, many conferences such as RecSys, ECIR and SIGIR have initiated reproducibility tracks where researchers can submit studies that reproduce, analyze or reflect on prior work.Gundersen et al. [16] provide a set of specific recommendations on how to ensure reproducibility in AI research.They motivate researchers by highlighting the importance of reproducibility for the improvement of science overall as well as the benefits for the researcher themselves.Beel et al. [5] outline actions that can make recommender systems research more reproducible, such as adopting practices from medical sciences, social sciences, and natural sciences, and conducting more comprehensive experiments, for example by varying model parameters and observing the effect.In this work, we attempt to further motivate reproducibility in recommender systems research by experimenting with various aspects of the recommendation process and showcasing the effect on the phenomenon of popularity bias.

Evaluation Strategy
In recommender systems research, reproducing the evaluation process followed in previous studies is not always trivial.To assist researchers in this endeavour, Said and Bellogín [29] describe the dimensions of evaluating recommender systems as dataset, data splitting, evaluation strategies, and metrics.Specifically, they identify and benchmark four stages of designing an evaluation protocol: data splitting, item recommendation, candidate item generation, and performance measurement.Application of metrics takes place in the final stage of evaluation, namely when measuring the performance of the scores produced during the recommendation process.Before measuring performance, it is necessary to define a crucial aspect of the evaluation strategy; generating candidate items for recommendation for each user, so that scores can be produced for them.The process of evaluating recommender systems includes reporting on results of commonly used metrics, which allows comparison between different algorithms to estimate their success in the context of the task at hand.To calculate the metrics in a way that meaningfully represent the success of the system, it is necessary that the evaluation strategy is suitable for the system, as well as for the metrics to be measured.
In terms of metrics, Gunawardana et al. [15] list a set of properties frequently taken into account when evaluating recommenders, such as accuracy, coverage, novelty, diversity, and so on.They point out that different applications have different needs, and thus the metrics to be evaluated should be chosen to appropriately reflect on the desired property.Said and Bellogín [29] argue that comparison of recommendation quality between different studies requires careful consideration of all stages of an evaluation protocol.They experimentally show that even when recommendation algorithms are similarly implemented, the results are not comparable when different evaluation strategies are used.In our study, we follow a similar approach of employing different evaluation strategies on the same task and comparing the results.We are prompted by variations in evaluation strategy choices made by published reproducibility studies, which highlights the importance of taking into account and addressing all evaluation steps in recommender systems research.

Popularity Bias
Studying and evaluating a specific phenomenon requires understanding its roots and effects.In the context of item popularity, it is known that consumption of media often follows a long tail distribution, where a few items are very popular and the rest are located in the heavy tail [4].Recommender systems attempt to facilitate users in their effort to discover items appropriate for them, regardless of the items' position in the long tail.However, it has long been noted that collaborative filtering techniques can be prone to popularity bias, the phenomenon where popular items tend to be recommended over long-tail ones [8].When an algorithm showcases such tendency, it may produce ranked lists with items not equally covered along the popularity tail [35].Bellogín et al. [6] argue that common ranking metrics calculate overall assumed satisfaction of a population, and 5:5 therefore produce high values when the recommended list consists of mostly popular items, even when it is not personalized.Item popularity is not a de facto bad criterion for recommendation and can be leveraged to track item quality [9,37].However, very popular items are often more likely to be already known, which renders the recommendation of mostly popular items to a user potentially not useful for the user and the system in general [1].
In addition to the three studies examined in this article, other studies evaluate popularity bias in various contexts and propose methods to mitigate it [7,20,24,38].These studies vary in terms of methods to measure popularity bias and findings.Elahi et al. [13] find that propagation of popularity bias is dependent on the scenario and domain; in certain cases, popularity bias is reduced through the recommendation process.In a follow-up paper to Kowald et al. [26], Kowald and Lacic [25] study popularity bias by four collaborative filtering algorithms on all three datasets used by the three studies we are reproducing, namely MovieLens1M, LastFM and Book-Crossing.It is interesting to note that their results are similar across the three datasets, while they diverge from the results reported by Abdollahpouri et al. [3] and Naghiaei et al. [27].This observation implies that the discrepancy in results showcased by the three studies cannot be explained solely by the differences between the datasets.In this work, we wish to investigate which aspects of the process account for the discrepancy.Our results and conclusions can contribute to consistency and reproducibility when evaluating popularity bias and other recommender systems phenomena by the research community.

OVERVIEW OF THE STUDIES TO BE REPRODUCED
In order to comprehend the differences in results between the three studies, we studied and collected the details of each approach to accurately reproduce their process.Note that while Kowald et al. [26] and Naghiaei et al. [27] made their code publicly available, we could not find a public repository for the code developed by Abdollahpouri et al. [3].Therefore, we describe the characteristics of their study based on the text of the published paper, and private correspondence with the authors.Throughout the paper, we explicitly state whether our understanding is based on this correspondence.

Core Process
The studies followed similar processes to compare the popularity distribution in the data and in recommendations: (1) Analyze distribution of popularity among items in given dataset.
(2) Label items as "popular" if they belong in the 20% most frequently rated items in the dataset.
(3) Analyze distribution of user propensity for popular items and divide users into three groups (Niche, Diverse and Blockbuster focused) accordingly.(4) Set aside 80% of the ratings for training and 20% for testing.
(5) Train the algorithms on the training data.(6) Recommend 10 items to each candidate user for each algorithm.(7) Report on overall popularity bias propagation: compare the number of times each item is recommended with the number of times it is rated.Repeat for each algorithm.(8) Report on popularity bias propagation per user group: compare average popularity of the items in profile with the items in recommendation for every user and average for each user group.Repeat for each algorithm.

Data
The studies used publicly available datasets which are commonly used in recommender systems literature.Abdollahpouri et al. [3] used MovieLens1M [19], Kowald et al. [26] used LastFM-1b   [32], and Naghiaei et al. [27] used Book-Crossing [39].Kowald et al. [26] and Naghiaei et al. [27] processed their respective datasets in order to approach the size of MovieLens1M.Specifically: -Kowald et al. [26] extracted 3,000 users from the original dataset, which contains 120,000 users.They also grouped the listening events into user-artist pairings.Subsequently, they scaled the number of times a user listened to an artist into a preference score from 0 to 1,000.Therefore, the algorithms in Kowald et al. [26] do not try to predict explicit rating, but rather the preference of a user towards an artist, which is represented by the number of times the user listened to this artist.-Naghiaei et al. [27] only kept the explicit ratings from the original dataset.Afterwards, they removed users with more than 200 ratings.Finally, they removed users and items with very few ratings, until all users in the dataset had rated at least 5 items and all items in the dataset had been rated by at least 5 users.
Abdollahpouri et al. [3] do not explicitly mention whether they used the dataset intact or processed it before the analysis and recommendation process, and it was not clarified from our correspondence with the authors.Table 1 shows the characteristics of each dataset, namely number of users, items, ratings, and sparsity.Sparsity in the context of rating data refers to the percentage of possible user-item ratings that are missing from the data.
The resulting datasets differ in terms of size and sparsity.At the same time, item consumption is differently distributed among them.Figure 1 shows the number of users that rated each item in every dataset.While every figure shows that the rating data is skewed towards more popular items, the long-tail distribution is clearer in the subset of LastFM-1b than in the subset of Book-Crossing, and especially than in MovieLens1M.

Algorithms
Each study examined whether several different algorithms propagated popularity bias into their recommendations, as seen in Table 2.They all tested two baseline algorithms, MostPopular and Random.Other than these, the studies tested mostly different algorithms.Abdollahpouri et al. [3] tested four well known collaborative filtering algorithms.Kowald et al. [26] also tested four collaborative filtering algorithms, but partly deviated from the choices of Abdollahpouri et al. [3].Specifically, they excluded SVD++ and ItemKNN to reduce computational cost, potentially due to the large number of items in the LastFM dataset.Finally, Naghiaei et al. [27] tested a total of nine collaborative filtering algorithms in order to cover a wider range of state-of-the-art approaches.
Overall, the list of algorithms consists of Nearest Neighbour-based, Matrix Factorization-based, and Neural Network-based approaches.The only common collaborative filtering algorithm across all studies is UserKNN [31].The divergence in the choice of algorithms between the original study and the reproduction studies raises the question of whether a different overall conclusion would be drawn if all studies tested the same set of algorithms.

Division of Users in Groups
The notion of item popularity is central in the three studies, and it is defined as frequency of either rating by the users (popularity in profile) or of recommendation by the algorithms (popularity in recommendation).Abdollahpouri et al. [3] deem an item popular in profile if it is one of the 20% most frequently rated items in the entire dataset.This choice is important since Abdollahpouri et al. [3] use this label to divide the users based on their propensity for popular items.Therefore, the fact that item popularity is differently distributed across the three datasets as shown in Figure 1 affects which users are considered Niche, Diverse or Blockbuster focused.Specifically, the user division in Abdollahpouri et al. [3] happens as follows: (1) The popularity of every item is calculated as the percentage of users who have rated it.
(2) The items are sorted based on their popularity.
(3) The 20% items with the highest popularity are labelled as popular.
(4) The average propensity towards popular items of every user is calculated as the average popularity of the items that this user has rated.(5) The popularity fraction of every user is calculated as the percentage of the items in this user's profile that have the label popular.(6) The users are sorted based on their popularity fraction.(7) The top 20% users are labeled "Blockbuster focused".Items that the user has not rated in the training set Kowald et al. [26] Users in the test set Items that the user has rated in the test set Naghiaei et al. [27] Users in the training set All items Note that, unlike Kowald et al. [26] and Naghiaei et al. [27], our description of the strategy used by Abdollahpouri et al.
While Naghiaei et al. [27] follow the same process, Kowald et al. [26] differ; they do not use the popularity fraction to divide the users into groups.Instead, they use the mainstreaminess score, which is available for the users in the LastFM dataset.Mainstreaminess is defined as the overlap between a userâĂŹs listening history and the aggregated listening history of all users in the original dataset.The 3,000 users are strategically extracted by Kowald et al. [26]; 1,000 users with low mainstreaminess, 1,000 with medium, and 1,000 with high.In other words, the mainstreaminess score is used as a proxy for user propensity for popular items and the user groups are divided based on that instead of user preference for items labeled "popular" in the final dataset.Note that the mainstreaminess score cannot be computed on MovieLens1M or Book-Crossing, as it requires multiple interactions between users and items (i.e., play counts) [25].

Evaluation Strategy
A key difference between the studies' approaches is the strategy they adopt to evaluate the propagation of popularity bias.In all studies, the dataset of ratings is divided into a training and a test set with a 80-20% split.However, they differ in terms of which users they recommend items to, and when it comes to candidate item generation, a step of the evaluation process that was benchmarked by Said and Bellogín [29].Said and Bellogín [29] discuss that this aspect of the evaluation protocol is crucial for the result, since by changing the item candidates for recommendation, a different ranking is evaluated.This is likely to have an effect on the measured popularity bias propagation.Table 3 shows an overview of the above mentioned characteristics of each study's evaluation strategy.
Let U r denote the set of users that an algorithm recommends items to, and L u the list of items that it ranks and chooses from to recommend to a user u in U r .
Kowald et al. [26] recommend items to every user in the test set.To generate candidate items, they adopt the UserTest strategy [29]; the system only considers items that the user has rated in the test set.In other words, in their system U r contains all users in the test set, and L u for every user u in U r contains every item i for which the rating (u, i) exists in the test set.
Naghiaei et al. [27] recommend items to all users in the training set.They adopt a version of the TrainItems strategy [29]; the system considers every possible item as candidates for a given user.In this case, U r consists of all users in the training set, and L u for every user u in U r contains every item i.Note that Said and Bellogín [29] describe TrainItems as disregarding the items that the user has actually rated, but Naghiaei et al. [27] consider all items instead.
In our correspondence with Abdollahpouri et al. [3], they stated that they recommend items to all users in the test set.They also adopt the TrainItems strategy, but differently to Naghiaei et al. [27].Specifically, for every user u in U r which are all the users in the test set, L u contains every item i that u has not rated in the training set.According to Said and Bellogín [29], this approach is suitable when simulating a real system where no test set is available. 5:9

Other Variations
While in our experiments we consider the aforementioned four aspects that vary across the three studies, there are other variations that we do not consider: -Kowald et al. [26] and Naghiaei et al. [27] both used Python to run their experiments, but used different Python-based libraries to perform the training and recommendation process.Kowald et al. [26] used Surprise,1 a toolkit for building and analyzing recommender systems that deal with explicit rating data [21].Naghiaei et al. [27] used Cornac,2 a framework for multimodal recommender systems [30].Abdollahpouri et al. [3] stated through our private correspondence that they used Librec-auto, 3 a Python tool for running recommender systems experiments [33].-In their paper, Abdollahpouri et al. [3] state that they tuned all collaborative filtering algorithms to reach a precision of 0.1, so that the results are comparable.On the other hand, neither Kowald et al. [26] nor Naghiaei et al. [27] tuned the algorithms.Instead, they used the default hyperparameters in the Python libraries.

EXPERIMENTAL SETUP
In order to investigate the cause of difference in results, we define a set of experiments that allows us to track which of the aspects that varies across the three studies affects whether popularity bias is propagated and whether the propagation unfairly impacts certain user groups.Each experiment consist of the following phases. Training.
-Divide the dataset with ratings into training and test set using a 80-20% split.
-Train the algorithms on the training set. Prediction.
-For every user u in U r , predict ratings for item i in L u based on each trained algorithm.(see notation in Section 3.5) -Rank the items based on the predicted rating.
-Recommend to the user the top 10 items. Evaluation.
-Overall popularity bias propagation: Note that the three studies indirectly use the concepts of correlation and item coverage to report on overall popularity bias, by plotting frequency of an item in profile versus in recommendation for every algorithm and visually observing whether there is correlation and whether certain items are almost never recommended.In this study, we quantify these concepts as follows: -For the set of items I calculate where r is the Pearson correlation coefficient, P I is the list of popularities of each item in the users' profiles, and F I is the list of frequencies of each item in the users' recommended lists (see [10]).
where R is the list of items recommended at least once.-Popularity bias propagation per user group -For every user group д calculate where GAP (д) p is the average popularity of the items in the users' profiles and GAP (д) r is the average popularity of the items in the users' recommended lists (see notation in Abdollahpouri et al. [3]).-For every user group д calculate NDCG@10 [23].Items for which no rating is available in the test set are assumed to have a utility of 0, as implemented by Ekstrand et al. [12].
We design the experiments by considering different choices for every aspect in which the studies deviated, and combining them in all possible ways.To train the models, we use the Cornac library, since it contains almost all algorithms needed.Similarly to Kowald et al. [26] and Naghiaei et al. [27], for every dataset we fix the random seed when splitting in training and test set for reproducibility.The code we developed for our experiments 4 has been made open source.In the subsequent subsections we summarize the choices for each of the four aspects.

Data
We train the algorithms using MovieLens1M, LastFM, and Book-Crossing.We use the entire Movie-Lens1M dataset, and the Book-Crossing subset that is used by Naghiaei et al. [27].For computational reasons, we do not experiment with the entire LastFM subset that Kowald et al. [26] use.The large number of items makes some algorithms crash when combined with certain evaluation strategies, and we wish to evaluate as many combinations as possible.Instead, we sample by removing items with fewer than 20 ratings.This sampling does not decrease the number of users, but greatly decreases the number of items from 352,805 to 12,690.The characteristics of the resulting dataset can be seen in Table 4.To assess whether the sampling has a large impact on the conclusions, we compare our results with the results reported by Kowald et al. [26] and find that they are consistent. 5

Algorithms
We train all collaborative filtering algorithms in Table 2, with the exception of UserItemAvg and SVD++, which are not available in Cornac.In total, we train 11 collaborative filtering algorithms.Note that we use the same default hyperparameters as Kowald et al. [26] and Naghiaei et al. [27].

Division of Users in Groups
We divide the users in groups in the following ways: (1) PopularPercentage: Divide the users based on the percentage of items in their profile that have the label "popular" with a 20%-60%-20% scheme, in the same way as Abdollahpouri et al. [3] and Naghiaei et al. [27].An item gets the label "popular" if it is one of the 20% most frequently rated items in the dataset.(2) AveragePopularity: Divide the users based on the average popularity of items in their profile with a 20%-60%-20% scheme.This approach is not followed by any of the studies.Given that it may result into differently divided groups, and as it also does not depend on items being labeled "popular", it is interesting to investigate how it impacts the results in terms of whether certain user groups are unfairly treated by a recommender.(3) Mainstreaminess: Divide the users based on the mainstreaminess score in the same way as Kowald et al. [26].This division is only tried on LastFM, given that the mainstreaminess score can only be calculated on this dataset.
Note that while each study used domain-specific terms to refer to each user group, for simplicity we will use the terms Niche, Diverse and Blockbuster-focused regardless the user division and dataset.

Evaluation Strategy
We apply all different evaluation strategies adopted by the studies.Specifically, the strategies vary with regards to which user and item are candidates for generating recommendations.

RESULTS
In this section, we present the results across all popularity bias metrics for every experiment.First, we discuss overall popularity bias propagation, and then we focus on propagation per user group.

Overall Popularity Bias Propagation
To evaluate overall popularity bias propagation for every combination of aspects, we report on correlation and coverage as described in Section 4. Tables 5-7 show the above mentioned values for each algorithm and each evaluation strategy, for MovieLens1M, LastFM and Book-Crossing, respectively.In addition to the correlation and coverage tables, we choose one algorithm, namely NMF, and plot frequency in profile versus in recommendation for every dataset and evaluation strategy.Figure 2 includes the resulting scatter plots.Given the large number of experiments, we include the scatter plots for the other algorithms in the Appendix, Figures 4 to 13.We analyze how the different aspects affect the results.The type of user division in groups is not relevant here as it does not relate to whether an algorithm overall propagates popularity bias.

Data.
The resulting correlation and coverage vary across the three datasets.There is not a clear pattern of a dataset being more or less prone to popularity bias across all algorithms.7 that the correlation averaged over all algorithms is higher than in Tables 5 and 6 for a given evaluation strategy.In other words, when trained on the Book-Crossing dataset, the algorithms on average result in recommended lists where item frequency is more highly correlated with item popularity than when trained on the other two datasets.By this metric, the subset of Book-Crossing used by Naghiaei et al. [27] is on average more prone to popularity bias than MovieLens1M and LastFM.
Table 7 shows that the mean item coverage is also higher for Book-Crossing than for the other two datasets in Tables 5 and 6 for a given evaluation strategy.For example, given Modified TrainItems, the mean item coverage for Book-Crossing is 0.2892, for MovieLens1M 0.1762, and for LastFM 0.1145.Looking at Figure 1 and Table 4, we see that Book-Crossing is very sparse and popularity is scattered.Even very popular books have been rated by less than 350 users, in contrast with MovieLens1M where some movies have been rated by more than half of the users.It is therefore expected that item coverage is higher, as popularity is less concentrated.Note also that when UserTest is applied, the item coverage is higher for Book-Crossing than for MovieLens1M, and especially for LastFM, not just on average but for every individual algorithm.These observations Reproducing Popularity Bias in Recommendation 5:13 show that BPR, HPF, NeuMF and VAECF consistently result in relatively high positive correlation, as well as low item coverage for TrainItems and Modified TrainItems.In this sense, BPR, HPF, NeuMF and VAECF are consistently prone to popularity bias.Note that these algorithms were only tested by Naghiaei et al. [27], which means that the choice of algorithms may have a large impact on whether popularity bias will be observed by a study.On the other hand, Figure 2 shows that NMF shows no correlation for any dataset for TrainItems and Modified TrainItems, as is the case for UserKNN.

Evaluation Strategy.
Evaluation strategy has a large impact on the reported result.Across all datasets and algorithms, there is a strong positive correlation between popularity in profile and in recommendation when UserTest is employed.This observation is in line with the conclusions of Kowald et al. [26] who used UserTest in their study, as well as their follow-up paper [25].This is due to UserTest only recommending to a user items that they have already consumed in the test set.For example, if a test user has rated 8 items in the test set, only these items are candidates for recommendation and they will all be recommended to that user given a 10-item recommendation.In this case, it is reasonable that popularity in profile and in recommendation correlate; popular items are more likely to be in the users' test sets and therefore be candidates for recommendation.
It is also the case that for every dataset item, coverage is on average higher when UserTest is employed.For example, Table 5 shows that for MovieLens1M average coverage is 0.6469 when UserTest is employed, while only 0.1762 and 0.1914 when Modified TrainItems and TrainItems are employed, respectively.Given that candidates for recommendation are only the items from a user's test set, it follows that more items will be covered by the recommendation process as users have different tastes.This is especially the case for Book-Crossing given its sparsity.The consistency in results when UserTest is deployed prompts to consider that the popularity bias reported might mostly be a result of the evaluation process instead of the algorithm's functionality or data characteristics.
For TrainItems and Modified TrainItems, the results fluctuate per algorithm.Figure 2 shows that for all datasets, NMF propagates popularity bias when evaluated with UserTest, but does not with the other two strategies.TrainItems and Modified TrainItems show similar results overall.However, in some cases the fact that TrainItems excludes items that the user has rated in the training set does impact the conclusion on whether popularity bias is propagated.For example, ItemKNN trained on LastFM shows no correlation when TrainItems is employed, but positive correlation with Modified TrainItems.

Popularity Bias Propagation per User Group
To assess popularity bias propagation per user group, we report on the %ΔGAP metric and on N DCG@10, as described in Section 4. Given the large number of experiments, we choose to present the results on LastFM since all three ways of dividing in user groups were possible on this dataset.Tables 8-10 show the %ΔGAP value for each user group, algorithm and user division, using Modified TrainItems, UserTest and TrainItems, respectively.Tables 11-13 show N DCG@10 for each user group, algorithm and user division, using Modified TrainItems, UserTest and TrainItems, respectively.The results on the other datasets are included in the Appendix, Tables 15 to 26.

Algorithms.
It is apparent in Tables 8 to 10 that certain algorithms consistently produce recommended lists with higher average popularity, while others do not.BPR, HPF, NeuMF and VAECF showcase high %ΔGAP values for all groups, regardless the evaluation strategy.For  example, Table 8 shows that when Modified TrainItems is deployed and users are divided in groups based on PopularPercentage or AveragePopularity, NeuMF recommends to Niche users items almost 4 times more popular than they have rated in their profiles (i.e., the %ΔGAP is 399.1 and 399.6, respectively).Note that the same algorithms consisently result in high correlation and low coverage, as discussed in Section 5.1.2.Tables 11 and 13 show that given Modified TrainItems and TrainItems, BPR, HPF, NeuMF and VAECF also result in high N DCG@10 regardless the user division.However, the N DCG@10 values are significantly lower for the Niche group.We can conclude that these algorithms tend to recommend mostly popular items to all users when trained on LastFM, at least when their default hyperparameters are used.On the contrary, MF mostly results in negative %ΔGAP values across Tables 8 to 10, meaning that the average popularity in the recommended lists is reduced compared  to the users' profiles.Finally, some algorithms like WMF are inconsistent when it comes to the average popularity of the items they recommend to each group.Additionally, Tables 8 to 10 show that BPR and NeuMF's recommendations generally result to higher %ΔGAP for the Niche and Diverse groups compared to the Blockbuster-focused group.However, other algorithms do not showcase such consistency, as %ΔGAP fluctuates across evaluation strategies and user divisions.Consequently, it is challenging to deduce that a set of algorithms tend to treat Niche users unfairly while others do not.It might be that such conclusions can be drawn given a specific context, and not about an algorithm's inherent functionality.

User Division in Groups.
The way the users are divided in groups in some cases determines whether the Niche user group is unfairly treated, as well as to what extent.For example, we see in Tables 8 and 10 that for Modified TrainItems and TrainItems, HPF results in higher %ΔGAP for Niche users with PopularPercentage and AveragePopularity, whereas with Mainstreaminess the  PopularPercentage and AveragePopularity mostly result in similar trends in terms of which group receives unfair treatment.On the other hand, Mainstreaminess often leads to different conclusions than PopularPercentage and AveragePopularity.This is somewhat expected since Mainstreaminess is not dependent on the propensity for popularity within the given data, but the mainstreaminess score that characterizes the underlying population where this subset of LastFM stems from.Therefore, in order to conclude unfair treatment of a group, it is necessary to define the characteristics of the group in question and whether we refer to specific behavior within the training dataset or we incorporate external information as well.

Evaluation Strategy.
As is the case for overall popularity bias propagation, the choice of evaluation strategy largely influences %ΔGAP.When UserTest is employed, almost all algorithms provide recommendations with increased average popularity compared to user profile for all user groups regardless the type of division (see Table 9).Even in cases where %ΔGAP is negative, the decrease is small.Overall, when UserTest is deployed, the %ΔGAP values do not vary between algorithms as much as given the other two evaluation strategies.
On the other hand, when Modified TrainItems and TrainItems are used, the results fluctuate between algorithms.Furthermore, the recommended lists of the algorithms which consistently perpetuate popularity bias for all groups (BPR, HPF, NeuMF and VAECF ) showcase higher %ΔGAP with Modified TrainItems compared to TrainItems across all types of user divisions (see Tables 8  and 10).This likely signifies that excluding items that a user has rated from the candidate list reduces popularity bias.Given that by definition many users have rated the popular items, then the popular items are excluded from many users' candidate items list when TrainItems is deployed.
The choice of evaluation strategy also has a large effect on N DCG@10.Table 25 shows that UserTest results in higher N DCG@10 than the other two strategies.Since the pool of candidate items is limited to a user's test items when UserTest is deployed, it follows that the recommended list will likely be similar to the 'ideal' list.Additionally, the produced N DCG@10 values are generally higher for TrainItems than for Modified TrainItems across algorithms and user division in groups.This is due to the exclusion of training items from a user's candidate list when TrainItems is deployed.In our calculation of N DCG@10, we only considered items to have a nonzero utility for the user if they were consumed unbeknownst to the system, i.e., the test items (see [12]).

DISCUSSION AND FUTURE WORK
The results show that data, algorithms, user division in groups, and evaluation strategy influence the conclusion on whether popularity bias is propagated and whether the users that have lesser propensity for popular items are disproportionately affected.The datasets have impactful differences, especially when it comes to distribution of item popularity; Book-Crossing is a very sparse dataset, which results in higher item coverage on average compared to the other two datasets, even when correlation between popularity in profile and in recommendation is also high.At the same time, the choice of which algorithms to include in the study influences the results; some matrix factorization algorithms (HPF, NeuMF, VAECF ) consistently propagate popularity bias, and in most cases recommend disproportionately many popular items to the Niche users.User division in groups often defines whether unfairness can be concluded, especially when a proxy measure of propensity towards popular items is used that does not stem from the specific training dataset.Finally, evaluation strategy, and specifically the generation of user and item candidates is overall crucial given the effect it has on whether both overall popularity bias propagation and user group unfairness can be observed.
The results indicate that perceived propagation of popularity bias is sensitive to various aspects of the evaluation process that are often unaddressed.The fact that tweaking these aspects determines whether propagation of popularity bias can be concluded or not renders the individual conclusions of the studies unique to their context and set up, and less generalizable.We conclude that it is challenging to report on popularity bias as a phenomenon that persists as a result of an algorithm's functionality.While some algorithms we studied do consistently propagate popularity bias given our metrics, for most algorithms the results largely fluctuate depending on the other aspects.Similarly, even though the choice of dataset does have an impact, we showed that the divergence in results across the reproduced studies could not be solely attributed to the datasets' characteristics.
Consequently, we recommend careful consideration of each of these aspects when popularity bias is evaluated.Future studies on the topic should reflect on: -Data: Which domain does the data come from?What is the size of the dataset?How sparse is it?What is the long-tail item distribution?-Algorithms: What type of optimization is performed?How sensitive is it to item popularity?-User division in groups: How is user propensity for popular items defined in this domain?
What characteristics would deem a user Niche?Is behavior in the dataset the relevant factor, or is external information needed?-Evaluation strategy: Which exact aspect of popularity bias is the study evaluating?How can it be translated into choices for every step of the evaluation process?
Evaluation in the case of recommender systems is generally challenging.Offline evaluation, in particular, is restricted by the lack of data or lack of absolute ground truth.The fact that a user has rated an item highly does not necessarily mean that they prefer it to an unrated one; it can simply mean lack of awareness.In other words, high accuracy in offline evaluation does not equate to a successful system.As a result, the choice of evaluation strategy is often application-specific, which extends further than the choice of metrics.For example, the same algorithm might be differently evaluated when used in an application which recommends 5 items to a user instead of 10.Some metrics can only be evaluated with certain strategies because of the data they require.For example, Mean Absolute Error can only be calculated for user-item pairings that do exist in the dataset and thus UserTest would be appropriate in this case.However, other metrics leave room for different choices and there might not be one right way to evaluate them.
To appropriately evaluate a metric, it is important to consider what is the exact phenomenon it is intended to measure.While popularity bias is not a recent topic within recommender systems research, there is a certain lack of specificity around the exact meaning of it.Correlation, coverage, and difference in average popularity can all be useful in measuring some sides of popularity bias, among others.Having said that, their interpretation requires careful design of an evaluation process that answers the question we are wondering about.In the case of the three studies each evaluation strategy answers a different question: -Modified TrainItems does not disregard the items a user has already consumed from the recommendation process.Such strategy is unsuitable when assessing general performance, as it leads to information leakage.It could be used to assess popularity bias in a system that does recommend to users items that they have already consumed and positively rated.However, the information leakage needs to be considered and accounted for.-UserTest only recommends to a user items that they have already consumed unknowingly to the system (i.e., from the test set), which renders it inappropriate for evaluating overall popularity bias.However, it might still be interesting to observe whether the items a user has rated and are unseen to the system are differently ranked by the system than by the user, because of popularity bias.-TrainItems measures whether a learned model propagates popularity bias into unseen useritem combinations.It is our belief that this strategy is the most appropriate to measure popularity bias, as it more closely resembles a real world scenario of learning the preferences of the users and recommending items to them that they have not yet rated.
Studies on the topic of popularity bias should account for these differences in evaluation strategy by specifying the research question, as well as the evaluation strategy that accompanies it.To ensure reproducibility of such studies, a way ahead could be to adapt the dimensions of evaluation benchmarked by Said and Bellogín [29] into a checklist for submissions in conferences and journals.Such an initiative can motivate researchers to critically think about the evaluation strategy they employ in their experimentation, and allow for easier comparison between the reported results of different studies.
Our study has limitations that future work should address.It is our intention to approach the phenomenon of popularity bias fundamentally by locating specific data and recommendation characteristics that instigate its propagation.While the aspects we identified through reproducing these studies are very important, there are additional ones we plan to consider.First, our reported results on UserKNN differ significantly from Abdollahpouri et al. [3] and Naghiaei et al. [27], even when the same evaluation strategy and data are used.The reason is presumably tuning and implementation; Abdollahpouri et al. [3] state that they tuned all the algorithms to reach similar precision in order to appropriately compare them.On the other hand, Naghiaei et al. [27] manually trained UserKNN instead of using Cornac like they did for the other algorithms.This prompts us to consider tuning an important aspect of the process as well, and future work should focus on the effect it has on the propagation of popularity bias.In practice, algorithms are designed to satisfy some accuracy metric instead of being trained with their default hyperparameters, and thus it is realistic to assume that tuning takes place.Second, by measuring propagation of popularity bias with cross-validation instead of one-shot prediction, we could account for random small differences between algorithms, and generalize their comparison.Finally, all three studies used different Python libraries to perform the recommendation process.For future work, we plan to explore the effect of potentially different implementations of the same algorithm across these libraries.

CONCLUSION
In this paper, we reproduced the analysis on the propagation of popularity bias by commonly used collaborative filtering algorithms performed by three studies using different datasets from the media domain.The studies evaluated overall popularity bias propagation, as well as whether users with niche tastes were unfairly treated by the system.The results reported differed for both evaluation tasks.We identified four aspects which varied across the three studies and could potentially account for the divergence in results: data, algorithms, division of users in groups, and evaluation strategy.We designed and carried out experiments to investigate to what extent each aspect impacted the results by combining all possible choices for each aspect.We found that all aspects affected the result to some degree.Evaluation strategy specifically largely accounted for the divergence, as the one employed by the study in the music domain resulted in reporting propagation of popularity bias for all datasets and all algorithms.We conclude that clarity around the evaluation strategy employed during the recommendation process is necessary to reproduce and compare analysis for different algorithms and datasets, especially for phenomena like popularity bias whose evaluation is not standardized in literature yet.

A APPENDIX
This section serves as supplementary material.

A.1 LastFM: Comparison between Complete and Filtered Dataset
As described in Section 4.1, for computational reasons we filtered the LastFM dataset.We assessed whether the sampling has a large impact on the conclusions by comparing our results with the results reported by Kowald et al. [26].In order to ensure sufficient compareability between the 5:21 results, we applied the code provided by Kowald et al. [26] on our sampled dataset.We plot the relation between popularity in profile and recommendation frequency for every item in the sampled dataset, given the algorithms UserKNN, UserKNN with means, and NMF, which are the algorithms reported by Kowald et al. [26].Figure 3 shows that there is a consistent correlation between popularity in profile and frequency of recommendation.This conclusion aligns with the results reported by Kowald et al. [26] and supports our argument that the consistent correlation stems from the evaluation strategy deployed, regardless the filtering of the dataset.Additionally, we calculated the MAE metric for each user group of the sampled dataset, as seen in Table 14.While the exact numbers differ from the ones reported by Kowald et al. [26], the trends are consistent, with the Niche user group receiving the worst rating predictions for the given algorithms, evaluation strategy and division of users in groups.The worst results are always given for the Niche user group (statistically significant according to a t-test with p < .005as indicated by * * ).Across the algorithms, the best results are provided by NMF.

A.2 Extensive Results
A.2.1 Overall Popularity Bias Propagation.We plot item popularity in profile versus frequency of appearance in the recommended lists of each algorithm, for every dataset and evaluation strategy, in Figures 4 to 13        A.2.2 Popularity Bias Propagation per User Group.We include the %ΔGAP values and N DCG@10 for the datasets MovieLens1M and Book-Crossing, for every algorithm, evaluation strategy and user division, in Tables 15 to 26.    158.4 * * indicates that the corresponding result is significantly higher than for the two other groups, while * indicates that it is significantly higher than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.

Fig. 1 .
Fig. 1.The long-tail distribution of item popularity in all datasets.

( 1 )
Modified TrainItems (Naghiaei et al. [27]): (a) Recommend items to every user in the training set.(b) Choose out of all the items.(2) UserTest (Kowald et al. [26]): (a) Recommend items to every user in the test set.(b) Choose out of all the items the user has rated in the test set.(3) TrainItems (Abdollahpouri et al. [3]): (a) Recommend items to every user in the test set.(b) Choose out of all the items the user has not rated in the training set.

Fig. 2 .
Fig. 2. Item popularity in profile versus frequency of recommendation by the algorithm NMF, for every dataset and evaluation strategy.

Fig. 3 .
Fig. 3. Relation between recommendation frequency and popularity in profile for the sampled LastFM dataset. .

Fig. 4 .
Fig. 4. Item popularity in profile versus frequency of recommendation by the algorithm UserKNN, for every dataset and evaluation strategy.

Fig. 6 .
Fig. 6.Item popularity in profile versus frequency of recommendation by the algorithm UserKNN with means, for every dataset and evaluation strategy.

Fig. 7 .
Fig. 7. Item popularity in profile versus frequency of recommendation by the algorithm MF, for every dataset and evaluation strategy.

Fig. 8 .
Fig. 8. Item popularity in profile versus frequency of recommendation by the algorithm PMF, for every dataset and evaluation strategy.

Fig. 10 .
Fig. 10.Item popularity in profile versus frequency of recommendation by the algorithm HPF, for every dataset and evaluation strategy.

Fig. 12 .
Fig. 12. Item popularity in profile versus frequency of recommendation by the algorithm BPR, for every dataset and evaluation strategy.

5 : 31 Fig. 13 .
Fig. 13.Item popularity in profile versus frequency of recommendation by the algorithm VAECF, for every dataset and evaluation strategy.

Table 1 .
The Characteristics of the Datasets used by the Three Studies

Table 2 .
The Algorithms Tested by the Three Studies

Table 3 .
Overview of the User and Item Candidates on which Each Study based the Evaluation of Popularity Bias Propagation

Table 4 .
The Characteristics of the Datasets used in Our Experiments

Table 5 .
Correlation and Item Coverage for the MovieLens1M Dataset

Table 6 .
Correlation and Item Coverage for the LastFM Dataset

Table 7 .
Correlation and Item Coverage for the Book-Crossing Dataset 5.1.2Algorithms.The algorithms propagate popularity bias in different degrees.Tables5 to 7

Table 8 .
%ΔGAP for Every User Group, Algorithm and Way of Dividing Users, when Modified TrainItems is Used * * 169.4 * N, D and BF signify Niche, Diverse and Blockbuster-focused. * * indicates that the corresponding result is significantly higher than for the two other groups, while * indicates that it is significantly higher than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.

Table 9 .
%ΔGAP for Every User Group, Algorithm and Way of Dividing Users, when UserTest is Used * * 100.7 * N, D and BF signify Niche, Diverse and Blockbuster-focused. * * indicates that the corresponding result is significantly higher than for the two other groups, while * indicates that it is significantly higher than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.

Table 10 .
%ΔGAP for Every User Group, Algorithm and Way of Dividing Users, when TrainItems is Used N, D and BF signify Niche, Diverse and Blockbuster-focused. * * indicates that the corresponding result is significantly higher than for the two other groups, while * indicates that it is significantly higher than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.

Table 11 .
N DCG@10 for Every User Group, Algorithm and Way of Dividing Users, when Modified TrainItems is Used N, D and BF signify Niche, Diverse and Blockbuster-focused. * * indicates that the corresponding result is significantly lower than for the two other groups, while * indicates that it is significantly lower than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.

Table 12 .
N DCG@10 for Every User Group, Algorithm and Way of Dividing Users, when UserTest is used N, D and BF signify Niche, Diverse and Blockbuster-focused. * * indicates that the corresponding result is significantly lower than for the two other groups, while * indicates that it is significantly lower than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.

Table 13 .
N DCG@10 for Every User Group, Algorithm and Way of Dividing Users, when TrainItems is Used * indicates that the corresponding result is significantly lower than for the two other groups, while * indicates that it is significantly lower than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.Blockbuster-focused group receive the highest increase in average popularity.Additionally, Popu-larPercentage and AveragePopularity also sometimes result in different conclusions on which group is unfairly treated.Table8shows that given Modified TrainItems, WMF recommends more or less popular items to the Niche group depending on which user division is assumed (%ΔGAP of −2.8 for PopularPercentage and 0.6 for AveragePopularity).
N, D and BF signify Niche, Diverse and Blockbuster-focused. *

Table 14 .
MAE Results for UserKNN, UserKNN with Means and NMF, Applied on the Sampled LastFM Dataset

Table 15 .
Percentage of Increase in Average Popularity of Items in Recommendation Versus in the MovieLens1M Dataset (%ΔGAP) for Every User Group, Algorithm and Way of Dividing Users, when Modified TrainItems is Used * * 88.2 * * indicates that the corresponding result is significantly higher than for the two other groups, while * indicates that it is significantly higher than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.

Table 16 .
Percentage of Increase in Average Popularity of Items in Recommendation Versus in the MovieLens1M Dataset (%ΔGAP) for Every User Group, Algorithm and Way of Dividing Users, when UserTest is Used **indicates that the corresponding result is significantly higher than for the two other groups, while *indicates that it is significantly higher than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.

Table 17 .
Percentage of Increase in Average Popularity of Items in RecommendationVersus in the MovieLens1M Dataset (%ΔGAP) for Every User Group, Algorithm and Way of Dividing Users, when TrainItems is Used * * indicates that the corresponding result is significantly higher than for the two other groups, while * indicates that it is significantly higher than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.

Table 18 .
Percentage of Increase in Average Popularity of Items in Recommendation Versus in the Book-Crossing Dataset (%ΔGAP) for Every User Group, Algorithm and Way of Dividing Users, when Modified TrainItems is Used

Table 19 .
Percentage of Increase in Average Popularity of Items in Recommendation Versus in the Book-Crossing Dataset (%ΔGAP) for Every User Group, Algorithm and Way of Dividing Users, when UserTest is Used **indicates that the corresponding result is significantly higher than for the two other groups, while *indicates that it is significantly higher than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.

Table 20 .
Percentage of Increase in Average Popularity of Items in RecommendationVersus in the Book-Crossing Dataset (%ΔGAP) for Every User Group, Algorithm and Way of Dividing Users, when TrainItems is Used * * indicates that the corresponding result is significantly higher than for the two other groups, while * indicates that it is significantly higher than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.

Table 21 .
N DCG@10 in the MovieLens1M Dataset for Every User Group, Algorithm and Way of Dividing Users, when Modified TrainItems is Used 0.040 * * 0.135 0.081 * 0.032 * * PMF 0.197 0.194 0.204 0.206 0.194 0.196 NMF 0.011 0.005 * 0.004 * 0.012 0.005 * 0.004 * * indicates that the corresponding result is significantly lower than for the two other groups, while * indicates that it is significantly lower than for one of the other groups.Statistical significance was concluded based on a t-test with p < 0.005.
* N, D and BF signify Niche, Diverse and Blockbuster-focused. *