Mitigating Mainstream Bias in Recommendation via Cost-sensitive Learning

Mainstream bias, where some users receive poor recommendations because their preferences are uncommon or simply because they are less active, is an important aspect to consider regarding fairness in recommender systems. Existing methods to mitigate mainstream bias do not explicitly model the importance of these non-mainstream users or, when they do, it is in a way that is not necessarily compatible with the data and recommendation model at hand. In contrast, we use the recommendation utility as a more generic and implicit proxy to quantify mainstreamness, and propose a simple user-weighting approach to incorporate it into the training process while taking the cost of potential recommendation errors into account. We provide extensive experimental results showing that quantifying mainstreamness via utility is better able at identifying non-mainstream users, and that they are indeed better served when training the model in a cost-sensitive way. This is achieved with negligible or no loss in overall recommendation accuracy, meaning that the models learn a better balance across users. In addition, we show that research of this kind, which evaluates recommendation quality at the individual user level, may not be reliable if not using enough interactions when assessing model performance.


INTRODUCTION
One of the critical limitations of recommender systems based on collaborative filtering (CF) models [5] is that they are not fair in how they serve different groups of users [9,11].This fairness issue is a result of the varying quality of users' neighborhoods (groups of users with similar preferences) from which information is taken to train a CF model [10,24].The information collected from large, coherent, and information-rich neighborhoods will be the dominant one in steering the process of learning to recommend for all users.We refer to such dominant neighborhoods as mainstream.Because the users belonging to such neighborhoods -the mainstream usersare compatible with the learned model, they are optimally served.For the non-mainstream users, e.g.niche groups who deviate from the mainstream and whose interaction information is therefore less rich [11], who are less active compared to the mainstream users [15], or where the preferences are not well pronounced, the neighborhoods cannot fully reflect their genuine preferences.All this will make the non-mainstream users receive recommendations of a lower quality than the mainstream users.The difference in the quality of the CF model for these two user groups, further referred to as the mainstream bias, will result in the continuous improvement of the performance for the mainstream group, and continuous decrease of the performance for the rest [12].
While the issue of treating users differently by a recommender system in general has been addressed by a number of approaches, making for example assumptions about the relation between users' gender [14] or demographics [4] and the quality of recommendation, not many approaches have focused specifically on addressing the mainstream bias.Li et al. [10] deployed an autoencoder [20] for feature reconstruction as an adversary to a traditional CF model, forcing it to deviate from the pure similarity-based learning and make the learned model more compatible with the non-mainstream users.More specifically, the autoencoder was deployed to steer the process of learning the user/item representation space for rating prediction via optimal reconstruction of the properties of all users, mainstream and otherwise, assuming this would lead to equal treatment of users during recommendation.Still, a more explicit focus on the mainstreamness of users is needed to ensure that the bias is effectively addressed.
Inspired by outlier detection techniques, Zhu and Caverlee [24] did focus on explicitly quantifying mainstreamness via similarities of user-preference profiles, and incorporated them to fine-tune the recommendation process for different user groups.However, in the absence of ground truth data about mainstreamness, it is difficult to assess how well these approaches identify non-mainstream users.In addition, these mainstreamness statistics are model-agnostic in the sense that they are independent of the recommendation strategy, effectively ignoring the model's own capability to reduce the mainstream bias or even amplify it.As a result, the learning process could be tailored to the wrong users.In this paper, we choose to focus there where the effect of mainstreamness is directly observed, that is, the recommendation utility provided by the data and recommendation model at hand.If a user receives poor recommendations it could be because their preferences deviate from the rest, or because there is not enough data to properly quantify their similarity to other users or to fully exploit it.Therefore, we choose utility as an implicit proxy for mainstreamness.Through this quantification of user mainstreamness, we make the training process focus on the non-mainstream ones by assigning them higher weights.We do so, however, in a cost-sensitive way [22], taking the cost of recommendation errors into account while training the CF model.Our results show that our implicit measurement of mainstreamness via utility is better able to differentiate niche users than an explicit approach, and that the cost-sensitive learning strategy does mitigate the bias by balancing the recommendation quality across users.Finally, we investigate data requirements for conducting research on mainstream bias at the individual user level, and provide suggestions for reliable experimentation in this area.

PROPOSED APPROACH
The basis of our approach is a weighted loss function where every user  ∈ U is assigned a weight  () that informs the learning process about the importance of every user's individual recommendation loss.The global loss is thus simply where the recommendation loss L  is specific of the model and learning paradigm.This way, we explicitly tell the learning process what users to optimize for by means of , which, in our case, should be high for non-mainstream users and low for mainstream users.

Definition of Weights
As explained in the previous section, we define  as a function of the user mainstreamness   .However, rather than simply using a naïve transformation of   , we introduce flexibility through a cost function that maps user mainstreamness onto a cost value.
In particular, and assuming   ranges between 0 and 1, we use the density function of a Normal distribution truncated between 0 and 1, with zero mean and variance adjusted to achieve a contrast ranging between 5 (i.e.users with mainstreamness   = 0 have a cost 5 times as large as users with   = 1) and 80 (ie.80 times as much).This is a simple choice to make  smooth and monotonically decreasing, but other cost functions that emphasize different levels of mainstreamness are of course possible; we leave this discussion for further work.Fig. 1 shows some examples.Nonetheless, the formulation of the cost function may consider various aspects tailored to the business case, as well as different magnitudes for the contrast between users with low and high mainstreamness.For example, it would be reasonable to assign very high weights to non-mainstream users with high activity, or to users with very low activity as an attempt to reduce the churn rate.An important point to consider when defining  is the distribution of mainstreamness across users.It could be the case that, given the current data and model, the least mainstream users are actually fairly mainstream already, so their weight relative to the most mainstream users should be adjusted via a smaller contrast.It could also be the case that the dataset is very sparse and there are simply not enough neighbors around users for the model to learn a good representation.That is, the majority of users could be considered non-mainstream, and as a result the cost function would hardly differentiate among them.Lastly, one could decide to compute   in several different ways (see next Section), which could potentially lead to quite different mainstreamness score distributions altogether, ultimately leading to a different set of weight values even for the same users.
In order to minimize this dependence on the dataset and mainstreamness definition, and ensure that the full co-domain of the cost function is used, we first normalize the raw mainstreamness scores.Simply re-scaling between the minimum and maximum could still lead to a disproportionate use of small parts of the co-domain, and would also be very sensitive to outlier users.Instead, we use the rank statistic of   normalized in [0, 1].We achieve this by using the empirical cumulative distribution function (ecdf) where, as mentioned, cost is defined in terms of a truncated Normal density function.

Measurement of Mainstreamness
An explicit approach to compute   would ideally follow some notion of mainstreamness, but mainstreamness is itself a complex construct very hard to define formally [1,10,24].Recently, Zhu and Caverlee [24] took inspiration from outlier detection techniques to propose four different definitions: • Sim: users are mainstream to the extent that their interactions are similar to that of the other users.The Jaccard coefficient is used to measure the average similarity between a user and all the others.• Den: users are mainstream to the extent that there are enough close neighbors to calculate similarity with.The local outlier factor algorithm (LOF) [2] is used to identify niche users.• Dis: users are mainstream to the extent that their interactions are common in the dataset, that is, they interact with popular items.The cosine similarity is used to measure the similarity between a user and the average user interactions.• Deep: similar to Den, niche users are identified by an outlier detection algorithm.In particular, the deep support vector data description algorithm (DeepSVDD) [19] is used.
However, it is difficult to assess how well these, or any other definitions for that matter, correlate with the concept of mainstreamness.To illustrate, Fig. 2 compares these four definitions as applied to the MovieLens 1M dataset.Although they are somewhat correlated to one another, it is evident that they produce very different scores.For instance, Sim and Dis lead to nicely shaped distributions, suggesting few users with extreme (non-)mainstreamness.However, Den and Deep lead to very skewed distributions, even in the opposite direction, pointing to many users with extreme scores.This shows that the same user could be considered both mainstream or non-mainstream, depending on how we choose to define mainstreamness.
Furthermore, it should be noted that these four definitions of mainstreamness are agnostic to the recommendation model.However, the effect of mainstreamness, ultimately, depends on the model and how it is able to exploit the specifics of the dataset it is trained on.It is not far-fetched to think of a user, assessed as non-mainstream, who receives bad recommendations under one model but good recommendations under a more capable one.This leads us to consider an alternative, implicit way to quantify mainstreamness that is not model agnostic.In particular, we decide to focus there where the effect of mainstreamness is to be observed, Table 1: Dataset statistics after pre-filtering.

Dataset
#users #items #ratings Density MovieLens 1M [6] 6,040 3,609 562,957 2.583% BeerAdvocate [13] 8,821 43,663 780,752 0.203% Amazon Digital Music [16] 14,057 379,171 619,673 0.011% Amazon Musical Instruments [16] 15,270 585,766 862,798 0.010% that is, the recommendation utility provided by the recommendation model at hand.This is where mainstreamness will ultimately have an impact on.The very nature of collaborative filtering tells us that if a user receives poor recommendations it is because they are non-mainstream under the current model: they cannot be properly represented, either because their preferences are somehow different from their closest neighbors, or because there are not enough data to properly quantify their similarity.Therefore, we use utility as a proxy for mainstreamness.Since utility, just like mainstreamness, is a complex concept difficult to measure, we decide to simply use the accuracy of the recommendation model for that user, measured through a metric like  or .
But there is the question of what accuracy scores we actually use.In principle, these scores should reflect user mainstreamness when there is no mechanism to minimize its effect, and they should be achieved by the recommendation model in the dataset at hand.Therefore, we decide to use the accuracy achieved, on a validation set, by the vanilla model whose loss function is as in Eq. ( 1) but using no weights.As intended, we thus first see how the model reacts to mainstreamness as reflected in the observed utility for users, and then act upon it in a cost-sensitive way.

EXPERIMENTAL DESIGN
We carried out a number of experiments to investigate the effectiveness of the proposed approach in mitigating the mainstreamness bias, as well as the effect of the contrast applied by the cost function.In particular, we study contrasts x5, x10, x20, x50 and x80, that is, the most non-mainstream user has a weight between 5 and 80 times larger than that of the most mainstream user.Fig. 1 details the cost functions.Regarding the measurement of mainstreamness, we consider both an explicit and an implicit quantification.For the former, we follow Zhu and Caverlee [24] and compute Sim scores.This choice is motivated by the time complexity of their four approaches (the computation of mainstreamness may quickly become intractable as the numbers of users and items increase; while their datasets include a few thousand items, ours span from a few thousands to over half a million), and their correlation to one another (Sim is also the one most correlated with the others, in particular with Deep).For the implicit quantification we compute utility scores using the metric  as an exemplar of recommender systems research; hereafter, we will refer to this definition of mainstreamness as Util.
We selected four real-world datasets containing user-item rating interactions from various domains and with different densities, especially including some highly sparse datasets (see Table 1).In line with common practice in ranking-oriented recommender systems research, we see all existing interactions in the datasets as relevant, and all other unseen interactions as irrelevant.We use  LensKit [3] to evenly split the relevant items for each user into training, validation and test sets.To make the modeling of utility -and hence mainstreamness-robust, each user has at least five relevant transactions in each of the three sets; we explain the rationale for this decision in Section 5.For training the model, we follow He et al. [7], Wu et al. [23] and randomly sample four irrelevant items per relevant item in the training partition.For validation and test, we follow DaisyRec [21] and evaluate the model for each user by ranking a total of 500 items consisting of their relevant items in the validation/test partition and a set of randomly sampled irrelevant items.Finally, to make sure relevant items are the minority, as happens in reality, we truncate the number of relevant interactions to 200.The dataset statistics after processing are shown in Table 1.
Regarding the recommendation model, we deploy a simple but effective CF model that only utilizes user-item interactions.Specifically, we choose Factorization Machines (FM) [18], which optimize the binary cross-entropy (BCE) loss via the Adaptive Moment Estimation (Adam) [8] learner, and leave the investigation on other training paradigms for future work.For each user, the BCE loss is normalized by dividing by the total number of relevant and irrelevant items used for training, so that all user losses are on the same scale in (1).After a fine-tuning process based on grid search, we fixed several key hyper-parameters including the dimension of vectors used for interaction (32), learning rate (0.0001), L2-regularization coefficient to avoid overfitting (0.001), and batch size (512).
All models are trained for 300 epochs to ensure full convergence, and with 3 different random initializations to minimize random effects due to the sampling process.The whole pipeline is implemented in PyTorch [17], and all experiments are run on one NVIDIA GeForce GTX 2080Ti GPU2 .

Mainstreamness and Utility
We first examine how Sim and Util differentiate between mainstream and non-mainstream users.In particular, we are interested in how well they correlate with the test  scores obtained by the baseline FM model: non-mainstream users should receive recommendations with low  scores, while mainstream users should receive higher scores.For each of the four datasets, Fig. 3 compares Sim and Util.We can first see that both approaches lead to similar distributions in the Amazon datasets, where there appear to be many nonmainstream users.However, they somewhat disagree in the BeerAdvocate dataset, where Util does not identify many non-mainstream users to benefit from the cost-sensitive approach.In terms of correlation with the test  scores, we can see that Util is much better correlated, specially in the Amazon datasets.This points to the possibility that Sim identifies many non-mainstream users to which the model is still able to offer good recommendations.If the training process increases their importance by assigning them a high weight , we may loose the opportunity to focus on those users that still receive poor recommendations.
In order to assess the effectiveness of the cost-sensitive approach for the mitigation of the mainstream bias, we will look in the next Section into different groups of users separated by their mainstreamness: group 'low' contains the 20% of users with lowest mainstreamness scores on the baseline model, group 'med-low' contains the next 20% or users, group 'med' contains the middle 20% of users, and so on with groups 'med-high' and 'high'.An effective mitigation of the mainstream bias would be reflected in increased performance for the lower groups, which ideally should be those with lowest test  scores in the baseline model.Fig. 4 shows how well Sim and Util separate users in these five groups.We can first see that the groups are indeed correlated with , but we can notice that this correlation is stronger with Util, specially in the Amazon datasets (the low groups receive lower utility, and the higher groups receive higher utility).We can also see that groups tend to overlap substantially when separated by Sim, potentially misplacing users.This overlap can be quantified by an ANOVA model of  modeled by two factors: dataset and user-group nested within dataset.Indeed, the user-group effect has a much larger sum of squares (SS) with Util than with Sim (SS=440 vs SS=218; SS of the dataset effect is 843).Finally, Fig. 4 also points that the BeerAdvocate dataset may be hard to further optimize for because the utility scores are already relatively high.

Bias Mitigation
An effective mitigation of the mainstream bias would be reflected in increased performance for the lower groups (i.e.mainly 'low' and 'med-low'), ideally with no detriment to the higher groups and,  especially, overall.In the previous section we separated users into groups by each of Sim and Util, but here we separate them directly by their test  with the baseline model FM, because this better illustrates how non-mainstream users suffer from the bias.Table 2 reports the relative percentage improvement in  scores per user group, as well as the overall mean score across all users in the dataset.We can clearly see that the use of Sim benefits the non-mainstream users only in the two Amazon dataset; in MovieLens and BeerAdvocate they are even hurt further.In contrast, Util is always able to improve the utility of non-mainstream users across datasets, achieving relative  improvements of up to 5% in the Amazon datasets.Improvements on the lower user groups are generally higher than losses on the higher groups, where users already receive (very) high recommendation utility anyway and such minor losses are probably unnoticed.This redistribution of model performance has a negligible effect on the global performance of the models, as evidenced by the overall  scores.This means that, with proper selection of the contrast in the cost function, Util can minimize the mainstream bias at virtually no overall loss in utility.On the other hand, the use of Sim for training leads to inferior overall performance on all four datasets.Fig. 5 presents a more fine-grained picture with one of the three random initializations in our experiments.Curve segments above 0 represent an improvement by the cost-sensitive models, while segments below 0 represent a loss.We can confirm that the costsensitive approach indeed makes the models focus on the nonmainstream users, as shown by the nicely smooth correlation between observed utility and relative improvement, moderated by the contrast in the cost function.As expected though, this focus on the non-mainstream users comes at the cost of a utility loss for the mainstream users on the right-hand side of the plots.Nevertheless, when using Util the relative loss for those users is generally much smaller than the gain for the very non-mainstream users, which are our main target.The figure also shows that the actual relation between improvement and utility varies across datasets, as reflected by the different curve shapes.This is explained by the differences in the shape of their  distributions (see Fig. 3); recall that we use the   of the scores.In a side-by-side comparison between Sim and Util, we see that Util offers better performance nearly everywhere along the -axis, but especially for the non-mainstream users.
In summary, we see that our cost-sensitive approach brings better balance across users, thus helping in the mitigation of the mainstream bias.In addition, we confirm that an implicit quantification of mainstreamness like Util works better than an explicit quantification like Sim in steering the learning process towards better recommendations for the users that receive low utility from the baseline model.In addition, we note that the mitigation effect via Util does not decay with increasing data sparsity (refer back to Table 1).
One could be tempted to argue that Util should obviously offer better results than Sim when analyzing test  because it is based on validation  scores; test and validation scores should be highly correlated (we will come back to this in Section 5).After all, both Table 2 and Fig. 5 analyze results by test .The argument made above is that differences between mainstream and non-mainstream users can be immediately identified by test scores, but for the sake of clarity and to avoid potentially unfair assessment towards Sim, Table 3 reports the same results but separating users by Sim, while Fig. 6 does so by plotting against Sim.While the results are less clear with this partition of users, the table confirms that models trained with Sim are generally better at mitigating the bias than those trained with Util.In particular, results for the Beer-Advocate dataset show that higher contrasts even lead to worse performance for the lower user groups, suggesting that Sim is perhaps not properly identifying non-mainstream users.The figure shows that Util improves over the baseline across all levels of mainstreamness in the Amazon datasets, further suggesting that Sim identifies as non-mainstream users that are probably not.In summary, and even though this comparison could in turn be considered favorable to Sim (note that previously we assessed against test , not against the validation  calculated by Util), the results again support the use of Util to quantify user mainstreamness and mitigate the bias.

DISCUSSION
A key assumption of our approach based on Util is that we can reliably use utility, measured as the accuracy on a validation set, to determine the weight that each user should have in the training process.This implies that the accuracy on the validation set is a good estimate of the accuracy on the test set, which is where the effect will ultimately be assessed.If there was a low correlation between validation and test accuracy, the loss function would apply high weights for users that do not really need it, limiting or even altogether canceling the potential of our approach.
Intuitively, how well validation and test scores correlate is mainly determined by the amount of data.If only a few interactions are involved in the calculation of accuracy, the resulting scores will bear a high degree of noise or random error, thus lowering the correlation.In principle, we would therefore use as much data as possible in the validation and test sets.However, we would generally prefer to use all that data to actually train the model, but we note that the validation scores are somehow part of the training process itself, because they determine the weights.
A balance is therefore necessary, so we need to study the strength of the validation-test scores correlation as a function of the number of interactions in their data partitions.We did this by running the baseline FM model on different data partitions with varying minimum numbers of relevant items in the training set (3, 4, 5 and 10), and validation and test sets (1, 2, 3, 4 and 5 each).The actual split was conducted maintaining proportions (i.e. for the combination of 4/3/3 minimum items per set, a user has 40% of their relevant items for training, 30% for validation, and 30% for testing).We then measured the strength of the validation-test correlation via the RMSE of the scores and their Spearman  correlation.Fig. 7 shows that, as expected, the correlation increases (low RMSE, high ) with the number of relevant interactions used in the validation and test sets.More interestingly, it shows that the  amount of training data has a much smaller and varying effect, so despite it being a major factor to maximize model performance, it is not so to robustly assess that performance.The plots indicate that requiring only one or two interactions in the validation set would lead to noisy scores; four interactions seem the bare minimum.As for the training set, the usual practice of having at least as much data as for validation and testing still applies in this context of non-time-aware recommendation.All in all, our suggestion for this line of research on mainstream bias that works at the individual user level, is to have no less than four items per user in each of the three standard data partitions.Because the strength of the correlation is a key factor in our approach, we decided to require at least five to be on the safe side.In fact, we also observed that the effect of cost-sensitive learning in the validation sets is similar to what is reported in Figs. 5 and 6.

CONCLUSIONS AND FUTURE WORK
In this paper, we tackled the challenge of mainstream bias in CFbased recommendation.The main aspect we focused on is to steer the process of mitigating this bias directly by the utility resulting from the recommendation model and data at hand.For this purpose, we proposed an approach that assigns each user an importance weight during training, with these weights defined in a cost-sensitive manner.By choosing to steer the model directly towards the users that receive low utility, and not towards those that appear to be non-mainstream, we avoid the model to focus on users that already receive high utility even if they were not expected to.This way, the model does focus on the niche users that suffer from the bias.
Empirical results show that such models produce a more effective balance of the recommendation utility among the mainstream and non-mainstream users, in a way that is consistent across datasets with varying properties.In addition, we provide suggestions regarding the minimum number of interactions to require when partitioning datasets.Without enough interactions, research on mainstream bias at the level of individual users might produce unreliable results.
For future work, we will first explore other ways to quantify mainstreamness.In the implicit measurement sense, an evident question is whether other metrics such as , or even the combination of multiple metrics, work better at identifying niche users.Additionally, we can think of ways to make the validation-test correlation robust to issues like sample selection bias, for example via inverse propensity scoring.Another line is to explore more principled approaches for an explicit quantification through an extensive study of the factors that influence mainstreamness, such as the temporal dynamics.
Regarding our cost-sensitive learning approach, we will explore its generality, to see how it works for underlying models other than FM or other ranking frameworks such as pairwise and listwise.We will also investigate the combination of cost-sensitive and adversarial learning strategies to mitigate mainstream bias: cost-sensitive to tell the model where to focus on, and adversary to tell how.
Finally, we note that our focus in this paper has been on the effect of mainstream bias mitigation on the users, but one could wonder about what effect it has on the items.One hypothesis is that nonmainstream users are better served because the less popular items are now more likely to be recommended, so it would be interesting to study whether mitigating one bias amplifies or mitigates other biases, such as popularity or position.

Figure 1 :
Figure 1: Cost functions used in the paper.The contrast denotes the relative cost between users with mainstreamness 0 and users with 1 (i.e.x10 means 10 times as much).

Figure 2 :
Figure 2: Comparison of the four mainstreamness definitions proposed by Zhu and Caverlee [24], as applied to the MovieLens 1M dataset. 1 Density plots illustrate the distributions of mainstreamness.Scatter plots show the relationship between pairs of definitions, quantified in the upper-right half via Pearson correlation scores.Scores are standardized to zero mean and unit variance for better comparison.

Figure 3 :
Figure 3: Correlation between mainstreamness and test nDCG in the baseline model (FM), for each mainstreamness definition.Density plots illustrate the distribution of mainstreamness.Scatterplots show their relationship with nDCG, quantified via Pearson correlation scores.Mainstreamness scores are standardized to zero mean and unit variance for better comparison.

Figure 4 :
Figure 4: Correlation between user groups, split by mainstreamness, and test nDCG in the baseline model (FM).

Figure 5 :
Figure 5: Mean nDCG relative percentage improvement between cost-sensitive models and baseline model, as a function ecdf(test nDCG) in the baseline FM model, for a sample data split.Curves fitted by a LOESS model.Ribbons indicate 95% confidence intervals.

Figure 7 :
Figure 7: Correlation between validation and test scores as a function of the amount of data used for validation, and testing, for two sample datasets (most and least dense).

Table 2 :
Mean nDCG of the baseline model (FM) per user group, and relative percentage improvement of each cost-sensitive model (e.g.users in group 'low' of MovieLens 1M received a score of .3284with the baseline, and an improvement of +3.89% with the x80-contrast cost-sensitive model under the Util mainstreamness definition).Column 'Overall' lists the mean across all users.Green/red for statistically significant gain/loss with respect to the baseline (hierarchical linear model with seed and user random effects, Bonferroni correction).

Table 3 :
Same as Table2, but user groups defined by Sim scores instead of test nDCG in the baseline model.