ProtoMF: Prototype-based Matrix Factorization for Effective and Explainable Recommendations

Recent studies show the benefits of reformulating common machine learning models through the concept of prototypes – representatives of the underlying data, used to calculate the prediction score as a linear combination of similarities of a data point to prototypes. Such prototype-based formulation of a model, in addition to preserving (sometimes enhancing) the performance, enables explainability of the model’s decisions, as the prediction can be linearly broken down into the contributions of distinct definable prototypes. Following this direction, we extend the idea of prototypes to the recommender system domain by introducing ProtoMF, a novel collaborative filtering algorithm. ProtoMF learns sets of user/item prototypes that represent the general consumption characteristics of users/items in the underlying dataset. Using these prototypes, ProtoMF then represents users and items as vectors of similarities to the corresponding prototypes. These user/item representations are ultimately leveraged to make recommendations that are both effective in terms of accuracy metrics, and explainable through the interpretation of prototypes’ contributions to the affinity scores. We conduct experiments on three datasets to assess both the effectiveness and the explainability of ProtoMF. Addressing the former, we show that ProtoMF exhibits higher Hit Ratio and NDCG compared to other relevant collaborative filtering approaches. As for the latter, we qualitatively show how ProtoMF can provide explainable recommendations and how its explanation capabilities can expose the existence of statistical biases in the learned representations, which we exemplify for the case of gender bias.


INTRODUCTION
Prototype-based models have introduced a novel paradigm for learning and characterizing latent factors, providing new possibilities, particularly for effective and explainable machine learning [3,14,28,30,35,71].In this context, a prototype is defined as an entity (e. g., in the form of an embedding) that is representative of a set of similar instances and is part of the observed data points, or an artifact that summarizes a subset of instances with similar characteristics [28].In principle, prototype-based models first identify a set of prototypes from the underlying data and then utilize them to make the prediction for a given data point by linearly combining the relatedness of the data point to the prototypes.This linear combination provides a clear separation of the contribution of each prototype to the final prediction and hence enables understanding of the models' decisions through analyzing these contributions and interpreting the prototypes.
Few recent studies have leveraged the concept of prototypes for recommender systems (RSs) in the context of cold/few-start scenarios [38,55], or as more effective recommendation algorithms [7].Within this context, we present ProtoMF, a novel collaborative filtering algorithm based on prototypes.The ProtoMF model builds upon latent factor models [21] and particularly the seminal Matrix Factorization (MF) [31,53].The proposed ProtoMF model, besides enhancing the performance of the model's recommendations, unlike previous work, enables explainable recommendations by leveraging prototypical users and items that capture the item-consumption characteristics of the system's users and items.For example, a user prototype might personify the overall user preference for Drama and Romance movies, while an item prototype might represent specific musical genres or product categories.The ProtoMF approach utilizes these prototypes to define new users' and items' representations in terms of their similarities to the corresponding prototypes.By leveraging these new representations, ProtoMF finally computes the user-item affinity scores as a linear combination of user/item prototype similarities and the corresponding item/user embeddings.
Considering this design, our prototype-based approach enhances the model's transparency, as the predictions can be deconstructed into a (linear) composition of the contributions stemming from the prototypes, in a similar spirit to previous studies on classification tasks [3,28,35].To further explain the model's decisions, one also requires an interpretation of prototypes, whether the ones of users or items.Related literature in the classification domain approaches this in two ways.The first approach interprets a prototype by observing the model's output given some crafted synthetic inputs that maximally activate the prototype [35,71].The second method interprets a prototype through a set of maximally close entities to it [3].In ProtoMF, we adopt the first and second approach to interpret user and item prototypes, respectively.Using these interpretations, we explain the recommendation of ProtoMF through the contributions of the prototypes to the affinity score.
ProtoMF's approach to explainability is aligned with the algorithmic transparency [4,28,37] aspect of interpretability discussed by Arrieta et al. [4], namely "understanding the process followed by the model to produce any given output from its input data".This is achieved by combining the interpretable capacity of a linear model with the natural explanation-by-example provided by the prototypes.Our ProtoMF approach is also aligned with several works that leverage explainability to unveil the existence of societal biases and stereotypes [13,15,32,50,51], and to help mitigate unfair treatment of individuals and groups [26,42,43,49,57,68].These aspects are particularly critical when abiding by regulations such as the EU Regulatory Framework for AI [19] or the EU Digital Service Act [18], encouraging the development of effective and transparent RS models, which are able to explain their predictions and can offer a way to correct possible misconducts [54].
We carry out extensive experiments to assess ProtoMF's effectiveness against relevant baselines.In particular, we evaluate ProtoMF on three real-world datasets (MovieLens, Amazon Video Games, and the LFM2b music dataset), showing that ProtoMF significantly outperforms Matrix Factorization [31], as well as two prototype-based approaches [7,38] in terms of Hit Ratio and NDCG.In the context of transparency, we showcase ProtoMF's explanation capabilities in two steps.First, by qualitatively demonstrating that the learned prototypical users and items capture general itemconsumption behaviors of real users (e. g., preference for movie genres or a specific movie storyline); and second, by exhibiting how the system leverages these learned prototypes to provide an explainable recommendation.Furthermore, utilizing the datasets containing male/female gender information of their users (Movie-Lens and LFM2b), we expose the existence of gender biases in the learned user prototypes.To this end, we identify prototypes with significant inclinations to either of the genders by analyzing the gender representations of their closest users.
Our contribution is three-fold: • We propose ProtoMF, a novel collaborative filtering model which leverages user/item prototypes to provide effective and explainable recommendations.
• We perform extensive quantitative and qualitative experiments to assess, respectively, the accuracy and explainability of our model.• We investigate and expose latent statistical (gender) biases in the learned user prototypes.Our paper is structured as follows: in Section 2 we review the relevant literature.We introduce our method in Section 3, and the experiment setup in Section 4. We show the evaluation results and the explanation capabilities of ProtoMF, followed by showcasing the existence of gender bias in the model in Section 5.

RELATED WORK
We review the relevant literature on prototypes in RSs, prototypebased explanations in machine learning as well as explainability in RSs.Finally, we discuss some studies regarding bias and fairness in RSs.

Prototypes in Recommender Systems
A common application of prototypes in RSs is approaching cold/fewstart problem [2,38,55,58].In this context, a few representative users/items are selected, whose consumption patterns are used to provide recommendations to new users or on new items.As a representative example of this line of work, Liu et al. [38] introduce Representative-based Matrix Factorization (RBMF) which proposes to align the latent factors resulting from matrix factorization with some specific users as the representatives of the system.With this alignment, RBMF also enables some degree of interpretability as recommendations can be explained by user-to-representatives similarity scores and the representatives' ratings.Our ProtoMF generalizes the concept of representatives offered by RBMF by learning prototypical users that incorporate general item consumption patterns.
In the context of prototype-based approaches to RS explainability, and closely related to the study at hand, Barkan et al. [7] recently introduce Anchor-based Collaborative Filtering (ACF).In this work, the authors define a set of anchor vectors -generic representatives of tastes and preferences -and use the same set to represent both users and items, based on which recommendations are made.In contrast to ACF, not only the prototypes in our proposed models are separately defined for users and items, but we also allow to trace back the contributions to the prototype vectors, facilitating a direct explanation for recommendations.
Finally, we should also mention clustering-based [45,56,61,63,[65][66][67] and group discovery [27,39,70] approaches in RSs as they share conceptual similarities with prototype-based approaches in terms of benefiting from shared subtleties of users/items.In principle, these approaches exploit clustering of users/items into subgroups and then use the subgroups to provide recommendations using the information of in-group/neighboring entities.Differently from these, prototype-based methods and particularly ProtoMF (1) redefine users and items by employing the similarities of the users/items to prototypes instead of performing clustering-based assignments to subgroups, and (2) linearly aggregate the similarity scores in the final prediction, enabling the decomposition of the recommendation score and hence an easier interpretation.

Explainability
Outside of the RS literature, various prototype-based explanation methods are proposed in a variety of machine/deep learning tasks.These methods particularly differ in the way the prototypes are identified in the first place.As examples of such studies, Li et al. [35] explore the utilization of prototypes in the context of image classification by showing that the decision of a network to classify the image of a digit can be explained by the similarity of the image to the prototypes that look like the digit.In their work, a decoder is trained to visualize and interpret the prototypes.Chen et al. [14] further extend this work by learning latent prototypes that match a portion of the latent representation of inputs, allowing for a more fine-grained explanation.Various approaches to learn prototypes are proposed in the literature.Bien and Tibshirani [11] select prototypes from training data by solving a set cover problem over the inputs, and perform classification based on top-1 nearest neighbor search.Wu and Tabak [64] find prototypes as a convex combination of inputs and utilize them in regression tasks.In contrast to the mentioned studies and similar to our work, Li et al. [35] learn the prototypes from scratch, allowing us to flexibly measure the similarity between the prototypes and data instances in the latent space.
Explainability in RSs has been the focus of several works.Zhang et al. [69] and later Cheng et al. [17] exploit external information, such as opinionated reviews, to provide interpretable user/item representations in terms of aspects, namely attributes that characterize a user/item.Barkan et al. [6] propose to model a user via an attentive mixture of personas which explain the recommendation of an item based on the affinity between the user's personas and the item itself.Another approach is to use several statistical tools to extract post-hoc explanations of the existing models in order to provide rationales for explaining the recommendations.Some of these post-hoc methods include using association rules [48], influence functions [16], and linear models [44].More related to the work at hand, Fusco et al. [25] and Pan et al. [46] focus on designing interpretable models, which can inherently provide explanations in terms of contributions of the user/item features.Our ProtoMF model differs from the above-mentioned approaches in the following ways: (1) ProtoMF does not leverage fixed external features to explain the recommendations, allowing any type of external information to be used to interpret the prototypes and (2) ProtoMF models provide a novel explanation approach based on analyzing the contributions of user and item prototypes to the recommendation.

Fairness and Bias in Recommender Systems
Another topic related to explainability is fairness and bias in RSs.In this direction, recent studies show that RS algorithms deliver different recommendation performances to different groups of users (e. g., in the sense of gender, age, or personality) [33,42,43], raising the concern that these algorithms (may unwantedly) encode personal/sensitive information.For example, Ekstrand et al. [22] and more recently Melchiorre et al. [42] show that a variety of common RS algorithms perform worse in terms of accuracy and beyond-accuracy metrics on female users.Motivated by the mentioned studies and further contributing to this line of research, we explore whether some of the learned user prototypes also capture the gender information of users.

METHODOLOGY
In this section, we describe our ProtoMF models in detail.We first introduce the User Prototype Matrix Factorization (U-ProtoMF) model, followed by its item-based equivalent Item Prototype Matrix Factorization (I-ProtoMF), and finally the User-Item Prototype Matrix Factorization (UI-ProtoMF) model achieved by combining the first two models.
Let U = {u i } N i=1 and T = {t j } M j=1 be the set of N users and M items, respectively.We assume that we only have access to the implicit interaction data I = {(u i , t j )}, where (u i , t j ) indicates that user u i has interacted with item t j .For brevity, we omit the user and item indexes when referring to any user or item.
Our ProtoMF models build on top of the widely-used Matrix Factorization (MF) methods [31,53], which carry out recommendations by assigning an embedding vector u to each user and t to each item, both with embedding size d.This results in the set of user vectors {u i } N i=1 and item vectors {t j } M j=1 .Using these embeddings, MF defines the recommendation score as the dot product of the corresponding user and item embeddings.The models explained in what follows also utilize user/item embeddings and expand MF by including user/item prototypes.

U-ProtoMF
The U-ProtoMF model is founded on the assumption that there exist several prototypical users, characterized by the various patterns of item consumption, shared among the users of the system.For example, in the context of music and movie recommendation, a user prototype may embody the preference of users in listening to Folk Metal music tracks or in highly enjoying Drama and Romance movies.
The U-ProtoMF model follows this idea by introducing a set of L u user prototypes (L u ≪ N ) denoted with P u , where each user prototype is defined as an embedding vector p u with dimension d.
The model then provides a new representation of each user u as u * , defined as the vector of the similarities of u to each of the user prototype vectors p u , as formulated below: where the similarity function sim is defined as the shifted cosine similarity and ∥x ∥ is the L 2 -norm of x.This definition of the similarity function guarantees that all the values of u * are positive in the range of 0 to 2. Lastly, U-ProtoMF measures the user-item affinity score (that u will interact with t) as a linear combination of the new user representation with the corresponding item embedding as shown below: where ⊙ indicates the element-wise multiplication, s user is the resulting user score vector, and t ∈ R L u the item embedding.The above formulation is in fact the dot product of u * and t, and can be written as: U-score(u, t) = u * ⊤ t.We intentionally formulate U-score(u, t) as in Eq. 2, as in this form the vector s user breaks down the final U-score(u, t) into separate user prototype scores.As we will see in Section 5.2, this characteristic is particularly beneficial to explain recommendations.A scheme of U-ProtoMF is shown in the left side of Figure 1.To train our model, we opt for the cross-entropy/softmax loss [52] given the data I over the model parameters Θ, defined below: where ∥Θ∥ indicates the L 2 -norm, added to the loss through the hyperparameter λ L2 as regularization term.Inspired by Li et al. [35], we introduce two additional interpretability terms to the recommendation loss.These terms aim to ensure that each user is associated with at least one prototype and vice versa, done by increasing the similarity values of the most similar pairs.These terms in fact impose an inclusion criteria [7,35] by forcing each user (and each prototype) to "get matched" with at least one prototype (one user).The first term R { P u →U } defines this criterion from the side of user prototypes to users, by increasing the similarity of each user prototype to the corresponding user with the largest similarity value, formulated as follows: The second term R {U→P u } states the criterion from the side of users to user prototypes: The final loss is therefore defined as follows: where λ 1 and λ 2 are hyperparameters, tuning the degrees of the effects of the inclusion criteria.In practice, since the number of users (N ) is commonly very high, the full computation of R { P u →U } and R {U→P u } over all users in every training batch is very costly.To mitigate this problem, we compute these terms over a sampled subset of users, namely the ones appearing in each given training batch.
Since the data is expected to be randomly shuffled, our in-batch sampling approach can be considered as an unbiased approximation of Eq. 4 and Eq. 5.
The U-ProtoMF model enables an easier interpretation of the system and its recommendations.First, the representation of every user is now (re)defined as a vector u * of positive values.Each value of u * corresponds to a specific consumption characteristic, where the characteristics are defined by user prototypes.Second, since the recommendation score is the dot product of u * and t (Eq.2), the item embeddings dimensions can be seen as weights of the corresponding characteristic (defined by user prototypes).For example, in music recommendation, a Heavy Metal song will likely have a higher value (weight) for the feature corresponding to the user prototype representing Metal fans.Lastly, the definition of the recommendation score as a linear function -the summation of the weighted prototype similarities in s user -provides a favorable characteristic for interpretability by allowing to discern the different contributions of user prototypes.

I-ProtoMF
The I-ProtoMF model follows the same structure as U-ProtoMF while introducing the concept of prototypes only from the item side.In particular, I-ProtoMF assumes the existence of several prototypical items intended to capture the different co-consumption patterns arising within the dataset.For example, an item prototype might be a representative of the items that fall within a specific musical genre or product category.
Following U-ProtoMF, I-ProtoMF first defines a set of L t item prototypes P t , each defined with an embedding p t with dimension d (expectedly, L t ≪ M).I-ProtoMF then provides a new representation for each item t as the similarity of its vector to the item prototype vectors, formulated below: Using t * , the final score is computed as: where user embeddings are in R L t .Similarly to U-ProtoMF, I-ProtoMF's score is also in fact I-score(u, t) = u ⊤ t * , while defining the intermediate vector s item l supports the recommendation explainability, as discussed in Section 5.2.We show I-ProtoMF's architecture on the right of Figure 1.Similar to U-ProtoMF, I-ProtoMF is enriched with two inclusion criteria defined below.
Putting all together, the loss function is defined as: where λ 3 and λ 4 are hyperparameters and L r ec is equivalent to Eq. formulation enables the interpretation of recommendation scores (in this case from the perspective of items), via the different contributions of the item prototypes.

UI-ProtoMF
The U-ProtoMF and I-ProtoMF models enable the explanation of recommendations in terms of prototypes from the user and item side, respectively.A natural extension is to simply combine these two models to exploit the benefits of both under the hood of one single model.We provide this by introducing the UI-ProtoMF model, which computes the recommendation score as the sum of the scores of both models.While UI-ProtoMF contains both U-ProtoMF and I-ProtoMF as two separate units, the embeddings of users and items can be shared across these two units.To this end, UI-ProtoMF defines two linear transformations, one from the user embeddings to the space of item prototypes, and the other from the item representations to the user prototypes space, defined below: Using these embeddings, the final score of UI-ProtoMF is computed as the sum of the dot products, formulated below: Figure 1 depicts a diagram of UI-ProtoMF.Accordingly, the loss is the sum of the loss functions: 1 In the definition of UI-score, we particularly opt for the sum of the scores and avoid any non-linear combinatorial function.This design choice enables us to easily separate the contribution of each unit (U-ProtoMF or I-ProtoMF) to the final recommendation score.Each score can then be traced back to its corresponding unit for providing interpretations.

EXPERIMENT SETUP
In this section, we describe our experiment setup, namely the datasets, baselines, training and evaluation methods, and hyperparameter tuning.To ensure reproducibility, we publicly share our code and settings on https://github.com/hcai-mms/ProtoMF.Datasets.We conduct our experiments on three datasets, covering movies, video games, and music domains.We consider an implicit feedback setting where user-item interactions are provided as binary values: 1 if the user interacted with the item and 0 otherwise.The statistics of the datasets are summarized in Table 1. 1 L r ec is included only once and is based on the UI-score.
(1) MovieLens-1M2 (ml-1m) [29] contains 1 million movie ratings on a scale from 1 to 5. As commonly done, ratings above 3.5 are treated as positive interactions [7,36].The dataset also contains demographic information of the users, including gender (see Table 1).Additionally, we perform 5-core filtering, namely we only keep the users that interact with at least 5 distinct items and only the items that were consumed by at least 5 distinct users.(2) LFM2b-1Month (lfm2b-1mon) [42] is a one-month extract of the large LFM-2b dataset. 3The dataset contains music listening histories of Last.fm users. 4The considered subset corresponds to the last month of the dataset (20/02/2020 -19/03/2020) and considers only users whose gender information are provided.We further filter the dataset by removing the outlier users that listened to more than the 99 th percentile of all the users, keeping only users with age between 10 to 95, and performing 10-core filtering.(3) Amazon Video Games5 (AmazonVid) [40] consists of the ratings on the Amazon's Video Games category on a 1 to 5 scale.We consider ratings above 3.5 as positives and perform 5-core filtering.
Performance Comparison and Evaluation.We evaluate the recommendation performance of our introduced models to assess their effectiveness in practice.We evaluate U-ProtoMF, I-ProtoMF, and UI-ProtoMF, and compare them with three baseline algorithms, namely Matrix Factorization (MF) [31,53], Representative-based Matrix Factorization (RBMF) [38], and Anchor-based Collaborative Filtering (ACF) [7].MF is the baseline matrix factorization model that computes the affinity score as the dot product of the learned latent user/item representations.RBMF and ACF are representative prototype-based methods, as explained in Section 2. We evaluate the performance of the algorithms with two standard accuracy metrics, namely Hit Ratio (HitRatio) and Normalized Discounted Cumulative Gain (NDCG), and report the results at a cutoff of 10 (the results for other cutoffs are provided in the repository).To obtain a final score, we average the metrics over all the users.We test the significance of improvements using Mann-Whitney U test [41], correct p-values for multiple comparisons using Bonferroni correction [12], and aggregate the p-values over the seeds using Fisher's method [24].We consider an improvement significant if p < 0.01.Data Splits.We split each dataset according to the leave-oneout strategy [20] for every user.More specifically, for each user we order their item interactions according to the timestamps (we keep only the earliest interaction if multiple ones with the same item exist).The last interaction of the user is used as test, while the penultimate one as validation set.The rest of the interactions constitutes the training set.During training and evaluation, for each positive user-item interaction we sample x negative items not interacted with by the user, and rank the positive item among the sampled ones.We then compute loss and performance metrics on the resulting ranking.We fix the number x of negative samples (sampled uniformly at random) to 99 for evaluation, while we treat x as a hyperparameter for training.Hyperparameter Tuning.We carry out an extensive hyperparameter optimization to evaluate the effectiveness of our approach.In summary, for all models we tune: optimization and loss-related hyperparameters, negative sampling hyperparameters for training, embedding size, and batch size.For ACF and ProtoMF we further tune the strength of the regularization losses and the number of anchors/prototypes.The complete table of hyperparameters and their relative value ranges is reported in the repository. 6We employ Tree-structured Parzen Estimators [8,9] and evaluate, for each model, 100 sampled hyperparameter configuration.We fix the number of epochs to 100, however, we prematurely stop training if we observe no improvement of HitRatio @10 over the validation set for 10 consecutive epochs.For the lfm2b-1mon dataset, we further employ the trial-scheduler HyperBand [34] to speed up the experiments.Finally, we pick the model with the highest HitRatio @10.We repeat the whole procedure for three unique seeds and report the mean of the metrics on the test set.

RESULTS
In this section, we first report the obtained results in terms of accuracy metrics.We then explain the methods to interpret the learned prototypes and lay out our approach to provide explanations for recommendations using ProtoMF models.Lastly, we showcase the existence of gender bias in the UI-ProtoMF model.

Evaluation Results
Table 2 shows the evaluation results of the models for the three datasets. 78The sign † shows the significant improvements of the models over MF, and ‡ over ACF.Based on the results, we observe that all three ProtoMF models mostly provide significant improvements to MF, where UI-ProtoMF in particular shows consistent improvements on the three datasets and two metrics.Comparing among the baselines, ACF shows consistently better performance.Our proposed UI-ProtoMF method also significantly outperforms the ACF model on both accuracy metrics over all datasets (with the only exception for HitRatio on AmazonVid).These results indicate the high effectiveness of UI-ProtoMF for recommendations in comparison with the baselines, achieved by combining the benefits of the U-ProtoMF and I-ProtoMF models.To provide a full picture, we also compare the models in terms of parameter complexity.To this end, let us assume a simplified setting with N users, M items, K (user or item) prototypes, and dimension d for any latent vector.MF contains (N + M) × d parameters, while ACF and UI-ProtoMF add K × d and 4K × d parameters to MF, respectively.However, we should consider that in RS scenarios (as in our experiments) N and M are commonly much larger than both d and K, and therefore these extra parameters (for both ACF and UI-ProtoMF) only add a small portion to the parameters of baseline MF.In the following sections, we use UI-ProtoMF and discuss how this model can provide explanations for its recommendations.
Finally, let us have a look at a visualization of the space of prototypes.Figure 2a shows the learned embeddings of users (blue) and user prototypes (red) of the ml-1m dataset, projected onto a two-dimensional space using t-SNE [62].Evidently, the prototypes appear at the center of formed user clusters, and there are no outliers among prototypes (as a result of the inclusion regularization terms -see Section 3).The visualization also indicates that users might be close to more than one prototype, enabling a higher capacity for the model to (re)define user embeddings, as users are inherently complex in their consumption behavior, whose encoding may therefore require more than one prototype.Figure 2b  Table 3: Top-5 related items of three representative user prototypes (left) and item prototypes (right) based on the UI-ProtoMF model on the ml-1m dataset.
a similar visualization with respect to items and item prototypes on the lfm2b-1mon dataset.

Explaining UI-ProtoMF Recommendations
Our first step towards explaining recommendations is to interpret what patterns of item consumption the prototypes capture, considering the user and item prototypes separately.In particular, we approach the interpretation of user prototypes by observing Pro-toMF's recommendations when fed with synthetic user inputs that maximally activate the prototypes, similar to previous studies [35,71].We interpret item prototypes by identifying the items closest to each prototype, similar to Alvarez-Melis and Jaakkola [3].
The following examples are taken from the trained UI-ProtoMF model on the ml-1m dataset.Due to lack of space, more examples for lfm2b-1mon are provided in the repository. 9  Interpreting User Prototypes.To interpret which itemconsumption characteristics a user prototype embodies, we create a synthetic similarity vector u * of an imaginary user, where a maximum value is given to the corresponding user prototype in the vector, and all other values are set to zero.We then compute recommendations for this imaginary user.Adopting this method, the left part of Table 3 shows the recommendations of three representative user prototypes.Each of these captures a specific movie consumption behavior.For example, prototype 71 represents a prototypical user who enjoys action movies and thrillers, while prototype 55 prefers Sci-Fi movies, mostly of the same series; and the last one's top movie recommendations mostly consist of animated movies.
Interpreting Item Prototypes.Since item and item prototype embeddings lie in the same space, interpretation of an item prototype can be achieved by simply identifying its nearest item neighbors.The right part of Table 3 shows three representative examples.As can be seen, each item prototype tends to match a specific movie genre.Here, item prototype 3 is close to Drama and Romance movies.Item prototype 6's closest neighbors are all part of the same Horror movie series, while item prototype 24 is a representative of Action movies.
Explaining Recommendations.Having elaborated the methods to interpret user and item prototypes, we now focus on the explainability of UI-ProtoMF's recommendations.Our approach to generate explanations utilizes the degree to which each prototype 9 https://github.com/hcai-mms/ProtoMF/blob/main/protomf_appendix.pdf has contributed to the final affinity score.As discussed in Section 3, this score is the sum of the scores stemming from U-ProtoMF and I-ProtoMF (see Eq. 12).
Referring to U-ProtoMF, the final U-score is a sum of the user prototypes contributions s user l (see Section 3.1).Now, understanding the recommendation score of U-ProtoMF involves first assessing the user prototypes' contributions based on the values of s user (for example by focusing on the ones with the highest contributions).The recommendation is then explained based on the interpretation of user prototypes, described before.In addition, we can further deepen our explanation by recalling how the score s user l is computed: as the product of a user-and an item-specific component.In fact, a value s user l can be high (or low) due to the corresponding values of the underlying user prototype and item embedding, respectively, u * l and tl .A similar procedure can be applied to the score of I-ProtoMF, by using s item l to detect the most contributing item prototypes.Let us clarify this procedure with an example, focusing on the UI-ProtoMF recommendations of an arbitrary user in the ml-1m dataset.As the first recommendation, the model predicts Pretty Woman, a movie in Comedy and Romance genre.Figure 3a and Figure 3b show the values of the vectors involved in this prediction for U-ProtoMF and I-ProtoMF, respectively.In particular, the user-to-prototype/item-to-prototype similarity values, the values of user/item embeddings, and the scores are shown in the top, the middle, and the bottom plots, respectively.
On the U-ProtoMF side, the model detects that user prototypes 53, 40, and 37 have the highest contribution to the final score (highest values in s user ).The interpretation results of these three user prototypes are reported in Table 4, indicating similar movies in genres such as Comedy and Romance for prototypes 53 and 40, and mostly animated movies for prototype 37. We further observe that the high values of these three prototypes in s user are caused by different components.In particular, the user-to-prototype similarities u * have high values for prototypes 40 and 37, while a relatively lower value for prototype 53 (see the lower plot in Figure 3a).On the other hand, the item embedding t (representing the movie) has a high value on prototype 53 and lower values on the other two (middle plot in Figure 3a), resulting in overall high values in the final scores.
Similarly, on the I-ProtoMF side, item prototypes 2, 3, and 13 represent the major contributors.The interpretation of these item  Table 4: Top-5 related items of three user prototypes (left) and item prototypes (right) mentioned in Figure 3.
prototypes are shown in Table 4, demonstrating the tendency toward Romance and Drama genres, with prototype 2 further including Comedy and Musical genres.Similarly, the scores in s item can be traced back to the corresponding user embedding values û and item-to-prototype similarities t * .As a last remark, ProtoMF's explanations can be flexibly conveyed in different manners to different target audiences [1,4].A global analysis of the prototypes scores and similarities values can interest a more technical audience (e. g., engineers and data analysts) to understand the general behavior of the recommender system and to correct possible misconducts (e. g., biases).We will shortly provide a case for this.At the same time, providing the system's end-users with an interactive visualization of the most contributing prototypes along with their descriptions can largely support the system's transparency for the users.

Showcasing Societal Biases
The trained UI-ProtoMF model can be used to study whether the learned prototypes encode existing gender biases in the datasets.For this reason, we focus on the ml-1m and lfm2b-1mon datasets, since they provide users' gender information.To carry out our study, we select a set of the most similar users to each user prototype and consider them as the representatives of the prototype.This set comprises all users whose similarities to the prototype are above the 95 th percentile.We then calculate the gender counts of these representatives and compare them with the corresponding gender distribution in the whole dataset, checking for statistically significant differences.This is done by carrying out several Fisher exact tests [23] with an alpha level of 1%, and further correct for multiple comparisons using the Bonferroni method [12].
The results are depicted in Figure 4a and Figure 4b, where each bar represents the gender distribution, as the proportion of male to female users, of the representative users of a user prototype, sorted from the prototypes with the highest share of males (blue) to the ones with the highest share of females (orange).We also define a neutral area (gray) corresponding to the prototypes with non-significant differences in gender distributions.
We observe 36 male vs. 9 female user prototypes (among a total of 93) in ml-1m, and 7 male vs. 6 female prototypes (among 43) in lfm2b-1mon, evidencing the existence of user prototypes that encode specific gender attributes.Particularly, we notice that ml-1m has a considerably higher number of male-related user prototypes compared to female-related ones.These observations are in   accordance with the ones made in previous studies [22,42], demonstrating the existence of stereotypical biases in recommendations, which we show for specific user prototypes.Table 5a reports the interpretation results of the most female/male-related user prototypes provided in ml-1m.According to these results, the prototypical male users have the tendency to watch Sci-Fi and Thriller movies, while the prototypical female users mostly watch Romance movies.
Steerable Bias Mitigation in Recommendation.In the following, we briefly discuss an interesting capability of ProtoMF, namely providing flexible and controllable recommendations.More specifically, ProtoMF's recommendations can be changed at run-time by manually adjusting the values of user/item-to-prototype similarity vectors (u * or t * ).In fact, since the affinity score is computed as an independent sum of prototypes' contributions, we are able to increase/decrease these values, and therefore change the recommendation.This capability can potentially be exploited in various scenarios, such as user-centric bias mitigation [5,10,49], or diversifying item recommendations to counteract filter bubbles [47,60].
Let us showcase this capability with an example in the context of gender bias mitigation.Based on the results presented in Figure 4, we first find the top-3 most female-related user prototypes in ml-1m.We then alter the corresponding values of these user prototypes in u * by multiplying them with a factor λ. We apply this method to the recommendations of a female user, and report the top-5 recommendations in the original case, with λ = 0.33, and λ = 0 in Table 5b.As shown, the recommendation of the user moves from movies in Romance and Comedy genres (more strongly associated with the female users in the dataset) to Action and Sci-Fi.This simple method suggests a potentially appealing framework to mitigate or adjust gender bias, as particularly different degrees of interventions can be set at inference time according to the wish of end-users or system designers.

CONCLUSION AND OUTLOOK
In this paper, we propose ProtoMF, a novel collaborative filtering approach that leverages user and item prototypes to provide accurate and explainable recommendations.As a result of its design, ProtoMF's recommendations can be explained in terms of contributions of user/item prototypes, the latter representing itemconsumption characteristics of real users and items of the system.To this end, we provide an explanation framework that allows us to interpret the user/item prototypes and investigate their contributions to the predicted affinity scores.Furthermore, we show through extensive quantitative experiments that ProtoMF significantly outperforms Matrix Factorization and two prototype-based approaches in terms of Hit Ratio and NDCG.Moreover, we expose the existence of gender biases in the learned user prototypes by identifying prototypes with significant inclinations to the consumption behavior that is stereotypical of male or female users.We conclude with an idea for steering the amount of gender bias in recommendations made by ProtoMF.
As promising future research directions, we envision a thoughtful examination of the effects of gender-related (possibly other demographics as well) user prototypes on the recommendations in terms of accuracy and beyond-accuracy metrics, similarly done in as [42] and mitigate likely biases in the recommendations by exploiting ProtoMF's controllable recommendations.Furthermore, we believe that including external features of users and items, such as contextual information or audio features, into ProtoMF might further benefit the interpretability of the prototypes.Lastly, we would like to assess the usefulness of our explanations in terms of the goals defined by Tintarev and Masthoff [59] by involving a real (technical and non-) audience.
User and user prototypes on ml-1m.Items Prototypes (b) Item and item prototypes on lfm2b-1mon.

Figure 3 :
Figure3: Visualizing the prototype similarities, weights, and scores of UI-ProtoMF for the recommendation of the movie "Pretty Woman" for an arbitrary user of ml-1m.

Figure 4 :
Figure 4: Gender distribution of the representatives users per prototype.The orange/blue area indicates the proportion of females/males, and the gray area refers to gender neutral prototypes.

Table 1 :
Statistics of the datasets after filtering.
3 replacing U-score with I-score.Similar to U-ProtoMF, this

Table 2 :
Evaluation results w.r.t.accuracy metrics at cutoff 10.The sign † indicates significant improvement over MF while ‡ indicates significant improvement over ACF. provides

Table 5 :
(a) Most male-related (left) and female-related (right) user prototypes in ml-1m.(b) Example of applying the controllable bias mitigation method to the recommendations of a sample female user.The effects of some female-related user prototypes are dampened with the factor λ.