skip to main content
research-article
Public Access

On the Relationship between Explanation and Recommendation: Learning to Rank Explanations for Improved Performance

Published:16 February 2023Publication History

Skip Abstract Section

Abstract

Explaining to users why some items are recommended is critical, as it can help users to make better decisions, increase their satisfaction, and gain their trust in recommender systems (RS). However, existing explainable RS usually consider explanation as a side output of the recommendation model, which has two problems: (1) It is difficult to evaluate the produced explanations, because they are usually model-dependent, and (2) as a result, how the explanations impact the recommendation performance is less investigated.

In this article, explaining recommendations is formulated as a ranking task and learned from data, similarly to item ranking for recommendation. This makes it possible for standard evaluation of explanations via ranking metrics (e.g., Normalized Discounted Cumulative Gain). Furthermore, this article extends traditional item ranking to an item–explanation joint-ranking formalization to study if purposely selecting explanations could reach certain learning goals, e.g., improving recommendation performance. A great challenge, however, is that the sparsity issue in the user-item-explanation data would be inevitably severer than that in traditional user–item interaction data, since not every user–item pair can be associated with all explanations. To mitigate this issue, this article proposes to perform two sets of matrix factorization by considering the ternary relationship as two groups of binary relationships. Experiments on three large datasets verify the solution’s effectiveness on both explanation ranking and item recommendation.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Recommendation algorithms, such as collaborative filtering [38, 39] and matrix factorization [23, 34], have been widely deployed in online platforms, such as e-commerce and social networks, to help users find their interested items. Meanwhile, there is a growing interest in explainable recommendation [5, 9, 11, 12, 15, 17, 26, 29, 47, 52, 53], which aims at producing user-comprehensible explanations, as they can help users make informed decisions and gain users’ trust in the system [42, 52]. However, in current explainable recommendation approaches, explanation is often a side output of the model, which would incur two problems: first, the standard evaluation of explainable recommendation could be difficult, because the explanations vary from model to model (i.e., model-dependent); second, these approaches rarely study the potential impacts of explanations, mainly because of the first problem.

Evaluation of explanations in existing works can be generally classified into four categories, including case study, user study, online evaluation, and offline evaluation [52]. In most works, case study is adopted to show how the example explanations are correlated with recommendations. These examples may look intuitive, but they are less representative to reflect the overall quality of the explanations. Results of user study [1, 16] are more plausible, but it can be expensive and is usually evaluated in simulated environments that may not reflect real users’ actual perception. Though this is not a problem in online evaluation, it is difficult to implement as it relies on the collaboration with industrial firms, which may explain why only few works [33, 49, 53] conducted online evaluation. Consequently, one may wonder whether it is possible to evaluate the explainability using offline metrics. However, as far as we know, there is no standard metrics that are well recognized by the community. Though BLEU [35] and ROUGE [30] have been widely adopted to evaluate text quality for natural language generation, text quality is not equal to explainability [6, 26].

With the attempt to achieve a standard offline evaluation of recommendation explanations, we formulate the explanation problem as a ranking task [31]. The basic idea is to train a model that can select appropriate explanations from a shared explanation pool for a recommendation. For example, when a movie recommender system suggests the movie “Frozen” to a user, it may also provide a few explanations, such as “great family movie” and “excellent graphics,” as shown in Figure 1. Notice that these explanations are available all the time, but their ranking orders differ for different movie recommendations, and only those ranked top are presented to the user. In this case, the explanations are also learned from data, similarly to recommendations. Moreover, this general formulation can be adapted to various explanation styles, such as sentences, images, and even new styles yet to be invented, as long as the user-item-explanation interactions are available. As an instantiation, we adopt three public datasets with textual explanations [27] for experimentation.

Fig. 1.

Fig. 1. A toy example of explanation ranking for a movie recommender system.

With the evaluation and data, we can investigate the potential impacts of explanations, such as higher chance of item click, conversion, or fairness [41], which are less explored but are particularly important in commercial systems. Without an appropriate approach to explanation evaluation, explanations have usually been modeled as an auxiliary function of the recommendation task in most explainable models [5, 11, 32, 40, 53]. Recent works that jointly model recommendation and text generation [12] or feature prediction [15, 45] find that the two tasks could influence each other. In particular, Reference [10] shows that fine-tuning the parallel task of feature ranking can boost the recommendation performance. Moreover, a user study shows that users’ feedback on explanation items could help to improve recommendation accuracy [16]. Based on these findings, we design an item–explanation joint-ranking framework to study if showing some particular explanations could lead to increased item acceptance rate (i.e., improving the recommendation performance). Furthermore, we are motivated to identify how the recommendation task and the explanation task would interact with each other, whether there is a tradeoff between them, and how to achieve the most ideal solution for both.

However, the above investigation cannot proceed without addressing the inherent data sparsity issue in the user-item-explanation interactions. In traditional pairwise data, each user may be associated with several items, but in the user-item-explanation triplet data, each user–item pair may be associated with only one explanation. In consequence, the data sparsity problem is severer for explanation ranking. Therefore, how to design an effective model for such one-shot learning scenario becomes a great challenge. Our solution is to separate user-item-explanation triplets into user–explanation and item–explanation pairs, which significantly alleviates the data sparsity problem. Based on this idea, we design two types of model. First, a general model that only makes use of IDs, aims to accommodate a variety of explanation styles, such as sentences and images. Second, a domain-specific model based on BERT [14] further leverages the semantic features of the explanations to enhance the ranking performance.

In summary, our key contributions are as follows:

To the best of our knowledge, our work is the first attempt to achieve standard evaluation of explainability for explainable recommendation via well-recognized metrics, such as Normalized Discounted Cumulative Gain (NDCG), precision, and recall. We realize this by formulating the explanation problem as a ranking-oriented task.

With the evaluation, we further propose an item–explanation joint-ranking framework that can reach our designed goal, i.e., improving the performance of both recommendation and explanation, as evidenced by our experimental results.

To that end, we address the data sparsity issue in the explanation ranking task by designing an effective solution, being applied to two types of models (with and without semantic features of the explanations).1 Extensive experiments show their effectiveness against strong baselines.

In the following, we first summarize related work in Section 2 and then formulate the problems in Section 3. Our proposed models and the joint-ranking framework are presented in Section 4. Section 5 introduces the experimental setup, and the discussion of results is provided in Section 6. We conclude this work with outlooks in Section 7.

Skip 2RELATED WORK Section

2 RELATED WORK

Recent years have witnessed a growing interest in explainable recommendation [4, 5, 9, 11, 12, 15, 25, 26, 29, 32, 40, 47, 53]. In these works, there is a variety of explanation styles to recommendations, including visual highlights [9], textual highlights [32, 40], item neighbors [16], knowledge graph paths [7, 20, 48], word cloud [53], item features [17], pre-defined templates [15, 25, 53], automatically generated text [12, 26, 28, 29, 50, 51], retrieved text [4, 5, 11, 46, 47], and so on. The last type of style is related to this article, but explanations in these works are merely side outputs of their recommendation models. As a result, none of these works measured the explanation quality based on benchmark metrics. In comparison, we formulate the explanation task as a learning to rank [31] problem, which enables standard offline evaluation via ranking-oriented metrics.

On the one hand, the application of learning to rank can also be found in other domains. For instance, References [19, 44] attempt to explain entity relationships in Knowledge Graphs. The major difference from our work is that they heavily rely on the semantic features of explanations, either constructed manually [44] or extracted automatically [19], while one of our models works well when leveraging only the relation of explanations to users and items, without considering such features.

On the other hand, the appropriateness of current evaluation for explanations is still under debate. There are some works [12, 29] that regard text similarity metrics (i.e., BLEU [35] in machine translation and ROUGE [30] in text summarization) as explainability, when generating textual reviews/tips for recommendations. However, text similarity does not equal to explainability [6, 26]. For example, when the ground-truth is “sushi is good,” two generated explanations, “ramen is good” and “sushi is delicious,” gain the same score on the two metrics. However, from the perspective of explainability, the latter is obviously more related to the ground-truth, as they both refer to the same feature “sushi,” but the metrics fail to reflect this issue. As a response, in this article we propose a new evaluation approach based on ranking.

Our proposed models are experimented on textual datasets, but it can be applied to a broad spectrum of other explanation styles, e.g., images, as discussed earlier. Concretely, on each dataset there is a pool of candidate explanations to be selected for each user–item pair. A recent online experiment [49] conducted on Microsoft Office 3652 shows that these types of globally shared explanations are indeed helpful to users. The main focus of this work is to study how users perceive explanations, which is different from ours that aims to design effective models to rank explanations. Despite that, their research findings motivate us to provide better explanations that could lead to improved recommendations.

In more details, we model the user-item-explanation relations for both item and explanation ranking. There is a previous work [17] that similarly considers user-item-aspect relations as a tripartite graph, where aspects are extracted from user reviews. Another branch of related work is tag recommendation for folksonomy [22, 37], where tags are ranked for each given user–item pair. In terms of problem setting, our work is different from the preceding two, because they solely rank either items/aspects [17] or tags [22, 37], while besides that we also rank item–explanation pairs as a whole in our joint-ranking framework. Another difference is that we study how semantic features of explanations could help enhance the performance of explanation ranking, while none of them did so.

Skip 3PROBLEM FORMULATION Section

3 PROBLEM FORMULATION

The key notations and concepts for the problems are presented in Table 1. We use \(\mathcal {U}\) to denote the set of all users, \(\mathcal {I}\) the set of all items, and \(\mathcal {E}\) the set of all explanations. Then the historical interaction set is given by \(\mathcal {T} \subseteq \mathcal {U} \times \mathcal {I} \times \mathcal {E}\) (an illustrating example of such interaction is depicted in Figure 2). In the following, we first introduce item ranking and explanation ranking, respectively, and then the item–explanation joint-ranking.

Fig. 2.

Fig. 2. Illustration of user-item-explanation interactions.

Table 1.
SymbolDescription
\(\mathcal {T}\)training set
\(\mathcal {U}\)set of users
\(\mathcal {I}\)set of items
\(\mathcal {I}_u\)set of items that user \(u\) preferred
\(\mathcal {E}\)set of explanations
\(\mathcal {E}_u\)set of user \(u\)’s explanations
\(\mathcal {E}_i\)set of item \(i\)’s explanations
\(\mathcal {E}_{u, i}\)set of explanations that user \(u\) preferred w.r.t. item \(i\)
\(\mathbf {P}\)latent factor matrix for users
\(\mathbf {Q}\)latent factor matrix for items
\(\mathbf {O}\)latent factor matrix for explanations
\(\mathbf {p}_u\)latent factors of user \(u\)
\(\mathbf {q}_i\)latent factors of item \(i\)
\(\mathbf {o}_e\)latent factors of explanation \(e\)
\(b_i\)bias term of item \(i\)
\(b_e\)bias term of explanation \(e\)
\(d\)dimension of latent factors
\(\alpha\), \(\lambda\)regularization coefficient
\(\gamma\)learning rate
\(T\)iteration number
\(M\)number of recommendations for each user
\(N\)number of explanations for each recommendation
\(\hat{r}_{u, i}\)score predicted for user \(u\) on item \(i\)
\(\hat{r}_{u, i, e}\)score predicted for user \(u\) on explanation \(e\) of item \(i\)

Table 1. Key Notations and Concepts

3.1 Item Ranking

Personalized recommendation aims at providing a user with a ranked list of items that he/she never interacted with before. For each user \(u \in \mathcal {U}\), the list of \(M\) items can be generated as follows: (1) \(\begin{equation} \text{Top}(u, M) := \mathop {\arg \max }_{i \in \mathcal {I} / \mathcal {I}_u}^{M} \hat{r}_{u, \underline{i}}, \end{equation}\) where \(\hat{r}_{u, \underline{i}}\) is the predicted score for a user \(u\) on item \(i\) and \(\mathcal {I} / \mathcal {I}_u\) denotes the set of items on which user \(u\) has no interactions. In Equation (1), \(i\) is underlined, which means that we aim to rank the items.

3.2 Explanation Ranking

Explanation ranking is the task of finding a list of appropriate explanations for a user–item pair to justify the recommendation. Formally, given a user \(u \in \mathcal {U}\) and an item \(i \in \mathcal {I}\), the goal of this task is to rank the entire collection of explanations \(\mathcal {E}\) and select the top \(N\) to reason why the item \(i\) is recommended. Specifically, we define this list of top \(N\) explanations as (2) \(\begin{equation} \text{Top}(u, i, N) := \mathop {\arg \max }_{e \in \mathcal {E}}^{N} \hat{r}_{u, i, \underline{e}}, \end{equation}\) where \(\hat{r}_{u, i, \underline{e}}\) is the estimated score of explanation \(e\) for a given user–item pair \((u, i)\), which could be given by a recommendation model or by the user’s true behavior.

3.3 Item-Explanation Joint-Ranking

The preceding tasks solely rank either items or explanations. In this task, we further investigate whether it is possible to find an ideal item–explanation pair for a user and to whom the explanation best justifies the item that he/she likes the most. To this end, we treat each pair of item–explanation as a joint unit and then rank these units. Specifically, for each user \(u \in \mathcal {U}\), a ranked list of \(M\) item–explanation pairs can be produced as follows: (3) \(\begin{equation} \text{Top}(u, M) := \mathop {\arg \max }_{i \in \mathcal {I} / \mathcal {I}_u, e \in \mathcal {E}}^{M} \hat{r}_{u, \underline{i, e}}, \end{equation}\) where \(\hat{r}_{u, \underline{i, e}}\) is the predicted score for a given user \(u\) on the item–explanation pair (\(i\), \(e\)).

We see that either item ranking task or explanation ranking task is a special case of this item–explanation joint-ranking task. Concretely, Equation (3) degenerates to Equation (1) when explanation \(e\) is fixed, while it reduces to Equation (2) if item \(i\) is already known.

Skip 4OUR FRAMEWORK FOR RANKING TASKS Section

4 OUR FRAMEWORK FOR RANKING TASKS

4.1 Joint-ranking Reformulation

Suppose we have an ideal model that can perform the aforementioned joint-ranking task. During the prediction stage as in Equation (3), there would be \(\left| \mathcal {I} \right| \times \left| \mathcal {E} \right|\) candidate item–explanation pairs to rank for each user \(u \in \mathcal {U}\). The runtime complexity is then \(O (\left| \mathcal {U} \right| \cdot \left| \mathcal {I} \right| \cdot \left| \mathcal {E} \right|)\), which makes this task impractical compared with the traditional recommendation task’s \(O (\left| \mathcal {U} \right| \cdot \left| \mathcal {I} \right|)\) complexity.

To reduce the complexity, we reformulate the joint-ranking task by performing ranking for items and explanations simultaneously but separately. In this way, we are also able to investigate the relationship between item ranking and explanation ranking, e.g., improving the performance of both. Specifically, during the testing stage, we first follow Equation (1) to rank items for each user \(u \in \mathcal {U}\), which has the runtime complexity of \(O (\left| \mathcal {U} \right| \cdot \left| \mathcal {I} \right|)\). After that, for \(M\) recommendations for each user, we can rank and select explanations to justify each of them according to Equation (2). The second step’s complexity is \(O (\left| \mathcal {U} \right| \cdot M \cdot \left| \mathcal {E} \right|)\), but since \(M\) is a constant and \(\left| \mathcal {E} \right| \ll \left| \mathcal {I} \right|\) (see Table 2), the overall complexity of the two steps is \(O (\left| \mathcal {U} \right| \cdot \left| \mathcal {I} \right|)\).

Table 2.
Amazon Movies & TVTripAdvisorYelp
# of users109,121123,374895,729
# of items47,113200,475164,779
# of explanations33,76776,293126,696
# of \((u, i)\) pairs569,8381,377,6052,608,860
# of \((u, i, e)\) triplets793,4812,618,3403,875,118
# of explanations/\((u, i)\) pair1.391.901.49
Density (\(\times 10^{-10}\))45.7113.882.07
  • Density is #triplets divided by #users \(\times\) #items \(\times\) #explanations.

Table 2. Statistics of the Datasets

  • Density is #triplets divided by #users \(\times\) #items \(\times\) #explanations.

In the following, we first analyze the drawback of a conventional Tensor Factorization (TF) model when being applied to the explanation ranking problem and then introduce our solution, Bayesian Personalized Explanation Ranking (BPER). Second, we show how to further enhance BPER by utilizing the semantic features of textual explanations (denoted as BPER+). Third, we illustrate their relation to two typical TF methods, Canonical Decomposition (CD) and Pairwise Interaction Tensor Factorization (PITF). Last, we integrate the explanation ranking with item ranking into a multi-task learning framework as a joint-ranking task.

4.2 Bayesian Personalized Explanation Ranking

To perform explanation ranking, the score \(\hat{r}_{u, i, e}\) on each explanation \(e \in \mathcal {E}\) for a given user–item pair \((u, i)\) must be estimated. As the user-item-explanation ternary relations \(\mathcal {T} = \lbrace (u, i, e) | u \in \mathcal {U}, i \in \mathcal {I}, e \in \mathcal {E}\rbrace\) form an interaction cube, we are inspired to employ factorization models to predict this type of scores. There are a number of tensor factorization techniques [2, 21], such as Tucker Decomposition (TD) [43], CD [3], and High Order Singular Value Decomposition [13]. Intuitively, one would adopt CD because of its linear runtime complexity in terms of both training and prediction [37] and its close relation to Matrix Factorization (MF) [34], which has been extensively studied in recent years for item recommendation. Formally, according to CD, the score \(\hat{r}_{u, i, e}\) of user \(u\) on item \(i\)’s explanation \(e\) can be estimated by the sum over the element-wise multiplication of the user’s latent factors \(\mathbf {p}_u\), the item’s \(\mathbf {q}_i\), and the explanation’s \(\mathbf {o}_e\), (4) \(\begin{equation} \hat{r}_{u, i, e} = (\mathbf {p}_u \odot \mathbf {q}_i)^\top \mathbf {o}_e = \sum _{k = 1}^d p_{u, k} \cdot q_{i, k} \cdot o_{e, k}, \end{equation}\) where \(\odot\) denotes the element-wise multiplication of two vectors.

However, this method may not be effective enough due to the inherent sparsity problem of the ternary data as we discussed before. Since each user–item pair \((u, i)\) in the training set \(\mathcal {T}\) is unlikely to have interactions with many explanations in \(\mathcal {E}\), the data sparsity problem for explanation ranking is more severe than that for item recommendation. Simply multiplying the three vectors would hurt the performance of explanation ranking, which is evidenced by our experimental results in Section 6.

To mitigate such an issue and to improve the effectiveness of explanation ranking, we propose to separately estimate the user \(u\)’s preference score \(\hat{r}_{u, e}\) on explanation \(e\) and the item \(i\)’s appropriateness score \(\hat{r}_{i, e}\) for explanation \(e\). To this end, we perform two sets of matrix factorization rather than employing one single TF model. In this way, the sparsity problem would be considerably alleviated, since the data are reduced to two collections of binary relations, both of which are similar to the case of item recommendation discussed above. Last, the two scores \(\hat{r}_{u, e}\) and \(\hat{r}_{i, e}\) are combined linearly through a hyper-parameter \(\mu\). Specifically, the score of user \(u\) for item \(i\) on explanation \(e\) is predicted as follows: (5) \(\begin{equation} \left\lbrace \begin{array}{l} \hat{r}_{u, e} = \mathbf {p}_u^\top \mathbf {o}_e^U + b_e^U = \sum _{k = 1}^d p_{u, k} \cdot o_{e, k}^U + b_e^U \\ \hat{r}_{i, e} = \mathbf {q}_i^\top \mathbf {o}_e^I + b_e^I = \sum _{k = 1}^d q_{i, k} \cdot o_{e, k}^I + b_e^I \\ \hat{r}_{u, i, e} = \mu \cdot \hat{r}_{u, e} + (1 - \mu) \cdot \hat{r}_{i, e} \end{array}, \right. \end{equation}\) where \(\lbrace \mathbf {o}_e^U, b_e^U\rbrace\) and \(\lbrace \mathbf {o}_e^I, b_e^I\rbrace\) are two different sets of latent factors for explanations, corresponding to users and items, respectively.

Since selecting explanations that are likely to be perceived helpful by users is inherently a ranking-oriented task, directly modeling the relative order of explanations is thus more effective than simply predicting their absolute scores. The Bayesian Personalized Ranking (BPR) criterion [36] meets such an optimization requirement. Intuitively, a user would be more likely to appreciate explanations that cater to her own preferences, while those that do not fit one’s interests would be less attractive to the user. Similarly, some explanations might be more suitable to describe certain items, while other explanations might not. To build such types of pairwise preferences, we use the first two rows in Equation (5) to compute the difference between two explanations for both user \(u\) and item \(i\) as follows: (6) \(\begin{equation} \left\lbrace \!\! \begin{array}{l} \hat{r}_{u, ee^{\prime }} = \hat{r}_{u, e} - \hat{r}_{u, e^{\prime }} \\ \hat{r}_{i, ee^{\prime \prime }} = \hat{r}_{i, e} - \hat{r}_{i, e^{\prime \prime }} \end{array}, \right. \end{equation}\) which respectively reflect user \(u\)’s interest in explanation \(e\) over \(e^{\prime }\) and item \(i\)’s appropriateness for explanation \(e\) over \(e^{\prime \prime }\).

With the scores \(\hat{r}_{u, ee^{\prime }}\) and \(\hat{r}_{u, ee^{\prime \prime }}\), we can then adopt the BPR criterion [36] to minimize the following objective function: (7) \(\begin{equation} \min _{\Theta } \sum _{u \in \mathcal {U}} \sum _{i \in \mathcal {I}_u} \sum _{e \in \mathcal {E}_{u, i}} \left[ \sum _{e^{\prime } \in \mathcal {E} / \mathcal {E}_u} - \ln \sigma (\hat{r}_{u, ee^{\prime }}) + \sum _{e^{\prime \prime } \in \mathcal {E} / \mathcal {E}_i} - \ln \sigma (\hat{r}_{i, ee^{\prime \prime }}) \right] + \lambda \left| \left| \Theta \right| \right|_F^2, \end{equation}\) where \(\sigma (\cdot)\) denotes the sigmoid function, \(\mathcal {I}_{u}\) represents the set of items that user \(u\) has interacted with, \(\mathcal {E}_{u, i}\) is the set of explanations in the training set for the user–item pair \((u, i)\), \(\mathcal {E} / \mathcal {E}_u\) and \(\mathcal {E} / \mathcal {E}_i\) respectively correspond to explanations that user \(u\) and item \(i\) have not interacted with, \(\Theta\) is the model parameter, and \(\lambda\) is the regularization coefficient.

From Equation (7), we can see that there are two explanation tasks to be learned respectively, corresponding to users and items. During the training stage, we allow them to be equally important, since we have a hyper-parameter \(\mu\) in Equation (5) to balance their importance during the testing stage. The effect of this parameter is studied in Section 6.1. After the model parameters are estimated, we can rank explanations according to Equation (2) for each user–item pair in the testing set. As we model the explanation ranking task under BPR criterion, we accordingly name our method BPER. To learn the model parameter \(\Theta\), we draw on the widely used stochastic gradient descent algorithm to optimize the objective function in Equation (7). Specifically, we first randomly initialize the parameters and then repeatedly update them by uniformly taking samples from the training set and computing the gradients w.r.t. the parameters, until the convergence of the algorithm. The complete learning steps are shown in Algorithm 1.

4.3 BERT-enhanced BPER (BPER+)

The BPER model only exploits the IDs of users, items, and explanations to infer their relation for explanation ranking. However, this makes the rich semantic features of the explanations, which could also capture the relation between explanations, under-explored. For example, “the acting is good” and “the acting is great” for movie recommendation both convey a positive sentiment with a similar meaning, so their ranks are expected to be close. Hence, we further investigate whether such features could help to enhance BPER. As a feature extractor, we opt for BERT [14], a well-known pre-trained language model, whose effectiveness has been demonstrated on a wide range of natural language understanding tasks. Specifically, we first add a special [CLS] token at the beginning of a textual explanation \(e\), e.g., “[CLS] the acting is great.” After passing it through BERT, we can obtain the aggregate representation (corresponding to [CLS]) that encodes the explanation’s overall semantics. To match the dimension of latent factors in our model, we apply a linear layer to this vector, resulting in \(\mathbf {o}_e^{BERT}\). Then, we enhance the two ID-based explanation vectors \(\mathbf {o}_e^U\) and \(\mathbf {o}_e^I\) in Equation (5) by multiplying \(\mathbf {o}_e^{BERT}\), resulting in \(\mathbf {o}_e^{U+}\) and \(\mathbf {o}_e^{I+}\), (8) \(\begin{equation} \left\lbrace \begin{array}{l} \mathbf {o}_e^{U+} = \mathbf {o}_e^U \odot \mathbf {o}_e^{BERT} \\ \mathbf {o}_e^{I+} = \mathbf {o}_e^I \odot \mathbf {o}_e^{BERT} \end{array}. \right. \end{equation}\)

To predict the score for the \((u, i, e)\) triplet, we replace \(\mathbf {o}_e^U\) and \(\mathbf {o}_e^I\) in Equation (5) with \(\mathbf {o}_e^{U+}\) and \(\mathbf {o}_e^{I+}\). Then we use Equation (7) as the objective function, which can be optimized via back-propagation. In Equation (8), we adopt the multiplication operation simply to verify the feasibility of incorporating semantic features. The model may be further improved by more sophisticated operations, e.g., multi-layer perceptron, but we leave the exploration for future work.

Notice that BPER is a general method that only requires the IDs of users, items, and explanations, which makes it very flexible when being adapted to other explanation styles (e.g., images [9]). However, it may suffer from the common cold-start issue as with other recommender systems. BPER+ could mitigate this issue to some extent, because besides IDs it also considers the semantic relation between textual explanations via BERT, which can connect new explanations with existing ones. As the first work on ranking explanations for recommendations, we opt to make both methods relatively simple for reproducibility purpose. In this way, it is also easy to observe the experimental results (such as the impact of explanation task on recommendation task), without the interference of other factors.

4.4 Relation among BPER, BPER+, CD, and PITF

In fact, our BPER model is a type of TF, so we analyze its relation to two closely related TF methods: CD [3] and PITF [37]. On the one hand, in theory BPER can be considered as a special case of the CD model. Suppose the dimensionality of BPER is \(2 \cdot d + 2\). We can reformulate it as CD in the following: (9) \(\begin{equation} \begin{aligned}p_{u, k}^{CD} &= {\left\lbrace \begin{array}{ll} \mu \cdot p_{u, k}, & \mbox{if}\ k \le d \\ \mu , & \mbox{else} \end{array}\right.}, \\ q_{i, k}^{CD} &= {\left\lbrace \begin{array}{ll} (1 - \mu) \cdot q_{i, k}, & \mbox{if}\ k \gt d \mbox{ and}\ k \le 2 \cdot d \\ 1 - \mu , & \mbox{else} \end{array}\right.}, \\ o_{e, k}^{CD} &= {\left\lbrace \begin{array}{ll} o_{e, k}^U, & \mbox{if}\ k \le d \\ o_{e, k}^I, & \mbox{else if}\ k \le 2 \cdot d \\ b_e^U, & \mbox{else if}\ k = 2 \cdot d + 1 \\ b_e^I, & \mbox{else} \end{array}\right.}, \end{aligned} \end{equation}\) where the parameter \(\mu\) is a constant.

On the other hand, PITF can be seen as a special case of our BPER. Formally, its predicted score \(\hat{r}_{u, i, e}\) for the user-item-explanation triplet \((u, i, e)\) can be calculated by (10) \(\begin{equation} \hat{r}_{u, i, e} = \mathbf {p}_u^\top \mathbf {o}_e^U + \mathbf {q}_i^\top \mathbf {o}_e^I = \sum _{k = 1}^d p_{u, k} \cdot o_{e, k}^U + \sum _{k = 1}^d q_{i, k} \cdot o_{e, k}^I. \end{equation}\)

We can see that our BPER degenerates to PITF if in Equation (5) we remove the bias terms \(b_e^U\) and \(b_e^I\) and set the hyper-parameter \(\mu\) to 0.5, which means that the two types of scores for users and items are equally important to the explanation ranking task.

Although CD is more general than our BPER, its performance may be affected by the data sparsity issue as discussed before. Our BPER could mitigate this problem given its explicitly designed structure that may be difficult for CD to learn from scratch. When comparing with PITF, we can find that the parameter \(\mu\) in BPER is able to balance the importance of the two types of scores, corresponding to users and items, which makes our BPER more expressive than PITF and hence likely reach better ranking quality.

In a similar way, BPER+ can also be rewritten as CD or PITF. Concretely, by revising the last part of Equation (9) as the following formula, BPER+ can be seen as CD. When \(\mathbf {o}_e^{BERT} = [1, \ldots , 1]^\top\), BPER+ is equal to BPER, so it can be easily converted into PITF. The graphical illustration of the four models is shown in Figure 3, (11) \(\begin{equation} \begin{aligned}o_{e, k}^{CD} &= {\left\lbrace \begin{array}{ll} o_{e, k}^U \cdot o_{e, k}^{BERT}, & \mbox{if}\ k \le d \\ o_{e, k}^I \cdot o_{e, k}^{BERT}, & \mbox{else if}\ k \le 2 \cdot d \\ b_e^U, & \mbox{else if}\ k = 2 \cdot d + 1 \\ b_e^I, & \mbox{else} \end{array}\right.}. \end{aligned} \end{equation}\)

Fig. 3.

Fig. 3. Tensor Factorization models. The three matrices (i.e., \(\mathbf {P}\) , \(\mathbf {Q}\) , \(\mathbf {O}\) ) are model parameters. Our BPER and BPER+ can be regarded as special cases of CD, while PITF can be seen as a special case of our BPER and BPER+.

4.5 Joint-Ranking on BPER (BPER-J)

Owing to BPER’s flexibility to accommodate various explanation styles as discussed before, we perform the joint-ranking on it. Specifically, we incorporate the two tasks of explanation ranking and item recommendation into a unified multi-task learning framework so as to find a good solution that benefits both of them.

For recommendation, we adopt the Singular Value Decomposition model [23] to predict the score \(\hat{r}_{u, i}\) of user \(u\) on item \(i\): (12) \(\begin{equation} \hat{r}_{u, i} = \mathbf {p}_u^\top \mathbf {q}_i + b_i = \sum _{k = 1}^d p_{u, k} \cdot q_{i, k} + b_i, \end{equation}\) where \(b_i\) is the bias term for item \(i\). Notice that the latent factors \(\mathbf {p}_u\) and \(\mathbf {q}_i\) are shared with those for explanation ranking in Equation (5). In essence, item recommendation is also a ranking task that can be optimized using BPR criteria [36], so we first compute the preference difference \(\hat{r}_{u, ii^{\prime }}\) between a pair of items \(i\) and \(i^{\prime }\) to a user \(u\) as follows: (13) \(\begin{equation} \hat{r}_{u, ii^{\prime }} = \hat{r}_{u, i} - \hat{r}_{u, i^{\prime }}, \end{equation}\) which can then be combined with the task of explanation ranking in Equation (7) to form the following objective function for joint-ranking: (14) \(\begin{equation} \begin{split} \min _{\Theta } \sum _{u \in \mathcal {U}} \sum _{i \in \mathcal {I}_u} \Big [\sum _{i^{\prime } \in \mathcal {I} / \mathcal {I}_u} - \ln \sigma (\hat{r}_{u, ii^{\prime }}) + \alpha \sum _{e \in \mathcal {E}_{u, i}} \Big (\sum _{e^{\prime } \in \mathcal {E} / \mathcal {E}_u} \\ - \ln \sigma (\hat{r}_{u, ee^{\prime }}) + \sum _{e^{\prime \prime } \in \mathcal {E} / \mathcal {E}_i} - \ln \sigma (\hat{r}_{i, ee^{\prime \prime }})\Big)\Big ] + \lambda \left| \left| \Theta \right| \right|_F^2, \end{split} \end{equation}\) where the parameter \(\alpha\) can be fine-tuned to balance the learning of the two tasks.

We name this method BPER-J where J denotes joint-ranking. Similarly to BPER, we can update each parameter of BPER-J via stochastic gradient descent (see Algorithm 2).

Skip 5EXPERIMENTAL SETUP Section

5 EXPERIMENTAL SETUP

5.1 Datasets

To compare the ranking performance of different methods, it is expected that the datasets contain user-item-explanation interaction triplets. The datasets could be manually constructed as in Reference [49], but we are not given access to such datasets. Therefore, we adopt three public datasets3 [27], where the explanations are automatically extracted from user reviews via near-duplicate detection, which ensures that the explanations are commonly used by users. Specifically, the datasets are from different domains, including Amazon Movies & TV,4 TripAdvisor5 for hotels, and Yelp6 for restaurants. Each record in the three datasets consists of user ID, item ID, and one or multiple explanation IDs and thus results in one or multiple user-item-explanation triplets. Moreover, each explanation ID appears no less than 5 times. The statistics of the three datasets are presented in Table 2. As it can be seen, the data sparsity issue on the three datasets is very severe.

Table 3 shows five example explanations taken from the three datasets. As we can see, all the explanations are quite concise and informative, which could prevent from overwhelming users, a critical issue for explainable recommendation [18]. Also, short explanations can be mobile-friendly, since it is difficult for a small screen to fit much content. Moreover, the explanations from different datasets well suit the target application domains, such as “a wonderful movie for all ages” for movies and “comfortable hotel with good facilities” for hotels. Explanations with negative sentiment can also be observed, e.g., “the place is awful,” which can be used to justify why some items are dis-recommended [53]. Hence, we believe that the datasets are very suitable for our explanation ranking experiment.

Table 3.
Amazon Movies & TV
Great story
Don’t waste your money
The acting is great
The sound is okay
A wonderful movie for all ages
TripAdvisor
Great location
The room was clean
The staff were friendly and helpful
Bad service
Comfortable hotel with good facilities
Yelp
Great service
Everything was delicious
Prices are reasonable
This place is awful
The place was clean and the food was good

Table 3. Example Explanations on the Three Datasets

5.2 Compared Methods

To evaluate the performance of explanation ranking task, where the user–item pairs are given, we adopt the following baselines. Notice that we omit the comparison with TD [43], because it takes cubic time to run and we also find that it does not perform better than CD in our trial experiment.

RAND: This is a weak baseline that randomly picks up explanations from the explanation collection \(\mathcal {E}\). It is devised to examine whether personalization is needed for explanation ranking.

Revised User-based Collaborative Filtering (RUCF): Because traditional CF methods [38, 39] cannot be directly applied to the ternary data, we make some modifications to their formula, following Reference [22]. The similarity between two users is measured by their associated explanation sets via Jaccard Index. When predicting a score for the \((u, i, e)\) triplet, we first find users associated with the same item \(i\) and explanation \(e\), i.e., \(\mathcal {U}_i \cap \mathcal {U}_e\), from which we then find the ones appearing in user \(u\)’s neighbor set \(\mathcal {N}_u\), (15) \(\begin{equation} \hat{r}_{u, i, e} = \sum _{u^{\prime } \in \mathcal {N}_u \cap (\mathcal {U}_i \cap \mathcal {U}_e)} s_{u, u^{\prime }}\ \mbox{where}\ s_{u, u^{\prime }} = \frac{\vert \mathcal {E}_u \cap \mathcal {E}_{u^{\prime }} \vert }{\vert \mathcal {E}_u \cup \mathcal {E}_{u^{\prime }} \vert }. \end{equation}\)

Revised Item-based Collaborative Filtering (RICF): This method predicts a score for a triplet from the perspective of items, whose formula is similar to Equation (15).

CD [3] as shown in Equation (4): This method only predicts one score instead of two for the triplet \((u, i, e)\), so its objective function shown below is slightly different from ours in Equation (7), (16) \(\begin{equation} \min _{\Theta } \sum _{u \in \mathcal {U}} \sum _{i \in \mathcal {I}_u} \sum _{e \in \mathcal {E}_{u, i}} \sum _{e^{\prime } \in \mathcal {E} / \mathcal {E}_{u, i}} - \ln \sigma (\hat{r}_{u, i, ee^{\prime }}) + \lambda \left| \left| \Theta \right| \right|_F^2, \end{equation}\) where \(\hat{r}_{u, i, ee^{\prime }} = \hat{r}_{u, i, e} - \hat{r}_{u, i, e^{\prime }}\) is the score difference between a pair of interactions.

PITF [37]: This makes prediction for a triplet based on Equation (10), and its objective function is identical to CD’s in Equation (16).

To verify the effectiveness of the joint-ranking framework, in addition to our method BPER-J, we also present the results of two baselines: CD [3] and PITF [37]. Since CD and PITF are not originally designed to accomplish the two tasks of item recommendation and explanation ranking together, we first allow them to make prediction for a user–item pair \((u, i)\) via the inner product of their latent factors, i.e., \(\hat{r}_{u, i} = \mathbf {p}_u^T \mathbf {q}_i\), and then combine this task with explanation ranking in a multi-task learning framework whose objective function is given as follows: (17) \(\begin{equation} \min _{\Theta } \sum _{u \in \mathcal {U}} \sum _{i \in \mathcal {I}_u} \Big [ \sum _{i^{\prime } \in \mathcal {I} / \mathcal {I}_u} - \ln \sigma (\hat{r}_{u, ii^{\prime }}) + \alpha \sum _{e \in \mathcal {E}_{u, i}} \sum _{e^{\prime } \in \mathcal {E} / \mathcal {E}_{u, i}} - \ln \sigma (\hat{r}_{u, i, ee^{\prime }}) \Big ] + \lambda \left| \left| \Theta \right| \right|_F^2, \end{equation}\) where \(\hat{r}_{u, ii^{\prime }} = \hat{r}_{u, i} - \hat{r}_{u, i^{\prime }}\) is the difference between a pair of records. We name them CD-J and PITF-J, respectively, where J denotes joint-ranking.

5.3 Evaluation Metrics

To evaluate the performance of both recommendation and explanation, we adopt four commonly used ranking-oriented metrics in recommender systems: NDCG, Precision, Recall, and F1. We evaluate on top-10 ranking for both recommendation and explanation tasks. For the former task, it is easy to find the definition of the metrics in previous works, so we define those for the latter. Specifically, the scores for a user–item pair on the four metrics are computed as follows: (18) \(\begin{equation} \begin{aligned}\text{rel}_p &= \delta (\text{Top}(u, i, N, p) \in \mathcal {E}_{u, i}^{te}) \\ \text{NDCG}(u, i, N) & = \frac{1}{Z} \sum _{p = 1}^N \frac{2^{\text{rel}_p} - 1}{\log (p + 1)}, \mbox{ where } Z = \sum _{p = 1}^N \frac{1}{\log (p + 1)} \\ \text{Precision}(u, i, N) & = \frac{1}{N} \sum _{p = 1}^N \text{rel}_p \mbox{ and } \text{Recall}(u, i, N) = \frac{1}{\left| \mathcal {E}_{u, i}^{te} \right|} \sum _{p = 1}^N \text{rel}_p \\ \text{F1}(u, i, N) & = 2 \times \frac{\text{Precision}(u, i, N) \times \text{Recall}(u, i, N)}{\text{Precision}(u, i, N) + \text{Recall}(u, i, N)}, \end{aligned} \end{equation}\) where \(\text{rel}_p\) indicates whether the \(p\)th explanation in the ranked list \(\text{Top}(u, i, N)\) can be found in the ground-truth explanation set \(\mathcal {E}_{u, i}^{te}\). Then, we can average the scores for all user–item pairs in the testing set.

5.4 Implementation Details

We randomly divide each dataset into training (70%) and testing (30%) sets and guarantee that each user/item/explanation has at least one record in the training set. The splitting process is repeated for 5 times. For validation, we randomly draw 10% records from training set. After hyper-parameters tuning, the average performance on the five testing sets is reported.

We implemented all the methods in Python.7 For TF-based methods, including CD, PITF, CD-J, PITF-J, and our BPER and BPER-J, we search the dimension of latent factors \(d\) from [10, 20, 30, 40, 50], regularization coefficient \(\lambda\) from [0.001, 0.01, 0.1], learning rate \(\gamma\) from [0.001, 0.01, 0.1], and maximum iteration number \(T\) from [100, 500, 1000]. As to joint-ranking of CD-J, PITF-J and our BPER-J, the regularization coefficient \(\alpha\) on explanation task is searched from [0, 0.1, \(\ldots ,\) 0.9, 1]. For the evaluation of joint-ranking, we first evaluate the performance of item recommendation for users, followed by the evaluation of explanation ranking on those correctly predicted user–item pairs. For our methods BPER and BPER-J, the parameter \(\mu\) that balances user and item scores for explanation ranking is searched from [0, 0.1, \(\ldots ,\) 0.9, 1]. After parameter tuning, we use \(d = 20\), \(\lambda = 0.01\), \(\gamma = 0.01\) and \(T = 500\) for our methods, while the other parameters \(\alpha\) and \(\mu\) are dependent on the datasets.

The configuration of BPER+ is slightly different, because of the textual content of the explanations. We adopted the pre-trained BERT from huggingface,8 and implemented the model in Python with PyTorch.9 We set batch size to 128, \(d = 20\), and \(T = 5\). After parameter tuning, we set learning rate \(\gamma\) to 0.0001 on Amazon, and 0.00001 on both TripAdvisor and Yelp.

Skip 6RESULTS AND ANALYSIS Section

6 RESULTS AND ANALYSIS

In this section, we first compare our methods BPER and BPER+ with baselines regarding explanation ranking. Then, we study the capability of our methods in dealing with varying data sparseness. Third, we show a case study of explanation ranking for both recommendation and disrecommendation, and also present a small user study. Last, we analyze the joint-ranking results of three TF-based methods.

6.1 Comparison of Explanation Ranking

Experimental results for explanation ranking on the three datasets are shown in Table 4. We see that each method’s performance on the four metrics (i.e., NDCG, Precision, Recall, and F1) are fairly consistent across the three datasets. The method RAND is among the weakest baselines, because it randomly selects explanations without considering user and item information, which implies that the explanation ranking task is non-trivial. CD performs even worse than RAND, because of the sparsity issue in the ternary data (see Table 2), for which CD may not be able to mitigate as discussed in Section 4.2. CF-based methods, i.e., RUCF and RICF, largely advance the performance of RAND, as they take into account the information of either users or items, which confirms the important role of personalization for explanation ranking. However, their performance is still limited due to data sparsity. PITF and our BPER/BPER+ outperform the CF-based methods by a large margin, as they not only address the data sparsity issue via their MF-like model structure but also take each user’s and item’s information into account using latent factors. Most importantly, our method BPER significantly outperforms the strongest baseline PITF, owing to its ability of producing two sets of scores, corresponding to users and items respectively, and its parameter \(\mu\) that can balance their relative importance to explanation ranking. Last, BPER+ further improves BPER on most of the metrics across the three datasets, especially on NDCG that cares about the ranking order, which can be attributed to the consideration of the semantic features of the explanations as well as BERT’s strong language modeling capability to extract them.

Table 4.
NDCG@10 (%)Precision@10 (%)Recall@10 (%)F1@10 (%)Training Time
Amazon Movies & TV
CD0.0010.0010.0070.0021h48min
RAND0.0040.0040.0270.006
RUCF0.3410.1701.4550.301
RICF0.4170.2591.7970.433
PITF2.3521.82414.1253.1491h51min
BPER2.630*1.942*15.147*3.360*1h56min
BPER+2.877*1.919*14.936*3.317*
Improvement (%)22.3525.2295.7395.343
TripAdvisor
CD0.0010.0010.0030.0015h32min
RAND0.0020.0020.0110.004
RUCF0.2600.1510.7790.242
RICF0.0310.0200.0870.030
PITF1.2391.1115.8511.7887h9min
BPER1.389*1.236*6.549*1.992*9h43min
BPER+2.096*1.565*8.151*2.515*
Improvement (%)69.07340.86239.31440.665
Yelp
CD0.0000.0000.0030.00112h7min
RAND0.0010.0010.0070.002
RUCF0.0400.0200.1250.033
RICF0.0370.0260.1370.042
PITF0.7120.6354.1721.06811h27min
BPER0.814*0.723*4.768*1.218*16h30min
BPER+0.903*0.731*4.544*1.220*
Improvement (%)26.86115.2308.92514.228
  • The best performing values are in bold, and the second best underlined. Improvements are made by BPER+ over the best baseline PITF (* indicates the statistical significance over PITF for \(p \lt\) 0.01 via Student’s \(t\)-test).

Table 4. Performance Comparison of all Methods on the Top-10 Explanation Ranking in Terms of NDCG, Precision, Recall, and F1 (%)

  • The best performing values are in bold, and the second best underlined. Improvements are made by BPER+ over the best baseline PITF (* indicates the statistical significance over PITF for \(p \lt\) 0.01 via Student’s \(t\)-test).

Besides the explanation ranking performance, we also present the training time comparison of the three TF-based methods in Table 4. For fair comparison, the runtime testing is conducted on the same research machine without GPU, because these methods are all implemented in pure Python without involving deep learning framework. From the table, we can see that the training time of the three methods is generally consistent on different datasets. CD takes the least time to train, PITF needs a bit more training time, while the duration of training our BPER is the longest. This is quite expected, since the model complexity grows larger from CD to PITF and BPER. However, the slightly sacrificed training time of BPER is quite acceptable, because the gap of training duration between the three methods is not very large, e.g., 5h32min for CD, 7h9min for PITF and 9h43min for BPER on TripAdvisor dataset.

Last, we further analyze the parameter \(\mu\) of BPER that controls the contributions of user scores and item scores in Equation (5). As it can be seen in Figure 4, the curves of NDCG, Pre, Rec, and F1 are all bell shaped, where the performance improves significantly with the increase of \(\mu\) until it reaches an optimal point, and then it drops sharply. Due to the characteristics of different application domains, the optimal points vary among the three datasets, i.e., 0.7 for both Amazon and Yelp and 0.5 for TripAdvisor. We omit the figures of BPER+, because the pattern is similar.

Fig. 4.

Fig. 4. The effect of \(\mu\) in BPER on explanation ranking in three datasets. NDCG@10, Precision@10, and F1@10 are linearly scaled for better visualization.

6.2 Results on Varying Data Sparseness

As discussed earlier, the sparsity issue of user-item-explanation triplewise data is more severe than that of traditional user–item pairwise data. To investigate how different methods deal with varying spareness, we further remove certain ratio of the Amazon training set, so that the training triplets to the whole dataset ranges from 30% to 70%, while the testing set remains untouched. For comparison with our BPER and BPER+, we include the most competitive baseline PITF. Figure 5 shows the ranking performance of the three methods w.r.t. varying spareness. The ranking results are quite consistent on the four metrics (i.e., NDCG, Precision, Recall, and F1). Moreover, with the increase of the amount of training triplets, the performance of all three methods goes up linearly. Particularly, the performance gap between our BPER/BPER+ and PITF is quite large, especially when the ratio of training data is small (e.g., 30%). These observations demonstrate our methods’ better capability in mitigating data sparsity issue, and hence prove the rationale of our solution that converts triplets to two groups of binary relation.

Fig. 5.

Fig. 5. Ranking performance of three TF-based methods w.r.t. varying sparseness of training data on Amazon dataset.

6.3 Qualitative Case Study and User Study

To better understand how explanation ranking works, we first present a case study comparing our method BPER and the most effective baseline PITF on Amazon Movies & TV dataset in Table 5. The two cases in the table respectively correspond to recommendation and disrecommendation. In the first case (i.e., recommendation), there are three ground-truth explanations, praising the movie’s “special effects,” “story,” and overall quality. Generally speaking, the top-5 explanations resulting from both BPER and PITF are positive, and relevant to the ground-truth, because the two methods are both effective in terms of explanation ranking. However, since PITF’s ranking ability is relatively weaker than our BPER, its explanations miss the key feature “story” that the user also cares about.

Table 5.
Ground-truthBPERPITF
Special effectsSpecial effectsGreat special effects
Great storyGood actingGreat visuals
Wonderful movieThis is a great movieGreat effects
Great storySpecial effects
Great special effectsGood movie
The acting is terribleThe acting is terribleGood action movie
The acting is badLow budget
The acting was horribleNothing special
It’s not funnyThe acting is poor
Bad dialogueThe acting is bad
  • The ground-truth explanations are unordered. Matched explanations are emphasized in italic font.

Table 5. Top-5 Explanations Selected by BPER and PITF for Two Given User–Item Pairs, Corresponding to Recommendation and Disrecommendation, on Amazon Movies & TV Dataset

  • The ground-truth explanations are unordered. Matched explanations are emphasized in italic font.

In the second case (i.e., disrecommendation), the ground-truth explanation is a negative comment about the target movie’s “acting.” Although the top explanations made by both BPER and PITF contain negative opinions regarding this aspect, their ranking positions are quite different (i.e., top-3 for our BPER vs. bottom-2 for PITF). Moreover, we notice that for this disrecommendation, PITF places a positive explanation in the first position, i.e., “good action movie,” which not only contradicts the other two explanations, i.e., “the acting is poor/bad,” but also mismatches the disrecommendation goal. Again, this showcases our model’s effectiveness for explanation ranking.

We further conduct a small scale user study to investigate real people’s perception toward the top ranked explanations. Specifically, we still compare our BPER with PITF on Amazon Movies & TV dataset. We prepared 10 different cases and hired college students to do the evaluation. In each case, we provide the movie’s title and the ground-truth explanations, and ask the participants to select one explanation list that is semantically closer to the ground-truth. There are two randomly shuffled options returned by BPER and PITF, respectively. A case is valid only when at least two participants select the same option. The evaluation results are shown in Figure 6. We can see that on 60% of cases our BPER’s explanations are closer to the ground-truth than PITF’s, which is quite consistent with their explanation ranking performance.

Fig. 6.

Fig. 6. Result of user study on explanations returned by two methods on Amazon Movies & TV dataset.

6.4 Effect of Joint-Ranking

We perform joint-ranking for three TF-based models, i.e., BPER-J, CD-J, and PITF-J. Because of the consistency in the experimental results on different datasets, we only show results on Amazon and TripAdvisor. In Figure 7, we study the effect of the parameter \(\alpha\) to both explanation ranking and item ranking in terms of F1 (results on the other three metrics are consistent). In each sub-figure, the green dotted line represents the performance of explanation ranking task without joint-ranking, whose value is taken from Table 4. As we can see, all the points on the explanation curve (in red) are above this line when \(\alpha\) is greater than 0, suggesting that the explanation task benefits from the recommendation task under the joint-ranking framework. In particular, the explanation performance of CD-J improves dramatically under the joint-ranking framework, since its recommendation task suffers less from the data sparsity issue than the explanation task as discussed in Section 4.2. It in turn helps to better rank the explanations. Meanwhile, for the recommendation task, all the three models degenerate to BPR when \(\alpha\) is set to 0. Therefore, on the recommendation curves (in blue), any points, whose values are greater than that of the starting point, gain profits from the explanation task as well. All these observations show the effectiveness of our joint-ranking framework in terms of enabling the two tasks to benefit from each other.

Fig. 7.

Fig. 7. The effect of \(\alpha\) in three TF-based methods with joint-ranking on two datasets. Exp and Rec respectively denote the Explanation and Recommendation tasks. F1@10 for Rec is linearly scaled for better visualization.

In Table 6, we make a self-comparison of the three methods in terms of NDCG and F1 (the other two metrics are similar). In this table, “Non-joint-ranking” corresponds to each model’s performance with regard to explanation or recommendation when the two tasks are individually learned. In other words, the explanation performance is taken from Table 4, and the recommendation performance is evaluated when \(\alpha = 0\). “Best Exp” and “Best Rec” denote the best performance of each method on respectively explanation task and recommendation task under the joint-ranking framework. As we can see, when the recommendation performance is the best for all the models with joint-ranking, the explanation performance is always improved. Although minor recommendation accuracy is sacrificed when the explanation task reaches the best performance, we can always find points where both of the two tasks are improved, e.g., on the top left of Figure 7 when \(\alpha\) is in the range of 0.1 to 0.6 for BPER-J on Amazon. This again demonstrates our joint-ranking framework’s capability in finding good solutions for both tasks.

Table 6.
AmazonTripAdvisor
Exp (%)Rec (‰)Exp (%)Rec (‰)
NDCGF1NDCGF1NDCGF1NDCGF1
BPER-J
Non-joint-ranking2.63.46.68.11.42.05.37.1
Joint-rankingBest Exp3.3 \(\uparrow\)4.6 \(\uparrow\)5.7 \(\downarrow\)7.1 \(\downarrow\)1.6 \(\uparrow\)2.4 \(\uparrow\)5.0 \(\downarrow\)6.4 \(\downarrow\)
Best Rec2.6 \(\updownarrow\)3.5 \(\uparrow\)7.1 \(\uparrow\)8.7 \(\uparrow\)1.5 \(\uparrow\)2.1 \(\uparrow\)6.3 \(\uparrow\)8.0 \(\uparrow\)
Improvement (%)26.935.37.67.414.320.018.911.3
CD-J
Non-joint-ranking0.00.06.57.90.00.04.54.8
Joint-rankingBest Exp2.6 \(\uparrow\)3.7 \(\uparrow\)5.5 \(\downarrow\)6.7 \(\downarrow\)1.7 \(\uparrow\)2.4 \(\uparrow\)4.6 \(\uparrow\)5.2 \(\uparrow\)
Best Rec1.9 \(\uparrow\)2.9 \(\uparrow\)6.8 \(\uparrow\)8.2 \(\uparrow\)9.6 \(\uparrow\)1.5 \(\uparrow\)4.9 \(\uparrow\)5.6 \(\uparrow\)
Improvement (%)InfInf4.63.8InfInf8.916.7
PITF-J
Non-joint-ranking2.43.26.57.71.21.84.34.7
Joint-rankingBest Exp3.0 \(\uparrow\)4.2 \(\uparrow\)6.4 \(\downarrow\)8.0 \(\uparrow\)2.0 \(\uparrow\)2.9 \(\uparrow\)6.0 \(\uparrow\)7.6 \(\uparrow\)
Best Rec2.8 \(\uparrow\)3.7 \(\uparrow\)7.1 \(\uparrow\)8.5 \(\uparrow\)2.0 \(\uparrow\)2.8 \(\uparrow\)7.0 \(\uparrow\)8.9 \(\uparrow\)
Improvement (%)25.031.39.210.466.761.162.889.4
  • Top-10 results are evaluated for both explanation (Exp) and recommendation (Rec) tasks. The improvements are made by the best performance of each task under joint-ranking over that without it (i.e., in this case the two tasks are separately learned).

Table 6. Self-comparison of Three TF-based Methods on Two Datasets with and without Joint-ranking in Terms of NDCG and F1

  • Top-10 results are evaluated for both explanation (Exp) and recommendation (Rec) tasks. The improvements are made by the best performance of each task under joint-ranking over that without it (i.e., in this case the two tasks are separately learned).

Skip 7CONCLUSION AND FUTURE WORK Section

7 CONCLUSION AND FUTURE WORK

To the best of our knowledge, we are the first one that leverages standard offline metrics to evaluate explainability for explainable recommendation. We achieve this goal by formulating the explanation problem as a ranking task. With this quantitative measure of explainability, we design an item–explanation joint-ranking framework that can improve the performance of both recommendation and explanation tasks. To enable such joint-ranking, we develop two effective models to address the data sparsity issue, which were tested on three large datasets.

As future work, we are interested in considering the relationship (such as coherency [24] and diversity) between suggested explanations to further improve the explainability. In addition, we plan to conduct experiments in real-world systems to validate whether recommendations and their associated explanations as produced by the joint-ranking framework could influence users’ behavior, e.g., clicking and purchasing. In addition, the joint-ranking framework in this article aims to improve the recommendation performance by providing explanations, while in the future, we will also consider improving other objectives based on explanations, such as recommendation serendipity [8] and fairness [41].

Footnotes

REFERENCES

  1. [1] Balog Krisztian and Radlinski Filip. 2020. Measuring recommendation explanation quality: The conflicting goals of explanations. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 329338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Bhargava Preeti, Phan Thomas, Zhou Jiayu, and Lee Juhan. 2015. Who, what, when, and where: Multi-dimensional collaborative recommendations using tensor factorization on sparse user-generated data. In Proceedings of the 24th International Conference on World Wide Web. 130140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Carroll J. Douglas and Chang Jih-Jie. 1970. Analysis of individual differences in multidimensional scaling via an n-way generalization of “eckart-young” decomposition. Psychometrika 35, 3 (1970), 283319.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Catherine Rose and Cohen William. 2017. Transnets: Learning to transform for recommendation. In Proceedings of the 11th ACM Conference on Recommender Systems. 288296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Chen Chong, Zhang Min, Liu Yiqun, and Ma Shaoping. 2018. Neural attentional rating regression with review-level explanations. In Proceedings of the World Wide Web Conference. 15831592.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Chen Hanxiong, Chen Xu, Shi Shaoyun, and Zhang Yongfeng. 2019. Generate natural language explanations for recommendation. In Proceedings of SIGIR’19 Workshop on ExplainAble Recommendation and Search. ACM.Google ScholarGoogle Scholar
  7. [7] Chen Hongxu, Li Yicong, Sun Xiangguo, Xu Guandong, and Yin Hongzhi. 2021. Temporal meta-path guided explainable recommendation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 10561064.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Chen Li, Yang Yonghua, Wang Ningxia, Yang Keping, and Yuan Quan. 2019. How serendipity improves user satisfaction with recommendations? A large-scale user evaluation. In Proceedings of the World Wide Web Conference. 240250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Chen Xu, Chen Hanxiong, Xu Hongteng, Zhang Yongfeng, Cao Yixin, Qin Zheng, and Zha Hongyuan. 2019. Personalized fashion recommendation with visual explanations based on multimodal attention network: Towards visually explainable recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 765774.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Chen Xu, Qin Zheng, Zhang Yongfeng, and Xu Tao. 2016. Learning to rank features for recommendation over multiple categories. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 305314.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Chen Xu, Zhang Yongfeng, and Qin Zheng. 2019. Dynamic explainable recommendation based on neural attentive models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5360.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Chen Zhongxia, Wang Xiting, Xie Xing, Wu Tong, Bu Guoqing, Wang Yining, and Chen Enhong. 2019. Co-attentive multi-task learning for explainable recommendation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’19). 21372143.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Lathauwer Lieven De, Moor Bart De, and Vandewalle Joos. 2000. A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl. 21, 4 (2000), 12531278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  15. [15] Gao Jingyue, Wang Xiting, Wang Yasha, and Xie Xing. 2019. Explainable recommendation through attentive multi-view learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 36223629.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Ghazimatin Azin, Pramanik Soumajit, Roy Rishiraj Saha, and Weikum Gerhard. 2021. ELIXIR: Learning from user feedback on explanations to improve recommender models. In Proceedings of the Web Conference 2021. 38503860.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] He Xiangnan, Chen Tao, Kan Min-Yen, and Chen Xiao. 2015. Trirank: Review-aware explainable recommendation by modeling aspects. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 16611670.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Herlocker Jonathan L., Konstan Joseph A., and Riedl John. 2000. Explaining collaborative filtering recommendations. In Proceedings of the ACM Conference on Computer Supported Cooperative Work. 241250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Huang Jizhou, Zhang Wei, Zhao Shiqi, Ding Shiqiang, and Wang Haifeng. 2017. Learning to explain entity relationships by pairwise ranking with convolutional neural networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). 40184025.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Huang Yafan, Zhao Feng, Gui Xiangyu, and Jin Hai. 2021. Path-enhanced explainable recommendation with knowledge graphs. World Wide Web 24, 5 (2021), 17691789.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Ioannidis Vassilis N., Zamzam Ahmed S., Giannakis Georgios B., and Sidiropoulos Nicholas D.. 2019. Coupled graphs and tensor factorization for recommender systems and community detection. IEEE Trans. Knowl. Data Eng. 33, 3 (2019), 909920.Google ScholarGoogle Scholar
  22. [22] Jäschke Robert, Marinho Leandro, Hotho Andreas, Schmidt-Thieme Lars, and Stumme Gerd. 2007. Tag recommendations in folksonomies. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 506514.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Koren Yehuda, Bell Robert, and Volinsky Chris. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 3037.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Le Trung-Hoang, Lauw Hady W., and Bessiere C.. 2020. Synthesizing aspect-driven recommendation explanations from reviews. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IICAI’20). 24272434.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Li Lei, Chen Li, and Dong Ruihai. 2021. CAESAR: Context-aware explanation based on supervised attention for service recommendations. J. Intell. Inf. Syst. 57, 1 (2021), 147170.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Li Lei, Zhang Yongfeng, and Chen Li. 2020. Generate neural template explanations for recommendation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 755764.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Li Lei, Zhang Yongfeng, and Chen Li. 2021. EXTRA: Explanation ranking datasets for explainable recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Li Lei, Zhang Yongfeng, and Chen Li. 2021. Personalized transformer for explainable recommendation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Li Piji, Wang Zihao, Ren Zhaochun, Bing Lidong, and Lam Wai. 2017. Neural rating regression with abstractive tips generation for recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 345354.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Lin Chin-Yew. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. 7481.Google ScholarGoogle Scholar
  31. [31] Liu Tie-Yan. 2011. Learning to Rank for Information Retrieval. Springer Science & Business Media.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Lu Yichao, Dong Ruihai, and Smyth Barry. 2018. Coevolutionary recommendation model: Mutual learning between ratings and reviews. In Proceedings of the World Wide Web Conference. 773782.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] McInerney James, Lacker Benjamin, Hansen Samantha, Higley Karl, Bouchard Hugues, Gruson Alois, and Mehrotra Rishabh. 2018. Explore, exploit, and explain: Personalizing explainable recommendations with bandits. In Proceedings of the 12th ACM Conference on Recommender Systems. 3139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Mnih Andriy and Salakhutdinov Russ R.. 2007. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems. 12571264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Rendle Steffen, Freudenthaler Christoph, Gantner Zeno, and Schmidt-Thieme Lars. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Rendle Steffen and Schmidt-Thieme Lars. 2010. Pairwise interaction tensor factorization for personalized tag recommendation. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 8190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Resnick Paul, Iacovou Neophytos, Suchak Mitesh, Bergstrom Peter, and Riedl John. 1994. Grouplens: An open architecture for collaborative filtering of netnews. In Proceedings of the ACM Conference on Computer Supported Cooperative Work. 175186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Sarwar Badrul, Karypis George, Konstan Joseph, and Riedl John. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International Conference on World Wide Web. 285295.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Seo Sungyong, Huang Jing, Yang Hao, and Liu Yan. 2017. Interpretable convolutional neural networks with dual local and global attention for review rating prediction. In Proceedings of the 11th ACM Conference on Recommender Systems. 297305.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Singh Ashudeep and Joachims Thorsten. 2018. Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 22192228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Tintarev Nava and Masthoff Judith. 2015. Explaining recommendations: Design and evaluation. In Recommender Systems Handbook (2nd ed.), Shapira Bracha (Ed.). Springer, Chapter 10, 353382.Google ScholarGoogle Scholar
  43. [43] Tucker Ledyard R.. 1966. Some mathematical notes on three-mode factor analysis. Psychometrika 31, 3 (1966), 279311.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Voskarides Nikos, Meij Edgar, Tsagkias Manos, Rijke Maarten De, and Weerkamp Wouter. 2015. Learning to explain entity relationships in knowledge graphs. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 564574.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Wang Nan, Wang Hongning, Jia Yiling, and Yin Yue. 2018. Explainable recommendation via multi-task learning in opinionated text data. In Proceedings of the Special Interest Group on Information Retrieval (SIGIR’18). ACM, 165174.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Wang Peng, Cai Renqin, and Wang Hongning. 2022. Graph-based extractive explainer for recommendations. In Proceedings of the ACM Web Conference 2022. 21632171.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Wang Xiting, Chen Yiru, Yang Jie, Wu Le, Wu Zhengtao, and Xie Xing. 2018. A reinforcement learning framework for explainable recommendation. In Proceedings of the IEEE International Conference on Data Mining (ICDM). IEEE, 587596.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Xian Yikun, Fu Zuohui, Muthukrishnan Shan, Melo Gerard De, and Zhang Yongfeng. 2019. Reinforcement knowledge graph reasoning for explainable recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 285294.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Xu Xuhai, Awadallah Ahmed Hassan, Dumais Susan T., Omar Farheen, Popp Bogdan, Rounthwaite Robert, and Jahanbakhsh Farnaz. 2020. Understanding user behavior for document recommendation. In Proceedings of the Web Conference 2020. 30123018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Yang Aobo, Wang Nan, Cai Renqin, Deng Hongbo, and Wang Hongning. 2022. Comparative explanations of recommendations. In Proceedings of the ACM Web Conference 2022. 31133123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Yang Aobo, Wang Nan, Deng Hongbo, and Wang Hongning. 2021. Explanation as a defense of recommendation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 10291037.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Zhang Yongfeng and Chen Xu. 2020. Explainable recommendation: A survey and new perspectives. Found. Trends Inf. Retriev. 14, 1 (2020), 1101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Zhang Yongfeng, Lai Guokun, Zhang Min, Zhang Yi, Liu Yiqun, and Ma Shaoping. 2014. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. 8392.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On the Relationship between Explanation and Recommendation: Learning to Rank Explanations for Improved Performance

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Intelligent Systems and Technology
          ACM Transactions on Intelligent Systems and Technology  Volume 14, Issue 2
          April 2023
          430 pages
          ISSN:2157-6904
          EISSN:2157-6912
          DOI:10.1145/3582879
          • Editor:
          • Huan Liu
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 February 2023
          • Online AM: 25 October 2022
          • Accepted: 4 October 2022
          • Revised: 13 September 2022
          • Received: 8 March 2022
          Published in tist Volume 14, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format