Triple Dual Learning for Opinion-based Explainable Recommendation

Recently, with the aim of enhancing the trustworthiness of recommender systems, explainable recommendation has attracted much attention from the research community. Intuitively, users’ opinions toward different aspects of an item determine their ratings (i.e., users’ preferences) for the item. Therefore, rating prediction from the perspective of opinions can realize personalized explanations at the level of item aspects and user preferences. However, there are several challenges in developing an opinion-based explainable recommendation: (1) The complicated relationship between users’ opinions and ratings. (2) The difficulty of predicting the potential (i.e., unseen) user-item opinions because of the sparsity of opinion information. To tackle these challenges, we propose an overall preference-aware opinion-based explainable rating prediction model by jointly modeling the multiple observations of user-item interaction (i.e., review, opinion, rating). To alleviate the sparsity problem and raise the effectiveness of opinion prediction, we further propose a triple dual learning-based framework with a novelly designed triple dual constraint. Finally, experiments on three popular datasets show the effectiveness and great explanation performance of our framework.


INTRODUCTION
Deep learning techniques have achieved remarkable success in recommender systems [3,35,53,56] by making accurate predictions on user preferences for unseen items.However, deep learning models' large-scale signal passing also gives rise to the black box problem, i.e., the reasons why a model recommends an item are unclear to the user.The lack of explanation greatly limits the persuasiveness and trustworthiness of recommender systems [1,14,36].Therefore, explainable recommender systems [54] that give specific and intuitive explanations when providing high-quality recommendation results have attracted the attention of the research community.
In practice, a user's rating on an item can be attributed to the user's multi-viewed opinions toward various properties of an item.As shown in Figure 1, the first user rates highly for a movie, because he/she feels its story is sweet while the third user gives a low rating for a movie due to its terrible acting.Thereby, rating prediction through modeling users' potential opinions of an item will enable fine-grained explanations for recommendation evidence.With the intuition that user reviews contain rich information on user opinions and their latent relationship with ratings, efforts have been devoted to predicting review information as explanations for recommendation [25,27,40].However, existing works focus on modeling review generation and rating prediction as relatively independent tasks, which provide a heuristic explanation of general user preference but overlook the quantitative relationships between specific opinion terms and the rating.Moreover, the reviews indeed contain abundant noise that is not relevant to user preference (e.g., "I wondered just how old Fred Astaire was... " as shown in Figure 1).Thus, models tend to generate general words instead of detailed opinion information, which is really effective for the explanation.To our best knowledge, there still lacks an explainable recommendation method that can provide a quantitative explanation form for rating influence from multi-aspect opinions.Indeed, opinion-based recommendation explanation is a nontrivial task that faces the following challenges: (1) The relationship between user's opinions and rating is complex.Specifically, with various confounders for the rating, a similar opinion may have different influences on the ratings under different contexts.For example, the opinion "interesting character" may have a greater impact on ratings of comedies while less on ratings of horror movies.Moreover, a user usually holds multiple opinions toward different aspects of an item (e.g., a user may be fond of the story of a movie while thinking its acting is terrible as shown in Figure 1), they have complicated nonlinear combination effects on the rating.
(2) Predicting a user's potential opinions of an unseen item is difficult.Specifically, the candidate pool of opinions is large and a user may only mention a fraction of their opinions about an item.This leads to the sparsity of opinion data, thereby bringing difficulty in training an opinion prediction model.
To tackle these challenges, we propose a novel framework by jointly modeling the interactive relationships among review, opinion, and rating to enable explainable recommendation.Specifically, we propose an overall preference-aware opinion-based explainable rating prediction model.The model decouples the rating into the preference score of each involved opinion by modeling the overall preference-aware importance of each opinion.In particular, we utilize overall preferences extracted from review information to assist in capturing the importance of individual opinions.Moreover, to alleviate the sparsity problem and raise the effectiveness of review and opinion prediction, we propose a triple dual learning-based framework for joint modeling of rating prediction, review generation and opinion ranking tasks.The triple dual structure captures the inter-dependencies among the three tasks from the probabilistic perspective.
To sum up, our contributions are listed below: -To our best knowledge, we are the first to propose a triple dual learning framework to jointly model opinion, review, and rating in recommender systems.In particular, we propose a triple dual probabilistic constraint based on the interactive conditional probabilities of the three tasks.-We propose a novel overall preference-aware opinion-based explainable rating prediction model, which has an intrinsically explainable network structure that decouples the rating into the contribution of each involved opinion.-Comprehensive experiments on three publicly available recommendation datasets demonstrate the effectiveness of our framework on rating, review and opinion prediction tasks.In addition, experimental results show that the model can well explain the quantitative contribution of user's opinions to the predicted rating.
Overview.The rest of this article is organized as follows: in Section 2, we briefly introduce the related literature of our study; in Section 3, we introduce the preliminaries, formally define our problem and present an overview of Triple dual learning framework for Explainable Opinion-based Rating prediction (TriEOR) framework; in Section 4, we detail the overall preference-aware opinion-based rating prediction; our designed triple dual learning framework is then introduced in Section 5; the performance of our models is evaluated in Section 6, with some further discussions and case studies; finally, we conclude this article in Section 7.

Deep Learning-based Recommendation
With the rapid advancement of deep learning technology, deep neural network-based recommendation models have become increasingly popular in recommendation systems, as they can significantly enhance the accuracy of recommendations [3,4,25,35,53].For example, NeuMF [13] integrated traditional matrix factorization with a Multi-layer Perceptron (MLP) to extract both low-order and high-order features for accurate rating prediction.Moreover, some recent works [8,15,37,55,57] have attempted to leverage sequence-based techniques, such as Recurrent Neural Networks (RNNs) [15,58] and attention mechanisms [51,57], to capture users' evolving interests from their historical behaviors.For instance, DIN [57] performed an attention mechanism between the user's historical behavior sequence and the current target item to model their changing interests.In addition, most recommendation data imply graph structures [46].For example, interaction data can naturally form a bipartite graph between user and item nodes, where observed interactions are denoted by edges between corresponding nodes.Given the remarkable ability of Graph Neural Networks (GNNs) in processing graph data, many works [12,32,46,52] have leveraged graph embedding techniques to learn the relations among interacting nodes, thus improving recommendation accuracy.For example, Graph Convolutional Network (GCN) [20] adopted a localized filter to aggregate information from neighboring nodes, while LightGCN [12] simplified GCN by eliminating unnecessary feature transformation and non-linear activation components, and instead relied solely on neighborhood aggregation to capture the collaborative filtering effect.Additionally, numerous studies have explored the integration of auxiliary information to further improve recommendation accuracy, such as user/item attribute information [42], user review [3,25,56], knowledge graph [10,43], and social network [16].For example, DeepCoNN [56] employed two parallel CNN-based content embedding modules that take the historical reviews of users and items as input, thus capturing the rich semantic information in user reviews for the recommendation.While existing models typically relied on users' historical feedback or auxiliary information for recommendations, few models have attempted to predict users' ratings of items from the perspective of their opinions.

Review-based Explanation for Recommendation
With the aim of improving the transparency, persuasiveness and trustworthiness of recommendations, explainable recommender systems [54] have attracted great attention from the research community [5,9,23,25,27,39,48]. Due to the rich opinion information in the review, great efforts have been devoted to generating the whole or partial reviews as explanations.Many studies [23,25,27,38,40,50] utilized the natural language techniques such as LSTM, GRU, or Transformer [41] to generate reviews.However, the generated text explanations face several limitations: First, the information in the generated sentence may not always be meaningful for explanation.Second, the overall traits (e.g., fluency) of the generated sentence still largely hinder user understanding.Thus, some attempts [11,29,44,47,49,54] were also devoted to extracting finer-grained aspect-level information from reviews to avoid the aforementioned generation issues.For example, EFM [54] added aspects extracted from reviews to the traditional user-item rating matrix, and then applied sentiment analysis to the factorization model to improve the performance of explanation and recommendation.Besides, MTER [44] jointly modeled the rating prediction task and the aspect-level explanation task through a joint tensor factorization framework.Since aspect information was generally annotated manually before, Ex3 [49] developed an explainable aspect-aware item-set recommendation algorithm that could learn important aspects automatically based on the user's historical behaviors.In most of the above models, recommendation and explanation tasks were loosely connected, such as by sharing user and item embeddings [25,40].Some models further [23,27,38,50] attempted to explore the implicit relationships between the recommendation and explanation.For example, in References [23,27,38], the output of the rating prediction task served as the input of the explanation task.DualPC [38] further modeled the rating prediction and explanation tasks in dual forms with the input of one task being exactly the output of the other task.However, few attempts have been made to model the explicit relationship between ratings and users' preferences for different item aspects (i.e., multi-aspect opinions).In contrast to existing works, we propose a triple dual learning framework for an explainable opinion-based rating prediction model, which can provide a fine-grained and reliable form of explanation for the predicted rating.

PRELIMINARY 3.1 Problem Formulation
In our work, we use U, I, R, C, S to denote the universal sets of users, items, ratings, reviews, and opinions, respectively.Each record in training dataset X can be formulated as X = (u, i, r , C, S ), where r ∈ R, C ∈ C, S ⊂ S denote the rating, review and opinion set for the given user-item pair of u ∈ U and i ∈ I, respectively.Specifically, the textual review C is constituted by T words (i.e., [c 1 , c 2 , . . ., c T ]).The opinion set S consists of M opinions, denoted as {s 1 , s 2 , . . ., s M }, where s * ∈ S refers to a short term that the user posts to show his/her preference for a certain aspect of the item.We have followed the existing work [24] to extract opinions from the review.
Based on the aforementioned concepts and notations, we thus formalize the opinion-based explainable recommendation problem as: Given the training dataset X, the goal is to train a model to predict appropriate opinions and rating with each (u, i) information pair as input, as well as to quantify the contribution of each predicted opinion to the rating.For facilitating illustration, some important mathematical notations used in this article are listed in Table 1.

Framework Overview
With the aim of providing consistent attribution explanations for ratings from multi-aspect opinions, we propose TriEOR.In TriEOR, we design an overall preference-aware opinion-based rating prediction task to quantify the influence of user opinions on the rating, which is illustrated in detail in Section 4. Formally, the overall preference-aware opinion-based rating prediction task can be defined as f r |cs : u, i, C, S → r , which is used to approximate the conditional probability p(r |CS; θ r |cs ) with parameters θ r |cs .Please note that we ignore the known u, i conditions in all probability formulas for simplicity.To alleviate the sparsity problem and raise the effectiveness of review and opinion prediction, we further propose a triple dual learning framework, in which besides rating prediction tasks, there are review generation and opinion ranking tasks.In detail, the review generation task based on the rating and opinion information can be formulated as f c |r s : u, i, r , S → C, which is used to approximate p(C |rS; θ c |r s ) with parameters θ c |r s .Additionally, the opinion ranking task based on the rating and review information can be defined as f s |rc : u, i, r , C → S, which estimates p(S |rC; θ s |rc ) with parameters θ s |rc .Particularly, we propose a triple dual constraint to model the probabilistic inter-dependencies among the three tasks.The triple dual learning framework with the constraint will be elaborated in Section 5 in detail.Please note that, without confusion, we would omit the parameter terms in the probability formulas (e.g., p(r |CS; θ r |cs ) is denoted as p(r |CS )).Predicted rating, review, and opinion set in adaptive prediction rating value.Therefore, we decouple the rating value into the preference score of every involved opinion in the opinion set by modeling the importance value of the opinions.Furthermore, to more accurately model the importance of each opinion, we propose to extract the overall preference shown in the review to assist in capturing the overall preference-aware importance of individual opinions.As a result, the predicted rating is modeled as rcs = f r |cs (u, i, C, S ), which quantifies the contribution of individual opinions to the rating.In this section, we primarily focus on introducing how to estimate the preference score of each opinion and determine the importance of each opinion in the opinion set to the rating.The detailed structure of this model except for the input layer is shown in Figure 2.

Input Layer
In f r |cs module, the opinion and review information as well as personalized user and item information are known.Below, we describe how to learn the representations of these input data, which are also shared throughout the whole framework: -User/Item.The user/item representation includes the user/item embedding and the information of user/item historical reviews: (1) After turning the user ID and item ID into user -Review.Many review generation models, such as NETE [23] and DualPC [38], employ GRU or LSTM to process the review data.As GRU is a variant of LSTM and is often faster to iterate, we utilize GRU to process the review data.Specifically, with the fusion of user and item representations E u E i (• • denotes the operation of concatenation) as the initial states [6] is adopted to capture the bidirectional information of the review as where E t c represents the embedding of the review word c t ∈ R d c (d c denotes the dimension of the word embedding vectors), − → h t c and ← − h t c are the forward and back output of BiGRU.
-Opinion Set.Each opinion s j in the opinion set {s 1 , s 2 , . . ., s M } is a short sentence.
Following BPER+ [26], we leverage BERT [18], a well-known pre-trained language model, to extract the textual features of each opinion.To align the dimension of latent factors in our model, we further apply a linear layer to this vector, resulting in s j e for j = 1, 2, . . ., M.

Opinion Preference Score
Every opinion naturally reflects the user's preference for the item, such as the opinion "Great Acting" indicating the user's fondness for the movie.Besides, there generally exist complicated nonlinear combination effects among multiple opinions in the opinion set.We thus adopt the Multi-head Attention mechanism [41] that can use ensemble self-attention to capture the combination effects.Taking these factors into account, the preference score of each opinion can be computed as: -Multi-head Attention Layer.The self-attention function in this layer is formulated as where Q, K, V correspond to the results of three linear transformations of s e = |s 1 e , s 2 e , . . ., s M e | ∈ R M ×d .This equation involves the MatMul, Scale, SoftMax, and MatMul operations as shown in Figure 2.After a residual concatenate operation, we have the ensemble feature of the opinion set on each s j as x j r |cs .-MLP Layer.With x j r |cs as the input of MLP, we obtain v j r |cs , which represents the preference score of opinion s j .This score considers both the emotional preference of the opinion itself s j and the influence of the other opinions in the opinion set on its preference score.

Overall Preference-aware Importance
The importance of an opinion indicates the extent to which the opinion affects the overall rating, which can be actually influenced by multiple factors.For example, given the review "The acting is good but the story is poor, I don't like it" showing overall negative preference, the importance of the second opinion in the opinion set {"The acting is good, " "the story is poor"} may be higher than that of the first opinion.Besides, even the same opinion can have different effects on the rating for different users and items.For example, "low price" may be more important for user A who pays attention to item price than user B caring about item quality.
Therefore, to capture the personalized overall preference-aware importance feature of the opinion s j , we fuse the representations of user, item, review and opinion (i.e., V u V i h c s j e ) as the input of MLP and obtain the importance feature mj r |cs .By feeding the importance features of all opinions in the opinion set (i.e., m1 r |cs , m2 r |cs , . . .mM r |cs ) to a SoftMax layer, we can obtain the final importance vector as [m 1 r |cs , m 2 r |cs , . . .,m M r |cs ].Eventually, with the overall preference-aware importance vector m r |cs and the preference score vector v r |cs of the opinion set, the rating can be predicted as Thereby, the contribution of each involved opinion to the user's rating is explicitly and quantitatively modeled.The corresponding loss term L r |cs is defined in Equation (13).

TRIPLE DUAL LEARNING FRAMEWORK
To enable the prediction of potential opinions from the large candidate opinion pool and realize consistent modeling, we design the opinion ranking f s |rc and review generation f c |r s tasks.The review generation task can provide additional supervision for modeling user-item interactions including the opinion.With the input of one task being exactly the output of the remaining two tasks, these tasks including f r |cs are presented in a triple dual form.Therefore, we propose a triple dual learning framework, as depicted in Figure 3. Instead of simply correlating the three tasks by sharing user and item embeddings, we further propose a triple dual constraint among the three tasks to capture the probabilistic relations among tasks.In Section 5.1, we describe this constraint in detail.We also provide a detailed account of the probabilistic modeling of the tasks and the optimization process in Section 5.2.Moreover, due to the unavailability of opinions, reviews and ratings for test data, an adaptive prediction process with solely user and item information is described in detail in Section 5.3.

Triple Dual Constraint
Ideally, if f r |cs , f c |r s , f s |rc tasks perform well, then the probabilistic inter-dependencies among them, as illustrated in Figure 3 constraint, are better satisfied.Specifically, the product of all paths starting from "∅" and ending at "rCS" should be equal to the same value p(rCS ) as much as possible.This can be formulated as where π (r , C, S ) represents the full permutation of {r , C, S }.Specifically, all six permutations correspond to the six paths in Figure 3 constraint, with three of them indicated by dotted lines.While it is nontrivial to directly model the equality of all these permutations, we can summarize them as the following three types of equations that ∀a 1 , a 2 , a 3 ∈ π (r , C, S ): (8) Encouragingly, we find that when the constraints in Equations ( 6) and ( 7) are satisfied, the constraint in Equation ( 8) is also satisfied as follows: p(a 1 )p(a 2 |a 1 )p(a Therefore, the constraints in Equations ( 6) and ( 7) are able to effectively show the probabilistic inter-dependencies among the tasks.Since it is infeasible in practice to directly incorporate the probabilistic constraints into the objective function, we further eliminate the shared term in Equations ( 6) and (7), resulting in the following two loss terms: Eventually, the triple dual loss term can thus be formulated as where |X| denotes the total number of training records in training dataset X.

Probabilistic Modeling and Optimization
For clarity, we describe the tasks and their corresponding probabilistic modeling based on their predicted targets as follows: Other Rating Prediction and Rating Probabilities.Since the triple dual constraint involves the marginal and conditional probabilities of the rating p(r |τ r ), ∀τ r ⊂ {C, S }, we design a unified probability model of the rating.We first predict the rating values in the corresponding modules as: -In f r , a Factorization Machine (FM) [33] is used to predict rating with the representations of user and item as r = FM (V u V i ); -In f r |c , feed the fusion of the representations of user, item and review (i.e., V u V i h c ) into MLP to obtain rc ; -In f r |s , a similar Multi-head Attention layer and MLP layer as in Section 4.2 are utilized to estimate the preference score v j r |s of the opinion s j .The rating is then predicted as rs = 1 M M j=1 v j r |s .Also, rcs has been modeled in f r |cs task detailed in Section 4. Similar to DualPC [38], we assume the probability of each rating record r follows a Gaussian distribution [30] with the mean of the predicted rating rτ r and the estimated variance of δ 2 τ r as Maximizing the negative log-likelihood of the Gaussian distribution is equivalent to minimizing the Mean-square Error (MSE) loss as Review Generation and Probabilities.The review generation task can not only provide additional supervision for modeling user-item interactions including the opinion but also better bond opinion and rating as an extra influencing factor through interactive conditional probability modeling in the triple dual framework.Below, we introduce the unified modeling of review generation tasks in our framework: Since GRU can better handle long sequences, we adopt it to model the probabilities p(C |τ c ), ∀τ c ⊂ {r , S }.The main difference among these modules lies in the different initial states of the corresponding GRUs.Specifically, the initial states д 0 in f c , f c |r , f c |s , f c |r s modules are the linear transformation of the fused representations of the tth word in the review), GRU recursively updates the current state as д t = GRU(д t −1 , E t c ) for t = 1, 2, . . .,T .The states are then mapped into the vocabulary space and a Softmax layer is finally used to model the conditional probability of each review word as where W c ∈ R |V |×d c is a trainable parameter, with |V | denoting the size of word vocabulary V in the review set C. Further, the probability of the review can be approximated as Maximizing the negative log-likelihood of all training reviews is equivalent to minimizing Negative Likelihood Loss (NLL) as Opinion Ranking and Probabilities.Intuitively, there are multiple factors influencing the ranked opinions.For example, a high rating value or a review showing an overall positive preference may indicate that a positive opinion is more appropriate than a negative opinion under the same conditions.Therefore, to address the sparsity of opinion information, we take the influence of all u, i, r , C on opinion ranking into consideration.To reduce complexity, we estimate the influence of u, i, r , C on opinions in a decoupled way.Specifically, the candidate opinion o ∈ S is fed to the BERT network and a linear layer to obtain the textual representation as o e in Section 4.1.Applying four sets of matrix factorization, the corresponding correlation scores are estimated as where {Q * o , b * o } are four different sets of latent factors for opinions, corresponding to u, i, r , C, respectively.denotes the element-wise multiplication of two vectors.In Q * o o e , both opinion ID and opinion semantic information are taken into consideration.Further, ∀τ s ⊂ {r , C}, let Y = τ s ∪ {u, i}, which represents the condition set where u, i are not omitted.The overall correlation score is unified as the average sum of the corresponding correlation scores and the Top-N opinion ranking can be formulated as With a high correlation score z τ s ,o , opinion o is greatly likely to match the ground-truth opinions.Therefore, we can assume the probability of an opinion set S as where σ (•) denotes the sigmoid function, mapping the value to the range of (0,1).To optimize the Top-N ranking problem, we adopt the Bayesian Personalized Ranking (BPR) [34] loss as   Feed (u, i, r , C, S ) to Triple Dual Learning Framework.
Finally, the final loss of TriEOR can be formulated as where λ is a discount factor and the detailed algorithm of TriEOR is shown in Algorithm 1.

Adaptive Prediction for Test Data
In practical recommender systems, the opinion, review, and rating data are unavailable.Therefore, we propose an adaptive prediction process with only the user and item information (i.e., User/Item in Section 4.1) as input.First, inspired by DeepCoNN [56], we simulate the review representation with TextCNN processors Γ u and Γ i as Triple Dual Learning for Opinion-based Explainable Recommendation

70:13
With the aim of replacing h c with h c , the optimization function is defined as To model the effect of individual opinion on rating, we first predict the opinion set and then utilize it to predict rating.Specifically, feeding h c to f r |c module, we can first approximate the rating r .As the predicted rating is a floating-point number, the rating value is rounded to the nearest integer as round (r ).Next, based on h c and round (r ), the opinion set S can be generated by feeding them into the opinion ranking module (i.e., f s |rc ) as In other words, given the approximate rating and review information (i.e., r and h c ), we obtain the result of the Top-N opinion ranking in Equation ( 18) as S = { s 1 , s 2 , . . ., s N }.Finally, feeding h c , S to the rating prediction module (i.e., f r |cs ) and round (r ), S to the review generation module (i.e., f c |r s ), the predicted rating and generated review are inferred as The objective optimization function can be formulated as where L r |c and L r |cs are calculated in Equation ( 13) and L c |r s is defined in Equation ( 16).The detailed algorithm of the adaptive prediction is shown in Algorithm 2. Eventually, TriEOR can obtain the S, r , C for test data and obtain the quantitative influence of opinions on the rating in f r |cs module simultaneously.

Complexity of Triple Dual
Training.The complexity of the triple dual training mainly comes from the representation learning and the three kinds of tasks.First, in representation learning, we employ commonly used techniques such as TextCNN, Bidirectional GRU, and BERT.TextCNN is an efficient text processing technique that can be accelerated through parallel computing and has been used in other successful models such as DeepCoNN [56], TransNet [2], and NARRE [3].Second, the review generation using GRU and opinion ranking tasks utilizing matrix factorization are similar to the existing works [26,38], while the complexity of rating prediction tasks is similar to that of related recommendation models [22,45] adopting Multi-head Attention mechanism.The overall complexity of our triple dual learning training is a linear combination of the complexity of individual modules, which indicates that it is affordable and within reasonable limits.

Complexity of Adaptive Prediction.
The prediction process consists of three parts: (1) We also adopt the efficient TextCNN technique in the simulation of review representation.(2) We rank opinions based on the efficient matrix factorization, similar to that in Reference [26].(3) The rating prediction and review generation modules are the same as those in triple dual learning.Therefore, the complexity of our adaptive prediction process is reasonable and practical, as it is a linear combination of several efficient modules.
In summary, our framework's complexity is a linear combination of similar complexity of the existing works, which ensures that our framework is practical and can be implemented in real-world scenarios.Initilize the trainable parameters of TextCNN part. 3: Get a mini-batch of (u, i, C) triplets.

5:
for each (u, i, C) record in mini-batch do 6: Compute the real review representation h c in Equation ( 2) Approximate review representation h c in Equation ( 22).

9:
end for 10: Optimize the parameters of TextCNN part. 11: until the trainable parameters have converged.12: end submodule 13: submodule AdpTriEOR  Feed h c , round (r ) into f s |rc module to get S.

21:
Feed h c , S into f r |cs module to get r .

22:
Feed S, round (r ) into f r |cs module to get C.

Experimental Settings
We evaluate our proposed framework on three publicly available explainable recommendation datasets.The three datasets are from Amazon (Movies and TV), 1 Yelp (Restaurant), 2 and TripAdvisor (Hotel). 3Each record in the datasets is composed of a user ID, an item ID, a rating on a scale of 1 to 5, an opinion set, and a review.Specifically, each opinion in the opinion set is a short term that the user posts to express preferences for specific aspects of an item.We follow the existing works [24] to extract the opinions of the user for the item from the corresponding review and preprocess the review data according to Reference [23].The statistics of these datasets are summarized in Table 2 and the detailed examples are shown in Table 3.We randomly split the three datasets into training and testing sets with a ratio of 7:3, respectively.
As for the preprocessing of review data, we set the size of the vocabulary to 20,000 by keeping the most frequent words.Moreover, we keep each review with the same length 15.Additionally,   This has to be the worst movie I ever watched... or sorry, I should say worst movie... the dimension of word embedding matrix of review data was set to 100 and the embedding matrix was pre-trained with the Gensim word2vec. 4s for the preprocessing of opinion data, since the number of opinions in opinion sets in datasets varies with the record, we padded or dropped opinions randomly in the opinion set to keep each opinion set with the same number of opinions per opinion set.Furthermore, the pre-trained BERT from huggingface5 was adopted to extract the textual features of opinions.
Our TriEOR framework is implemented with Pytorch,6 a well-known open-source software library for deep learning.The remaining network configurations are presented in Table 4.

Performance Evaluation
In the following, we evaluate the performance of TriEOR on the three main tasks to examine the effectiveness of our framework.For these tasks, we disable some components of TriEOR for ablation study to show their effectiveness as follows: -Multi-task MLP.This baseline only incorporates f r , f c , f s modules in TriEOR with only the information of users and items as input, aiming at examining the necessity of modeling the relationships among the rating, review, and opinion set.-TriEOR (w/o TDL).This baseline directly performs the adaptive prediction as illustrated in Section 5.3 without optimization of triple dual learning.The purpose is to validate that the adaptive prediction is ineffective without triple dual modeling, thereby further verifying the necessity of our proposed triple dual learning.-TriEOR (pre-opinion).This baseline just replaces S with the opinion set ranked only by the user and item information (i.e., the output of Multi-task MLP, Top({u, i}, N ) in Equation ( 18  in the adaptive prediction process.Then it predicts the corresponding review and rating, which is devised to validate the effectiveness of capturing the impact of review and rating data on the opinion ranking task, thus verifying the effectiveness of the triple dual structure.Because the results of opinion ranking tasks in this baseline are exactly those in Multi-task MLP baseline, we omit them in Table 7. -TriEOR (w/o opinion).This baseline directly predicts rating r with the approximated review representation h c on the basis of the learned triple dual learning framework, which is designed to verify the necessity of opinion information for rating prediction.-TriEOR (w/o review).This baseline models the rating prediction and opinion ranking tasks in a dual form (similar to the dual form in DualPC [38]) without the review information.To address the unavailability of rating and opinion data for test samples, TriEOR (w/o review) ranks opinions by simulating the rating representation and then utilizes the ranked opinions to predict rating.This baseline is specifically designed to verify the auxiliary effect of the review information in our model.

Rating Prediction
Task.At present, there are few opinion-based rating prediction models.Based on whether auxiliary review information is utilized or not, we divide the existing rating prediction methods into two categories as follows: (1) Rating-only baselines.These models directly predict ratings only using the user and item information.
-MF [21] projects both the user and item into a low-dimensional latent space for rating prediction.-NeuMF [13] combines traditional matrix factorization with MLPs to extract low-order and high-order features for prediction.
Triple Dual Learning for Opinion-based Explainable Recommendation 70:17 -NRT [27] is a multi-task learning model for both rating prediction and review generation, which directly leverages user and item latent factors to predict ratings.-NETE [23] treats the rating prediction and explanation generation as multi-task modeling and utilizes the user and item latent vectors for rating prediction.-PETER [25] jointly models rating prediction and review generation tasks by sharing user and item embeddings.
(2) Rating and Review baselines.These models predict ratings with auxiliary review data.
-DeepCoNN [56] uses two parallel TextCNN processors to obtain the user and item historical review semantic embeddings for rating prediction.-TransNet [2] approximates the review representation of the user-item pair to predict the rating.
-NARRE [3] advances DeepCoNN by using the attention mechanism to select useful reviews from the user and item historical reviews for rating prediction.-DualPC [38] models rating prediction and review generation tasks in dual forms.To address the unavailability of rating and review data, DualPC directly predicts the rating by simulating the review representation for test data.
For fairness of comparison, the configurations of the TextCNN processors in DeepCoNN, TransNet, and NARRE are the same as those in TriEOR.Moreover, the hidden state sizes of all LSTMs in baselines for rating prediction are consistent with that of the GRU in f r |cs module.Besides, we adopt the same pre-trained word embedding matrix for review data in all methods.The accuracy of this task is measured by root-mean-square error (RMSE) as follows: where X test denotes the test dataset and |X test | represents the total number of records in X test .
A lower RMSE score indicates a better performance.The RMSE results of this task are shown in Table 5.First, TriEOR outperforms all compared methods on the datasets for this task.Furthermore, with historical review information as auxiliary information, DeepCoNN, TransNet, and NARRE almost show better rating prediction performance compared with MF, NeuMF, NRT, NETE, and PETER.TriEOR further incorporates the review and opinion information, which leads to better results.Moreover, DualPC, which models rating prediction and review generation tasks in dual forms, does not perform as well as TriEOR, which may result from that DualPC directly uses simulated review representations to predict ratings without adapting the model for the test data.Without opinion, the performance of TriEOR (w/o opinion) drops, demonstrating the importance of opinions in raising rating prediction performance.

Review Generation Task.
For this task, we compare our proposed TriEOR with the following baselines: -Att2Seq [7] decodes the latent embeddings of the user and item to generate the review.
-NRT [27] leverages the predicted rating as a part of the initial state of LSTM for review generation.
-DualPC [38] models the rating prediction and review generation tasks in dual form with the input of one task being exactly the output of the other task.It feeds the predicted ratings to LSTM to generate reviews.-NETE [23] uses the results of the rating prediction task as the input of GRU to generate review explanations.-PETER [25] employs a Transformer model [41] with a modified attention mask to use the user and item ID information to jointly model the review generation and rating prediction tasks.To ensure fairness, we set the hidden state sizes of all LSTMs/GRUs used for review generation in these baselines to be the same as that of GRU in the f c |r s module in TriEOR.For evaluating the text quality of the generated review C, we adopt commonly-used metrics, including BLEU [31] and ROUGE [28] scores.Specifically, BLEU scores are precision-oriented, while ROUGE scores include precision, recall, and F-measure parts based on the overlapping content of the generated reviews and the real reviews.Moreover, there are different types of BLEU scores and Rouge scores corresponding to different overlapping content lengths, which can measure text equality from various aspects.For instance, BLEU-1 score measures the accuracy at the word level, while higher-order BLEU (e.g., BLEU-4) can measure the fluency of generated sentences.The results based on these metrics for the review generation task are presented in Table 6.Most importantly, TriEOR outperforms all compared methods on most metrics on the three datasets, especially achieving the best performance on BLEU-4 and Precision, F1-measure of ROUGE-1, ROUGE-2, and ROUGE-L metrics, which indicates that TriEOR is able to generate more suitable reviews.Unlike Att2seq and PETER, which only encode user and item information, NRT, NETE, and DualPC also take rating information into account, resulting in the three models almost performing better.Our TriEOR further encodes the opinion information, which can provide rich semantic information of opinions to generate reviews.

Opinion Ranking Task.
Since there are few studies on the opinion ranking task presently, we introduce three existing baselines for this task: a weak baseline RAND and two strong baselines, PITF [24] and BPER+ [26].The details of these baselines are as follows: -RAND randomly selects opinions from S for each user-item pair.
-PITF [24] utilizes matrix factorization to incorporate personalized information about both the user and item for opinion ranking.-BPER+ [26] further considers the semantic information of opinions for opinion ranking.
As our opinion ranking task is actually a ranking task, we adopt four commonly used rankingoriented metrics: Normalized Discounted Cumulative Gain (NDCG) [17], Precision (Pre), Recall (Rec), and F1 to evaluate the performance on Top-10 ranking.These metrics are computed as follows: rel j = δ s j ∈ S , , where rel j indicates whether the jth opinion in the ranked list S (computed in Equation ( 24)) belongs to the ground-truth opinion set S. The experimental results for the opinion ranking task on the three datasets are shown in Table 7. First, TriEOR outperforms all baselines in the opinion ranking task.Besides, PITF, BPER+, and Multi-task MLP do not perform as well as our model, suggesting that it makes sense to consider the impact of reviews and ratings on opinion ranking.Note that only with the adaptive prediction, the performance of our model (i.e., TriEOR (w/o TDL)) decreases greatly due to no opinion ranking loss term in Equation ( 26), indicating the necessity of tripe dual learning.
In conclusion, TriEOR outperforms all baselines on most evaluation metrics across the three tasks.In particular, TriEOR achieves the best results in the rating prediction and opinion ranking tasks, demonstrating its ability to improve the performance of tasks presented in the triple dual form.Furthermore, the performance of Multi-task MLP with only the user and item information as the input of each task is poorer than that of TriEOR, which suggests it is effective to model the relationships among the rating, review and opinion set.Moreover, compared to TriEOR, the performance of our framework only with the adaptive prediction (i.e., TriEOR (w/o TDL)) decreases a lot, which proves the necessity of triple dual modeling of these three tasks in advance.Finally, the performance of the framework that lacks the modeling of the impact of reviews and ratings on opinion ranking (i.e., TriEOR (pre-opinion)) drops a lot.This observation indicates the effectiveness of modeling the impact of review and rating information on opinion ranking, and further verifies the effectiveness of our proposed triple dual structure.As expected, TriEOR (w/o review) performs worse than TriEOR in both rating prediction and opinion ranking tasks.This is possibly due to the fact that the auxiliary review provides additional supervision for modeling user-item interaction (rating and opinion) and it reveals the user's overall preference for an item.This auxiliary information can better assist in capturing the importance of individual opinions in opinion-based rating prediction.

Impact of the Triple Dual Constraint
In TriEOR, we propose a triple dual constraint in Equation (11) to capture the probabilistic inter-dependencies among the tasks.The parameter λ in Equation ( 21) affects the impact of the constraint.Specifically, when λ = 0, TriEOR degenerates to classical multi-task learning that simply shares embeddings without our proposed triple dual constraint.We tune it among [0,0.2,0.4,0.6,0.8,1.0] to investigate the impact of the constraint on Amazon dataset, as shown in Figure 5(a).Overall, with the increase of λ, the performance of rating prediction, review genera- tion and opinion ranking tasks increases first and then decreases.Furthermore, the constrained model at most different constraint strengths (i.e., λ 0) outperforms the unconstrained model simply sharing embeddings (i.e., λ = 0).This observation indicates that with the triple dual constraint, TriEOR is capable of obtaining more precise user and item embeddings for the tasks.Since the three tasks perform best with λ = 0.2, we set λ = 0.2 for Amazon dataset.The impact of the triple dual constraint on Yelp and TripAdvisor datasets is also presented in Figures 5(b) and 5(c).For Yelp dataset, the rating prediction task performs best with λ = 0.2 while the review generation and opinion ranking tasks achieved optimal performance with λ = 0.4.Thus, we set λ = 0.4 for Yelp dataset.Following a similar analysis, we set λ = 0.2 for TripAdvisor dataset.

Evaluation of Opinion-based Explanation
In TriEOR, each generated opinion is assigned a preference score and overall preference-aware importance value to quantify its contribution to the predicted rating.To evaluate the quality of the explanations, we provide the explanations generated by TriEOR and two competitive baselines (i.e., DualPC and BPER+) for three specific cases, as shown in Table 8.For clarity, we manually select five more relevant opinions from the top 10 opinions ranked by TriEOR and BPER+.Through these cases, we can observe that: -Compared to the other baselines.the opinions and reviews generated by TriEOR are more relevant to the ground truth.For example, in case 1, almost all opinions resulting from BPER+ are positive, such as "Excellent drama, " which contradicts the real negative opinion "The acting is terrible." In contrast, the opinions generated by TriEOR for case 1 are essentially negative and contain the real opinion.Besides, the preference words in the reviews generated by TriEOR align better with the actual reviews better than those generated by DuaPC.-The explanations generated by TriEOR can effectively reveal the aspect-level preference.
For example, in case 3, "very good" in generated review indicates the overall positive preference while the opinions show positive preference in different aspects such as "acting" and "entertainment." -The preferences of the explanations are consistent with that of the corresponding predicted ratings (i.e., consistent relationship) in TriEOR.For example, the "bad movie" in the review generated by TriEOR is more consistent with the rating in terms of preference than "very good" for case 1. Besides, the preference scores of generated opinions in TriEOR are relatively close to the ground truth and predicted ratings, such as with preference scores less than 2 for case 1 and around 5 for case 3.
Further, to verify the meaningfulness of the preference scores of opinions, we visualize the average preference scores of opinions on Amazon dataset.Specifically, in Figure 6(a), the word size is positively correlated with the average preference score of opinion words while it is  Besides, we show the specific preference scores and importance values of the generated opinions for case 3 in Figure 7, where the matching opinions stand out.With the high importance values, the opinions well explain the aspect-level preference on the predicted rating.For instance, the opinions "Classic comedy" and "Great Entertainment, " with the importance of 20.98%, 20.73% and preference scores of 5.37, 5.08, respectively, indicate that the high predicted rating 4.61 lies in the positive preference for the entertainment aspect of the movie.

CONCLUSION
In this article, we propose an opinion-based rating prediction model to realize the quantitative explanation at the level of user preferences and item aspects.To address the sparsity problem and realize the consistent modeling of review, opinion, and rating, we propose a triple dual learning framework.A triple dual constraint is further proposed to capture the probabilistic inter-dependencies among the tasks, which is proven to improve the performance of each task.Besides, our framework assigns a meaningful preference score and overall preference-aware importance value to each generated opinion to quantitatively explain the contribution of user aspect-level preference to the predicted rating.Finally, experiments on three real-world datasets validate the effectiveness and great explanation performance of our proposed framework.The existing opinion extraction process may result in some opinions having overlapping meanings or unclear information, which could potentially limit the quality of the explanation.In future work, we plan to further preprocess the opinion data to better enhance the quality of our explanation.

Triple 3 Fig. 1 .
Fig. 1.Three records for different movies from Amazon (Movies and TV).Positive opinions are highlighted with red boxes while negative opinions are highlighted with blue boxes.

Fig. 2 .
Fig. 2. Structure of the overall preference-aware opinion-based rating prediction module.and item embedding matrices U ∈ R d ×|U | and I ∈ R d ×|I | , respectively (d denotes the dimension of the embedding vectors), we can obtain the embeddings of user u and item i as E u , E i .(2) Inspired by DeepCoNN [56], the TextCNN processors [19] Γ u and Γ i are used to preprocess user's historical reviews rev u and item's historical reviews rev i as

Fig. 3 .
Fig. 3. Overall structure of the triple dual learning framework.The f r |cs , f c |r s , f s |rc tasks (i.e., τ a = {r , C, S }− {a}) are presented in the triple dual form as shown in Figure 4.

4 :
for each (u, i, r , C, S ) in the mini-batch do 5: .

7 : 8 :
Compute the triple dual loss in Equation (11) with all probabilities.Compute the loss L in Equation(21).
the parameters of Triple Dual Learning Framework.11: until the trainable parameters have converged.

14 :repeat 16 :
Load the convergent trainable parameters of Triple Dual Learning in Algorithm 1.15:Get a mini-batch of traing data.

17 :
for each (u, i, r , C, S ) record in mini-batch do 18: Feed u, i into TextCNN submodule to get h c .19: Feed h c into f r |c module to get r .20:

Fig. 7 .
Fig. 7. Visualization of preference scores and importance values of the Top 10 opinions in case 3.negatively correlated in Figure6(b).Obviously, opinions with high average preference scores, such as "Highest Recommendation, " indicate the user's fondness for the item, while those with low average preference scores, like "bad movie, " show the user's dislike of the item.It demonstrates that the preference score of an opinion can effectively indicate meaningful preference.Besides, we show the specific preference scores and importance values of the generated opinions for case 3 in Figure7, where the matching opinions stand out.With the high importance values, the opinions well explain the aspect-level preference on the predicted rating.For instance, the opinions "Classic comedy" and "Great Entertainment, " with the importance of 20.98%, 20.73% and preference scores of 5.37, 5.08, respectively, indicate that the high predicted rating 4.61 lies in the positive preference for the entertainment aspect of the movie.

Table 1 .
Mathematical Notations Number of words c * in C and the number of opinions s * in S τ r , τ c , τ s Condition set: ∀a ∈ {r , C, S }, each τ a is a subset of {r , C, S } − {a} f r |τ r , θ r |τ r Rating prediction task given condition set τ r , the parameters of f r |τ r f c |τ c , θ c |τ c Review generation task given condition set τ c , the parameters of f c |τ c f s |τ s , θ s |τ s Opinion ranking task given condition set τ s , the parameters of f s |τ s E u , E i , E r , E t e Textual feature of opinion s j v j r |cs , m j r |cs Preference score and the importance value of opinion s j in f r |cs module rτ r , δ 2 τ r Predicted rating and the estimated variance of rating given condition set τ r д * Hidden state of GRU π (r , C, S ) Full permutation of {r , C, S } o Candidate opinion for opinion ranking task: o ∈ S z ao Correlation score of candidate opinion o given known information a {Q * o , b * o } Sets of latent factors in opinion ranking task λ Discount factor h c , r Approximated rating and review representation in adaptive prediction r , C, S respectively, where x s is the mean pooling of the feature of the opinion set [s1 e , s 2 e , . . ., s M e ] and E r is the rating embedding.At each time t, with the previous hidden state д t −1 and E t c (i.e., the word embedding Triple Dual Learning for Opinion-based Explainable Recommendation 70:11

Table 2 .
Statistics of the Three Datasets

Table 3 .
Example Data format

Table 4 .
Network Configurations , Γ i , Γ u and Γ i 3 kernel count of Γ u , Γ i , Γ u and Γ i 100 number of heads in Multi-head Attention in f r |s /f r |cs 8 dimension of heads in Multi-head Attention in f r |s /f r |cs 16 hidden layer of MLP in f r |cs [50] hidden layer of MLP in f r |c /f r |s [20] hidden state size of BiGRU 50 hidden state size of GRU in f c |r /f c |s /f c |r s 100 u

Table 5 .
RMSE Comparison on Three DatasetsWe repeat each model five turns and the RMSE results are listed in the form of mean ± standard deviation.* indicates p ≤ 0.05, paired t-test of TriEOR vs. the best baselines on RMSE.The bold formatting indicates the best performance.

Table 7 .
Performance Comparison on the Top-10 Opinion Ranking in Terms of NDCG, Precision (Pre), Recall (Rec), and F1 * 4.0253 * 1.0971 * * 1.5718 * 7.9027 * 2.5058 * The best performance values are marked in bold and % is omitted for table clarity.* indicates the statistical significance for p ≤ 0.05 via paired t-test.