UCEpic: Unifying Aspect Planning and Lexical Constraints for Generating Explanations in Recommendation

Personalized natural language generation for explainable recommendations plays a key role in justifying why a recommendation might match a user's interests. Existing models usually control the generation process by aspect planning. While promising, these aspect-planning methods struggle to generate specific information correctly, which prevents generated explanations from being convincing. In this paper, we claim that introducing lexical constraints can alleviate the above issues. We propose a model, UCEpic, that generates high-quality personalized explanations for recommendation results by unifying aspect planning and lexical constraints in an insertion-based generation manner. Methodologically, to ensure text generation quality and robustness to various lexical constraints, we pre-train a non-personalized text generator via our proposed robust insertion process. Then, to obtain personalized explanations under this framework of insertion-based generation, we design a method of incorporating aspect planning and personalized references into the insertion process. Hence, UCEpic unifies aspect planning and lexical constraints into one framework and generates explanations for recommendations under different settings. Compared to previous recommendation explanation generators controlled by only aspects, UCEpic incorporates specific information from keyphrases and then largely improves the diversity and informativeness of generated explanations for recommendations on datasets such as RateBeer and Yelp.


INTRODUCTION
Explaining, or justifying, recommendations in natural language is fast gaining traction in recent years [21-24, 28, 32, 34], in order to show product information in a personalized style, and justify how the recommendation meets users' need.That is, given the pair of user and item, the system would generate an explanation such as "nice TV with 4K display and Dolby Atmos!".To generate such high-quality personalized explanations which are coherent, relevant and informative, recent studies introduce aspect planning, i.e., including different aspects [22,23,32,34] in the generation process so that the generated explanations will cover those aspects and thus be more relevant to products and to users' interests.While promising, existing methods struggle to include accurate and highly specific information into explanations because aspects (e.g., screen for a TV) mostly control the high-level sentiment or semantics of generated text (e.g., "good screen and audio!"), but many informative product attributes are too specific to be accurately generated (e.g., "nice TV with 4K display and Dolby Atmos!").Although aspect-planning explanation generators try to harvest expressive and personalized natural-language-based explanations Figure 1: Preliminary experiments on the aspect coverage, phrase coverage, and Distinct-2 of generated explanations from previous models ExpansionNet [34], Ref2Seq [32] and PETER [22] on RateBeer and Yelp datasets.Check details in Appendix A from users' textual reviews [22,23,32,34], we observe that many informative and specific keyphrases in the training corpus (i.e., user reviews) vanish in generated explanations according to our preliminary experiments.As Figure 1 shows, generated explanations from previous methods miss many specific keyphrases and have much lower Distinct (diversity) scores than a human oracle.Hence, with aspects only, existing methods suffer from generating (1) too general sentences (e.g., "good screen!") that are hard to provide diverse and informative explanations to users; (2) sentences with inaccurate details (e.g., "2K screen" for a 4K TV), which are not relevant to the product and hurt users' trust.To address the above problems, we propose to use more concrete constraints to recommendation explanations besides aspects.Specifically, we seek a model unifying lexical constraints and aspect planning.In this model, introducing lexical constraints guarantees the use of given keyphrases (e.g., "Dolby Atmos") and thus includes specific and accurate information.Also, similar to the aspect selection of previous explanation generators [22,32], such lexical constraints can come from multiple parties.For instance, explanation systems select item attributes with some strategies; vendors highlight product features; users manipulate generated explanations by changing the lexical constraints of interest.Hence, the informativeness, relevance and diversity of generated explanations can be significantly improved compared to previous methods with aspect planning.Meanwhile, aspect planning remains useful when no specific given information but multiple aspects need to be covered.
To achieve this goal of Unifying aspect-planning and lexical Constraints for generating Explanations in Recommendation, we present UCEpic.There are some challenges of building UCEpic.First, lexical constraints are incompatible with existing explanation generation models (see group (A) in Table 1), because they are mostly based on auto-regressive generation frameworks [14,17,19,20,31,34] which cannot be guaranteed to contain lexical constraints in any positions with a "left-to-right" generation strategy.Second, although insertion-based generation models (see group (B) in Table 1) are able to contain lexical constraints in generated sentences naturally, we find personalization or aspects cannot be simply incorporated with the "encoder-decoder" framework for existing insertion-based models.Existing tokens are strong signals for new tokens to be predicted, hence the model tends to generate similar sentences and ignore different references1 from encoders.
For the first challenge, UCEpic employs an insertion-based generation framework and conducts robust insertion pre-training on a bi-directional transformer.During robust pre-training, UCEpic gains the basic ability to generate text and handle various lexical constraints.Specifically, inspired by Masked Language Modeling (MLM) [4], we propose an insertion process that randomly inserts new tokens into sentences progressively so that UCEpic is robust to random lexical constraints.For the second challenge, UCEpic uses personalized fine-tuning for personalization and awareness of aspects.To tackle the issue of "ignoring references", we propose to view references as part of inserted tokens for the generator and hence the model learns to insert new tokens relevant to references.For aspect planning, we formulate aspects as a special insertion stage where aspect-related tokens will be first generated as a start for the following generation.Finally, lexical constraints, aspect planning and personalized references are unified in the insertion-based generation framework.
Overall, UCEpic is the first explanation generation model unifying aspect planning and lexical constraints.UCEpic significantly improves relevance, coherence and informativeness of generated explanations compared to existing methods.The main contributions of this paper are summarized as follows: • We show the limitations of only using aspect planning in existing explanation generation, and propose to introduce lexical constraints for explanation generation.• We present UCEpic including robust insertion pre-training and personalized fine-tuning to unify aspect planning, lexical constraints and references in an insertion-based generation framework.
• We conduct extensive experiments on two datasets.Objective metrics and human evaluations show that UCEpic can largely improve the diversity, relevance, coherence and informativeness of generated explanations.

RELATED WORK
Explanation Generation For Recommendation.Generating explanations of recommended items for users has been studied for a long time with various output formats [6,43,44] (e.g., item aspects, attributes, similar users).Recently, natural-language-based explanation generation has drawn great attention [21-24, 28, 32, 34] to generate post-hoc explanations or justifications in personalized style.For example, Li et al. [24] applied RNN-based model to generate explanations based on predicted ratings.To better control the explanation generation process, Ni et al. [32] extracted aspects and controlled the semantics of generated explanations conditioned on different aspects, and Li et al. [22] proposed a personalized transformer model to generate explanations based on given item features.Also, the review generation area is highly related since explanation generation methods usually harvest expressive and informative explanations from user reviews.Many controllable review generators [5,34,39] are tailored to explanation generations as baseline  models in early experiments.Although previous works continued increasing the controllability of generation, they are all on the basis of auto-regressive generation frameworks [14,17,19,20,31,34] thus only considering aspect planning.In our work, UCEpic increases the controllability, informativeness of generated explanations by unifying aspect planning and lexical constraints under an insertion-based generation framework.
Lexically Constrained Text Generation.Lexically constrained generation requires that generated text contain the lexical constraints (e.g., keywords).Early works usually involve special decoding methods.Hokamp and Liu [12] proposed a lexical-constrained grid beam search decoding algorithm to incorporate constraints.Post and Vilar [36] presented an algorithm for lexically constrained decoding with reduced complexity in the number of constraints.Hu et al. [13] further improved decoding by a vectorized dynamic beam allocation.Miao et al. [30] introduced a sampling-based conditional decoding method, where the constraints are first placed in a template, then decoded words under a Metropolis-Hastings sampling.Special decoding methods usually need a high running time complexity.Recently, Zhang et al. [45] implemented hardconstrained generation with O (log ) time complexity by language model pre-training and insertion-based generation [2,7,8,37] used in machine translation.CBART [10] uses the pre-trained model BART [15] and the encoder and decoder are used for instructing insertion and predicting mask respectively.

METHODOLOGY
We describe aspect planning and lexical constraints for explanation generation as follows.Given a user persona R  , item profile R  for user  and item  as references, the generation model under aspect planning outputs the explanation E  related to an aspect A  but not necessarily including some specific words.Whereas for lexical constraints, given several lexical constraints (e.g, phrases or keywords) C  = { 1 ,  2 , . . .,   }, the model will generate an explanation E  = ( 1 ,  2 , . . .,   ) that has to exactly include all given lexical constraints   , which means   = (  , . . .,   ).The lexical constraints can be from users, businesses, or item attributes recommended by personalized systems in a real application.UCEpic unifies the two kinds of constraints in one model2 .We study only the explanation generation method and assume aspects and lexical constraints are given.Our notations are summarized in Table 2 3.1 Robust Insertion 3.1.1Motivation.Previous explanation generation methods [22,32] generally adopt auto-regressive generation conditioned on some personalized inputs (e.g., personalized references and aspects).As shown in Figure 2 (a), the auto-regressive process generates words in a "left-to-right" direction so lexical constraints are difficult to be contained in the generation process.However, for the insertionbased generation in Figure 2 (b) which progressively inserts new [1 0 0 0 2 0 0 0 0 0 0]   −1 <s>tacos.Love the + tropical fruits flavor.</s> . . . . . . 0 (lexical constraints) <s>tropical fruits flavor </s> tokens based on existing words, lexical constraints can be easily contained by viewing constraints as a starting stage of insertion.

Formulation.
The insertion-based generation can be formulated as a progressive sequence of  stages  = { 0 ,  1 , . . .,   −1 ,   }, where  0 is the stage of lexical constraints and   is our final generated text.For  ∈ {1, . . .,  },   −1 is a sub-sequence of   .The generation procedure finishes when UCEpic does not insert any new tokens into   .In the training process, all sentences are prepared as training pairs.Specifically, pairs of text sequences are constructed at adjacent stages (  −1 ,   ) that reverse the insertion-based generation process.Each explanation   in the training data is broken into a consecutive series of pairs: ( 0 ,  1 ), ( 1 ,  2 ), . . ., (  −1 ,   ), and when we construct the training data, the final stage   is our explanation text   .

Data Construction.
Given a sequence stage   , we obtain the previous stage   −1 by two operations, masking and deletion.Specifically, we randomly mask the tokens in a sequence by probability p as MLM to get the intermediate sequence  , −1 .Then, [MASK] tokens are deleted from the intermediate sequence  , −1 to obtain the stage   −1 .The numbers of deleted [MASK] tokens after each token in  , −1 are recorded as an insertion number sequence  , −1 .Finally, each training instance contains four sequences (  −1 ,  , −1 ,  , −1 ,   ).We include a simple example for the data construction process in Table 3.Since we delete  * p tokens in sequence   where  is the length of   , the average number of  is log 1 1−p  .Models trained on this data will easily re-use the knowledge from BERT-like models which have a similar pre-training process of masked word prediction.
Insertion generation (see Algorithm 1) is an inverse process of data construction.For each insertion step prediction from Ŝ−1 to Ŝ , the model will recover text sequences by two operations, mask insertion and token prediction.In particular, UCEpic first inserts [MASK] tokens between any two existing tokens in Ŝ−1 to get Î,−1 according to Ĵ,−1 predicted by an insertion prediction head.Then, UCEpic with a language modeling head predicts the masked tokens in Î,−1 , and recovers [MASK] tokens into words to obtain the Ŝ .(MLP) with activation function GeLU [11] and H MI is a linear projection layer.Finally, our predictions of mask insertion numbers and word tokens are computed as: where  MI ∈ R   × ins and  TP ∈ R   × vocab ,   and   are the length of Ŝ−1 and Î,−1 respectively,  ins is the maximum number of insertions and  vocab is the size of vocabulary.Î,−1 is obtained by inserting [MASK] tokens into Ŝ−1 according to Ĵ,−1 .
Because the random insertion process is more complicated to learn than the traditional autoregressive generation process, we first pre-train UCEpic with our robust insertion method for general text generation without personalization.The pre-trained model can generate sentences from randomly given lexical constraints.

Personalized References and Aspect Planning
3.2.1 Motivation.To incorporate personalized references and aspects, one direct method is to have another text and aspect encoder and insertion generation conditioned on the encoder like the sequence-to-sequence model [38].However, we find the pretrained insertion model with another encoder will generate similar sentences with different personalized references and aspects.The reason is the pre-trained insertion model views the lexical constraints or existing tokens in text sequences as a strong signal to determine new inserted tokens.Even if our encoder provides personalized features, the model tends to overfit features from existing tokens.Without lexical tokens providing different starting stages, generated sentences are usually the same.

Formulation.
To better learn personalization, we propose to view references and aspects as special existing tokens during the insertion process.Specifically, we construct a training stage   + to include references and aspects as: where R  , A  denote personalized references and aspects;   ,   and  are tokens or aspect ids in references, aspects and insertion stage tokens respectively.Because insertion-based generation relies on token positions to insert new tokens, we create token position ids in Transformer starting from 0 for R  , A  and   respectively in order to make it consistent for   between pre-training and fine-tuning.Similarly, we obtain the insertion number sequence Similar as Equation ( 1) and Equation ( 2), we can get Ĵ,−1 and Ŝ by argmax operation.Because personalized references and aspects are viewed as special existing tokens, UCEpic will directly incorporate token-level information as generation conditions and hence generates diverse explanations.
Recall that existing text sequences are strong signals for token prediction.For better aspect-planning generation, we design two starting stages  0 + and  0 + for aspects and lexical constraints respectively.In particular, we expect the aspect-related tokens can be generated at the starting stage (i.e., no existing tokens) according to given aspects and personalized references.Hence, the aspect starting stage is where A pad is a special aspect that is used for lexical constraints.During training, we sample  0 + with probability  to ensure UCEpic learns aspect-related generation effectively which is absent in pre-training.

Model Training
The training process of UCEpic is to learn the inverse process of data generation.Given stage pairs ( ,   + ) from pre-processing 3 , we optimize the following objective: , where  , −1 where MaskInsert denotes the mask token insertion.We make a reasonable assumption that  , −1 + is unique given (  + ,   −1 + ).This assumption is usually true unless in some corner cases multiple  , −1 + could be legal (e.g., masking one "moving" word in "a moving moving moving van");  , −1 + by definition is the intermediate sequence, which is equivalent to the given ( , −1 + ,   −1 + ).In Equation (8), we jointly learn (1) likelihood of mask insertion number 3 For fine-tuning with personalized references and aspects, we train the model with stage pairs (   [4], we optimize only the masked tokens in token prediction.The selected tokens to mask have the probability 0.1 to stay unchanged and the probability 0.1 to be randomly replaced by another token in the vocabulary.For mask insertion number prediction, most numbers in  , −1 + are 0 because we do not insert any tokens between the existing two tokens in most cases.To balance the insertion number, we randomly mask the 0 in  , −1 + with probability q.Because our mask prediction task is similar to masked language models, the pre-trained weights from RoBERTa [26] can be naturally used for initialization of UCEpic to obtain prior knowledge.

Inference
At inference time, we start from the given aspects A  or lexical constraint C  to construct starting stage  0 + or  0 + respectively.Then, UCEpic predicts { Ŝ1 + , . . ., Ŝ + } repeatedly until no additional tokens generated or reaching the maximum stage number.We obtain final generated explanation Ŝ from Ŝ + by removing R  and A  .Without loss of generality, we show the inference details from Ŝ−1 (3) given Ŝ + , UCEpic meets the termination requirements or executes step (1) again.The termination criterion can be a maximum interation number or UCEpic does not insert new tokens into Ŝ + .

EXPERIMENTS 4.1 Datasets
For pre-training, we use English Wikipedia5 for robust insertion training which has 11.6 million sentences.For fair comparison with baselines pre-trained on a general corpus, we use Wikipedia as the pre-training dataset; and for fine-tuning, we use Yelp6 and Rate-Beer [29] to evaluate our model (see Table 4).We further filter the reviews with a length larger than 64.For each user, following Ni et al. [32], we randomly hold out two samples from all of their reviews to construct the development and test sets.Following previous works [32,33], we employ an unsupervised aspect extraction tool [18] to obtain phrases and corresponding aspects for lexical constraints and aspect planning respectively.The number of aspects for each dataset is determined by the tool automatically and aspects provide coarse-grained semantics of generated explanations.Note that, typically, the number of aspects is much smaller than the number of lexical constraints and aspects are more high-level.

Baselines
We consider two groups of baselines for automatic evaluation to evaluate model effectiveness.The first group is existing text generation models for recommendation with aspect planning.
• ExpansionNet [34], generates reviews conditioned on different aspects extracted from a given review title or summary.
• Ref2Seq [32], a Seq2Seq model incorporates contextual information from reviews and uses fine-grained aspects to control explanation generation.• PETER [22], a Transformer-based model that uses user-and item-IDs and given phrases to predict the words in target explanation generation.This baseline can be considered as a state-of-the-art model for explainable recommendation.
We compare those baselines under both aspect planning and lexical constraints.Specifically, we feed lexical constraints (i.e., keyphrases) into models and expect models copy keyphrases to generated text.The second group includes general natural language generation models with lexical constraints: • NMSTG [40], a tree-based text generation scheme that from given lexical constraints in prefix tree form, the model generates words to its left and right, yielding a binary tree.
• POINTER [45], an insertion-based generation method pretrained on constructed data based on dynamic programming.• CBART [10], uses the pre-trained BART [15] and instructs the decoder to insert and replace tokens by the encoder.
The second group of baselines cannot incorporate aspects or personalized information as references.These models are trained and generate text solely based on given lexical constraints.We do not include explanation generation methods such as NRT [3], Att2Seq [5] and ReXPlug [9], non-natural-language explainable recommenders such as EFM [44] and DEAML [6], and lexically constrained methods CGMH [30], GBS [12] because PETER and CBART reported better performance than these models.We also had experiments with "encoder-decoder" based UCEpic as mentioned in Section 1, but this model generates same sentences for all user-item pairs hence we do not include it as a baseline.Detailed settings of baselines can be found in Appendix B.

Evaluation Metrics
We evaluate the generated sentences from two aspects: generation quality and diversity.Following Ni et al. [32], Zhang et al. [45], we use n-gram metrics including BLEU (B-1 and B-2) [35], METEOR (M) [1] and ROUGE-L (R-L) [25] which measure the similarity between the generated text and human oracle.As for generation diversity, we use Distinct (D-1 and D-2) [16].We also introduce BERT-score (BS) [42] as a semantic rather than n-gram metrics.

Implementation Details
We use RoBERT-base [26] (#params ≈ 130M, other pretrained model sizes are listed in Appendix B).In training data construction, we randomly mask p = 0.2 tokens in   to obtain  , −1 .0 in  , −1 are masked by probability q = 0.9.The tokenizer is byte-level BPE following RoBERTa.For pre-training, the learning rate is 5e-5, batch size is 512 and our model is optimized by AdamW [27]   the target text as the aspect for planning and lexical constraint respectively7 .

Automatic Evaluation
4.5.1 Overall Performance.In Table 5, we report evaluation results for different generation methods.For aspect-planning generation, UCEpic can achieve comparable results as the state-of-the-art model PETER.Specifically, although PETER obtains better B-2 and ROUGE-L than our model, the results from UCEpic are significantly more diverse than PETER.A possible reason is that auto-regressive generation models such as PETER tend to generate text with higher n-gram metric results than insertion-based generation models, because auto-regressive models generate a new token solely based on left tokens while (insertion-based) UCEpic considers tokens in both directions.Despite the intrinsic difference, UCEpic still achieves comparable B-1, Meteor and BERT scores with PETER.Under the lexical constraints, the results of existing explanation generation models become lower than the results of aspect-planning generation which indicates current explanation generation models struggle to include specific information (i.e., keyphrases) in explanations.Although current lexically constrained generation methods produce text with high diversity, they tend to insert less-related tokens with users and items.Hence, the generated text is less coherent (low n-gram metric results) than UCEpic because these methods cannot incorporate user personas and item profiles from references which are important for explainable recommendation.In contrast, UCEpic easily includes keyphrases in explanations and learns useritem information from references.Therefore, our model largely outperforms existing explanation generation models and lexically constrained generation models.
Based on the discussion, we argue UCEpic unifies the aspect planning and lexical constraints for explainable recommendations.The gap between UCEpic and CBART becomes large as the number of keyphrases decreases since CBART cannot obtain enough information for explanation generation with only a few keywords, but UCEpic improves this problem by incorporating user persona and item profiles from references.The results indicate existing lexically constrained generation models cannot be applied for explanation generation with lexical constraints.

Ablation Study.
To validate the effectiveness of our unifying method and the necessity of aspects and references for explanation generation, we conduct an ablation study on two datasets and the results are shown in Figure 4. We train our model and generate explanations without aspects (w/o A), without references (w/o R) and without both of them (w/o A&R).From the results, we can see that BLEU-2 and Meteor decrease if we do not give aspects to the model because the aspects can guide the semantics of explanations.
Without references, the model generates similar sentences which usually contain high-frequency words from the training data.The performance drops markedly if both references and aspects are absent from the model.Therefore, our unifying method for references and aspects is effective and provides user-item information for explanation generation.

Kind of Constraints.
We study the performance of UCEpic with different kinds of constraints on the Yelp dataset and the results are shown in Table 6.The settings of Aspect and L-Extract are consistent as UCEpic under aspect-planning and lexical constraints8 respectively in Table 5.We also study three other kinds of constraints: (1) L-Frequent.We use the most frequent noun phrase of an item as the lexical constraint.(2) L-Random.We randomly sample the lexical constraint from all noun phrases of an item.(3) Aspect & L. This method combines both aspect-planning and lexical constraints demonstrated in Table 5 and uses the two kinds of constraints simultaneously.From the results, we can see that (1) L-Extract and Aspect & L have similar results which indicate the lexical constraints have strong restrictions on the generation process hence the aspect planning rarely has controllability on the results.( 2) Generation with lexical constraints can achieve better results than aspect-planning generation.(3) Lexical constraint selections (i.e., L-Extract, L-Frequent, L-Random) result in significant differences in generation performance, which movitate that lexical constraint selections can be further explored in future work.

Human Evaluation
We conduct a human evaluation on generated explanations.Specifically, We sample 500 ground-truth explanations from Yelp dataset, then collect corresponding generated explanations from PETERaspect, POINTER, CBART and UCEpic respectively.Given the ground-truth explanation, annotator is requested to select the best explanation on different aspects i.e., relevance, coherence and informativeness) among explanations generated from PETER.POINTER, CBART and UCEpic (see Appendix C for details).We define relevance, coherence and informativeness as: • Relevance: the details in the generated explanation are consistent and relevant to the ground-truth explanations.• Coherence: the sentences in the generated explanation are logical and fluent.• Informativeness: the generated explanation contains specific information, instead of vague descriptions only.
The voting results are shown in Figure 5.We can see that UCEpic largely outperforms other methods in all aspects especially for relevance and informativeness.In particular, lexically constrained generation methods (UCEpic and CBART) significantly improve the quality of explanations because specific product information can be included in explanations by lexical constraints.Because POINTER is not robust to random keyphrases, the generated explanations do not get improvements from lexical constraints.
We can see that Ref2Seq and PETER usually generate general sentences which are not informative because they struggle to contain specific item information by traditional auto-regressive generation.POINTER and CBART can include the given phrases (pepper chicken) in their generation, but they are not able to learn information from references and hence generate some inaccurate words (pepper sauce chicken, chicken wings) which mislead users.In contrast, UCEpic can generate coherent and informative explanations which include the specific item attributes and are highly relevant to the recommended item.

CONCLUSION
In this paper, we propose to have lexical constraints in explanation generation which can largely improve the informativeness and diversity of generated reviews by including specific information.To this end, we present UCEpic, an explanation generation model that unifies both aspect planning and lexical constraints in an insertion-based generation framework.We conduct comprehensive experiments on RateBeer and Yelp datasets.Results show that UCEpic significantly outperforms previous explanation generation models and lexically constrained generation models.Human evaluation and a case study indicate UCEpic generates coherent and informative explanations that are highly relevant to the item.

A MOTIVATING EXPERIMENT DETAILS
In this experiment, we evaluate the diversity and informativeness of explanations.Specifically, we apply phrase coverage, aspect coverage and Distinct-2 to measure generated explanations and humanwritten explanations.
For phrase coverage, we extract noun phrases from explanations by spaCy noun chunks.Then we compare the phrases in human-written explanations and generated explanations.If a phrase appears in both explanations, we consider it as a covered phrase by generated explanations.This experiment measures how much specific information can be included in the generated explanations.
For aspect coverage, we use the aspect extraction tool [18] per dataset to construct a table that maps phrases to aspects, then we map the phrases in generated explanations to aspects by looking up the phrase-aspect table.For each sample, we calculate how many aspects in ground-truth explanation are covered in generated explanations and report the average aspect coverage per dataset.
For Distinct-2, we use the numbers as described in Table 5.

B BASELINE DETAILS
For ExpansionNet, we use the default setting which uses hidden size 512 for RNN encoder and decoder, batch size as 25 and learning rate 2e-4.For aspect planning in ExpansionNet, we use the set of lexical constraints (as concatenated phrases) to replace the title or summary input as contextual information for training and testing.For Ref2Seq, we use the default setting with 256 hidden size, 512 batch size and 2e-4 learning rate.For aspect planning, we concatenate our given phrases as (historical explanations are also incorporated as references following the original implementation) as contextual information in training and testing.
For PETER, we use the original setting with 512 embedding size, 2048 hidden units, 2 self-attention heads with 2 transformer layers, 0.2 dropout.We use the training strategy suggested by the authors.Since original PETER only supports single words as an aspect, we adopt PETER to multiple words with a maximum length of 20 and reproduce the original single-word model on our multi-word model.We input our lexical constraints as the multi-word input for PETER training and testing.
For NMSTG, we use the default settings with an LSTM with 1024 hidden size with the uniform oracle.We convert our lexical constraints into a prefix sub-tree as the input of NMSTG, and then use the best sampling strategy in our testing (i.e., StochasticSampler) for NMSTG.
For POINTER, we use the pre-training BERT-large [4] (#params ≈ 340M.) from WIKI to fine-tune 40 epochs on our downstream datasets.We use all the default settings except batch sizes since POINTER requires 16 GPUs for distributed training that exceeds our computational resources.Instead, we train POINTER with the same configuration on 3 GPUs.For testing, we select the base maximum turn as 3 with the default greedy decoding strategy.We feed lexical constraints as the original implementation.
For CBART, we use the checkpoint pre-trained on BERT-large [4] (#params ≈ 340M.) with the one-billion-words dataset to fine-tune our downstream datasets.We use the 'tf-idf' training mode and finetune it on one GPU.For testing, we select the greedy decoding strategy.We set other hyper-parameters to default as the code base 9 .

C HUMAN EVALUATION DETAILS
We conduct human evaluation experiments on Yelp datasets to evaluate the generation quality of generated explanations in terms of relevance, coherence and informativeness.
We submit our task to MTurk10 and set the reward as $0.02 per question.For each question, we first show the definition of relevance, coherence and informativeness, then we shuffle the order of model-generated explanations to eliminate the positional bias.Each question is requested to be answered by 3 different MTurk workers, who are required to have great than 80% HIT Approval Rate to improve the quality of answers.Figure 6 is an example of our evaluation template.We collect the answers and count the majority votes, where the majority vote is defined as model  has 2 or more votes (since we have 3 answers per question).We ignore the questions without majority votes.Finally, we collected 1,120 valid votes for 370 questions, in which 275 relevance questions, 281 coherence questions and 266 informativeness questions have majority votes.

Figure 2 :
Figure 2: Overview of generating explanations for a given user and recommended items using (a) an aspect-planning autoregressive generation model; using (b) our UCEpic that unifies aspect-planning and lexical constraints.

3. 1 . 4
Modules.UCEpic uses a bi-directional Transformer architecture with two different prediction heads for mask insertion and token prediction.The architecture of the model is closely related to that used in RoBERTa[26].The bi-directional Transformer D will predict the mask insertion numbers and word tokens with two heads H MI and H TP respectively.H TP is a multilayer perceptron Algorithm 1 Insertion in the -th Stage procedure Insertion( Ŝ−1 ) Ĵ,−1 ← predict number of masks from Ŝ−1 via eq.(1) ; Î,−1 ← build intermediate sequence from Ĵ,−1 and Ŝ−1 ; Ŝ ← predict masked tokens in Î,−1 via eq.(2); return predicted sequence Ŝ ;

Figure 3 :
Figure 3: Performance (i.e., B-2 and Meteor) of lexically constrained generation models on RateBeer data with different numbers of keyphrases.

Figure 3 Figure 4 :
Figure 4: Ablation study on aspects and references.

Figure 5 :
Figure 5: Human evaluation on explanation quality.

Figure 6 :
Figure 6: Our human evaluation example on MTurk.

Table 1 :
Comparison of previous explanation generators for recommendation in group (A), general lexically constrained generators in group (B), and our UCEpic in group (C).
Notation Description R  , R  historical review profile of user  and item .E  generated explanation when item  is recommended to user .A  aspects controlling explanation generation for item  and user .C  lexical constraints (e.g., keywords) controlling explanation generation for item  and user .  , Ŝ text sequence of the -th stage generation.  is training data and Ŝ is model prediction. , −1 , Î,−1 intermediate sequence between   −1 and   .(training data and model prediction)  , −1 , Ĵ,−1 insertion number sequence between   −1 and   .(training data and model prediction) D a bi-directional transformer for encoding.H MI a linear projection layer for insertion numbers.H TP a multilayer perceptron with activation function for token prediction.

Table 4 :
Statistics of datasetsDataset Train Dev Test #Users #Items #Aspects

Table 5 :
Performance comparison of the explanation generation models (ExpansionNet, Ref2Seq, PETER), lexically constrained generation models (NMSTG, POINTER, CBART) and UCEpic.All values are in percentage (%).We underline the highest scores of aspect-planning generation results and the highest scores of lexically constrained generation are bold.

Table 6 :
UCEpic with different constraints on Yelp dataset.L denotes lexical constraints.

Table 7 :
Generated explanations from Yelp dataset.Lexical constraints (phrases) are highlighted in explanations.