Personalized Showcases: Generating Multi-Modal Explanations for Recommendations

Existing explanation models generate only text for recommendations but still struggle to produce diverse contents. In this paper, to further enrich explanations, we propose a new task named personalized showcases, in which we provide both textual and visual information to explain our recommendations. Specifically, we first select a personalized image set that is the most relevant to a user's interest toward a recommended item. Then, natural language explanations are generated accordingly given our selected images. For this new task, we collect a large-scale dataset from Google Maps and construct a high-quality subset for generating multi-modal explanations. We propose a personalized multi-modal framework which can generate diverse and visually-aligned explanations via contrastive learning. Experiments show that our framework benefits from different modalities as inputs, and is able to produce more diverse and expressive explanations compared to previous methods on a variety of evaluation metrics.


INTRODUCTION
Personalized explanation generation models have the potential to increase the transparency and reliability of recommendations.Previous works [1,7,47,51] considered generating textual explanations from users' historical reviews, tips [23] or justifications [27].However, these methods still struggle to provide diverse explanations because a large amount of general sentences (e.g., 'food is very good!') exist in generated explanations and the text generation models lack grounding information (e.g., images) for their generation process.To further diversify and enrich explanations for recommendations, we propose a new explanation generation task named personalized showcases (shown in Figure 1).In this new task, we explain recommendations via both textual and visual information.Our task aims to provide a set of images that are relevant to a user's interest and generate textual explanations accordingly.Compared to previous works that generate only text as explanations, our showcases present diverse explanations including images and visually-guided text.
To this end, the first challenge of this task is building a dataset. 1 Existing review datasets (e.g., Amazon [27] and Yelp ) are largely unsuitable for this task (we further discuss these datasets in Section 3.2).Thus, we first construct a large-scale multi-modal dataset, namely Gest, which is collected from Google Local Restaurants including review text and corresponding pictures.Then, to improve the quality of Gest for personalized showcases, we annotate a  small subset to find highly matched image-sentence pairs.Based on the annotations, we train a classifier with CLIP [32] to extract visually-aware explanations from the full dataset.The images and text explanations from users are used as the learning target for personalized showcases.
For this new task, we design a new multi-modal explanation framework.To begin with, the framework selects several images from historical photos of the business that the user is most interested in.Then, the framework takes the displayed images and users' profiles (e.g., historical reviews) as inputs and learns to generate textual explanations with a multi-modal decoder.However, generating expressive, diverse and engaging text that will capture users' interest remains a challenging problem.First, different from previous textual explanation generation, the alignment between multiple images and generated text becomes an important problem for showcases, which poses higher requirements for information extraction and fusion across modalities.Second, a typical encoderdecoder model with a cross-entropy loss and teacher forcing can easily lead to generating repetitive and dull sentences that occur frequently in the training corpus (e.g., "food is great") [16].
To tackle these challenges, we propose a Personalized Cross-Modal Contrastive Learning (PC 2 L) framework by contrasting input modalities with output sequences.Contrastive learning has Amazing!Best Cesar salad I ever had and the cake was delicious.
Seafood soup was excellent.Granddaughter loved the Spaghetti and meatballs.
I had an excellent experience at this restaurant.The ambience is romantic and perfect for a couple date night.drawn attention as a self-supervised representation learning approach [5,29].However, simply training with negative samples in a mini-batch is suboptimal [19] for many tasks, as the randomly selected embeddings could be easily discriminated in the latent space.Hence, we first design a cross-modal contrastive loss to enforce the alignment between images and output explanations, by constructing hard negative samples with randomly replaced entities in the output.Motivated by the observation that users with similar historical reviews share similar interests, we further design a personalized contrastive loss to reweight the negative samples based on their history similarities.Experimental results on both automatic and human evaluation show that our model is able to generate more expressive, diverse and visually-aligned explanations compared to a variety of baselines.

An Italian Restaurant
Overall, our contributions are as follows: • To generate more informative explanations for recommendations, we present a new task: personalized showcases which can provide both textual and visual explanations for recommendations.• For this new task, we collect a large-scale multi-modal dataset from Google Local (i.e., maps).To ensure alignment between images and text, we annotate a small dataset and train a classifier to propagate labels on Gest, and construct a highquality subset for generating textual explanations.• We propose a novel multi-modal framework for personalized showcases which applies contrastive learning to improve diversity and visual alignment of generated text.Comprehensive experiments on both automatic and human evaluation indicate that textual explanations from our showcases are more expressive and diverse than existing explanation generation methods.

TASK DEFINITION
In the personalized showcases task, we aim to provide both personalized textual and visual explanations to explain recommendations for users.Formally, given user  ∈  and business (item)  ∈ , where  and  are the user set and business set respectively, the personalized showcases task will provide textual explanations  = { 1 ,  2 , ...,   } and visual explanations  = { 1 ,  2 , ...,   }, where  and  represent sentences and images in explanations. and  are matched with each other and personalized to explain why  is recommended to .
To better study the relation between textual and visual explanations and provide baselines for future work, in this paper, we decompose the task into two steps as shown in Figure 5: (1) Selecting an image set as a visual explanation that is relevant to a user's interest; (2) Generating textual explanations given the selected images and a user's historical reviews.
For our method, we consider the following aspects: • Accuracy: We aim to predict the target images (i.e., images associated with the ground-truth review) from business image candidates correctly, and the generated text is expected to be relevant to the business.• Diversity: The selected images should be diverse and cover more information from businesses (e.g., including more dishes from a restaurant).Textual explanations should be diverse and expressive.• Alignment: Unlike previous explanation or review generation tasks which only use historical reviews or aspects as inputs, our visually-aware setting provides grounding to the images.Hence the generated explanations in this new task should aim to accurately describe the content and cover the main objects (e.g., the name of dishes, the environment) in the given set of images.

DATASET 3.1 Dataset Statistics
We collected reviews with images from Google Local.Gest-raw in Table 1 shows the data statistics of our crawled dataset.We can see that Gest-raw contains 1,771,160 reviews from 1,010,511 users and 65,113 businesses.Every review has at least one image and the raw dataset has 4,435,565 image urls.We processed our dataset into two subsets as (1) Gest-s1 for personalized image set selection, and (2) Gest-s2 for visually-aware explanation generation.Statistics of our processed dataset are in Table 1, with more processing details in Section 3.3 and Appendix A.

Visual Diversity Analysis
To distinguish our Gest from existing review datasets and show the usefulness of personalized showcases, we first define CLIP-based dissimilarity in three levels to measure the diversity of user-generated images in each business.Then, we compare the visual diversities between our Gest data with two representative review datasets, Amazon Reviews [25,27] and Yelp.
First, similar to [32,52], we use the cosine similarity (denoted as sim) from pre-trained CLIP to define the dis-similarity between image   and   as dis(  ,   ) = 1 − sim(  ,   ).Thus, we introduce visual diversity in three levels as Intra-Business Div, Inter-User Div and Intra-User Div, which are formally defined in Appendix B; higher scores mean more visual diversity.
Then, we investigate the visual diversities for our Gest data as well as Amazon Reviews (using all categories All (A) and subcategories Beauty (B), Clothing (C), Electronics (E)) and Yelp.For Amazon, we treat each item page as a "business" because reviews are collected according to items.In our calculation, we sample 5,000 items with more than one user-uploaded image.Note that images in Yelp dataset do not have user information, so we cannot calculate user-level diversities for Yelp.From Figure 3, we have the following observations: • Diversities within datasets: Figure 3 shows that for Gest and Amazon, Inter-User Div is the highest and Intra-User Div is the lowest.It indicates even for the same business (item), users focus on and present different visual information.• Gest vs. Amazon: In Figure 3, three visual diversities of Amazon are consistently lower than Gest by a large margin.
We try to explain this by discussing the difference of user behaviors on these two platforms.As an example in Figure 4, user-generated images usually focus on the purchased item.
Though the information they want to show differs, there is usually a single object in an image (i.e., the purchased item).Thus visual diversity is limited.While for Gest, as examples in Figure 2 show, reviews on restaurants allow users to share more diverse information from more varied items, angles or aspects.Compared with Amazon, using Gest should generate more informative personalized showcases according to different user profiles.• Gest vs. Yelp: Yelp images are high-quality (as an example in Figure 4) and the intra-business div. is higher (0.44) than Gest (0.39).Images in Yelp themselves are similar to images in Gest.However, Yelp images do not fit our task due to the lack of user information.

Explanation Distillation
Reviews often contain uninformative text that is irrelevant to the images, and cannot be used directly as explanations.Hence, we construct an explanation dataset from Gest-raw.We distill sentences in reviews that align with the content of a given image as valid explanations.Three annotators were asked to label 1,000 reviews (with 9,930 image-sentence pairs) randomly sampled from the full dataset.The task is to decide if a sentence describes a image.Labeling was performed iteratively, followed by feedback and discussion, Figure 5: Illustration of our personalized showcases framework for the given business.We take user historical images and textual reviews as inputs.First, we select an image set that is most relevant to a user's interest.Then we generate natural language explanations accordingly with a multi-modal decoder.A cross-modal contrastive loss and a personalized contrastive loss are applied between each input modality and the explanations.Last, the selected images and generated textual explanations will be organized as multi-modal explanations to users.
until the quality was aligned between the three annotators.The annotated image-sentence pairs are then split into train, validation, and testing with a ratio of 8:1:1.
We then train a binary classification model Φ based on these annotated image-sentence pairs and their corresponding labels.Specifically, we extract the embedding of each sentence and image via CLIP.The two features are concatenated and fed into a fully connected layer.The classifier achieves an AUC of 0.97 and F-1 score of 0.71 on the test set, where similar results are obtained in [27] for building a text-only explanation dataset.We use this model to extract explanations from all reviews.The statistics of the dataset Gest-s2 can be found in Table 1.

METHODOLOGY
In this section, we present our framework of producing personalized showcases.As the overview shows (Figure 5), we start with personalized image set selection and the visually-aware explanation generation module, then introduce our personalized cross-modal contrastive learning approach in Section 4.3.

Personalized Image Set Selection
The first step is to select an image set as a visual explanation that is relevant to a user's interests, and is diverse.We formulate this selection step as diverse recommendation with multi-modal inputs.
Multi-Modal Encoder.Generally, these user textual-or visualprofiles can be effectively encoded with different pre-trained deep neural networks (e.g., ResNet [14], ViT [11], BERT [9]).Here we choose CLIP [31], a state-of-the-art pre-trained cross-modal retrieval model as both textual-and visual-encoders.CLIP encodes raw images as image features, and encodes user textual-and visualprofiles as user profile features.
Image Selection Model.We use a Determinantal Point Process (DPP) method [18] to select the image subset, which has recently been used for different diverse recommendation tasks [2,39].Compared with other algorithms for individual item recommendation, DPP-based models are suitable for multiple image selection.Given user  and business , we predict the image set Î, as follows: where   is the image set belonging to business .In our design, we calculate user-image relevance using the CLIP-based user's profile features and image features.More details of the model are in [39].

Visually-Aware Explanation Generation
After obtaining an image set, we aim to generate personalized explanations given a set of images and a user's historical reviews, with the extracted explanation dataset Gest-s2 in Section 3.3.Specifically, we build a multi-modal encoder-decoder model with GPT-2 [33] as the backbone.
Multi-Modal Encoder.Given a set of user 's where   and   are two learnable projection matrices.Then we use a multi-modal attention (MMA) module with stacked selfattention layers [38] to encode the input features: where each    ,    aggregate features from two modalities and [; ] denotes concatenation.This flexible design allows for variable lengths of each modality and enables interactions between modalities via co-attentions.
Multi-Modal Decoder.Inspired by recent advances of powerful pre-trained language models, we leverage GPT-2 as the decoder for generating explanations.To efficiently adapt the linguistic knowledge from GPT-2, we insert the encoder-decoder attention module into the pre-trained model with a similar architecture in [4].

Personalized Cross-Modal Contrastive Learning
Unlike image captioning tasks [41,46] where the caption is a short description of an image, our task utilizes multiple images as "prompts" to express personal feelings and opinions about them.
To encourage generating expressive, diverse and visual-aligned explanations, we propose a Personalized Cross-Modal Contrastive Learning ( 2 ) framework.We first project the hidden representations of images, historical reviews, and the target sequence into a latent space: where   ,   , and   consist of two fully connected layers with ReLU activation [26] and average pooling over the hidden states   ,   and   from the last self-attention layers.For the vanilla contrastive learning with InfoNCE loss [5,29], we then maximize the similarity between the pair of source modality and target sequence, while minimizing the similarity between the negative pairs as follows: where   , , = sim( H () , H ( ) )/, sim is the cosine similarity between two vectors,  is the temperature parameter, () and ( ) are two samples in the mini-batch,  is the set of negative samples for sample ().
One challenge of this task is the model is asked to describe multiple objects or contents in a set of images.To ensure the visual grounding between multiple image features and output text, we design a novel cross-modal contrastive loss.Specifically, given a target explanation  = { 1 ,  2 , ...,   }, we randomly replace the entities 3 in the text with other entities presented in the dataset to construct a hard negative sample  ent = { ′ ent1 ,  2 , ... ′ ent2 , ...  } (i.e., "I like the sushi" to "I like the burger"), such that during training, the model is exposed to samples with incorrect entities regarding the images, which are non-trivial to distinguish from the original target sequence.Thus, we add the hidden representation of  ent as an additional negative sample ent to formulate the cross-modal contrastive loss: On the other hand, to enhance the personalization of explanation generation, we re-weight negative pairs according to user personalities.The intuition is that users with more distinct personalities are more likely to generate different explanations.Motivated by this, we propose a weighted contrastive loss for personalization: where negative pairs in a mini-batch are re-weighted based on user personality similarity function  .In our framework, user personalities are represented by their historical reviews.Specifically, we define  function as: i.e., we reduce the weights of negative pairs with similar histories, and increase those with distinct histories. ( > 1) is a hyperparameter that weighs the negative samples, sim is the cosine similarity, R() and R( ) are the average features of two users' input historical reviews.
Overall, the model is optimized with a mixture of a cross-entropy loss and the two contrastive losses: where  1 and  2 are hyperparameters that weigh the two losses.

A Metric for Visual Grounding
As mentioned in Section 2, we want our model to generate explanations that can accurately describe the content in a given image set.
Typical n-gram evaluation metrics such as BLEU compute scores based on n-gram co-occurrences, which are originally proposed for diagnostic evaluation of machine translation systems but not capable of evaluating text quality, as they are only sensitive to lexical variation and fail to reward semantic or syntactic variations between predictions and references [34,35,48].To effectively test the performance of the alignment between visual images and text explanations, we design an automatic evaluation metric: CLIP-Align based on [32].Given a set of images  = { 1 ,  2 , ...,   } and a set of sentences from the generated text  = { 1 ,  2 , ...,   }, we first extract the embeddings of all the images and sentences with CLIP, we compute the metric as follows: where cs , is the confidence score produced by the CLIP-based classifier Φ trained on our annotated data.By replacing cs , with the cosine similarity of image and sentence embeddings, we obtain another metric CLIP-Score, similar to [15].Compared with previous CLIP-based metrics [15,52], CLIP-Align focuses specifically on the accuracy and the alignment between objects in the sentences and the images (e.g."food is great" and "burger is great" achieves similar high scores with the same burger image computed on CLIP-Score, and a model that repetitively generates "food is great" can reach high performance on CLIPscore in corpus level).Moreover, the vanilla CLIPscore [15] showed poor correlations with captions containing personal feelings, making it less suitable for this task.We show in Section 5 with automatic and human evaluation results that our metric performs better when evaluating alignment between images and text.

EXPERIMENTS
In this section, we conduct extensive experiments to evaluate the performance of our personalized showcases framework.Ablation studies show the influence of different modalities to personalized showcases.Case studies and human evaluation are conducted to validate that our model present more diverse and accurate explanations than baselines.

Experimental Setting
Baselines.To show the effectiveness of our model, we compare it with a number of popular baselines from different tasks, including image captioning, report generation and explanation generation: • ST [41] is a classic CNN+LSTM model for image captioning.
• R2Gen [6] is a state-of-the-art memory-driven transformer specialized at generating long text with visual inputs.• Ref2Seq [27] is a popular reference-based seq2seq model for explanation generation in recommendation.• Peter [21] is a recent transformer-based explanation generation model which uses the user and item IDs to predict the words in the target explanation.• img and text refer to image and text features respectively.Evaluation Metrics.For image selection, we report Precision@K, Recall@K and F1@K to measure the ranking quality.Due to the nature of our task, we set a small K ( = 3).To evaluate diversity, we introduce the truncated div@K ( = 3) for the average dissimilarities for all image pairs in recommended images.Formally, given K images { 1 , . . .,   }, div@K is defined as: For textual explanations, we first evaluate the relevance of generated text and ground truth by n-gram based text evaluation metrics: BLEU (n=1,4) [30], METEOR [8] and NIST (n=4) [10].To evaluate diversity, we report Dinstinct-1 and Distinct-2 which is proposed in [20] for text generation models.We then use CLIP and BERT to compute embedding-based metrics.CLIP-Align is our proposed metrics in Section 4.2.CLIP-Score [15] BERT-Score [48] are two recent embedding-based metrics.
Implementation Details.We use CLIP [31] with ViT-B/32 as image and text encoder to encode user historical reviews and images.We convert user profile feature into a 128-dimensional vector with a MLP model (1024→512→512→256→128), and convert candidate images with another MLP (512→512→512→256→128), where both models use ReLU activations [26].We follow [39] to calculate each element of  and optimize DPP using Adam [24] with an initial learning rate of 1e-3 and batch size 512.For inference, we use greedy decoding to select  = 3 images as visual explanation.
For training PC 2 L, we use AdamW [24] as the optimizer with an initial learning rate of 1e-4.The maximum sequence lengths are set to 64 which covers 95% of the explanations.The maximum number of images and historical reviews are set to 5 and 10 respectively.The hidden sizes of both the encoder and decoder are 768 with 12 heads.There are 3 layers in the encoder and 12 layers in the decoder.The batch size for training is 32.We use the GPT-2-small pre-trained weights with 117M parameters.The weighting parameters  1 ,  and temperature  are set to 0.2, 0.2,  and 0.1 respectively.We use a beam size of 2 for decoding to balance the generation effectiveness and efficiency.

Framework Performance
We first report the model performance on text evaluation metrics in Table 2, as we found this last step in our framework came with more challenges and interesting findings, e.g., how to generate human-like explanations and avoid dull text, how to evaluate Here the input images are selected by our model, 4 and the input text consists of historical reviews from users.First, the clear gap between text-input models and image-input models on diversity and CLIP-based metrics validates the importance of incorporating image features.The setting of visually-aware generation models is able to generate accurate explanations with diverse language style.Second, our  2  shows substantial improvement on most of the metrics compared to LSTM and transformer based models, showing that a pretrained language model with contrastive learning is able to generate high quality explanations.Finally, though text-based models Ref2Seq and Peter achieve competitive results with our method on some n-gram metrics such as BLEU and METEOR, their performance is much worse on diversity and embedding metrics.The text quality is also low with repetitive and non-informative sentences appearing often, which we further validate with human evaluations and case studies.

Component Analysis
We conduct ablation studies to evaluate the effectiveness of each component individually.
Model for image set selection.First, we evaluate the performance of personalized image set selection.For general ranking performance, we compare our model with random selection and different input modalities.As shown in Table 3, though the truncated diversity of the text-only model is the highest, its performance is significantly worse than those with images in terms of ranking metrics.This indicates text input alone is far insufficient to provide personalization for users, and its recommendation result is closer to that of random selection.Historical images on the other hand, provide an important visual cue for modeling users' preference.Overall, a model with images and text can achieve the best ranking performance for image set selection, which validates the importance of our multi-modal setting for personalized showcases.

Effectiveness of Contrastive Learning
We conduct ablation studies on different variations of our contrastive loss to verify the effectiveness of our method.As shown in Table 4, our PC 2 L achieves the best performance over all baselines on different metrics.Specifically, CCL contributes more to the visual grounding by enforcing the model to distinguish random entities from the correct ones, and Table 4: Ablation study on contrastive learning.Baseline is to train a multi-modal decoder without contrastive learning.CL, CCL and PCL are the contrastive losses in Eq. ( 7), Eq. ( 8) and Eq. ( 9  improves CLIP-Align compared to the vanilla contrastive framework [5].PCL improves more on diversity by encouraging the model to focus on users with dissimilar interest. To further evaluate the generation quality improved by contrastive learning, we analyze the generated explanations from two aspects, length distributions of generations and keywords coverage.Figure 6 (a) compares the length distributions of generations on the test set to the ground truth.We categorize text lengths into 6 groups (within the range of [0, 60] with an interval of 10).The model without PC 2 L has a sharper distribution, while adding our PC 2 L leads to a distribution which is closer to the ground truth, demonstrating its effectiveness and the ability to generalize on unseen images.Note the ground truth contains more long texts than generations from the model since we set the max length to 64 during training and inference, which results in the discrepancy for text length greater than 60.
Figure 6 (b) shows the keyword coverage (i.e., nouns, adjectives and adverbs) in output sentences.We consider an output as covering a keyword if the word exists in the corresponding ground truth.We compare two models trained with and without PC 2 L. We can see that PC 2 L improves the coverage of all kinds of keywords, which indicates our contrastive learning method diversifies and personalizes the generated text.Overall, incorporating contrastive learning into multi-modal explanation generation leads to better output quality with more diverse and visually-aligned texts.
Can GPT-2 provide linguistic knowledge?Finally, we study whether GPT-2 can provide linguistic knowledge for our generation We ordered pork and shrimp spring rolls that came with a peanut-y dipping sauce.Then we ordered a chicken banh-mi and a lemongrass beef with noodles.
if you like vietnamese food, you should try this place out. the spring rolls are a definite must -. the pho is good.
we ordered the fried rice and it was very good.
The burger was delicious though!My co worker said the Pork Torta was delicious!Other guys had Gyro, pizza and fish tacos.My Bacon Cheeseburger was excellent.
i had the grilled cheese sandwich and it was delicious !

Previous Ref2Seq
Ours Personalized Showcases bloody mary was perfect.food was wonderful, try the fried green tomato breakfast tacos.
The steak frites was tasty -it was charred, which I really liked, and topped with a butter sauce.The truffle fries were also really, really good.i had the grilled chicken sandwich , which was delicious .old school rustic feel with a wide selection of burgers and beers.the burgers were done well ……

EXAMPLE 1 EXAMPLE 2 EXAMPLE 3
i love it if you want to eat japanese -style ramen.the rice pilaf was very good as well.

Previous
Text GPT-2 first time here, i had the bbq bacon cheeseburger medium rare with onion rings.As shown in Table 5, comparing the performance of random and GPT-2 initialization, it is evident that the pretrained weights play a significant role.Finetuning on in-domain data (260k samples from users with one review and excluded from our personalization dataset) further improves domain-specific knowledge of the decoder and benefits generation performance on diversity metrics.

Case Study
We study three examples (see Figure 7) and compare our personalized showcases to single-modal explanations from Ref2Seq and Text GPT-2.Overall, our visual explanations is able to recommend images that fit users' interest.This indicates the effectiveness of our image selection module and the selected images can be used as valid visual explanations.More importantly, these images can provide grounding information for text generation such that the textual explanations become more informative (i.e., specific dishes), which aligns with our CLIP-Align metric as well as human evaluations in Section 5.5.As is shown in Figure 7, we can see historical review text alone cannot provide correct explanations (see Case 1) to the  2).In contrast, our showcase provides relevant and diverse textual explanations based on images.In case 3, our generated text missed some entities in the user's review since it only correctly describes one of the selected images.Hence, generating texts from multiple images is still a challenging problem for this new task.
As we can observe from the examples, Ref2Seq tends to generate explanations with the same pattern, which also match the observation in Table 2 that it has low Distinct-1 and Distinct-2.

Human Evaluation
To fully evaluate our model, we conduct human evaluation on Amazon Mechanical Turk. 5 For each model, we randomly sample 500 examples from the test set.Each example is scored by three human judges using a 5-point Likert scale to reduce variance.We instruct the annotators to consider two perspectives, expressiveness (semantically correct, diversity, no repetition) and visual alignment (the text describes the context of the images).As is shown in Table 6, PC 2 L significantly outperforms Ref2Seq, which is consistent with the automatic evaluation metrics.

RELATED WORK 6.1 Explanation Generation
There has been a line of work that studies how to generate explanations for recommendations [42,49].Some work generates product reviews based on categorical attributes [51] images [37], or aspects [28].Due to noise in reviews, Li et al. [22] generated 'tips' from the Yelp dataset which are more concise and informative as explanations in recommendation.To further improve the quality of generation, Ni et al. [27] proposed to identify justifications by dividing reviews into text segments and classifying text segments to get "good" justifications.Li et al. [21] proposed transformerbased model for recommendation explanation generations by incorporating user, item embeddings and related features.These text generation tasks leverage historical reviews from users or items.Images, on the other hand, provide rich information and grounding for text generation.Moreover, multi-modal information in our task (i.e., images and text) are more acceptable than text as explanations for users.
In this paper, we propose a new task for generating multi-modal explanations and present a framework that provides personalized image showcases and visually-aware text explanations for recommendations.

Multi-Modal Learning
Recent years have witnessed the success of deep learning on multimodal learning and pretraining [4,31].These models usually adopt the Transformer [38] structure to encode visual and textual features for pretraining, to later benefit the multimodal downstream tasks.Among them, CLIP [31] is a powerful model trained on a massive amount of image-caption pairs, and has shown a strong zero-shot or transfer learning capability on various vision and language tasks, from image classification, image captioning, to phrase understanding [36,45].Several recent study [15,52] used CLIP embeddings to compute modality similarities between image and text, and use CLIP-based scores as evaluation metrics for image captioning and open-ended text generation tasks.
In our work, we used CLIP extensively as the multi-modal encoder for our framework.We also designed a new metric based on CLIP for evaluating the visual alignment between the image set and generated explanations.

Contrastive Learning
The goal of contrastive learning [29] is to learn representations by contrasting positive and negative pairs.It has been investigated in several fields of applied machine learning, including computer vision [5,13], natural language processing [12,17], and recommender systems [40,43,50].A few recent work showed promising results of applying contrastive learning to conditional text generation, by generating adversarial examples [19], finding hard negatives with pretrained language models [3,44], or bridging image and text representations to augment text generation tasks [53].
Our work differs in that we study contrastive learning for conditional text generation in a cross-modal setting for personalization, where we proposed a novel contrastive framework for generating personalized multi-modal explanations.

CONCLUSION
In this paper, to generate explanations with rich information for recommendations, we introduce a new task, namely personalized showcases, and collect a large-scale dataset Gest from Google Local for the task.We design a personalized cross-modal contrastive learning framework to learn visual and textual explanations from user reviews.Experimental results show that showcases provide more informative and diverse explanations compared to previous text-only explanations.As future work, one promising direction is to develop an end-to-end framework for generating both visual and textual explanations.Besides, visual grounding on multiple images is still challenging for showcases.Another interesting setting is to address cold-start users or reviews written without images.We hope our dataset and framework would benefit the community for future research on multi-modalities and recommendations.

A DATA CONSTRUCTION
Our dataset is constructed from Google Local (i.e., maps) using a breadth-first-search algorithm with memorization.After collecting the review data, we filtered out reviews of length less than 5 words, which are less likely to provide useful information; we also removed reviews (2.13%) containing more than 10 images.The details of Gest-s1 construction for personalized image selection are as follows: We remove users with only one review for building a personalized dataset, then filter out reviews whose image urls are expired.After pre-processing, statistics for the personalized showcase dataset are shown in Table 1, where the number of images per business is 35.63 on average.We then randomly split the dataset by users, with 95,270/11,908/11,908 users for train/val/test.

B VISUAL DIVERSITY DEFINITION
We define the visual diversities in three levels as below: • Intra-Business Div: Measure the average diversity for image pairs at a business-level, where P 1 () means all the possible image pairs for business . 1 is the valid counts

Figure 1 :
Figure 1: Illustration of previous text-only explanation and our personalized showcases for recommendations.Given a recommended item or business: (1) Text-only Explanation models only use historical textual reviews from user and item sides to generate textual explanations.(2) We propose a personalized showcases task to enrich the personalized explanations with multi-modal (visual and textual) information, which can largely improve the informativeness and diversity of generated explanations.

Figure 2 :
Figure 2: Example of business and user reviews in Gest.For a business (e.g., an Italian restaurant), Gest contains historical reviews and images from different users.

Figure 3 :
Figure 3: Visual Diversity Comparison.A, B, C, E in Amazon denote different categories of amazon review datasets, which are uniformly sampled from All, Beauty, Clothing and Electronics, respectively.Intra-/Inter-User Diversity for the Yelp dataset is unavailable since Yelp images lack user information.

Figure 4 :
Figure 4: Example of user-generated images from Amazon from an item page and for Yelp from a business.Amazon images mainly focus on a single item and Yelp images for a business are diverse (yet the current public Yelp dataset has no user-image interactions).

Figure 6 :
Figure 6: (a) The length distributions of generated texts on the test set.(b) The generated explanation coverage of nouns (Noun), adjectives (ADJ) and adverbs (ADV) in ground truth.

Figure 7 :
Figure 7: Comparison between text-only explanations (i.e., Ref2Seq and Text GPT-2) and our showcases.User reviews are processed following Section 3.3.

Table 1 :
Data statistics for Gest.Avg.R. Len.denotes average review length and #Bus.denotes the number of Businesses.-raw denotes raw Gest.-s1 denotes Gest data for the first step, and -s2 denotes Gest data for the second step of our proposed personalized showcases framework.Dataset #Image #Review #User #Bus.Avg.R. Len.

Table 2 :
Results on personalized showcases with different models and different input modalities.Results are reported in percentage (%).GT is the ground truth.

Table 3 :
Ablation study for personalized image selection.Results are reported in percentage (%).

Table 5 :
Ablation Study on different initializations of the decoder.Random randomly initializes model weights.Text GPT-2 and Img GPT-2 are initialized with weights from [33].Img GPT-2 + FT finetunes the model on a corpus similar to our training text data.Results are in percentage (%).

Table 6 :
Human evaluation results on two models.We present the workers with reference text and images, and ask them to give scores from different aspects.Results are statistically significant via sign test (p<0.01).user (i.e., explanations from Ref2Seq and Text GPT-2 are irrelevant to the user review) and the sentences are monotonous (see Case 6of dis-similarity calculations (same as below):• Inter-User Div: Measure the average diversity for image pairs from different users for the same business, where P 2 () means all possible image pairs for business  that come from different users: Measure the average diversity in (business, user)-level, where P 3 (, ) means all possible image pairs from user  to business :