Generating Product Insights from Community Q&A

In e-commerce sites, customer questions on the product details-page express the customers' information needs about the product. The answers to these questions often provide the necessary information. In this work, we present and address the novel task of generating product insights from community questions and answers (Q&A). These insights can be presented to customers to assist them in their shopping journey. Our method first generates concise, self-contained sentences based on the information in the Q&A. Then insights are selected based on the prominence of their associated questions. Empirical evaluation attests to the effectiveness of our approach in generating well-formed, objective, and helpful insights that are often not available in the product description or in summaries of customer reviews.


INTRODUCTION
Product descriptions on e-commerce sites such as amazon.comor ebay.comhave been shown to play an important role during customers' shopping journey [10,13,21].While information provided in the description is rich and useful, many products have short description or no description at all [15].Moreover, product descriptions often suffer from inherent bias, as they present the seller's point of view of the product, and reflect the seller's incentive to sell.Another source of important information that does not suffer from the seller's bias is the customer reviews section.In fact for popular products, the reviews section is so large that potential buyers cannot read through them and rely instead on smart ranking of the reviews [4,27] or on automatic summaries [2,5,16].However, reviews tend to focus on subjective information, sometimes leaving out important objective product characteristics.For this reason, some websites have a Questions and Answers (Q&A) section, in which customers can post questions of interest to the community.Customer questions express information needs about the product, and the answers to these questions often provide the necessary information.We observed that customers often ask for information that is missing from the product description or not prominent in customer reviews.In other cases, customers wish to verify the seller-provided information.
Motivated by this observation, we develop a method to distill useful product information from community Q&A.One way to present this information is to identify prominent Q&A, and simply surface them, as is, to the customer.However, as questions are often lengthy and may receive several answers, this may require the customer to read through large amounts of information.Another approach is to generate concise snippets using information present in the Q&A, and surface those to the customers.This approach presents the required information to the customer in a summarized and easy to read format.In this work, we focus on the latter approach, which can contribute to a wide range of use cases, such as enriching product description (Fig. 1), voice interface experiences, and side-by-side product comparison.
We define an insight as a concise, self-contained sentence that is likely helpful to many customers during their shopping journey.As the source of information for insights, we focus on yes-no questions, which constitute 54% of the Q&A in Amazon.comaccording to the Amazon-PQA dataset [26].
The proposed process for insight generation from community Q&A constitutes of two stages.First, an insight is generated from a yes-no question and its answers, reflecting the information provided in them.In the second stage, we select a subset of the generated insights to be presented to the customers, based on the prominence of the associated questions and the diversity of the final insight set.Our main contributions are: (1) presenting a novel task for generation of product insights from community Q&A; (2) an end-toend pipeline consisting of insight generation and selection stages; and (3) a new annotated dataset of 20K yes-no questions, their answers, and the generated insights.

RELATED WORK
Many works have been done on extracting useful insights from product reviews.Some works [9,14,30] focus on aspect-based summarization: extraction of prominent product aspects, and aggregation of review sentences by aspect and sentiment into a concise representation.Other summarization tasks include identifying helpful sentences [5], generating tips from reviews [8], and generating entire descriptions based on reviews [15].Other works aim to generate product insights based on matching reviews to questions [12], or using reviews of similar products [20].In contrast, community Q&A are an untapped source of information, with a great potential to yield valuable insights, different from review-based insights, as we show in our analysis (Section 6).A question is usually more focused on a single aspect of the product than a review, which may contain multiple aspects.Therefore, the challenge in summarizing Q&A is not aspect extraction, but rather an abstractive transformation of question and answer into a standalone readable sentence.
Previous works have studied the task of generating a full, standalone answer based on a question and a factoid short answer [17], or a question and a paragraph [1].In [17], a Q&A dataset based on Wikipedia was used to create training data for this task, and a web-based, manually curated dataset was used in [1].However, full answer generation in the product Q&A domain is different than in those domains: the answer is not always contained in the product description or the customer answers, so they cannot be used as a paragraph as in [1]; customer answers may be multiple and contradicting; and both customer questions and answers are noisy and contain grammar and spelling errors.Additionally, in this work we focus on yes-no questions, which are different from open ended [1] or factoid questions [17], as demonstrated in §4.1.
Finally, our work presents a possible approach for Q&A summarization.We note that previous works exist about summarization or processing of multiple answers to a single question [18,28].However, to the best of our knowledge, there are currently no works on summarization of the information presented in multiple product questions and their answers.

MINING INSIGHTS FROM Q&A
In this section, we describe our proposed method to generate helpful insights out of a Q&A collection which consists of multiple processing steps.At a high level the process can be divided to two stages.The first, illustrated in Figure 2, is an insight generation stage from a yes-no question and its answers, reflecting the information provided in them.The second is an insight selection stage, based on the prominence of the associated questions and the diversity of the final insight set.

Product insight generation
The input to our method is a Q&A collection of a specific product.Since information is divided between the question and the answer, and usually, neither of them constitute a standalone sentence, extractive methods are not suitable for our use-case.Therefore, we first transform each question and answer into a standalone sentence (insight).
Processing questions and answers.In this work we focus on yes-no questions, which constitute more than 50% of the community Q&A in Amazon according to the Amazon-PQA dataset [26].Therefore, the first step is to identify and retain only the yes-no questions.Next, as community questions typically have free-text answers, a second required step is to map each answer to a yes, no, or neutral (unknown) label.For example, the answer of course it does is mapped to a "yes", No USB support is mapped to a "no", and I'm not sure is mapped to "neutral".Finally, community questions are often answered independently by multiple users (51% of yes-no questions in the Amazon-PQA dataset [26] have multiple answers), and the answers may not be unanimous.Therefore, a third required step is aggregating the diverse answers, and determining a final "yes", "no", or "neutral" answer, which is used for generating the insight.We note that the multiple answers (e.g. two "Yes" answers, one "Neutral", and one "No") can be used for estimating the confidence in each of the insights.This information could be used for insight selection, and may even be explicitly exposed to the customers to increase their confidence in the generated insights.
To perform the three mentioned steps, we adopt the solutions proposed in [26].Namely, identifying yes-no questions is performed using the heuristics proposed in [7].Mapping the free-text answers to yes/neutral/no answers is performed using a RoBERTa-based classifier trained for this task.Finally, in case of multiple answers, these are aggregated using a simple heuristic: when an answer is provided by a verified seller, it is considered as the final label; otherwise, the majority vote answer is assigned as final yes/no label, or neutral in case of a tie.This process was applied to generate the Amazon-PQA dataset that we utilize in this work, which contains an aggregated answer per each question.
Sentence generation.In this step, given a yes-no question and its aggregated answer, the goal is to generate a concise standalone sentence (insight) representing that information.We framed the insight generation task as a neural machine translation task, where the inputs are the yes-no question and the short answer, and the output is the concise standalone sentence.We leveraged a transformer-based model, and experimented with several training sets: a small scale dataset of human generated insights; a large scale out-of-domain dataset; transfer learning between the two; and a few-shot setting.

Product insight selection
The Q&A section often contains more than one question and for popular products the number of questions can become quite large.In the Amazon-PQA dataset, over 13% of the products have at least 5 answered yes-no questions.Since the available space for displaying product insights is often limited, a selection mechanism that promotes the most helpful insights is needed.We rely on the following hypothesis (verified in experiments in §4.2): a helpful insight is based on a prominent question that expresses an information need common to many customers.
In order to estimate question prominence, we can simply search for the most popular product questions posted in the customer Q&A widget.However, based on our analysis almost 90% of the questions are asked only ones.Another way customers can seek information on the detail page is to utilize the search widget (located above the Q&A section) to type a query.We found that for 50% of the products1 , the number of search queries on the detail page was at least 4 times greater than the number of questions available.This makes the search history log a valuable resource for understanding customers' information needs.Therefore, we consider two techniques for estimating question prominence: Log Popularity (LogPop).In the Q&A search widget, customers typically type very short queries to express their information needs.In fact, more than 90% of queries have at most 2 words.In order to match between the product questions and the queries, we leverage the existing production algorithm that retrieves a set of existing questions in response to the customer query.We rank questions based on the number of queries they were retrieved for, as a proxy for question popularity.
Category Popularity (CatPop).While there are almost no duplicate questions for the same product, we can rank them by their popularity in similar products.For each product, we rank the questions (and their corresponding insights) in descending order by the number of similar questions we find within the product category.Two questions are considered similar if the cosine similarity between their embeddings is above a predefined threshold of 0.87. 2 To embed the questions, we use a Sentence-Transformers model [24], pretrained to find similar questions on Quora dataset. 3iversification.To avoid duplicate questions and select a final diverse set of insights for a given product, we apply a greedy diversity mechanism, iterating over the ranked questions in descending order and selecting a question only if its cosine similarity with previously selected questions is below 0.5. 4 We used the same embeddings as in the selection step to represent the questions.The process ends when a predefined number of questions are selected.

EXPERIMENTS 4.1 Insight generation
4.1.1Datasets.As a source data for our experiments, we used Amazon-PQA [26], a publicly available dataset of product Q&A.This dataset contains over 9M questions for over 1.4M unique products, divided to 100 narrow categories.54% of the questions are yesno questions, and for each such question an aggregated yes-no answer is provided (36% yes, 16% no, 48% neutral).We filtered out questions where the aggregated answer is neutral, since they represent uncertainty in the answer.
In-Domain dataset (ID) -5 A small scale manually generated dataset.50K yes-no questions were randomly sampled from Amazon-PQA.These questions and their aggregated yes-no answers were presented to annotators via the Appen annotation platform 6 , who transformed each (question, answer) pair into a stand-alone sentence reflecting the information provided in the pair.The annotation underwent grammar correction [25] and cleaning steps, resulting in 19,470 triplets (e.g., <Is it waterproof?, Yes, The product is waterproof.>).We split the dataset into train (70%), validation (15%) and test (15%).
Out-Of-Domain dataset (OOD) -A large scale dataset of 315K question, factoid answer and full answer triplets.The dataset was Table 1: Experimental results of the generation models (all numbers are in percents).'' and '' marks statistically significant differences (using a two-tailed paired t-test with p-value ≤ 0.05) with T5-ID and T5-transfer, respectively.Boldface: the largest value in a column.created [17] using SQuAD [23], a Wikipedia-based Q&A dataset, by matching each question and factoid short answer with the original sentence containing the answer.
4.1.2Experimental details.We tested several approaches for insight generation, involving supervised direct and transfer learning over the aforementioned datasets, as well as a few-shot setting.For the supervised sequence-to-sequence generation task we used a T5 model [22], initialized with the pretrained 'T5-base' 7 checkpoint which has 220 million parameters.For each dataset, the first two entries (question and short answer) in the triplet were concatenated and fed to the model with the prefix "summarize:", and the last entry (stand-alone sentence) was used as the target.We used an AdamW optimizer with batch size of 8 and learning rate of 1e-4, greedy decoding and a maximal length of 80 tokens.We used repetition penalty of 2.5 to avoid repetition of a phrase in the generated sentence.
In our experiments, we either trained the T5 model directly on one of the datasets for 5 epochs, or performed transfer learning.In the latter approach, we first trained the model on the OOD dataset for 10 epochs, and then further fine-tuned it on the smaller ID dataset for 5 epochs.In both setups, we applied early stopping according to the loss on the validation set, to choose the best model checkpoint.
In the few-shot setting, we used the open-source GPT-J [29] model which has 6 billion parameters and was trained on the Pile dataset [6].GPT-J has shown impressive performance compared to the 6 billion GPT-3 [3] model on various zero-shot NLP tasks.At inference time the model gets a prompt that consists of the task instruction ("Generate a factually correct statement from the following question and answer pair"), followed by a representative set of 13 examples and one new example (a question and answer pair) that we want the model to generate the insight for (see Figure 3).The model predicts the most probable next tokens that are used as the generated insight.We use a greedy decoding strategy as it yields the best result according to our experiments.The maximum length of the generated text was set to 25 tokens.

Evaluation and results
. We compared the generation performance of the model in various training schemes over the test portion of the ID dataset (Table 1).We measured the performance via automatic scores (ROUGE [11] and BLEU [19]) and manual evaluation (2 rightmost columns).In the automatic evaluation, we observed that a supervised T5 model trained or fine-tuned on the target dataset outperformed both the OOD T5 model and the GPT-J model.Additionally, transfer learning from OOD (3rd row) led to a 7 https://huggingface.co/t5-baseWe further compared the methods via manual evaluation of insights generated from 300 Q&A pairs (we excluded the worst performing method from this evaluation).Four in-house expert annotators were asked whether each generated sentence was: (i) readable -clear and has no significant grammar mistakes, (ii) consistent with the yes-no answer, and (iii) hallucinating any information that was not present in the (question, answer) pair.Readability and consistency rates are shown in Table 1.We used 200 sentences to calculate annotator agreement between annotator pairs.Cohen's Kappa scores were 0.59, 0.89 and 0.49 for tasks (i), (ii) and (iii) respectively (between fair to excellent agreement).We found that 99.9% of the generated sentences (via all methods) did not hallucinate irrelevant information.For T5, the transfer learning setting reached higher readability and consistency scores than direct training.The best performing model was GPT-J, suggesting that while the insights generated by GPT-J differed from the ground truth insights, this model is better at generating text that is more readable and accurate for humans.An analysis of the error cases shows that the major cause for inconsistency was wrong model/item name (e.g, chevy tahoe in the question was replaced with chevy tie in the generated insight).We additionally observe that unreadable insights are mostly (65%) due to minor grammar or spelling errors, and only 35% of them are completely unclear.
Henceforth, we utilize the T5 model for insight generation, as its complexity is an order of magnitude smaller compared to GPT-J.Moreover, despite slightly lower readability and consistency scores, T5 outperforms GPT-J in terms of similarity to human-written insights.

Insight selection -evaluation and results
We evaluate the insight selection procedure on a set of 551 products with the highest number of detail page views in headphones, laptops, mobile phones and televisions categories.As a baseline, we ranked the insights completely at random and compare the results to our popularity based methods.For each product, the top 5 insights retrieved by each selection model were annotated.Ten in-house annotators labeled the insights as helpful/non-helpful.A helpful insight should contain valuable information that helps the customer make a purchase or usage decision.
The random baseline reached 73% helpfulness, supporting the assumption at the basis of our work, that Q&A contain valuable product information.The results also showed that CatPop and LogPop achieved impressive helpfulness scores of 93.6% and 92.2%, respectively.Further analysis of the top 5 ranked insights revealed that 70% of these insights were consistent across both methods, which may explain the similar results we observed.These results attest to the merits of selecting insights by their corresponding question prominence.
We next analyzed the key reasons the annotators marked insights as unhelpful.The leading cause (31% of errors) was lack of clarity, usually as a result of grammar mistakes or lack of context in the originating question (e.g., This phone has 6.0, based on the question does the phone have 6.0?).Other errors were over-specificity (20%), i.e., insights about a feature that is too niche, e.g., These headphones block out really loud snoring., or overly general insights (15%), e.g., This TV is really worth buying.
For the remainder of the paper, LogPop will be used as the primary ranking method because it is derived from customers actual searches for the specific product rather than for products in the same category.We do however, view CatPop as an important alternative as it better handles the cold start issue, i.e. products for which search information has not been accumulated yet.

ONLINE EXPERIMENT
In order to evaluate the helpfulness of the insights, we set up an experiment on "Alexa's Insights" widget on the retail website.The widget is placed in the product detail page and aims to help customers with their shopping decisions by providing a snapshot view of customer review aspects and product information as presented in Figure 4 (a).
We conducted a user study to learn more about how customers perceive the helpfulness of Alexa's Insights.Ten customers were asked to imagine they were shopping for a pair of wireless headphones, and that they go to Amazon and find the new Apple AirPods Pro.The customers had a full length detail page prototype with the Alexa's Insights widget included.Testers were asked how helpful  would they find Q&A insights on a 5-point rating scale.The majority of testers thought that the insights were helpful and would save them time of having to looking through all of the Q&A ("Very Helpful": 5 votes, "Helpful": 4 votes, "Neither": 1 vote).
Encouraged by the user study results, we conducted an online A/B test to measure the potential impact of our insights.When customers viewed the detail page of a supported product (in US), and Q&A insights were present, they were allocated to one of the following groups: (1) Control -existing widget with Review Aspects and product information.(2) Treatment -the Q&A insights are shown between Review Aspects and product information (see Figure 4 (b)).
During the experiment we measured a LongTermBenefit metric which is an estimate for long term customer activity on the e-commerce platform.The estimation is derived from customer's purchases and other activities, such as searches, performed on the platform.Based on a 28-day analysis of the experiment we observed a statistically significant improvement of 0.14% in LongTermBenefit in the treatment group compared to the control group.This positive result demonstrates that the new insights provide useful information to customers that help them make more informed and confident purchasing decisions.

DATA ANALYSIS
So far our results show that our method successfully generates helpful, well-formed and relevant insights.In this section we examine whether the generated insights add new information beyond customer reviews and beyond the product description provided by the seller.
Comparison to customer reviews.As mentioned in Section 1, a key hypothesis motivating our work is that product reviews

Figure 1 :
Figure 1: Product insights generated from community Q&A by our method.

Figure 2 :
Figure 2: First stage of our pipeline -insight generation

Figure 3 :
Figure 3: Few-Shot prompt for GPT-J; typos and grammatical errors are introduced to teach the model how to handle them.

Figure 4 :
Figure 4: Alexa's Insights widget on the desktop.(a) Control -existing widget in the product detail page; (b) Treatmentour insights are added.