SE-PQA: Personalized Community Question Answering

Personalization in Information Retrieval is a topic studied for a long time. Nevertheless, there is still a lack of high-quality, real-world datasets to conduct large-scale experiments and evaluate models for personalized search. This paper contributes to filling this gap by introducing SE-PQA (StackExchange - Personalized Question Answering), a new curated resource to design and evaluate personalized models related to the task of community Question Answering (cQA). The contributed dataset includes more than 1 million queries and 2 million answers, annotated with a rich set of features modeling the social interactions among the users of a popular cQA platform. We describe the characteristics of SE-PQA and detail the features associated with questions and answers. We also provide reproducible baseline methods for the cQA task based on the resource, including deep learning models and personalization approaches. The results of the preliminary experiments conducted show the appropriateness of SE-PQA to train effective cQA models; they also show that personalization remarkably improves the effectiveness of all the methods tested. Furthermore, we show the benefits in terms of robustness and generalization of combining data from multiple communities for personalization purposes.


INTRODUCTION
Personalization is a problem studied for a long time in Information Retrieval (IR) [7-9, 13, 17, 26] and Natural Language Processing (NLP) [8].Personalized search aims to tailor the search outcome to a specific user (or group of users) based on the knowledge of her/his interests and online behaviour.Given the ability of Deep Neural Network (DNN) models to face many different tasks by extracting relevant features from both texts and structured sources [19], there is the expectation of a huge potential also for their application in Personalized IR (PIR) and Recommender Systems (RS).However, the lack of publicly-available, large-scale datasets that include user-related information is one of the biggest obstacles to the training and evaluation of DNN-based personalized models.Some real-world datasets are commonly used in the literature to design and assess personalization models.These datasets include the AOL query log [21], the Yandex query log 1 , and the CIKM Cup 2016 dataset 2 .Moreover, even synthetically enriched datasets have been used such as: PERSON [27], the Amazon product search dataset [2], and a dataset based on the Microsoft Academic Knowledge Graph [5].However, all of them have some issues.For example, ethical and privacy issues are related to using the AOL query log [3].In contrast, the anonymization performed on the Yandex query log prevents its use for training or fine-tuning natural language models.This paper aims to fill this gap by contributing SE-PQA (Stack-Exchange -Personalized Question Answering), a large dataset rich in user-level features that can be exploited for training and evaluating personalized models addressing the community Question Answering (cQA) task.SE-PQA is based on StackExchange 3 , a popular cQA platform with a network of 178 open forums.A dump of the StackExchange user-contributed content is publicly available 4 according to a cc-by-sa 4.0 license 5 according to a cc-by-sa 4.0 license 6 .With great care, we have preprocessed the original dump by building SE-PQA, a curated dataset with about one million questions and two million associated answers annotated with a rich set of features modeling the social interactions of the user community.The features include, for example, the positive or negative votes received by a question or an answer, the number of views, the number of users that selected a given question as a favorite one, the tags from a controlled folksonomy describing the topic dealt with, the comments that other users might have written under a question or an answer.To favor the design and evaluation of personalized models, the users in SE-PQA are associated with their past questions and answers, their social autobiography, their reputation score, and the number of views received by their profile.The cQA task can be addressed on SE-PQA with different methodologies exploiting either the textual description of questions and answers, the folksonomy, the features modeling the social interactions, or a combination of the above information sources.In this paper, we focus on IR approaches to cQA.Thus, we adapt the cQA task to an ad-hoc retrieval task where the question is seen as a query, and the answers are retrieved from the pool of past answers indexed for the purpose.In this particular setting, the system aims to retrieve a (small) ranked set of documents that contain the correct answers to the user question.There can be multiple correct answers given a question, so, in this case, personalization can be used to understand the user's context and background and rank higher the answers that are more relevant to the specific user.In summary, the novel contribution of this paper is the following: • We contribute the SE-PQA dataset, a novel public resource consisting of a comprehensive corpus including more than one million questions and two million answers by about 600 users.The richness and variety of features provided with the dataset enable its use for the design and evaluation of both classical and personalized cQA.• We provide a detailed analysis of the resource made available in SE-PQA compared to those previously available and used in the research community.• We report a preliminary comparison of the performance of different methods for cQA applied to the questions, answers, and users in SE-PQA.The results show that models based on deep learning outperform in effectiveness traditional retrieval models, and that by exploiting personalization features we can obtain a significant performance boost.
The paper is organized as follows.Section 2 introduces the SE-PQA dataset and reports some statistics about its content.Furthermore, the section details the personalized cQA tasks addressed in this paper by using SE-PQA.Moreover, it provides a comparison of SE-PQA with respect to other publicly-available resources in the field.Section 3 presents a preliminary comparison of traditional and personalized models for cQA applied to SE-PQA.In Section 4 we discuss the utility and the practical implications of the new resource.Finally, Section 5 concludes the work and draws some future lines of investigation.

THE SE-PQA DATASET
The textual posts in StackExchange forums are associated with rich social metadata information.When users ask a question to the community, they assign some tags specifying the topic to make the question searchable and visible to the users interested in it.The questions are up-voted or down-voted by the community based on their interest and adherence to the community guidelines 7 .In many cases, the community suggests to the question author how to improve the question if it is poorly expressed or formatted.Similar treatment is given to the answers, which can be up-voted or downvoted by the community; moreover, the user who asked the question can also choose the answer he/she deems the best, which may differ from the one that received the most up-votes from others.We note, however, that 87.6% questions and answers are assigned a score given by the difference between the number of up and down-votes.
A positive score thus indicates that the post has more up-votes than down-votes, while a negative score indicates that more users down-voted it.
StackExchange is quite well known in the IR community: for example, it has been used for training a language model for sentence similarity [14].To the best of our knowledge, the usage of StackExchange for Q&A tasks has been, however, limited just to 7 https://meta.stackexchange.com/help/how-to-answerselecting similar sentence training pairs without exploiting userlevel/social features for personalized information retrieval tasks.Another study uses StackExchange for duplicate question retrieval [12].The dataset built for this task consists of 12 separate communities, and the authors do not address the de-duplication task across community boundaries.
With SE-PQA we overcome the previous limitations and provide a complete, curated dataset of textual questions and answers belonging to different, heterogeneous forums.In SE-PQA a user can belong to multiple communities; if we take into consideration users that wrote at least 2 questions, about 50% of them have asked questions in multiple communities.As we increase the minimum number of questions, also the percentage of users using multiple communities increases.For instance, if we take only the users that wrote at least 5 documents (either questions or answers) and consider both the questions and the answers written by the user, we note that of the resulting 62k users only 23k (37%) wrote either a question or an answer in only a single community, while 40k (63%) wrote documents in at least two different communities, 26k (42%) in at least three and 18k (28%) in more than three communities.
We claim that personalization is particularly useful for multidomain collections, where we can exploit information about users' interests in multiple topics of different domains; when the data is instead derived from a single domain or a specific topic of a domain, personalization may become less important.We provide evidence of this assertion in Section 3.2 where we report about our experiments applying the same personalization approach to both the complete SE-PQA dataset and to the data sampled from separate communities: the results show that personalization on the multi-domain dataset yields better improvements than on only single communities.
To increase diversity, in SE-PQA we thus combine data from multiple networks that can be categorized under the large umbrella of humanistic communities.These communities focus on different to pics, but the language used is not too diverse among them.In particular, we choose the following 50 communities: writers, workplace, woodworking, vegetarianism, travel, sustainability, sports, sound, skeptics, scifi, rpg, politics, philosophy, pets, parenting, outdoors, opensource, musicfans, music, movies, money, martialarts, literature, linguistics, lifehacks, law, judaism, islam, interpersonal, hsm, history, hinduism, hermeneutics, health, genealogy, gardening, gaming, freelancing, fitness, expatriates, english, diy, cooking, christianity, buddhism, boardgames, bicycles, apple, anime, academia.
The training, validation, and test split are done temporally to avoid any kind of data leakage.The training set includes all questions written from 2008-09-10 to 2019-12-31 (included), the validation set is formed by questions asked between 2019-12-31 and 2020-12-31 (included), while the test set contains the questions from 2019-12-31 till 2022-09-25 (included).
There are a total of 1, 125, 407 questions in the dataset, 1, 001, 706 of which have at least one answer (89% of all questions) and 525, 030 of which have a response that the questioner has selected as the best one (47% of all questions).We are left with 822 974 training questions, 78 854 validation questions, and 99 878 test questions after the temporal splits.There are 2, 173, 139 answers and 588, 688 users.Many users in the communities register themselves just for asking a question and then never use their accounts again.In fact, the dataset has a median of 1 user-generated document (either a question or an answer), with about 80% of users having no more than 2 documents.The text in the dataset is preprocessed by removing HTML tags present in the original documents.In Table 1 we report the basic statistics for the dataset.Specifically: document length, measured in the number of words, document score, which is the difference between the number of up-and down-votes assigned by the community; answers' count, the number of answers given to a question; comments' count, the number of user comments to a given question or answer; favorite count, that indicates the number of users that flagged the question as their favorite, showing their interest in that topic; tags count, the number of tags associated to the question by the asking user.From Table 1 and Figure 2 we can notice that, as expected, most documents are short, with answers generally longer than questions.The dataset is available at Zenodo8 .The code to reproduce the dataset and the baseline is publicly available 9 .

Task Definition
Even though SE-PQA can be used for many IR tasks (e.g., duplicate and related question retrieval or expert finding), we address here the cQA task only, by illustrating how it can be addressed by using the resources in SE-PQA.The addressed cQA task focuses on satisfying the information needs expressed in user questions by retrieving relevant documents from a collection of historical answers posted by the community members.We infer the relevance of an answer  to a question from the number of up-votes given by community members.Concerning the experiments involving personalized cQA models, we only consider relevant the single answer that is explicitly labeled as the best answer by the user who submitted the question.More formally we provide the following definition.Let A be a set of answers { 1 , . . .,   } posted by the members of the community and let q be a question asked by the user u.The objective of cQA is to retrieve a ranked list of  answers { ,1 , . . .,  , } from A based on their relevance to q.In our experiments with SE-PQA we infer the relevance of an answer to a question from the up-votes given by community members.With regards to the experiments involving personalized cQA models, we consider instead only the answer  , ∈ A that is the most relevant for the question q and for the specific user u.This can be assessed on SE-PQA by considering the single answer that is explicitly labelled as the best answer by the user who submitted the question.In order to address the above-defined cQA task, we preprocess the collection of answers of SE-PQA.Specifically, we discard answers with negative scores since they are assumed to be of low quality and not relevant to the cQA task.This cleaning step affects about 100k answers.As a result, 2, 073, 370 answers are left in the dataset.Moreover, we discard all the questions that have not received an answer.
To create the set of relevant answers for the questions, i.e., the golden standard, we consider the answers given to each question.A total of 525,030 questions out of 1,001,706 have an answer selected as the best by the user who asked the question.We are sure that this answer received a positive score from the community since we have removed all the answers with negative scores.Thus, the answer given to a question q of user u and selected as best can be considered as both relevant to the question q (positive score from the community) and to the user u (selection of the best answer).By using this information, we define two versions of this dataset: the base version, where we consider as relevant for a question all the answers having a positive score, and the personalized (pers) version, which, instead, considers relevant for both the user and the question only the single answer that the user selected as the best answer.We note that both versions of the dataset, base and pers, can be used for personalized cQA, with the following difference: in the pers version each query is potentially personalizable, while the base version also includes queries that cannot be always used to train personalized models since the choices of the answers preferred by the users are not always available.
A variety of user-generated information from the training set can be used in the personalization phase.For each question, we include all the user posts (questions and answers of the user asking the question) that were written prior to the question being asked.This is done to avoid any data leakage for query-wise training, but the user data is not limited to these documents; in fact, one can also consider the social interaction between users, the tags assigned by the users to the previous questions asked along with their meaning, the badges earned by users.Furthermore, the dataset includes the biographic text (about me) self-introducing each user, a rich set of numeric features (e.g., user reputation score, number of up-votes and down-votes of each post, number of views), a set of temporal information (e.g., user creation date, last access date, post creation timestamp).

Comparison with available datasets
We survey other works contributing datasets for personalized IR and discuss their limitations.These datasets cover a huge variety of tasks ranging from web search to product search and academic search.
In Table 2 we summarize the basic statistics of the main datasets used in the literature for personalized IR tasks.The AOL query log was released in 2006 and even after the harsh criticism it received due to privacy-related issues, it remains to this date a widely employed resource for a variety of tasks, especially personalized ad-hoc retrieval.This dataset includes about 20 million Web queries issued by more than 657,000 users over three months (from 03/01/2006 to 05/31/2006).For each query, the dataset details provide the userid, the URL, and the rank of the web page clicked, if any.A huge limitation of this dataset is that the web pages in the corpus are represented by their URLs and the text is not provided.To cope with this issue, researchers use a version of the corpus collected in 2017 [1] scraping the text content of the web pages in the corpus.This additional dataset comes however with another problem: the content of web pages can change over time, and many documents might have changed from 2006 to 2017, thus making the dataset less reliable.Recently, a new version of the dataset was proposed, which used Internet Archive to retrieve documents as they were in 2006 [18].The AOLIA dataset [18] is a derivative of the original dataset (AOL Query Log [21]); it has been cleaned to generate a higher-quality query set.First, queries with no clicks (and consequently no relevant documents) are removed.Then, all the queries with domain references (.com, .org,etc.) and queries pointing to adult or illegal websites are eliminated.Furthermore, all queries with fewer than three characters are discarded as well.Finally, queries from users with less than twenty associated queries are removed in order to have enough user-related data to perform personalization.As a result, the AOLIA dataset contains about 1,3M documents and around 30k different users.To tackle the issues that come with the real words datasets, some synthetic datasets, and associated evaluation frameworks have been proposed in recent years: PERSON [27], Amazon product search [2] and MAG, a dataset based on the Microsoft Academic Knowledge Graph [5].Tabrizi et al. [27] proposed PERSON as a synthetic personalized evaluation framework for IR based on citation networks.The authors base the dataset on the ArnetMiner citation network [28].The idea is that, from the authors' perspective, the papers referenced in a document are somehow related to the document it is cited in.From an IR point of view, the document content (title or abstract) are considered as the user query, while the cited documents are assumed relevant to the query.The dataset based on the Microsoft Academic Graph (MAG) [5] follows a procedure similar to the one used by Tabrizi et.al; it uses a much larger citation graph to derive four different datasets, one for each of the following subjects: Political Science, Psychology, Computer Science and Physics.Similarly to the PERSON framework, queries come from paper titles, and the previous papers of a user are used to build her personal profile.Paper titles from users authoring less than twenty previous papers are removed from the dataset, providing the user-related information necessary to define appropriate user models.As explained in Tabrizi et al. [27], such a dataset can be employed to develop and compare various user models, but cannot be used to assess personalized search effectiveness due to the strong assumptions made to determine relevant documents in this framework.
The Amazon product search dataset [2] is based on the Amazon Review dataset [10].The dataset is created in a very synthetic manner using item categories and properties to generate user queries: the terms from each product's category are concatenated following their hierarchy order to create a topic string.Stopwords and duplicate terms are removed from the topic string that is then used as a query for the associated item.When removing the duplicated words, the terms from a lower category are preserved, e.g.,  → ℎ →    is converted to "photo digital camera lenses".Given an item  purchased by a user , the item is considered relevant for , and the synthetic query is generated as explained above.This process comes with some drawbacks: it generates a low number of unique queries that do not resemble real-world queries as rarely a user writes down all the categories of a product in a hierarchical order to search for it.
Table 2 shows that the proposed dataset is, in terms of corpus volume, very similar to the other datasets, and it has a comparable value also for the other statistics.Actually, it is the largest one in terms of number of queries provided.In terms of relevance assessment, it is the only one that has been explicitly annotated by the users: by a single user for the best answer and by various community users for relevance, by either up-voting it if the answer was relevant according to them or down-voting it otherwise.

PRELIMINARY EXPERIMENTS WITH SE-PQA
In this section, we briefly describe the experimental setup and introduce the methods employed to showcase SE-PQA on the cQA task defined in Section 2.1.Finally, we report and discuss the results of the preliminary experiments conducted.

Experimental settings
We adopt a two-stage ranking architecture aimed at trading-off effectiveness and efficiency by applying two increasingly accurate and computationally expensive ranking models.The first stage is inexpensive and recall-oriented.It aims at selecting for each query a set of candidate documents that are eventually re-ranked by the second, precision-oriented ranker.The first stage is based on elasticsearch, and uses BM25 as a fast ranker.To increase the recall in the set of candidate documents retrieved by the first stage, we optimize BM25 parameters by performing a grid search driven by Recall at 100 on a subset of 5000 queries randomly sampled from the validation set.The optimal values for b and k1 found are 1 and 1.75, respectively.For the second, precision-oriented stage, we rely on a linear combination of the scores computed by BM25, a neural re-ranker based on a pre-trained language model, and, when used, a personalization model exploiting user history, represented by the tags used by the users.In all the experiments the second stage re-ranks the top 100 results retrieved with BM25.
Neural models.We use the following three neural models in the second stage: • The first model is MiniLM 10 .This model was trained and tuned using billions of training pairs, given the presence of StackExchange pairs in the training data, we use the model as it is, without any fine-tuning; 10 https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 • The second one is DistilBERT 11 .In this case we fine-tune the model using all the training queries of SE-PQA.For each query, one positive document and two negative documents are randomly sampled, one from the list retrieved by BM25, and one in-batch random negative [11].We finetune DistilBERT for 10 epochs, with a batch size of 16 and a learning rate of 10 −6 by using Triplet Margin Loss [25], with a margin  = 0.5.• The third one is MonoT5 [20].It is based on a T5 [24] reranker, which is fine-tuned on the MS MARCO passage dataset 12,13 .We train two different versions of MonoT5: small and base.To further fine-tune MonoT5-small, we follow the same setting proposed in [20], i.e. batch size equal to 128 and learning rate of 10 −3 .For each query, we sample one positive document and one negative document from the list retrieved by BM25.To fine-tune MonoT5-base, we reduce the batch size to 64 due to hardware constraints.Also in this case the models are trained for a total of 10 epochs.Instead of fine-tuning the whole model, we rely on Adapter modules [22,23], composed of two Feed-Forward layers: the first one is a down projection of the input vector into an intermediate dimension, which is followed by a non-linear activation function; the second one is an up projection to the dimension of the input vector.Following Karimi et al. [15], the intermediate dimension is set to 48.For DistilBERT and the two T5 models, we rely on AdamW [16] as the optimizer.For reproducibility purposes, we set the random seed to 42 for training DistilBERT and 0 for training the T5 models.
Personalized TAG model for cQA.For a given answer a produced in response to a query q formulated by a user u, a personalization score is computed as explained below.As previously explained this score is linearly combined with the BM25 score and with the score produced by the neural re-ranker.Given a question q, asked by user u at time t, let  , be the set of tags assigned by u to all her/his questions posted before t (including q). , thus represents the interests of u as expressed in her/his previous interactions.The authors of the answers to query q do not have the possibility of 11 https://huggingface.co/distilbert-base-uncased 12 https://huggingface.co/castorini/monot5-small-msmarco-10k 13 https://huggingface.co/castorini/monot5-base-msmarco-10k the T5 models due to their long computation time.For a fair comparison, we performed for each community the optimization of the  weights on single-domain validation data.Differently from the multi-domain results shown in Table 3, we notice that the contribution of the TAG model is lower, and in some cases missing.Specifically, for 25 out of 50 communities, personalization does not lead to any improvement, i.e.,    = 0. On the other 13 communities, we do not observe statistically significant improvements for P@1 over the non-personalized methods.Since statistical significance is affected also by the size of the sample, we computed also the performance metrics averaged on all the runs with singledomain data.As expected, the absolute metrics are slightly higher for single-domain tests due to the higher recall in the first-stage retrieval.In fact, by considering single-community data at a time, we drastically reduce the size of the collection indexed, allowing the first-stage ranker to perform better.However, in terms of the absolute performance boost due to the TAG model, we achieve a 2% improvement on P@1 when using all communities together, while the boost decreases to 1.1% when considering the communities separately.The results of these experiments are reported integrally in the SE-PQA Zenodo and Github page.Here, in Table 5, we report the results for the 12 communities for which personalization achieves statistically significant improvements.

UTILITY AND PREDICTED IMPACT
The SE-PQA resource we make available to the research community is a step ahead toward a fair and robust evaluation of personalization approaches in Information Retrieval.The features provided with the dataset include explicit signals to create relevance judgments and a large amount of historical user-level information allowing to design and test classical and novel personalization methods.We expect the SE-PQA dataset to be useful for many researchers and practitioners working in personalized IR and in the application of machine/deep learning techniques for personalization.In recent years, the IR community spent important effort in studying personalization.However, a comprehensive dataset for evaluating and comparing different approaches is still missing.Researchers mainly rely on synthetic datasets or use non-public data, which makes the comparison between different methods less reliable or, worse, not possible at all.The SE-PQA dataset advances this research area by filling this gap with a large-scale dataset covering the activity of StackExchange users in a period of 14 years.For this reason, we expect that the dataset will impact the research community working on personalized IR as it provides a single common ground of evaluation built on questions & answers from real users socially interacting via a community-oriented web platform.

CONCLUSION AND FUTURE WORKS
This paper discussed the characteristics of SE-PQA (StackExchange -Personalized Question Answering), a large real-world dataset including about 1 million questions and 2 million associated answers contributed by the users of StackExchange communities.The data comes with a rich set of user-level features modeling the interactions among the members of the online communities, e.g., the positive or negative votes received by questions and answers, the tags associated with questions, the comments that other users might have written under a question or an answer, the users' autobiographies, reputation score, and the number of views received by their profile.
We detailed all the information available in the dataset and discussed how it can be exploited for training and evaluating classical and personalized models addressing cQA task.As exemplifying methodologies, we focused on IR approaches for these tasks based on a two-stage architecture where the second re-ranking stage exploits a combination of the scores computed by BM25, Distil-BERT/MiniLM/T5, and TAG models.The results of the preliminary experiments conducted show that personalization works effectively on this dataset, improving by a statistically significant margin, in most of the cases, state-of-the-art methods based on pre-trained large language models.
The analysis conducted and the peculiarities of the SE-PQA resource suggest several lines of future investigation.For example, in this work we employed a relatively simple user model for personalization, we leave the development of more complex personalized models for future works that could exploit user features of SE-PQA that were not used in the proposed models.

Figure 2 :
Figure 2: Word length distribution for questions and answers in SE-PQA.

Table 1 :
Basic feature statistics for question and answers.

Table 2 :
Comparison between SE-PQA and other text-based datasets for personalized IR.All the datasets are in English.