SE-PEF: a Resource for Personalized Expert Finding

The problem of personalization in Information Retrieval has been under study for a long time. A well-known issue related to this task is the lack of publicly available datasets that can support a comparative evaluation of personalized search systems. To contribute in this respect, this paper introduces SE-PEF (StackExchange - Personalized Expert Finding), a resource useful for designing and evaluating personalized models related to the task of Expert Finding (EF). The contributed dataset includes more than 250k queries and 565k answers from 3 306 experts, which are annotated with a rich set of features modeling the social interactions among the users of a popular cQA platform. The results of the preliminary experiments conducted show the appropriateness of SE-PEF to evaluate and to train effective EF models.


INTRODUCTION
Expert finding (EF) is a well-studied problem in community question answering (cQA).The aim of EF in a cQA scenario, is to identify users, namely the experts, that might be able to answer correctly a given question on a specific topic.This task is important for many applications, e.g., crowd-sourcing, and for cQA platforms that wish to increase user engagement by precisely identifying the experts to whom to propose the questions about a given topic.
Personalization is gaining traction in many IR [3,4,7,15] and NLP [6] tasks, but it is not largely adopted in EF due to the lack of publicly-available, large-scale datasets containing user-related information.In this research paper, building upon our previous work [10], in which the authors presented a dataset for personalized community question answering, we introduce SE-PEF (Stack-Exchange -Personalized Expert Finding), a large dataset rich in user-level features that can be leveraged for training, evaluating and comparing both personalized and non-personalized models for the Expert Finding task.SE-PEF comprises around 250k questions and 560k associated answers provided by 3,306 experts, and it inherits a rich set of features modeling the social interactions within the user community.To train personalized models, we keep the user-related data as they are provided in the original dataset: users' past questions and answers, their own social autobiography, their reputation score, and the number of profile views that they have received.
In the case of EF, personalization can improve the perceived service quality in different ways.For example, when the requesting user is interested in multiple topics, identifying an expert by considering also the requesting user's interests can improve the trust in the answer received.A similar effect can be obtained by preferring experts that are closer to the requesting users based on past interactions or follower/followee dynamics.In summary, the contribution of this paper is the following: • We provide and make available the SE-PEF dataset, as a public resource consisting of a comprehensive corpus including around 255k questions and 560k answers provided by 3306 expert users.The richness and variety of features provided with the dataset enable its use for the design and evaluation of personalized EF methods.• We report a preliminary comparison of the performance of different EF methods applied to the questions, answers, and users in SE-PEF.The results confirm that models based on deep learning outperform in effectiveness traditional retrieval models and that by exploiting personalization features we can obtain a significant performance boost.
The rest of the paper is organized as follows.Section 2 introduces the SE-PEF dataset and reports some statistics about its content.Moreover, the section details the EF task addressed in this paper by using SE-PEF.Section 3 compares SE-PEF with respect to other publicly-available resources in the field.Section 4 presents a preliminary comparison of traditional and personalized models for EF applied to SE-PEF.In Section 5 we discuss the utility and the practical implications of the new resource.Finally, Section 6 concludes the work and draws future lines of investigation.In [10] the authors show that personalization is more useful if multiple communities are used together in this dataset rather than using a single community to create the dataset.Meanwhile, previous works that use StackExchange for EF tasks focus only on a single community or a portion of a community, thus neglecting the domain diversity characterizing the questions and the various experts [8,13,17].

Accessing the SE-PEF dataset
SE-PEF dataset is made publicly available on zenodo 2 according to the conditions detailed in the included CC BY-SA 4.0.license agreement and the code used for data creation, training, hyperparameter optimization, and testing are available on github 3 .

SE-PEF Definition
In the following, we introduce the specific instance of EF task in which we are interested and illustrate how to address it by using the resources in SE-PEF.
1 https://stackexchange.com 2 https://doi.org/10.5281/zenodo.8332747[18] 3 https://github.com/pkasela/SE-PEFOur EF task shares the same goal as the question-answering task: satisfy users' needs in a cQA forum in the most effective way.In a cQA forum, a user may ask a question that does not have any related answers in the answer collection.Since not receiving any answer can create a sense of frustration in a user posting a question, it is important for the community and the platform to identify and, eventually, notify domain experts who may be able to answer the question correctly.Finding good matches between unanswered questions and expert users can improve remarkably the engagement with the community.In fact, on the one hand, users posting a question can receive correct answers from the alerted experts in a short time; on the other hand, expert users can dedicate their time to answering questions specifically related to their expertise rather than searching for questions that they can respond.
Formally, let E be a set of expert users { 1 , . . .,   }.Given a question q asked by user u, the EF task consists in retrieving from E a list of  experts { ,1 , . . .,  , } ordered by their likelihood of answering correctly to question q.
StackExchange data has been used in several EF papers, e.g., in [8,13,17].These works however mostly focus on solving the expert finding task for a single community.SE-PEF incorporates instead information from multiple communities to provide a dataset that can be used also to investigate models for generalist cQA forums that may not have separate channels for the discussed topics.
To create the dataset, we define as best answer for a given question the answer selected as the best one by the user who asked the question, if available; otherwise, we assume the best answer to the one with the highest score, if it has received a score greater than a fixed threshold   4 .We note that this assumption, for the best answer being the most voted answer if no answer has been flagged as best by the user asking the question, is used only for the expert detection procedure, which will be explained subsequently and not as relevance judgement for the test data.In the test set we only consider the best answer, the answer explicitly labeled as such by the user asking the question.Exploiting high-scored answers as the best answers allows us to increase the number of questions successfully answered.Indeed this choice is justified by the observation that 87.6% of the answers, which are selected as best ones by the user asking the question, are also the most up-voted ones.On the other hand, we have observed that many users, once they satisfy the information need with a good answer, do not bother to mark the answer as the best.
At this point, to identify the set of experts E, we follow the procedure indicated by Dargahi et al. [8] for their StackOverflow dataset: • For each community , let U be the set of users, and B the set of best answers computed as explained above in the community .For each user u ∈ U, let A , = { ,1 ,  ,2 , . . .,  , } be the set of answers given by u in ; • Remove all users who do not have at least   answers selected as best answers, i.e. define: • Compute the acceptance rate for the users in E ′ given by the ratio between the number of accepted answers and the number of total answers of the user in that community.For each user e ′ we define  , : • Compute the average acceptance rate ā  for the users in a community and select as experts only those users who have an acceptance rate above the community average one: The final set of experts E is defined as the union of the sets of experts found for each community.The above process ensures that the selected experts have a high level of engagement and write high-quality answers having a high acceptance rate.
In Figure 2 we show the basic structure of the JSON file provided for training, validation, and test.The user_questions, user_answers contain the identifiers (ids) of the questions and the answers, written before the current question timestamp, of the user asking the question.The expert_questions, and expert_answers contain the ids of the questions and the answers of the expert that has given the best answer.The data is provided also with a collection of questions and a collection of answers; they are two very simple JSON files, where the keys correspond to the ids of the questions and answers respectively.The values of the keys are constituted by the texts of the questions and answers respectively.The data is provided also with multiple data-frames, curated from the original data found from archive.org, which can be used to add more features.These features are described on the Stack Exchange website. 5

COMPARISON WITH AVAILABLE DATASETS
Concerning the EF task, there are plenty of datasets available [12], and some of them are based on data from cQA websites.For example, StackExchange is used to create a pre-trained BERT model for the EF task in [13].However, the work focuses only on designing an EF pretraining framework based on a specific augmented masked language model able to learn the question-expert matching task.Other EF datasets derived from cQA forums come from: StackOverflow [8,20], Yahoo Answers! [9,21], Wondir [14] and Quora [22].Recently, a domain-specific expert finding task was tackled using Avvo [1], a legal cQA website, but in this case, personalization is not possible due to the fact that users are anonymous.In Table 1 we report the basic dataset statistics of some of the commonly used datasets in EF for comparison.
A common issue with the existing datasets is that the experts are, in many cases, not well-defined, and determining what makes a user an expert is not trivial.Furthermore, most works among those previously cited either rely on a private dataset, or refer to a specific domain and make very strong assumptions simplifying the task addressed.Conversely, SE-PEF will be made publicly available, it has a well-defined definition of an expert, which is inspired by reasonable hypothesis common to other works [11,13,17].Furthermore, it provides a rich set of social features usable for personalization and combines data from multiple communities, which, as we have already stated, increases dataset diversity and opens the possibility of exploiting cross-domain user information for EF.
To build the SE-PEF for EF we followed the procedure detailed in Section 2.2, by setting   = 5 and   = 10.Finally, we also remove from the training dataset the questions answered by experts who previously posted less than 5 answers to avoid the cold start problem for expert modeling.Using this procedure, we obtain SE-PEF, from starting from the dataset presented in [10], including 81,252 users, 3,306 experts, 252,501 queries (218,647 for training, 16,710 for validation, and 19,995 for testing), and 564,690 answers.

PRELIMINARY EXPERIMENTS WITH SE-PEF
This section provides a concise overview of the experimental setup and introduces the methods employed to showcase the capabilities of SE-PEF in the EF task, defined and discussed in Section 2.2.Finally, we report and discuss the results of the conducted experiments.

Experimental settings
For our EF task, we use a retrieval-based approach [16], and simply cast the EF task to a cQA task where we use the similarity scores of the retrieved documents as experts' scores.We explain this in detail in the following paragraphs.
We adopt a two-stage ranking architecture that prioritizes efficiency and recall in the first stage.The primary objective of this first stage is to select for each query a set of candidate documents that are eventually re-ranked in a second stage by a precision-oriented ranker.The first stage is based on Elastic Search6 , and uses BM25 as a fast ranker.We use the same BM25 hyperparameters as indicated in [10]: 1 and 1.75 for b and k1, respectively.In the second, precision-oriented stage, to re-rank the retrieved documents we utilize a linear combination of the set of available scores that includes the BM25 score, the similarity score computed by a neural re-ranker, and, when used, the score computed by a personalization model exploiting the user history.In all the experiments the second stage re-ranks the top-100 results retrieved with BM25.
{ " i d " : " a c a d e m i a _ 4 9 9 0 6 " , " t e x t " : " I n c l u d i n g t e a c h i n g s t a t e m e n t i n RA p o s i t i o n a p p l i c a t i o n p a c k a g e [ . . .] " , " t i m e s t a m p " : 1 4 3 8 6 9 3 7 8 4 , " u s e r _ i d " : 3 4 2 2 2 6 1 , " u s e r _ q u e s t i o n s " : [ " a c a d e m i a _ 2 8 2 3 8 " , " a c a d e m i a _ 2 8 2 4 0 " , . . .] , " u s e r _ a n s w e r s " : [ " g e n e a l o g y _ 4 0 5 8 " , " a c a d e m i a _ 3 7 5 3 8 " , . . .] , " t a g s " : [ " a p p l i c a t i o n " ] , " e x p e r t _ i d s " : [ 3 3 9 1 2 5 ] , " e x p e r t _ q u e s t i o n s " : [ [ ] ] , " e x p e r t _ a n s w e r s " : [ [ " e x p a t r i a t e s _ 2 5 2 0 " , " a c a d e m i a _ 1 8 9 9 1 " , . . .] ] } Non-personalized models.As neural re-ranker in the second stage we use the following two models used also in [10]: • DistilBERT.This model is obtained by fine-tuning the pretrained distilbert-base-uncased model 7 for the task of answer retrieval tackled in [10].We use the same training data and experimental settings used in [10].• MiniLM, based on MiniLM-L6-H384-uncased 8 .This model is used as it is, without any fine-tuning.
Personalized model for EF.For building the EF personalized model we exploit the folksonomy arising from tags, very similar to the one employed in [10].This model, which we also call TAG from hereon, aims at capturing the similarities among the topics addressed by the asker in their current and previous questions, and the ones in which a considered expert answered in the past.Given a question q, asked by user u at time t, let  , be the set of tags assigned by u to all theirs questions posted before t (including q). , thus represents the interests of u as expressed in their previous interactions.The authors of the answers to query q do not have the possibility of tagging explicitly their answers, so for each answer, we consider the tags associated with the answered questions.
The way we represent the expert user is slightly different: the expertise, in this case, is based on a pre-computed, static representation   of each expert  in SE-PEF.This representation considers the tags  ′  of all the questions answered by  included only in the training set.To build   from  ′  we perform an additional step consisting in discarding the tags with a frequency lower than the median frequency of tags in  ′  .This tag pruning step reduces the noise coming from the possible presence of non-relevant tags that might have appeared as additional tags in a few questions answered but the expert might not be an expert on those topics.As for the previous task, the EF TAG score  , for expert  is finally computed as Score computation and combination.Given the list of answers  retrieved and re-ranked with the above models, we observe that 7 https://huggingface.co/distilbert-base-uncased 8 https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2This is potentially an important feature for characterizing their expertise.Therefore, to obtain the expert-level score, we sum up the scores assigned to all the answers in  coming from the same expert.Moreover, since the TAG model returns a score for all experts in the dataset, even those not having an answer in , we assume that these experts receive a score contribution equal to 0 from BM275 and the non-personalized models.Finally, the scores from BM25, personalized and non-personalized models are combined by computing the weighted sum of the normalized scores from the models, using weights  25 ,   / , and    , with    = 1.The  values are optimized on the validation set by performing a grid search in the interval [0, 1] with step 0.1.
Evaluation Metrics.For the task of expert finding we utilize the following evaluation metrics: Precision at 1 (P@1), Recall at 3 (R@3), Recall at 5 (R@5), and Mean Reciprocal Rank at 5 (MRR@5) as our evaluation metrics.The cutoffs are set low as we prioritize identifying experts at the top of the ranked lists.All the metrics are computed by using the ranx library [2,5].

Experimental Results
The results are reported instead in Table 2.The symbol * indicates a statistically significant improvement over the respective non-personalized method not using any contribution from the TAG model.Statistical significance is assessed with a Bonferronicorrected two-sided paired student's t-test with 99% confidence.The column labeled  reports the optimized weights, found using the validation set, used for combining the scores computed by BM25, DistilBERT / MiniLM, and TAG models.In the cases in which the optimal weight for the BM25 score is equal to 0 -i.e., BM25 does not contribute to re-ranking -we omit BM25 from the name of the model and  1 = 0 from the weights column.
Differently from the cQA task tackled by the authors of [10], we observe that on EF the performance gap of DistilBERT vs. MiniLM SBERT is sensibly reduced.The best-performing model among the ones tested is in fact DistilBERT + TAG which significantly outperforms both DistilBERT and MiniLM SBERT .Analogously to the cQA task, personalization is very effective for EF.The contribution of the TAG model allows for significantly improving all the non-personalized methods, with a performance boost exceeding three points in MRR@5 for the DistilBERT model.By looking at the optimized  weights reported in all three tables, we see that the TAG model contribution is much higher for the EF task (   ≥ .5)than for the one obtained by the authors of [10] (   ≤ .3).

UTILITY AND PREDICTED IMPACT
The SE-PEF resource we make available to the research community is a step ahead toward a fair and robust evaluation of personalization approaches in Expert Finding.The features inherited from [10] include explicit signals to create relevance judgments and a large amount of historical user-level information to design and test classical and novel personalization methods.We expect the SE-PEF dataset being useful for many researchers and practitioners working in personalized IR and the application of machine/deep learning techniques for personalization.In recent years, significant efforts have been dedicated to the study of personalization techniques.However, there is still a lack of a comprehensive dataset for evaluating and comparing different approaches, which makes the comparison between different methods less reliable or, worse, not possible at all.
For this reason, we expect that the proposed dataset will impact the research community working on personalized EF as it provides a common ground of evaluation built on questions, answers, and experts from real users socially interacting via a community-oriented web platform.
In this proposal, the expert can have different domain backgrounds and share interests and knowledge in various communities.We also expect that training training on such rich and diverse data, like SE-PEF, should produce a more robust and generalizable model.

CONCLUSION AND FUTURE WORK
SE-PEF (StackExchange -Personalized Expert Finding) is an extension of a previous work [10], which presents a large real-world dataset for personalized cQA.The data inherits a rich set of userlevel features modeling the interactions among the members of the online communities.
Our study provided a detailed description of the data creation and training process.Furthermore, we illustrated the methodologies adopted, explicitly focusing on IR techniques.We discussed how the similarity scores computed can be aggregated and combined to target the EF task adopted.For the retrieval, we adopted a two-stage architecture, where the second stage utilizes for reranking an optimized combination of the scores generated by BM25, DistilBERT/MiniLM SBERT , and TAG models.
The preliminary experiments conducted proved the effectiveness of personalization on this dataset, surpassing methods that rely on pre-trained and fine-tuned large language models by a statistically significant margin.We expect other researchers to develop more complex strategies to improve results on the SE-PEF resource.We leave such research as future work for us and the IR community working on personalized IR.

Figure 2 :
Figure 2: Example of a line of the jsonl file provided.

Table 1 :
Comparison between SE-PEF and other cQA datasets for EF.When a specific definition of an expert is provided we distinguish normal users from experts.

Table 2 :
Results for the SE-PEF EF task.