SciMine: An Efficient Systematic Prioritization Model Based on Richer Semantic Information

Systematic review is a crucial method that has been widely used. by scholars from different research domains. However, screening for relevant scientific literature from paper candidates remains an extremely time-consuming process so the task of screening prioritization has been established to reduce the human workload. Various methods under the human-in-the-loop fashion are proposed to solve this task by using lexical features. These methods, even though achieving better performance than more sophisticated feature-based models such as BERT, omit rich and essential semantic information, therefore suffered from feature bias. In this study, we propose a novel framework SciMine to accelerate this screening process by capturing semantic feature representations from both background and the corpus. In particular, based on contextual representation learned from the pre-trained language models, our approach utilizes an autoencoder-based classifier and a feature-dependent classification module to extract general document-level and phrase-level information. Then a ranking ensemble strategy is used to combine these two complementary pieces of information. Experiments on five real-world datasets demonstrate that SciMine achieves state-of-the-art performance and comprehensive analysis further shows the efficacy of SciMine to solve feature bias.


INTRODUCTION
Systematic review aims to use systematic and explicit methods to collect, identify and critically appraise relevant studies about one research theme [12,54].It is an important research method that scholars from different domains like Medicine, Agriculture, and Biology have widely used.After querying databases of science literature, scholars need to screen the retrieved unordered set of paper candidates to ensure the comprehensiveness and correctness of the systematic review.Since the screening process can be highly expensive and time-consuming, the task of screening prioritization is established to reduce the human workload.Formally, screening prioritization refers to the task of searching for relevant documents given an unordered set of paper candidates.Various automatic systems have been developed to learn from user needs based on their screening record and return the relevant papers to them.
Recent studies have demonstrated the effectiveness of building an active learner to solve this task [10,13,44,54,56,67].As shown in Figure 1, when the active learner performs each human-in-the-loop iteration, a classifier is trained on the set of labeled documents and predicts the set of unlabeled candidates to find the one that is most likely to be relevant.The user then screens this document and backfed it to the learner for incremental training.Most applications propose to build the classifier based on lexical features [13,44,54,56].Current state-of-the-art model AsReview [54], which uses TF-IDF as feature extraction with a Naive Bayers Classifier can outperform models that use more sophisticated feature extractions like Doc2vec and BERT.Due to the sparsity of scientific phrases in the corpus, however, these models may deviate to focus on spurious patterns [19,30,32,53].Intuitively, models should not neglect the semantic information from both corpus and background knowledge.
Making use of the richer semantic information, though appealing, poses its own challenges.The first challenge is how to infer background knowledge across different scientific domains.Pretrained language model seems a good fit but recent studies [58,64] also show that the PLM-based neural rankers can not guarantee better performance over lexical-based methods.Second, the classifier may suffer from feature bias.This is due to the number of relevant documents being far fewer than irrelevant documents so that the classifier may learn from frequent but theme-irrelevant phrases.We further analyze this phenomenon in Section 4.2.The third challenge comes from the difficulty of capturing the minimal difference between relevant and irrelevant documents.For example, in a study whose research theme is nudging healthcare professionals, the representation of "Just-in-time evidence-based e-mail reminders in home health care: impact onnurse practices" and "Just-in-time evidence-based e-mail reminders in home health care: impact on patient outcomes" is very close in the latent space since they have many phrases in common.However, the domain expert can tell only the first one is relevant because "onnurse" is a phrase more related to "healthcare professionals" than "patient".
To address the above challenges, we consider a framework that mimics the domain expert behavior, by categorizing documents using scientific background knowledge and semantic relevant features.We name the framework SciMine, which consists of three main components, as shown in Figure 2. First, in order to obtain domain knowledge, we adopt SPECTER [11], a Transformer language model pre-trained on the citation network of scientific literature, to obtain document embeddings and a pre-trained masked language model (MLM) to learn phrase embeddings.Second, to cope with feature bias, we adopt a variational autoencoder model to rank the candidate documents while preserving semantic information from pre-trained representations.Third, to further integrate phrasebased rationale, we use a community detection algorithm to find phrase-level features and train a classifier on these features to provide a second ranking of the candidates.Finally, two rankings are merged using an ensemble approach.We adopt the standard query strategy and experimental settings of previous work [16,54].Experiments on five standard benchmarks show that SciMine outperforms existing methods significantly, achieving the best-reported results in the literature.In addition, for a detailed understanding of the underlying mechanisms, we conducted a human study with ecological experts, resulting in a new dataset AgriDiv, which consists of 1,505 documents, and 129 of them are relevant to the research theme: "investigating the agriculture diversification in rice production".
In summary, we build SciMine, a human-in-the-loop framework for efficient screening prioritization.To our knowledge, we are the first to show the efficacy of adopting contextual representation from the pre-trained language model in the task.SciMine can save more than 10% workload than the current SOTA.Furthermore, to better understand intrinsic user screening habits in the systematic review, we work with ecological scientists to create a novel dataset, AgriDiv, which includes research papers in the ecological domain.
We perform a user study on this dataset and gain some valuable observations for future work.We open-source the codebase and the dataset1 .

RELATED WORK
Screening Priorization refers to the task of searching for relevant documents among an unordered set of paper candidates.A series of machine learning-evolved applications [1,13,16,18,44,46,49,54,56,59,67,72] have been proposed for screening priorization.These models can be mainly categorized into "One-off" learning [58] and iterative learning.For "One-off" learning, seed information like the user's search query [1,49,62], the theme of the research [50], or a set of prior knowledge [25,57,59] is used to directly rank the document candidates.And for human-in-the-loop iterative learning, the model can iteratively accumulate user feedback, hence showing more efficiency in screening, and providing an intuitive user experience.Models under this fashion are varied in model input, query strategy, retraining strategy, and stopping criteria.For example, FASTREAD [67] utilizes uncertainty-based sampling to query documents for labeling and it retrains every ten iterations.Rayyan [44] takes user-provided words and citations as input and stops when the model can no longer be improved.CAL [13,14,18] and ASReview [54] both take a set of labeled documents as input, but the former one designs a "knee" method to automatically stop the iterations while the later one lets the user decide when to stop.One common point of these above human-in-the-loop learning works is they all classify over lexical features like TF-IDF.Another recent work [64] tests fine-tuning the BERT model in every iteration and concludes that this method underperforms the lexical-based method when the corpus has very different textual characteristics.Thus, while studies [16,20] have demonstrated the power of lexical features via comprehensive experiments, the power of representation generated from pre-trained language models for this task remains unstudied.In this work, we not only analyze why the advanced contextual representation can beat lexical features, but also propose a model that can utilize document-level and phrase-level information.
Active Learning is a type of machine learning that allows the model to choose the training samples it would like to learn from.It has been widely used in text classification [3,15,45,51,66,69] by concentrating the human annotating effort on the most informative data points that can boost model performance significantly [30,35].The problem setting for AL in text classification is to let the model query and train on a certain number of samples from the training set in each iteration, then test the performance of the model on a different test set.Though our model is one kind of active learning model, there are mainly two differences between  [2,34,43].Some work on paper recommendation [6,20,48] utilizes additional author or citation information to recommend papers while we study how to recommend purely based on the paper's title and abstract.By using Transformers [55], these models usually pretrain on large domain-specific data [21,26,47] or data from multiple domains [5,11,32].Besides learning from the text of the paper, they often employ citation features to capture the inter-document relations [11,42].Although the Transformer-based model dominates in a lot of scientific literaturerelated tasks, previous attempts to adapt PLM in screening prioritization [58] always fall short with classic lexical-based methods.Therefore, how the representation learned using PLM can boost the screening process remains an unsolved problem.In this work, we utilize PLM to generate both document embedding and phrase embedding and demonstrate that this contextualized information can outperform lexical features.
Text Classification with less training data has been studied for a long time and several lines of work have been proposed.Semisupervised classification models [9,40,60,63] generate augmented instances via creating real text segments or the hidden states of the model.Zero-shot text classification [28,70] generalizes the knowledge learned from seen classes and transfers it to unseen ones.Even though this line of work requires less training data when compared to traditional classification models, they still ask for more human annotations than ours to kick off.Weakly-supervised classification [36][37][38][39]61] tries to categorize documents based on the word-level description by using seed information like categoryrelated words.However, the practical user need of our task is the model should train fast in each iteration while it usually takes a long time to train a weakly-supervised model.In addition, these models rely on the correlation between words and topics.But in our task, due to the complexity of the scientific literature, it is difficult to define a topic.In SciMine, we design a phrase-level feature classification module to help the document-level classifier by detecting important phrase-level features from the corpus.

TASK DEFINITION
Formally, a systematic review corpus (D) is about one particular research theme and is collected by scholars querying databases of scientific literature.A candidate document d in this corpus is either relevant to the scholars' research theme (d ∈ R) or irrelevant (d ∈ I).To facilitate scholars finding relevant documents for their research, an active learner learns and finds relevant documents iteratively.As shown in figure 1, a complete human-in-the-loop iteration t contains the following steps: (1) a classification model is (re)trained on a set of user-labeled document D  and predicts on the remaining unlabeled document set D  , (2) The active learner ranks documents from D  and returns the top-ranking document to the user and, (3) the user reads this documents and decides whether it is relevant, where the user decision is used as a label that is back-fed to the learner, which then moves this labeled document from D  to D  for incremental training.In real use cases, this iteration repeats until the user feels there are few relevant documents in D  and decides to stop.In our experiments, we also follow existing work [16,54] and set a target percentage p of the relevant documents and study how to minimize the total iterations T needed to reach this target.
In the task of efficient screening prioritization, given (1) a systematic review corpus D, where each document d ∈ D is the concatenation of a research paper's title and abstract, (2) a seed set of user-labeled documents   .The label is binary, indicating whether a document is relevant (1) or not (0), and (3) a remaining set of unlabeled documents   .We aim to find the target percentage p of relevant documents   while minimizing the total iterations T .

METHOD
We propose a novel active learner SciMine to address the problem of screening prioritization.As shown in figure 2, it has four steps: representation learning, VAE-based document level classification, phrase-level feature classification, and ranking ensemble.In this section, we introduce our proposed method by first introducing how we learn both document-level and phrase-level representation in Section 4.1, then describe the two modules of our active learner in Sections 4.2 and 4.3 and the ranking ensemble module in Section 4.4.

Representation Learning
We first learn both document embedding and phrase embedding using pre-trained language models.For document-level representation, we apply a Transformer language model SPECTER [11] to generate document embeddings.SPECTER is pre-trained on a large scientific literature corpus and captures the relatedness between documents via a structure called "citation graph".In our case, we feed the concatenation of a paper's title and abstract into SPECTER and take the final representation of the [CLS] token as the embedding of the paper: For phrase-level representation, we first obtain quality phrases in the corpus by using a phrase mining tool called Autophrase [29,52], then learn their MLM-based embeddings.For each phrase, we get its MLM-based embedding to capture both content and context features simultaneously.Suppose that a phrase  appears   times in the corpus.Then, for each of its mention   ,  ∈ {1, 2, . . .,   }, we obtain its content feature x   by feeding the original sentence into a pre-trained MLM and taking the average of the generated embedding vectors corresponding to the tokens of .To get the context feature y   of this mention, we first replace the entire phrase  with a single [Mask] token, feed the new sentence into the same language model, and then use the embedding vector of this [Mask] token as the context feature.Finally, to get the phrase embedding that captures both content and context features, we concatenate two feature vectors for each mention and take the average of the resulting mention vectors: Then we introduce how we utilize the representation information in SciMine.

AE-based Document-level Classification
Our neural model can be mainly divided into a feature extractor and a classifier, and we use cross-entropy loss to fit the model.We first demonstrate the feature bias by using the cross-entropy loss below.
Proposition 1: Minimal cross-entropy (CE) loss does not imply that all possible features of a task can be learned in the feature extractor.
Let us assume that all features of a class are learned with minimal CE loss.We use a toy example in Virus dataset, whose research theme is related to common livestock, to demonstrate the incorrectness.Assume that in the training data, all the positive documents have the word "piglet" and negative ones have the word "dog".We further assume that the feature extractor learns only two significant features in the feature vector by chance, one for the word "piglet" and one for the word "dog".When a study with the word "piglet" is presented to the feature extractor, the first feature has the value of 1 and all the other features have the value of 0, and for a sentence with the word "dog", all the features have the value of 0. Under this feature setting, we can easily achieve a cross-entropy value of 0 by adjusting weight values.This shows that a model can rely on incomplete and spurious patterns to fit the CE loss during training, but can fail to generalize during testing.
Proposition 2: Feature bias can easily occur in the task of iterative screening prioritization.
Among candidate documents for screening, the proportion of relevant documents is much smaller when compared with irrelevant documents.This leads to the feature extractor learning a subset of the features due to the limited number of relevant documents.Following the example mentioned above, the limited positive data may contain frequent but theme-irrelevant phrases such as "stool" and "fecal", which makes the classifier learn from these phrases while overlooking the features of less frequent but theme-relevant phrases like "poultry".This leads to feature bias.An empirical t-SNE visualization of features from an MLP feature extractor for classification is shown in Appendix A.1.
The above issues suggest that preserving feature information is crucial for the task.To this end, autoencoder can serve as a tool, which enforces that the same input can be reconstructed from a representation [8,31,33].Here we adopt one of such methodsvariational autoencoder (VAE) as our classification model, which uses the reconstruction loss to preserve the original semantic information and adds Gaussian noise to generate meaningful-semantic representations for isotropy [27,68,71].
Training.Suppose that we have the data  samples from the distribution parameterized by the ground truth generative factors , VAE aims to maximize the probability of the  on average over all the possible samples from the latent factors, corresponding to: where Φ and  are the parameters for the encoder and the decoder of the VAE model.The objective is equivalent to : where  is the hyper-parameter which characterizes the pressure for the posterior  Φ (|) to match Gaussian prior  ().The first term is the expectation of negative log-likelihood of instance , referring to the reconstruction loss and leading to the preservation of the semantic information.The second term is a regularizer based on Kullback-Leibler divergence KL(•) between the prior distribution   (|) and the posterior distribution   .The prior is typically set to the isotropic unit Gaussian distribution N (0, 1).We use multilayer perceptron (MLP) models (mainly containing two linear layers for dimension reduction and reconstruction) as the encoder and the decoder of the VAE model.Due to the reason that the sampled value  reduces the topology information of the training data and there exist latent variable collapse issues [27], here we adopt the feature extractor from the second last layer in the encoder for classification with parameters  ′ .The binary crossentropy loss is used as the training objective for classifying the relevance of the scientific document: Overall, the training objective of the VAE-based document-level classification follows: Ranking with Trained Classifier.We use the trained classifier  ′ for identifying the relevance of unlabeled data.For each document candidate  with the document representation   , we calculate the relevance score as follows: where  = 1 refers to the label relevant.The first ranking list [ 1  ] is calculated based on the array of relevance score [ ()] in descending order.

Phrase-level Feature Classification
Our pilot user study suggests that a frequent clue for rejecting irrelevant documents is key phrase matching, which motivates our integration of phrase-level semantic features.Phrase Selection We first want to select phrases that are more related to the relevant documents.Hence, we define the following two measures to select phrases: Indicative: Ideally, a phrase that is indicative of relevant documents should be frequent in relevant documents.Therefore, we design our relevant-indicative measure as: where  , is the number of labeled documents of relevance label  that  appears.Unusual: Since the relevant documents only take up a small proportion of all document candidates, we want phrases that are unusual.To incorporate this, we design a measure of inverse document frequency: Inspiring by [36], we use geometric mean to combine these two measures, which provides a score for each phrase in the labeled documents.We rank all available phrases by this score and only select the top 30 % for the following steps.Phrase Clustering & Feature Selection To capture the semantic relation between phrases, we construct a graph where each node on it represents a phrase.Edges are built according to the two nodes' semantic similarity , which we calculate as the cosine similarity between their pre-trained MLM-based embeddings: After constructing the phrase graph, we utilize an unsupervised community detection algorithm, Louvain clustering [7], to generate non-overlapping communities in this graph.The reason we choose Louvain over other clustering methods is that it does not require the number of clusters given ahead and can be used flexibly on the corpus from very different scientific domains.
After putting those phrases with similar semantics in a cluster, we continue to choose phrase-level features from these clusters based on the assumption that the phrase-level feature should have a stronger correlation with relevant documents.So for each cluster   , we count the number of positive documents it is related to (   ), and a cluster is selected as a phrase-level feature if it is larger than  percent of positively labeled documents: Pseudo label generation.As the number of irrelevant documents is much larger than the relevant ones in the systematic review corpus, we can generate pseudo labels from unlabeled documents (D  ) by using our trained VAE classifier.For each   ∈ D  , we calculate its probability to be relevant.Then we rank D  based on this probability and select documents in the lowest 30% as pseudonegative samples.This pseudo data is used together with the labeled documents D  to train our phrase-level feature classifier.

Phrase-level Feature Classification
To train this classifier, we need to calculate the corresponding value of the phrase-level feature for each training document.For a phrase mentioned in a training document, we first calculate the cosine similarity between the phrase and its feature cluster's centroid.Then the largest value from each cluster is set as the feature value: Then, for each phrase-level feature, we can calculate its corresponding value in the document as: With the phrase-level feature values  , the labeled documents   , and pseudo-labeled documents, we train a Random Forest model to learn which important phrase-level features matter for relevant documents.Finally, we use this trained model to predict and rerank top  documents from the first ranking result [ 1   ] to get the second-ranking list [ 2   ] and perform the ranking ensemble process.

Ranking Ensemble
A relevant document should be ranked higher in both documentlevel ranking list [ 1  ] and phrase-level ranking list [ 2  ].Therefore, we use the ranking ensemble method on two ranking lists so that the highly-ranked irrelevant documents in the first ranking list can be rectified by our phrase-level feature classifier.For each unlabeled document candidate , we calculate its final score by summing up its mean reciprocal rank scores in each ranking list: where    is its ranking in the ranking list .SciMine ranks the final scores in descending order and return the first document for user screening according to the certainty-based query strategy.After the scholar reads and labels the document, this document is moved from D  to D  .Then the scholar can decide whether he wants to stop SciMine or starts the next iteration.• AgriDiv: In order to understand how scholars do a systematic review, we collaborate with experts in the ecological domain to create this dataset for their research.The theme of this research is to investigate the effect of agriculture diversification on rice production.We collect 1505 documents by searching the Web of Science and Scopus.Then two domain experts were invited to label the corpus and 129 studies are confirmed as relevant.Compared Methods.We compare the following methods whose information includes lexical-level, sentence-level, and documentlevel features.

Dataset
• TF-IDF+NB: We test this machine learning model with TF-IDF as the feature extraction and Naive Bayers as the classifier.According to [16,54], this method is able to outperform models with more sophisticated feature information.• D2V+SVM: We use doc2vec [24] as the feature extraction and the Support Vector Machine as the classifier.• HierTrans: We use a pre-trained model SimCSE [17] to learn sentence embeddings in each document and utilize a hierarchical transformer as the classifier.By learning different weights of sentences in each training sample, this hierarchical model [36,65] can achieve good performance in classification tasks with less training data.• SPECTER-Once: Use the same document embedding that we gained during the preprocessing step.Train an SVM classifier 2 https://github.com/asreview/systematic-review-datasetswith the initial seed set and predict the unlabeled studies to get a one-time retrieval result.• SPECTER+SVM: This method utilizes the same document embedding that we gained during the prepossessing step and uses an SVM as a classifier.• SPECTER+MLP: Use the same document embedding that we gained during the prepossessing step and use a 1-layer multilayer perceptron as the classifier.• SciMine-NoPFC: An ablation of our framework that removes the phrase-level feature classification module.• SciMine: Our proposed framework captures both document-level information and phrase-level information.
Implementation Details.For testing purposes, instead of screening the datasets by domain experts, we simulate the screening process by comparing the newly retrieved document to the gold label in each human-in-the-loop iteration.The simulation starts with a seed set of 5 relevant and 5 irrelevant studies and the classification model is retrained after the end of each iteration.The model is terminated once it has reached the target recall of relevant documents.We set this number to 0.95 in our case.The initial seed set is picked randomly from the corpus.For avoiding bias from the initial seed set, we create 5 seed sets for each dataset by randomly picking documents from the corpus.We test baseline methods and our proposed models based on these seed sets and every simulation is run 10 times for each seed set.We utilize Adam with a weight decay rate of 1 − 4 to optimize our model.Except for TF-IDF+SVM and SPECTER, all baselines as well as SciMine are trained 200 rounds in each human-in-the-loop iteration.The learning rate is set to 1 − 4, the pressure  of VAE is 0.1, the batch size is 40, the  is 0.5 for selecting feature clusters, and the  is 50 for the ranking ensemble.We use the certaintybased query strategy, which retrieves the document with a high probability to be relevant from the prediction result.
Evaluation Metrics.We follow previous studies and evaluate our results using Work Saved over Sampling (WSS) and Relevant References Found (RRF).Given a level of recall, WSS calculates the reduction of documents needed to be screened.For instance, WSS@95 measures the percentage of records that can be saved when 95% of relevant documents have been identified by the user.Meanwhile, RRF@10 evaluates how many relevant documents can be identified when 10% of the unlabeled documents have been screened.It is used as a quick overview of the relevant documents.

Results
Table 2 shows the main results.In terms of WSS, the classic TF-IDF+NB model has very stable performance across datasets.As a method of capturing lexical features, it beats the D2V+SVM model in three datasets, which learns more sophisticated word embeddings.The Hierarchical Transformer model also does not perform very well on most of the datasets when compared to TF-IDF+NB and D2V+SVM.It may be because of the too-limited training data in our problem setting.By using the scientific literature-related document embedding, SPECTER+SVM and SPECTER+MLP outperform TF-IDF+NB, which demonstrates the advantage of richer semantic features over lexical features.It also proves that the advanced pre-trained language model can be applied to the task of Table 2: Evaluation results on five real-world datasets over the metric Work Saved over Sampling 85 percent and 95 percent of relevant documents (WSS@85 & WSS@95) and Relevant References found by the first 10% of iterations (RRF@10).For each dataset, models are tested on 5 randomly-sampled seed set to avoid bias.

Nudging Virus
Figure 3: The visualization of the relevant document-finding process.The X-axis represents the number of retrieved relevant documents and the Y-axis is the number of iterations.
screening prioritization to boost the screening process.Not surprisingly, the SPECTER+Once works terribly, which indicates that even the SPECTER makes good document-level embedding on the scientific literature, only given the initial seed set is not sufficient and a human-in-the-loop session is a must.SciMine outperforms all baseline methods on most datasets by a large margin.It is consistently better than SciMine-NoPFC as well, which verifies that the document-level and phrase-level feature information are complementary.SciMine-NoPFC can outperform SPECTER+MLP on four datasets except Nudging, which demonstrates the necessity of an autoencoder to preserve the feature information.We further analyze the reason in Section 4.3.1.
For the measurement of RRF@10, we can see that the TF-IDF+NB can outperform several PLM-based models in Calcium and Depression.It indicates that the lexical feature is good at finding relevant documents during the early stage of the screening process.
We also visualize the process of finding the relevant documents on Nudging and Virus in Figure 3.We can see that SciMine can lead this competition in the whole process by finding more relevant documents in less human-in-the-loop iterations.And even though the lexical feature model performs well initially, the gap between the lexical feature model and other methods becomes obvious as the iteration goes on.In a real use case, when a user decides when to stop the model, this low efficiency by the lexical feature model may hinder the user from finding more relevant documents.  of the documents.We search k nearest neighbors of each positive instance   through the Euclidean distance of the representations from encoding models such as TF-IDF, and SPECTER.Then we count the number of positive instances around   , formulated as:

Further Analysis
where  is the index of the k nearest neighbors of   .The density results are shown in Figure 4 with kernel distribution estimation (KDE) plot.First, as is observed in the density distributions of the relevant documents through TF-IDF representations, the peaks of the distribution curve are about 10 and 11 for Nudging and Virus, respectively.In comparison, in the density distribution through SPECTER representations, the density peaks further shift to larger ranges, 12 for Nudging and 17 for Virus, which implies that the features of relevant documents tend to become denser.The phenomenon demonstrates that the pre-trained language model SPECTER can obtain more informative features of the scientific documents than traditional methods such as TF-IDF, and as a result, the relevant documents can be more easily classified.
In SPECTER embeddings, the density distributions of relevant documents in Virus (the distribution of AgriDiv, and Calcium are shown in the Appendix) show that the representations of most relevant documents have significant feature similarity.The continuous distribution curves have peak shifts to large densities.It implies that most representations of the relevant documents have typical features for the CE-trained classifier (with only an MLP feature extractor) to distinguish, but there exist some documents having few similar features, difficult for such a classifier due to feature bias.The phenomenon also aligns with proposition 1 in Section 4.3.But for Nudging, the density distribution is relatively uniform compared with other ones, which implies the feature bias is less significant and a classifier with the MLP feature extractor can also achieve great performance.The distribution explains the reason that SPECTER+MLP achieves stronger WSS@95 compared with SciMine-NoPFC on Nudging, but fails on others.

Parameter
Studies.We experiment to understand how varying the number of documents in the initial seed set influences the performance of our model.For each dataset, we create a seed set by randomly sampling n relevant and n irrelevant documents.We vary n from 1 to 15 and plot the results in Figure 4. We can see that on the Nudging dataset, the performance of our model improves significantly when n < 10 and gradually saturates when n ≥ 10.A similar trend can be observed on the Virus dataset, as shown in Figure 5.This verifies that our model only needs around 10 documents in total to achieve reasonable performance, which is affordable for most scholars.It is also interesting to notice that TF-IDF+NB achieves comparable results with SciMine on Virus dataset when  is 1, but the gap between the two methods becomes obvious as  increases.

5.3.3
The Influence of Query Strategy.Query strategy decides how the active learner retrieves the document from the predictions for human annotation.The most widely used query strategy applied to our task is certainty-based, which selects the document with the highest probability of being relevant.The other common strategy used is uncertainty-based, which selects the "hard relevant" document.Recently, one kind of uncertainty-based query strategy called Contrastive Active Learning (CAL) becomes popular in the Active Learning area [35,66].This strategy tries to pick the most contrastive example, for instance, the probability of it and its neighbors' having the largest Kullback-Leibler divergence.We test how the query strategy influences the performance of our model.
As shown in Figure 6, SciMine with the certainty-based strategy still performs best.We can also observe that even if the two uncertainty-based models have some difficulties finding relevant  documents in the early stage, it becomes more efficient during the second half.Regarding the final WSS@95, the margin between the certainty-based and two uncertainty-based models is not that large, which shows that our ranking model is robust to different query strategies.

Case Study.
Figure 7 shows one irrelevant document from the Virus dataset whose research theme is "performing viral Metagenomic Next-Generation Sequencing (mNGS) in common livestock".We also list the phrase-level features that SciMine discovered and the lexical features the TF-IDF model relies on.It can be seen that TF-IDF can recognize some important phrases like "piglet", "virus" and "metagenomic".However, it also weighs on some spurious phrases like "human" and "identified".These two phrases may appear more in the labeled relevant documents but do not indeed imply relevance.In contrast, SciMine detects three clusters of phrase-level features, which are related to livestock, sequencing approach, and virus, separately.These clusters are also in accord with the research theme.Therefore, when the candidate document does not mention anything related to common livestock, SciMine ranks this document lower.We also apply the SCD method [22] to highlight features that the VAE model in SciMine discovers.VAE weighs on important features like "next generation sequencing" and "virus" while skipping spurious patterns like "fecal sample".

User Study.
To understand how real end users experience SciMine, we perform a user study.We design a UI interface for screening prioritization models.Six Ph.D. students in the ecological domain were invited to join the study.They were divided into two groups: three students used SciMine while the other three used the TF-IDF+SVM model.Before the test began, they were asked to fully understand the research theme of AgriDiv and were instructed to know how to label the relevant/irrelevant documents.Furthermore, they were informed that they could stop the model whenever they felt there were no remaining relevant documents left.As a result, the average recall for SciMine and TFIDF+SVM is 91.3% and 83.7 %, which demonstrates the effectiveness of SciMine in the real screening scenario.However, both scores are lower than 0.95, which suggests that the traditional WSS score may not truly reflect the performance of the model.Users tend to stop earlier when the model constantly recommends irrelevant documents to them.

CONCLUSION
We proposed SciMine, a novel human-in-the-loop framework for efficient screening prioritization.Different from previous methods that solely rely on lexical information, we study how to apply the contextual information from pre-trained language models for this task.SciMine captures two types of information: documentlevel and phrase-level from the corpus and uses rank ensemble to finalize the prediction.To understand how scholars work in a systematic review, we contribute a dataset AgriDiv in the ecological domain.Experiments on five real-world datasets show that the richer semantic features are useful for the screening prioritization task since SciMine framework allows rich pre-trained knowledge to outperform discrete token features, achieving state-of-the-art results across 5 benchmarks, and providing analysis using different feature extractions.We conclude that: (1) The classic lexical-based methods may result in feature bias; (2) Feature bias can easily occur in the task of iterative screening prioritization; (3) Contextualized document-level and phrase-level information are complementary in solving feature bias for this task.In the future, we plan to extend our framework by allowing models to incorporate more user-provided information.For example, the human rationale in text patterns can be used to teach the model in each iteration or provides a sentence describing his research theme as another seed of information.

A.1 Empirical Study for Feature Bias
To demonstrate the feature bias of active learning, we illustrate the representations of dataset Virus from the trained feature extractor by using t-SNE.We show the representations in Figure 9 when  the labeled relevant documents are 20, 60, and 100, respectively.Obviously, in the sub-figure 60 and 100, the labeled documents cluster together for feature bias, but other semantic information is overlooked and these unlabeled irrelevant data locate far away from labeled relevant data.

A.2 Density Distribution of Relevant Documents
We also show the density distributions of relevant documents through SPECTER embeddings using Eq(1) in Figure 8.The density distributions are similar to that of Virus, in that the peak shifts to a large density.It indicates that there exist a small proportion of relevant documents having few similar features to the majority, which are difficult for the minimal CE loss trained classifier with an MLP feature extractor.

Figure 1 :
Figure 1: Pipeline of the human-in-the-loop iteration performed by SciMine.

5. 3 . 1
Representation Analysis.We propose an intuitive method based on k nearest neighbors (kNN) to show the feature similarity

Figure 4 :
Figure 4: Density distribution of the relevant documents in Nudging and Virus through feature extraction methods such as TF-IDF, and SPECTER.
Figure 5: Parameter Study

Figure 7 :
Figure 7: One example of an irrelevant document with SciMine discovered Document-level features (rectangled with red), Phrase-level features (colored with orange), and TF-IDF's lexical features (underlined green).

Figure 8 :
Figure 8: Density distribution of the relevant documents in AgriDiv and Calcium through the feature extraction method SPECTER.

Figure 9 :
Figure 9: Visualization of document representations obtained from the MLP feature extractor.We use t-SNE to transfer the feature space into two-dimensional space.
We conduct our experiments on four previously published datasets2and one newly created dataset.These datasets are from different research domains and the percentage of relevant documents ranges from 4.6% to 23.0%.Table 1 summarizes the statistics for them.• Calcium [12]: This dataset is released in research on how to use citation classification to accelerate systematic review.The theme of this dataset is studying calcium channel blockers and it is in the medicine domain.• Nudging [41]: This dataset is about a systematic review in the social science domain.The theme of this research is nudging healthcare professionals into evidence-based medicine.• Depression [4]: This dataset is in the animal science domain and comprehensively includes published preclinical non-human animal literature on depression.• Virus [23]: This dataset is from the medicine domain and is about performing viral Metagenomic Next-Generation Sequencing (mNGS) in common livestock.