BioSift: A Dataset for Filtering Biomedical Abstracts for Drug Repurposing and Clinical Meta-Analysis

This work presents a new, original document classification dataset, BioSift, to expedite the initial selection and labeling of studies for drug repurposing. The dataset consists of 10,000 human-annotated abstracts from scientific articles in PubMed. Each abstract is labeled with up to eight attributes necessary to perform meta-analysis utilizing the popular patient-intervention-comparator-outcome (PICO) method: has human subjects, is clinical trial/cohort, has population size, has target disease, has study drug, has comparator group, has a quantitative outcome, and an “aggregate” label. Each abstract was annotated by 3 different annotators (i.e., biomedical students) and randomly sampled abstracts were reviewed by senior annotators to ensure quality. Data statistics such as reviewer agreement, label co-occurrence, and confidence are shown. Robust benchmark results illustrate neither PubMed advanced filters nor state-of-the-art document classification schemes (e.g., active learning, weak supervision, full supervision) can efficiently replace human annotation. In short, BioSift is a pivotal but challenging document classification task to expedite drug repurposing. The full annotated dataset is publicly available and enables research development of algorithms for document classification that enhance drug repurposing.


INTRODUCTION: DRUG REPURPOSING VIA NATURAL LANGUAGE PROCESSING
The development of clinical drugs is an expensive process requiring billions of dollars in research and development to bring a new drug to market [7,46,55].Drug repurposing seeks to reduce the cost of discovering new treatments by identifying currently approved drugs with therapeutic value for other diseases [2].Doing so relies on aggregating clinical studies and data to identify therapeutic combinations of the highest value [3].
Drug repurposing (sometimes called drug repositioning) is the use of an existing drug for a different disease or indication other than the one for which it was initially developed or marketed [39].Drug repurposing is a safe and cost-effective way to expedite treatment discovery.It is particularly effective for novel, rare, or intractable diseases where current standard-of-care treatments are inadequate.For example, repurposed drugs were critical during the initial onset of the SAR-CoV-2 (COVID-19) pandemic [40].Even if a repurposed drug may not fully ameliorate a new disease, it could be a powerful adjuvant therapy that enhances the efficacy of existing standard-of-care treatments or decreases adverse events or side effects.Drug repurposing may be done by evaluating molecular similarities; comparing shared biochemical targets; examining associations with adverse event profiles; examining the effect of popular therapeutics for common antecedent diseases or co-morbidities; or other forms of measured association between a drug and a specific patient attribute.Once a repurposed drug candidate is identified, it can undergo expedited clinical testing due to the existing safety profiles.If the repurposed drug candidate is deemed successful, it may undergo standard regulatory approval for the new indication or be prescribed off-label if the new indication is too rare for a standard clinical trial.
Searching, filtering, reviewing, and analyzing large volumes of scientific literature is critical to the drug repurposing process.On average, 75+ clinical trials are published each day [6].Traditional efforts to synthesize data from the literature for drug repurposing, systematic review, or meta-analysis primarily rely upon PubMed advanced search filters to index and retrieve candidate documents.Unfortunately, neither standard nor advanced PubMed search filters enable efficient filtering of critical attributes for meta-analysis.Typically only a very small proportion of retrieved PubMed documents meet inclusion criteria [12,38] for metaanalysis.Document filtering remains a pivotal bottleneck in drug repurposing meta-analysis [1].Improved automatic document filtering is needed to remove irrelevant documents and improve downstream processes for curating data necessary for drug repurposing.
To this end, we construct and release an extensive annotated data set, BioSift, that enables improved filtering based on attributes utilized for meta-analysis in drug repurposing.Namely, most meta-analyses employ the patient-intervention-comparator-outcome or PICO method when determining if a document has the elements necessary for study inclusion: Experiments demonstrate that our dataset enables more nuanced document inclusion/ exclusion than is available in PubMed advanced search alone.BioSift enables users to screen out 70+% of returned articles not containing relevant data.Thus, BioSift significantly decreases the research time required for filtering articles for biomedical evidence synthesis.Current results illustrate that current active learning, weak supervision, and full supervision algorithms are not able to fully automate the filtering process for drug repurposing.However, BioSift is an extremely valuable open resource for continued machine learning development of improved document filtering algorithms for drug repurposing.This paper makes the following contributions:

•
We develop a protocol for filtering documents relevant to drug discovery using defined attributes that better emulate the PICO review process utilized by clinical scientists.

•
We present a human-annotated dataset of 10,000 PubMed abstracts with eight unique filtering attributes or labels than indicate an article's likely utility for inclusion in a clinical meta-analysis.

•
We present three low-resource and one fully-supervised baseline to compare different automated strategies for biomedical abstract filtering in the absence of annotation resources.

DATASET
We present, BioSift, a collection of 10,000 documents labeled with multiple criteria to filter clinical studies containing relevant information for drug repurposing.Inclusion criteria were chosen based on collaboration with epidemiological experts to retain only abstracts containing sufficient information to be used in a meta-analysis on drug repurposing potential.Inclusion criteria and other document statistics are shown in Table 1.Three or more curators annotated each document, with expert curators checking a sample of disagreeing labels during a quality control phase.A depiction of the end-to-end document selection, filtering, and annotation process is shown in Figure 1, and the relative cooccurrence of the seven labels in BioSift is shown in Figure 2.
names and/or Medical Subject Headings (MeSH) titles of cancer types and drug respectively.The objective of this query was to gather clinical evidence of whether drugs used to treat comorbidities or antecedent diseases had a positive or negative effect on cancer outcomes.The pool of documents was taken as the union of results for these queries for 8 different types of cancer and 94 non-cancer drugs.Following a PubMed search, abstracts were further filtered by removing those that did not have any chemical entities in their MeSH terms or had 5 or fewer words in the text of the abstract.The final post-filtering pool of documents contained 58,720 unique abstracts, from which we randomly selected 10,000 for annotation.

Annotator Selection and Training
The dataset was annotated by a cohort of 58 university undergraduate students selected from biology, computer science, neuroscience, and biomedical engineering majors.Additionally, 10 students with prior annotation training and experience were recruited as quality control managers.The BioSift student annotation program was similar to our previous awardwinning undergraduate biocuration program [41].
The annotator recruiting process consisted of two rounds of screening.First, a graded assessment was used to evaluate the candidates' untrained "annotation aptitude" using a simplified schema similar to the present study.Candidates who achieved a satisfactory score were interviewed in small groups (less than 6 students).Candidates were asked a series of questions regarding their interest in the project and their problem-solving strategies.Of the 83 candidates who applied for the position, 58 were ultimately recruited as Biosift annotators.
Annotator training was conducted over a 6-week period.First, students participated in live lectures designed to introduce them to the annotation schema, annotation software, relevant vocabulary, and context surrounding the project goals.Next, students were given formal annotation training, including annotation guides and worked examples that defined the labeling schema, live guidance in labeling practice abstracts, self-paced practice annotation problems, and graded practice annotation assessments.
Prior to annotating BioSift, a 2-week beta test was performed to assess the developed schema and the success of the annotator training.At the conclusion of the beta test, annotators were surveyed for feedback regarding the study label schema and annotation platform.Beta test results were used to refine the training resources and final BioSift labeling schema to reduce error and improve inter-annotator agreement.
During all stages (training, beta test, and final annotation of BioSift) the students were given tools to openly communicate directly with each other, the quality control managers, and research coordinators via an electronic communication platform and live virtual discussions.

Final Annotation and Data Quality Control
Each abstract in BioSift was annotated by 3+ different students using LightTag [45].The annotators were encouraged to submit comments with challenging or confusing abstracts to proactively prevent errors due to semantic or lexical misunderstandings.All curated abstracts without inter-annotator disagreement and without comments were accepted without manager-level quality control.If there was inter-annotator disagreement, the abstract was reviewed by a separate quality control manager to correct the abstract's annotations.
Quality control (QC) for BioSift data was conducted by a team of 10 student managers with both formal annotation training and at least 6 months of previous annotation experience.The quality control team was directly involved in training the student annotators and creating annotation resources for the project.The managers received additional quality control training from the research study coordinator.The quality control protocol required the managers: 1) to validate and/or fix potential annotation errors; 2) review and resolve inter-annotator disagreement to discern a final "ground truth" annotation for each abstract.
The final round of quality control involved ranking the articles in descending order of disagreement levels between the three annotators across the seven classes.The articles with the highest disagreement levels were assigned a final round of quality control with two annotators for each article.First, confidence level of each annotator was ranked based on the agreement with the ground truth labels for a gold set of 25 articles.QC annotations with the complete agreement were taken as ground truth.For QC annotations with disagreement, the final label was determined as the annotation of the annotator with higher confidence score.
The statistics and results in this paper pre-date this final round of quality control which affects < 1% of annotations.The data incorporating this quality control will be available in the GitHub repository.

Dataset and Annotation Statistics
For the 10,000 annotated abstracts in BioSift, we evaluate the positive annotation ratio for each label class, inter-annotator agreement, and co-occurrence between positive label schema.Figure 3 shows the proportion of inter-annotator agreement for each class.It demonstrates that more than 50% of all labels except Comparator Group are annotated with positive labels by all three annotators.
Figure 4 shows the distribution of the number of labels with complete agreement among annotators.It shows that 4 or more labels are in complete agreement in most abstracts.
We define the annotation ratio =

Number of positive annotations
Total number of annotations and assign each category a positive binary label when the annotation ratio exceeds 0.6.The aggregate label for an abstract is positive when all category labels are positive.Figure 5 shows the Pearson correlation coefficient between the binary labels, including the aggregate label.It highlights that some labels are strongly correlated, like Population Size with Quantitative Outcome, Human Subjects, and Cohort Study/Clinical Trial.It also shows that the Quantitative Outcome and Comparator Group have the most significant effect on the aggregate label.
We additionally observe that positive labels have higher interannotator agreement than negative labels, pictured in Figure 6.

METHODS
The document filtering/classification task presented in BioSift is one that has normally been solved by carefully crafted queries (e.g., Cochrane Highly Sensitive Search [11]), supplemented with post-filtering based on rules, heuristics, and machine learning models [1,37,38,53].Since manual curation resources are often very limited due to the high cost of obtaining reviewers with sufficient medical expertise, previous work has primarily relied upon machine learning methods that generalize well with little to no labeled data.We accordingly test a slate of models taken from active learning, weak supervision, and prompt-based zero-shot learning domains and compare them to fully-supervised transformer models fine-tuned on our data.We additionally compare these models with results from carefully crafted PubMed advanced search queries.Results illustrate that document filtering for drug repurposing meta-analysis is a difficult task and that utilization of BioSift data meaningfully improves document filtering.

Problem Formulation
We formulate the document filtering problem in BioSift as a multilabel classification task with 7 independent labels + a binary aggregate label as described in section 2. For each class, we report the precision, recall, and F1-score of each evaluated model, defined as: where TP, FP, and FN and the counts of true-positives, falsepositives, and false-negatives, respectively.

Weakly Supervised Learning
Weak supervision is the use of programmatic labeling to obtain noisy estimates of labels on data points.Programmatic labeling functions (LFs) generally take the form of heuristics, expert-defined rules, lookups in dictionaries/databases, or outputs of other models used to approximate labels for a given task.Since weak supervision does not rely on ground truth labels, labeling functions can be applied to both labeled and unlabeled documents to create a larger pool of training documents than would otherwise be possible.
For our document filtering task, we develop (LFs) comprised of keyword rules, regular expressions, and NER models to identify evidence of each inclusion criterion.Rules were written with the software package Snorkel [47] with LF outputs defined as ABSTAIN = −1; For categories where it is difficult to craft rules that can precisely exclude documents (e.g., Has comparator group, Has population size), ABSTAIN labels were labeled as EXCLUDE as done in [12] to avoid excessive LF imbalance.We created a total of 32 LFs which collectively matched 99.1% of the instances in our dataset.A comprehensive list of LFs grouped by inclusion criterion can be found in Table 8.
The LFs were used to generate weak labels for the entire labeled BioSift corpus as well as the remaining 46,720 unlabeled documents.For each inclusion criterion, LF outputs were aggregated by majority voting (MV) to form a higher-confidence weak label for the document.We also tried aggregating weak labels with the generative label model described in [47] but found that it produced inferior results to MV. Aggregated weak labels were used to fine-tune a pre-trained biomedical language model to allow prediction on documents unmatched by some or all LFs.The model was fine-tuned using masked binary cross entropy (BCE) loss: where the mask is applied to prevent the loss from being computed on categories for which an instance is not labeled.Once trained, the model was evaluated by picking the threshold that maximizes F 1 score on the validation set for each label and using these thresholds to predict labels for the seven classes.An aggregate label of 1 was assigned when all predicted classes were positive and 0 otherwise.

Zero-Shot Filtering
Zero-shot classification methods enable document filtering without requiring significant computational resources for model training or data labeling.Prior works have used natural language inference (NLI) based methods for zero-shot text classification by modeling it as a textual entailment task.Such models are trained to determine if one statement naturally follows from another.
For each label, we created a set of hypothesis templates, which are text statements indicating that an abstract did or did not meet the given inclusion criterion.Classification was performed by concatenating a document with the positive and negative hypothesis templates, passing it through the pre-trained model, and comparing the relative entailment probabilities of the positive and negative hypotheses.We experimented with multiple templates for each class, and the best-performing templates are given in Table 4.
The training data was used only to determine the optimal probability threshold for classifying an input as the positive class.This threshold is selected by computing the precision-recall curve and selecting the threshold where precision is equal to recall on the training data.This threshold is then fixed for evaluation on the test data.Predictions were made separately for each label.An aggregate label of 1 was assigned when all class-wise labels were 1.

Active Learning
Labeling documents for drug repurposing is a complex task requiring a certain level of medical expertise, making documents more difficult and expensive to label.Active learning (AL) proposes to iteratively select the most informative unlabeled instances for human labeling based on a mathematical query strategy.Newly labeled data is then used to update the model, and the process repeats until a stopping criterion is met.This process aims to maximize model performance given a limited labeling annotation budget.In theory, this process allows for the annotation of a smaller volume of data to achieve a similar level of predictive quality.
For our study, we used AL to finetune PubMedBERT [25] and compared three well-known query strategies described in a recent review by Schroeder et al. [51] along with a random sampling baseline.Query strategies used a pool-based approach, where a batch of k samples is selected for annotation at each iteration.All query strategies used implementations from the small-text AL library [52] with batches of k = 20 samples.
For our query strategies, we denote instances by x 1 , x 2 , …, x n , and the respective label for each instance x i is y i , where ∀i, y i ∈ 0, 1.The predicted class distribution is denoted by P y i | x i .
Our query strategies are as follows: 1.
Random Sampling (RS) selects the samples uniformly from the unlabeled data pool.This is the most commonly used baseline against which other query strategies are compared.

3.
Least Confidence (LC) [9] picks the sample whose top prediction k* from the current model has the least confidence.

4.
Breaking Ties (BT) [36,49] takes the samples with the minimum gap between the top two most likely probabilities.
where k 1 * is the most likely label and k 2 * is the second most likely label.
We evaluated all the above query strategies for seven labels separately and classified the aggregate label as 1 if all the seven labels are 1 otherwise, 0.

Supervised Learning
Given the performance of large, transformer-based language models on document classification, we fine-tuned a diverse collection of biomedical language models on BioSift.All models were finetuned for 5 epochs with a batch size of 16 and weight decay of 0.01.The model from the best-performing epoch (as determined by the validation set) was evaluated on the test set at the end of training.Models included are PubMedBERT [25], BioBERT [31], RoBERTa [35], KRISSBERT [58], SapBERT [34], BART [33], BigBird [57], and BioELECTRA [28].

Overall Results
The results of all tested models' ability to predict the multi-class labels of BioSift are shown in Table 5.
Fully supervised transformer models outperform other low-resource strategies for predicting each individual label and the aggregate document label.
Weakly supervised models have high recall but low precision.This result is likely due to the high propensity of LFs to label positive, which exaggerates the class imbalance beyond what is actually present in the dataset.Thus, weak supervision tends to under-filter documents for drug repurposing.
AL methods generally have lower recall than methods that learn from more samples.Here, the AL methods are often more precise than other low-resource methods but are more likely to miss documents with positive labels that should be included for drug repurposing.
PubMed filters tend to be more precise than other filtering metrics, sometimes even exceeding fully-supervised precision.PubMed often excludes a more significant proportion of documents that should be included for drug repurposing.
Our overall results illustrate that document filtering for drug repurposing is a very challenging task.Despite being widely known for inefficiently filtering abstracts for drug repurposing, carefully crafted PubMed queries often outperform the filtering ability of state-of-the-art low-resource machine learning algorithms.Our results highlight the need for new algorithms to improve the accuracy of document filtering tasks for drug repurposing.

Comparison with PubMed Filtering
PubMed advanced search filtering is the primary method biomedical researchers use to identify and select relevant abstracts for a particular research area.For each category annotated in our dataset, we used multiple advanced queries to replicate the results in our annotated dataset.Table 6 shows the PubMed filtering arguments that produced the best F 1 score for each category.While some PubMed filters can be quite precise, they often omit large numbers of documents that would be otherwise desirable to include in a meta-analysis.Notably, each PubMed filter can throw out up to 40% of results with each desirable property, which compounds with aggregation.Moreover, PubMed does not provide any means of filtering for drug/disease focused studies beyond the MeSH terms included in our initial query.
Table 7 gives examples of documents that were incorrectly included.Here, keyword-based PubMed searches fail to filter out abstracts that do not meet inclusion criteria.Similarly, Table 8 shows documents incorrectly excluded based on PubMed filtering.Here, very clear examples of clinical trials with carefully delineated comparator groups and quantified results were removed that should have been included.

Weakly Supervised Learning Results
Weak supervision has the potential to make learning significantly more efficient by reducing the need for annotators to label abstracts individually.We evaluate the extent to which weak supervision can label each class by post hoc computation of coverage, precision, recall, and other metrics on the train set of BioSift.These results are summarized in Table 9.
LF evaluation shows substantial disparities in coverage between classes, with Cohort Study/ Clinical Trial and Comparator Group having the lowest coverage, and Study Drug, Target Disease, and Human Subjects having the highest coverage.We also see that majority voting consistently outperforms the Snorkel label model by a small margin.This may be due to the large class imbalance present in the LF outputs due to the difficulty of creating exclusion rules.

Utility of Active Learning
Due to the relatively high cost of annotating examples in the biomedical domain, we evaluate whether active learning can be used to annotate a smaller pool of abstracts while achieving comparable accuracy.The active learning section of Table 5 shows that the best AL method with 50 query batches (1,000 total samples) has better precision than weak supervision but lags behind all other models in recall.
We also evaluated how much each AL model continues to improve model performance as the total number of samples increases.Figure 7 shows accuracy vs. number samples for prediction entropy, the query strategy with the highest F 1 score.This figure illustrates that model performance rapidly improves near the beginning of training but slows considerably for most classes between 200 and 400 samples.

NLP Drug Repurposing & Meta-Analysis
Natural language processing has recently shown strong potential for synthesizing evidence for systematic reviews of biomedical literature [1,38].However, these reviews rely upon PubMed filtering to select data articles to be included in such reviews [4,53].This results in systems that are either highly restrictive in the types of evidence that can be included or that require further manual curation or rule-based filtering [12,38].While some published works construct filtering datasets for specific diseases such as cancer [4], the developed datasets are proprietary and not accessible for use by the general research community.

Weakly Supervised Learning
Dua et al. [12] build a weakly supervised pipeline to filter documents for repurposing non-cancer drugs for cancer treatment.The authors develop a set of labeling functions targeted at excluding abstracts that are about cancer-related genes, cancer prevention, and premalignant patients.Similar to our weak supervision sources, they also create LFs using SciSpacy to determine if relevant diseases and drugs are present in documents.However, Dhrangadhariya and Müller develop a weak supervision pipeline for recognizing token-level PICO elements in text using expertdefined heuristics and alias matching to biomedical ontologies.BioSift differs from their work by presenting a new dataset and focusing on document filtering instead of token classification.

Zero-Shot Filtering
Yin et al. [56] first propose approaching zero-shot text classification as a textual entailment problem.They train a BERT[10] model on mainstream entailment datasets to learn the relationships between premises and hypotheses.For zero-shot classification, they convert labels into hypotheses and then use the previously pre-trained model to get an entailment decision.
In the biomedical domain, Barker et al. [5] propose a hybrid architecture that pairs a supervised text classification model with an NLI reranker to improve classification performance when training data is abundant for some classes but scarce or even nonexistent for others.Koutsomitropoulos [29] also suggests validating the quality of ontology-based annotations of biomedical resources using NLI models such as BART [33] and XLM-R [8], to overcome training barriers posed by large label sets and scarcity of data.

Active Learning
Active learning was first introduced by

CONCLUSION
This paper presents a new, original document classification dataset, BioSift, consisting of 10,000 human-annotated abstracts to expedite the initial selection and labeling of studies for drug repurposing.Each abstract is annotated by at least three human annotators and undergoes subsequent quality control.Robust benchmark results on the dataset illustrate neither PubMed advanced filters nor stateof-the-art document classification algorithms can efficiently replace human annotation.Thus, the publicly available dataset, BioSift, facilitates the future development of improved algorithms for document filtering aimed at drug repurposing.

CCS CONCEPTS
Applied computing → Life and medical sciences; ⋅ Information systems → Information retrieval; Document filtering.Co-occurrence and aggregated effect Inter-annotator Agreement

BioSift
presents LFs aimed at a more general goal and provides an open-source resource for the development and evaluation of weak supervision for drug repurposing, which Dua et al. do not.

Figure 3 :
Figure 3: Inter-Annotator Agreement for Each Class

Figure 7 :
Figure 7: Classification Accuracy vs. Number of AL Samples BioSift makes this task more accessible by open-sourcing such data for public, unrestricted use.A few recent datasets seek to enable the extraction of PICO elements from clinical trials to facilitate evidence-based medicine.Nye et al. [43] use crowd workers to provide detailed annotations of patients, interventions, and outcomes in a corpus of clinical trials.Similarly, Zlabinger et al. develop a PICO annotation protocol that leads to improved annotation outcomes and use this to present an additional corpus with token-level PICO tags.Thomaset al. [54]develops a machine learning model for classifying whether or not a clinical study is a randomized controlled trial.BioSift complements these projects in enabling researchers to filter based on additional inclusion criteria to facilitate the automation of medical evidence synthesis.
David and Gale [32], where they introduced uncertainty sampling to text classification.They iteratively sample low-confidence examples for labeling until a target accuracy is reached.In the biomedical domain, Guo et al.[26]used SVM-based active learning to annotate biomedical articles and achieved 82% accuracy with 2% of the examples used to train a similar fully supervised model.Active learning is frequently used in annotation pipelines to accelerate the work of human labelers[42]and is a common component of many commercial annotation platforms[44, 45].

Table 1 :
Dataset Statistics Int ACM SIGIR Conf Res Dev Inf Retr.Author manuscript; available in PMC 2024 April 30.

Table 2 :
Examples of included drugs in pre-filtering PubMed queries for cancer drug repurposing Int ACM SIGIR Conf Res Dev Inf Retr.Author manuscript; available in PMC 2024 April 30.

Table 9 :
Label Model and Majority Voter Performance

Metric Cohort Study Comp. Group Human Subjects Population Size Quant. Outcome Study Drug(s) Target Disease MV
Int ACM SIGIR Conf Res Dev Inf Retr.Author manuscript; available in PMC 2024 April 30.