Large-scale Dataset and Effective Model for Variant-Disease Associations Extraction

Extracting variant-disease associations (VDAs) from the biomedical literature is a critical task in biomedical and genomics research, as it provides valuable insights into the genetic basis of diseases and facilitates the development of precision medicine. The biomedical literature is a vast and growing source of information containing a wealth of knowledge on genetic variants and their associations with diseases. However, the manual extraction of VDAs from the literature is a time-consuming and labor-intensive process, making it challenging to keep up with the rapidly expanding literature. Therefore, there is a pressing need to develop computational methods for effectively extracting and curating VDAs from the biomedical literature, and to build a comprehensive dataset for this significant task. In this paper, we present a large-scale, semi-automatically annotated dataset for VDA extraction from the biomedical literature (called VDAL) based on the DisGeNet platform which contains one of the largest publicly available collections of genes and variants associated with human diseases. To the best of our knowledge, VDAL is one of the largest datasets for VDA extraction, containing 9,362 related PubMed documents from the biomedical domain. In addition, we propose a novel and simple yet effective model, called VDANet, which incorporates the corresponding gene embeddings of the variants into the model to better explore the associations between genetic variants and human diseases. Extensive experiments on the constructed dataset show that VDANet significantly outperforms the state-of-the-art baseline methods, thus establishing a new benchmark for VDA extraction. For reproducibility, our code and data are available at https://github.com/JasonCLEI/VDANet.


INTRODUCTION
The study of genetic variants and their association with diseases is of paramount importance in understanding the underlying mechanisms of human diseases and in the development of precision medicine [1,5].A vast amount of knowledge on variant-disease associations (VDAs) is embedded in the ever-growing biomedical literature, which serves as a valuable source of information for genomics research.For instance, in PubMed website 1 [15,19], there are over 35 million citations from the biomedical literature, including MEDLINE, life science journals, and online books, offering an abundance of resources and evidence for extracting and elucidating various aspects of VDAs.However, the manual extraction of VDAs from the literature is a labor-intensive and time-consuming task, making it difficult to keep up with the rapid expansion of the biomedical literature.This highlights the need for efficient computational methods to extract and curate VDAs from the available literature to build a comprehensive dataset for this important task.
In recent years, there has been a growing interest in developing automated methods for extracting various biomedical entities and their relations, such as gene/variant-disease, chemical-disease and drug-disease associations, from the biomedical literature [5, 9-11, 16, 20].These methods rely on natural language processing techniques, machine learning algorithms, and knowledge-based (PMID: 7545869) L206W mutation of the cystic fibrosis gene, relatively frequent in French Canadians, is associated with atypical presentations of cystic fibrosis.Cystic fibrosis is caused by mutations in the cystic fibrosis transmembrane conductance regulator (CFTR) gene.Over 400 mutations have been reported at this locus.Although severe forms of cystic fibrosis are usually associated with pancreatic insufficiency, pulmonary dysfunction, and elevated sweat chloride, there is a wide range of phenotypes, including congenital absence of the vas deferens, observed with some of the milder mutations.The L206W mutation, which was first identified in patients from South France, is relatively frequent in French Canadians from Quebec.In this report, we document the atypical form of cystic fibrosis associated with this mutation, in a cohort of 7 French Canadian probands.approaches to identify and extract relevant information from the text [2,3,6,8,21].However, the extraction of more fine-grained VDAs has received comparatively less attention than other biomedical entity relations.One of the main challenges is the lack of largescale, high-quality annotated VDA datasets available for training and evaluation.Also, a more effective model is required to better explore the associations between genetic variants and human diseases.

Inputs
In this paper, we address these challenges by presenting a largescale, semi-automatically annotated dataset for VDA extraction from the biomedical literature, called VDAL.Our dataset is based on the DisGeNet platform2 [12][13][14], which contains one of the largest publicly available collections of genes and variants associated with human diseases.In VDAL, each variant-disease pair is substantiated by DisGeNet, with sufficient evidence and corresponding PubMed publications, establishing it as a reliable dataset for VDA extraction.To the best of our knowledge, VDAL is one of the largest datasets for VDA extraction, containing 9,362 related PubMed documents in the biomedical domain.The construction of such a large-scale dataset enables the development and evaluation of more advanced computational methods for VDA extraction.
In addition to the dataset, we propose a novel and simple yet effective model, called VDANet, which incorporates the corresponding gene embeddings of the variants into the model to better explore the associations between genetic variants and human diseases.The incorporation of gene embeddings provides additional biological knowledge that can help improve the performance of the model in identifying VDAs from the biomedical literature.
We conducted extensive experiments on the constructed VDAL dataset to evaluate the performance of VDANet and compare it with state-of-the-art baseline methods, including BioBERT [8], SciB-ERT [2], PubMedBERT [6] and ATLOP [21].Our results show that VDANet significantly outperforms the baseline methods, thus establishing a new benchmark for VDA extraction.We believe that our dataset and the proposed model will serve as valuable resources for the research community working on VDA extraction from the biomedical literature and contribute to the advancement of genomics research and precision medicine.

DATASET CONSTRUCTION 2.1 Data Collection
The data used to generate VDAL was originally collected from the DisGeNet platform, which offers one of the largest publicly  available collections of genes and variants associated with human diseases.DisGeNet comprises two types of source databases for VDAs: curated and literature-based.The curated data comprises VDAs from expert-curated resources, such as Uniprot, ClinVar, the GWAS Catalog, and GWAS db, while the literature data encompasses VDAs extracted from the biomedical literature through textmining techniques, such as BeFree [3].Specifically, We obtained the original VDA data from DisGeNet (v7.0) with including both types of source databases, consisting of 369,554 variant-disease associations between 194,515 variants and 14,155 diseases, traits, and phenotypes.To access the data, we employed the Browse function 3  provided in the DisGeNet web interface, and we retrieved VDAs with evidence from PubMed publications supporting the association, thereby ensuring a high level of confidence in the collected data.

Data Cleaning
After collecting the original VDA dataset, we performed a data cleaning process to ensure the quality and reliability of the VDA data.This process involved representing all genetic variants by their unique dbSNP rsIDs and all diseases by their unique MeSH IDs, as well as storing the corresponding NCBI Entrez gene IDs of the variants to facilitate a more comprehensive exploration of the associations between genetic variants and human diseases.To guarantee the reliability of the VDA data, we prioritized retaining data from multiple sources, with at least one source being expertcurated, and required clear evidence from PubMed publications supporting the association, specifically mentioning both the variant and the disease in the PubMed publication.During the data cleaning process, we identified and filtered out duplicate VDAs to avoid redundancy in the dataset, and prioritized expert-curated VDAs over those from literature sources when multiple sources indicated the same VDAs.We followed these data-cleaning steps to create a reliable, high-quality VDA dataset for our study to serve as the foundation for our subsequent analyses.

Dataset Generation
We successfully obtained a set of reliable variant-disease associations, along with their corresponding PubMed publications, after completing the data-cleaning process.To perform the named entity recognition (NER) task, we utilized the widely recognized tool, PubTator [18], which allowed us to accurately identify the exact locations of each variant and disease mentioned in the PubMed publications and subsequently provide these mentions with unique IDs.For the main relation extraction (RE) task, we treated the variant-disease pairs in our semi-automatically annotated VDA set as positive instances, while the other variant-disease pairs were considered negative instances, following previous studies [20,21].
Ultimately, we compiled a final VDAL dataset comprising 9,362 related PubMed publications, containing 12,149 positive variantdisease pairs with 7,343 unique variants and 2,401 unique diseases, and we selected 8,200/600/562 out of them as training, validation and test sets, respectively.Particularly, we took precautions to prevent any data leakage issues by ensuring that the positive VDA pairs appearing in the training set did not appear in the validation and test sets.This approach helped maintain the integrity of the evaluation process.The detailed dataset statistics of VDAL are reported in Table 1.Additionally, we report the top-10 most frequent variants, diseases, and variant-disease pairs in VDAL in Figure 2.

METHODOLOGY 3.1 Problem Definition
The objective of VDA extraction is to accurately identify the associations between different variant and disease entities present in the biomedical literature [5,10].Formally, given a PubMed document  containing a set of biomedical entities {  }  =1 , which includes both variant and disease entities, the aim of VDA extraction is to predict the true relations from the set R ∪ { } between head (i.e., variant) and tail (i.e., disease) entity pairs ( ℎ ,   ) ℎ, ∈ {1...} .Here, R represents a pre-defined set of relation types, and   stands for No Relation.The entities  ℎ and   correspond to the variant and disease entities, respectively, and  denotes the total number of entities.Note that an entity   may appear multiple times in the document  through various entity mentions {   } =1 , where    signifies the number of entity mentions.A relation is considered to exist between a variant-disease pair ( ℎ ,   ) if it is expressed by any pair of their mentions.

The Overall Architecture
In this paper, we propose VDANet, a novel and simple yet effective model for VDA extraction, which incorporates the corresponding gene embeddings of the variants into the model to better explore the associations between genetic variants and human diseases.Figure 1 illustrates an overview of the VDANet framework, which comprises three key components: an embedding layer, an encoder, and a relation classifier.It should be noted that VDANet is model-agnostic and can be directly applied to arbitrary models (e.g., BioBERT [8], SciBERT [2], PubMedBERT [6] and ATLOP [21]).Next, we will describe each part in VDANet in detail.

Embedding Layer
In the embedding layer of our approach, the PubMed document is represented through a series of embeddings, which serve as the original input representations.Following BERT [4], we first add the token embeddings, position embeddings and segment embeddings together: the token embeddings (denoted by    ) encapsulate the semantic meaning of each word in the document, the position embeddings (denoted by      ) convey the positional information of each token, and the segment embeddings (denoted by   ) indicate the token types for the input.Additionally, we incorporate the corresponding gene embeddings (denoted by   ) of the variants into the embedding layer, which serve as a bridge for exploring the associations between genetic variants and human diseases more effectively.Specifically, if a token is a variant at a given location, the corresponding gene embedding of the variant, which is learned through model training, is added to the input representations; otherwise nothing is added, allowing us to include the corresponding gene information in each variant representation.
In summary, the overall input representations  are obtained by adding together the token embeddings    , the position embeddings      , the segment embeddings   , and corresponding gene embeddings   , which is formulated as:

Encoder
Upon acquiring the comprehensive input representations from the embedding layer, the entire embedded PubMed document is subsequently modeled using an encoder (e.g., the BioBERT [8] encoder) to derive contextualized entity representations (i.e., variant and disease representations) and variant-disease pair representations.Formally, the PubMed document  with length  can be denoted as  = [  ]   =1 , and the input representations are obtained as  = [  ]   =1 .Additionally, a special token "*" is inserted at the beginning and end positions of each entity mention to identify entities (including variant and disease entities), following previous research [17,21].The encoder is then employed to acquire the contextualized representations  of document : The representations of the special token "*" at the starting position of the entity mentions are taken as its embeddings, denoted as     .For each entity   with entity mentions {   }    =1 , its contextualized entity representation    is computed using a smoother logsumexp pooling [7], which offers a more refined approach than the max pooling operation: Subsequently, the final variant-disease pair representation  ℎ for each variant and disease entity (i.e.,   ℎ and    ) is attained through a feature combination via group bilinear pooling, following [21], which divides the entity representations into  equal-sized groups (e.g.,   ℎ =  1  ℎ ; . . .;    ℎ and    =  1   ; . . .;     ) and applies bilinear pooling within these groups: where    represents learnable parameters for  = 1 . . ., and   is a bias term.

Relation Classifier
Finally, a relation classifier employing a feed-forward neural network (FFN) is utilized to predict the accurate relation labels  of all variant-disease pairs ( ℎ ,   ), based on the variant-disease pair representations, formulated as: where   denotes a learnable weight matrix,   represents a bias term, and  ℎ signifies the final variant-disease pair representation for each respective variant and disease entity.Notably, the standard cross-entropy loss is employed to optimize the entire VDANet framework for BERT-based encoders, such as BioBERT [8], SciBERT [2], and PubMedBERT [6].The adaptive thresholding loss [21] is utilized for the ATLOP-based [21] encoder, considering its exceptional performance in diminishing decision errors in relation classification.

EXPERIMENTS 4.1 Experimental Data
To assess the performance of state-of-the-art baseline models and the effectiveness of our proposed VDANet model for VDA extraction, we conduct experiments on the constructed VDAL dataset.Note that all models share the same data preprocessing procedure.

Baseline Methods and Evaluation Metrics
We compared VDANet with several state-of-the-art baselines for VDA extraction, including BioBERT [8], SciBERT [2], PubMed-BERT [6] and ATLOP [21].Among them, BioBERT, SciBERT and PubMedBERT are BERT-based pre-trained language models tailored for biomedical and scientific scenarios utilizing various pre-training corpora, while ATLOP achieves improved contextualized entity representations and training objectives by introducing a localized context pooling strategy and an adaptive thresholding loss.Note that our proposed VDANet is model-agnostic and can be directly applied to arbitrary models for VDA extraction.Furthermore, we employed three widely-utilized classification evaluation metrics to assess the VDA extraction performance, including Precision, Recall and F1 score [20].

Experimental Results and Analysis
Table 2 reports the overall performance of VDANet and its baseline counterparts on the constructed VDAL validation and test sets.As Table 2 shows, we can draw the following conclusions.First, among BERT-based baselines (i.e., BioBERT, SciBERT and PubMedBERT) fine-tuned for VDA extraction, BioBERT and Pub-MedBERT achieved competitive results on the VDAL dataset and significantly outperformed SciBERT, which can be attributed to the fact that both BioBERT and PubMedBERT are primarily pre-trained on the biomedical domain, whereas SciBERT is solely pre-trained on general scientific texts.Second, ATLOP, equipped with a localized context pooling strategy and adaptive thresholding loss, outperformed all BERT-based baselines due to its enhanced learning of contextualized entity representations.Third, our proposed VDANet consistently exceeds the performance of all compared models in the VDAL dataset, achieving improvements of 0.61, 0.98, 0.88, and 0.81 in the F1 score on the VDAL test set over the vanilla baselines.This establishes a new benchmark for VDA extraction, demonstrating the potential of concurrently integrating corresponding gene embeddings of variants into the model to better investigate the associations between genetic variants and human diseases.

CONCLUSION
In this paper, we present a large-scale, semi-automatically annotated dataset, called VDAL, for VDA extraction from the biomedical literature, based on the DisGeNet platform, which is one of the largest datasets for VDA extraction, containing 9,362 related PubMed documents in the biomedical domain.In addition, we propose a novel and simple yet effective VDANet model that incorporates the corresponding gene embeddings of the variants into the model to better explore the associations between genetic variants and human diseases.Extensive experiments on the constructed VDAL dataset show that VDANet significantly outperforms the compared methods, thus establishing a new benchmark for VDA extraction.

Figure 1 :
Figure 1: The architecture of VDANet framework.VDANet is mainly composed of three fundamental components: an embedding layer, an encoder and a relation classifier.

Table 1 :
Dataset statistics (after preprocessing).Note that "# D", "Avg.#E" and "Avg.#R' are short for total number of documents, average number of entities and relations per document respectively.

Table 2 :
Overall results on the constructed VDAL dataset in terms of Precision, Recall and F1 score.