Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement

Mentions of new concepts appear regularly in texts and require automated approaches to harvest and place them into Knowledge Bases (KB), e.g., ontologies and taxonomies. Existing datasets suffer from three issues, (i) mostly assuming that a new concept is pre-discovered and cannot support out-of-KB mention discovery; (ii) only using the concept label as the input along with the KB and thus lacking the contexts of a concept label; and (iii) mostly focusing on concept placement w.r.t a taxonomy of atomic concepts, instead of complex concepts, i.e., with logical operators. To address these issues, we propose a new benchmark, adapting MedMentions dataset (PubMed abstracts) with SNOMED CT versions in 2014 and 2017 under the Diseases sub-category and the broader categories of Clinical finding, Procedure, and Pharmaceutical / biologic product. We provide usage on the evaluation with the dataset for out-of-KB mention discovery and concept placement, adapting recent Large Language Model based methods.


INTRODUCTION
Identifying new concepts and placing them into Knowledge Bases (KBs, e.g., ontologies and taxonomies) from texts such as a vast amount of publications is a key application of KB construction and AI for scientific discovery [13].Emerging concepts are particularly common in the biomedical domain and KB can easily be outdated.For example, new variants of SARS-CoV-2 have kept emerging since 2020; "Curry-Jones syndrome" was not added to SNOMED CT ontology [8] until 2017.
Existing datasets on using texts to enrich ontologies are relevant to several tasks, but each of them only reflects a part of the whole picture.In Taxonomy Completion [20,29], a pre-specified out-of-KB (a.k.a.NIL) concept is used to enrich a taxonomy.In Ontology Extension [11], this is extended into description logic based ontologies, which include complex concepts which can be considered [6,15] but not yet for the existing datasets [11].In Concept Post-coordination [5,22], an out-of-KB concept is defined with several existing concepts and attributes, i.e., placed under a complex concept.Datasets from all the three tasks above assume that the input concept term is already specified and non-contextual (e.g., without contexts in a corpus), which does not reflect the real-world situation.In Out-of-KB Mention and Entity Discovery [7,16,17], out-of-KB mentions and their clustering are discovered from texts, but their placement in KBs has not been fully investigated.
In this study, we propose a new benchmark for new entity discovery and placement, supporting two sequential tasks: (i) Out-of-KB Mention and Entity Discovery: identifying new mentions of concepts from texts which are not included in a KB; (ii) Concept Placement: given a new entity expressed as a mention in the text, placing it into a KB, either an ontology with complex concepts or a taxonomy with only atomic concepts.Our new dataset and task setting are different from previous work in terms of the characteristics below: • Out-of-KB or NIL discovery: inclusion of out-of-KB mentions from texts to support their concept discovery and placement.• Contextual terms: inclusion of contexts for mentions, distinct from only using concept labels as in the previous work.• Complex concepts: placement of concepts under logicequipped complex concepts, instead of atomic concepts alone.
More specifically, the study uses a SNOMED CT subset as the ontology, and the time difference of two versions (in 2014 and 2017) to synthesise new entities, then uses MedMentions Entity Linking dataset [25] (from PubMed abstracts to UMLS) to construct in-KB and out-of-KB mentions.The study further introduces the usage Table 1: Comparison of relevant datasets and tasks on KB (e.g., ontology and taxonomy) enrichment from texts.NIL Discovery denotes whether the task can support discovering out-of-KB mentions (cf.in-KB mentions).Contextual Term denotes whether the input term has a context window in a text corpus.Concept Placement denotes whether the task finally places (or can be used to place) the term in the KB.Complex Concepts denote whether the placement position in the KB includes complex concepts.The asterisk (*) denotes that only data construction scripts are available instead of the dataset itself.
of the data with evaluation for out-of-KB mention discovery and concept placement.We provide benchmarking results adapting rulebased and BERT-based Entity Linking [7,32] and prompting with GPT-3.5-turbo.Results show that the dataset well differentiates the performance between rule-based and BERT-based methods, and the Pre-trained and Large Language Model (LLM) based methods are still yet to achieve satisfying results.

RELATED WORK
The representative datasets are summarised in Table 1 based on the four tasks introduced in Section 1.We only list the public and accessible datasets.We next discuss the related work of each task.Taxonomy Completion and Ontology Extension.Studies in taxonomy completion [20,29,31,33,34] and ontology extension [11] aim to enrich KB using the concept labels and the concept graph structure.However, the studies usually assume that the new term (or concept label) is pre-discovered, which is not the case in the real-world scenario, where new mentions of concepts need to be discovered from corpora.Also, from the perspectives of OWL (Web Ontology Language) [2,12], most of these studies focus only on atomic concepts and do not place the new concept under a complex concept, e.g., with existential restrictions used in SNOMED CT (e.g., [24] focuses on placement under only atomic concepts in SNOMED CT).Also, datasets in both tasks use concept terms as input and do not consider contexts.Concept Post-coordination.The studies aim to place a new concept by describing it with existing concepts and attributes in the ontology [5,22].Dataset construction steps in both works [5,22] assume that the new concepts or terms are pre-discovered and without context windows from a corpus.Out-of-KB Mention and Entity Discovery.The studies aim to discover new mentions from texts, w.r.t. to a KB [7,30] and group them into entity clusters [17,18,21,27].There is a growth of datasets in this area recently, constructed through Manual Labelling, KB pruning, and/or KB versioning [7].The studies, however, do not place the newly discovered entities into a KB.
In this work, we present dataset construction for Concept Discovery and Placement to support a comprehensive set of characteristics (Table 1), with usage for benchmarking, e.g., with Pre-trained and Large Language Models.

PROBLEM DEFINITION
The task of Concept Discovery and Placement inputs contextual, in-KB and out-of-KB mentions in a corpus and a KB (more formally as an OWL ontology [2,12]) and outputs an enriched KB where each out-of-KB mention is inserted into a directed edge, i.e., < parent, child >, of the KB, when the out-of-KB mention is the child of the parent and the parent of the child.The child is considered NULL when the mention corresponds to a leaf concept.The parent can be a complex concept.
Several key definitions are as follows.Formally, an OWL ontology is a Description Logic KB that contains a set of axioms [2,12].We focus on the TBox (or the terminology part) in an ontology, which mainly consists of General Concept Inclusion axioms of the form  ⊑ , where  (and ) are either atomic or complex concepts [1].A TBox can be reduced to a taxonomy (or a subsumption hierarchy) after the classification process [1].Borrowing the definition of taxonomy in [19,29], we use a simple definition of ontology as a set of concepts and directed edges, where both can be atomic or complex.Directed Edges are edges in an ontology (or a taxonomy [31]) which contain a direct parent and a direct child.Complex edges are edges which have a complex concept as the parent2 .Complex concepts mean concepts that involve at least one logical operators, e.g., negation (¬), conjunction (⊓), disjunction (⊔), existential restriction (∃ .),universal restriction (∀ .),etc. [1]. 3n ideal dataset for Concept Discovery and Placement requires a real-world text corpus and a large OWL ontology (reducible to a taxonomy), with gold-standard directed edges (possibly complex) for each out-of-KB concept linked to the mentions in the corpus.

DATASET CONSTRUCTION
Step 0: KB and Subset Selection.We consider SNOMED CT [8], one of the most important OWL ontology in the biomedical domain, and choose a subset by selected categories of concepts.We focus on the second level category, Disease (disorder) and the first level categories, Clinical finding, Pharmaceutical / biologic product, and Procedure, abbreviated to CPP as the initials of the categories.CPP categories have the most important types of complex edges for placement (or post-coordination), according to Kate [22].
Step 1: KB Versioning.We follow a KB versioning strategy [7,17] to synthesise out-of-KB entities for the older KB.The concept gap between the two versions of SNOMED CT subsets (ver 20140901 and 20170301) is considered.The numbers of concepts in the older and the newer sub-KB are 64,900 and 72,595, resp., for the Disease sub-category and are 175,895 and 188,988, resp., for CPP categories.
Step 2: Edge Extraction.We extract the directed edges in the older sub-KB for both in-KB and out-of-KB entities.For in-KB entities, this is achieved by querying all the direct parents and children in the older KB.For out-of-KB entities, this is achieved by querying all the most direct, in-KB (older KB) parents and children from the newer sub-KB, given that edges for the out-of-KB entities are not available in the older KB.If the entity is a leaf node, we set the direct child as NULL, as in Zhang et al. [34].The querying process is based on ontology processing module in DeepOnto [14].
We thus create mention-to-edge datasets with the information for each mention, each rendered in a JSON format.The information includes the left and the right contexts of the mention (ctxt  and ctxt  ), the mention or concept itself and its SNOMED CT ID, the parent and child in the older SNOMED CT ID (and with expression for complex concepts) and their labels. 14We use DeepOnto's verbaliser [14,15] to form labels of the complex concepts.
Regarding data splitting for the benchmark, for out-of-KB mention and concept discovery, the dataset follows the original splits of training, validation, and testing sets from MedMentions; for concept placement, the setting is unsupervised for out-of-KB mentions, i.e., training (and validating) with in-KB mentions but testing on out-of-KB mentions (and in-KB mentions) We provide two formats of the data, mention-level, with edges grouped for each mention; and mention-edge-pair-level, where each mention-edge pair occupies a row and mentions are repeated if there are multiple edges.Statistics of the datasets are in Table 2.

Evaluation with the Data
Metrics for Out-of-KB Mention Discovery The dataset supports the metrics in [7], including overall accuracy for all in-KB and out-of-KB mentions (); out-of-KB precision (  ), recall (  ), and  1 score ( 1  ) to measure how well out-of-KB mentions are detected; and in-KB precision (  ), recall (  , and  1 score ( 1  ).
Table 2: Statistics for datasets for Concept Discovery and Placement, for SNOMED CT (ver 20140901, "S14") under different categories: "Disease" and "CPP", i.e., Clinical finding, Procedure, and Pharmaceutical / biologic product.A mention-edge pair denotes a mention (in a corpus) and one of its directed edges in the KB.Mentions are from the MedMentions dataset ("MM").* The numbers of edges are those having one hop (including leaf nodes to NULL) and two hops from any paths in the ontology.

Metrics for Concept Placement
The dataset supports the metrics used in taxonomy completion, to evaluate the ranking of edges for a given mention [19,29,34].The metrics mainly include Precision at  (P@), Recall at  (R@),  1 score at  ( 1 @), Mean Rank (MR), and Mean Reciprocal Rank (MRR).We report P@ and R@ for different top- values.

Experimenting with the Data
We experiment with the data w.r.t the two tasks using a rule-based method and recent, LLM-based methods.
For Out-of-KB Mention Discovery Existing methods are supervised, i.e., require a certain amount of NIL in the training data.Thus, we split the "out-of-KB" mentions in Table 2 based on the NIL mentions in the original MedMentions data split.The number of training, validation, and testing NIL mentions are 568, 260, and 172, resp., for CPP (in total 1,000 mentions); and 329, 161, and 115, resp., for Disease sub-categories (in total 605 mentions).For the rule-based method, we use Sieve-based approach, which uses rules designed for biomedical texts and predicts a mention as out-of-KB if no in-KB entity can be linked to [9].For the LLM-based method, we follow BLINKout [7] to detect out-of-KB mentions from texts adapting a two-step BERT-based approach [32]: candidate generation with bi-encoder and candidate selection with cross-encoder.Out-of-KB mentions are discovered through NIL entity representation and classification in the cross-encoder [7].We used default parameters with top- value as 50 and domain-specific model, SapBERT [23].
Results on Out-of-KB Mention Discovery Table 3 shows that BLINKout performs much better than the Sieve-based approach in terms of the overall accuracy and out-of-KB  1 scores.However, it is still challenging to achieve satisfying performance to identify out-of-KB mentions (with out-of-KB  1 between 15% and 30%).
For Concept Placement We use the mention-edge pairs (see Table 2) to train and validate a model to match an in-KB mention to its gold-standard directed edges in a KB and then test on outof-KB mentions, following the unsupervised setting.The model architecture includes edge candidate generation with an optional step of edge selection.For edge candidate generation, we adapt the bi-encoder [32], with the input of a contextual mention and an edge (i.e., edge-bi-encoder), to match a contextual mention to a directed edge in an ontology using their concept names 15 .Top- edges rankings are selected after this step.For an optional edge selection among the top-, we test the capability of zero-shot prompting of an LLM, GPT-3.5 ("gpt-3.5-turbo"),where  is set as 50.The prompt includes a header, the mention with contexts, and the top- candidate edges to query the LLM to select the correct edges 16 .
Results on Concept Placement Table 4 suggests that concept placement as edge prediction is very challenging.Also, using GPT-3.5 to select top-1 from the top-50 edge candidates does not improve, or only improves marginally, the results with the prompts.This may suggest the limitation of the state-of-the-art LLM interacting with formal, domain-specific knowledge using zero-shot prompting.

CONCLUSION AND FUTURE STUDIES
This work introduced a new benchmark for Ontology Enrichment from Texts by Concept Discovery and Placement.The dataset focuses on enriching OWL ontologies as formal KBs, which are reducible to and thus compatible with taxonomies.Compared to the prior art, the dataset supports a more comprehensive set of characteristics, including NIL Discovery, Contextual Term, Concept Placement, and Complex Concepts.We propose a pipeline to construct this resource and release a dataset using MedMentions corpus (PubMed abstracts), UMLS and SNOMED CT ontologies.We provide usage of the data by evaluating recent LLM-based methods.
The data construction method can be applied to other KBs in the biomedical domain and KBs in various domains.
The baseline LLM-based methods are yet to achieve satisfying performance on the benchmark.Further methods are encouraged to address this challenge.

Figure 1 :
Figure 1: Data construction pipeline: KB and Subset Selection, KB Versioning, Edge Extraction, and Mention-Edge Data Creation.

KB Versioning to Synthesize Out-of-KB Entities
! ≠ Newer KB)

Table 3 :
Results on out-of-KB mention discovery

Table 4 :
Results on out-of-KB concept placement