Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking

Discovering entity mentions that are out of a Knowledge Base (KB) from texts plays a critical role in KB maintenance, but has not yet been fully explored. The current methods are mostly limited to the simple threshold-based approach and feature-based classification, and the datasets for evaluation are relatively rare. We propose BLINKout, a new BERT-based Entity Linking (EL) method which can identify mentions that do not have corresponding KB entities by matching them to a special NIL entity. To better utilize BERT, we propose new techniques including NIL entity representation and classification, with synonym enhancement. We also apply KB Pruning and Versioning strategies to automatically construct out-of-KB datasets from common in-KB EL datasets. Results on five datasets of clinical notes, biomedical publications, and Wikipedia articles in various domains show the advantages of BLINKout over existing methods to identify out-of-KB mentions for the medical ontologies, UMLS, SNOMED CT, and the general KB, WikiData.


INTRODUCTION
Knowledge Bases (KBs) are widely used for representing entities and facts about the world with reasoning supported.KBs are inherently incomplete.New entities are constantly emerging, for example, from late 2020 to early 2022, a new variant of SARS-CoV-2 emerged every few months [44].Existing KBs may thus also inevitably miss entities, for example, "Curry-Jones syndrome" [42] was not added to SNOMED CT [9] until 2017.Delays in incorporating these entities into the KB may result in failure to cataloguing, searching, and reasoning with them.Automated discovery of mentions of new entities from common resources such as texts can support the maintenance of KBs.
Updating KBs with entities from texts is highly relevant to Entity Linking (EL), which is to match mentions in texts to entities in a KB [34,40].Current works on EL, however, often assume all the target mentions have corresponding entities in the KB and ignore mentions that have no corresponding entities [3,45].The latter are sometimes called out-of-KB mentions and are matched with a NIL entity.To maintain a KB, it is required to discover out-of-KB mentions which can be further processed as new KB entities. 2s NIL is not described in the KB, it is hard to obtain their lexical or embedding representations, and consequently, it is hard to predict them directly.One idea is setting a threshold towards the mention-to-entity matching score: a mention is regarded as NIL if its scores to all the KB entities are below the threshold [6].Another idea is to classify a mention into NIL or in-KB based on a set of features regarding the mention and its top-k entity candidates [46].There are also some other studies attempting to represent the NIL entity with key phrases or features based on external corpora [21] or manual effort [22].
Recently, neural EL has shown good performance by applying pre-trained language models (LMs), e.g., BERT [8], to represent texts and entities [45].In contrast, such LM-based methods have been rarely applied for out-of-KB mentions, as reviewed in [38,39].Traditional, threshold-based approaches in combination with deep learning are still the state-of-the-art [7,25].There is a lack of studies to seamlessly integrate out-of-KB mention discovery with pre-trained LM-based approaches like BLINK [45].Also, entities are likely to have various surface forms or synonyms [7,11].This entity variety suggests enhancing synonyms for EL and may help differentiate between in-KB and out-of-KB entities.
Besides the shortage of methodology research, benchmarking datasets considering out-of-KB mention discovery are also relatively rare compared to in-KB EL.Most EL datasets assume that the KB is complete and do not include out-of-KB entities, e.g., the MedMentions dataset [32] that links mentions to concepts in UMLS [5].The most recent large-scale dataset is NILK [23], which synthesizes NIL entities from the entity gap between the newer version of Wikidata (2021) and the older version (2017) to enrich the older KB.Also, in the biomedical domain, the main out-of-KB EL dataset is Share/CLEF 2013 [41], with NIL mentions manually identified for training and evaluation.The only two other datasets from the other domains also rely on manual annotations, e.g., historical newspapers [12] and news in microposts [37], which require domain experts' substantial effort.Also, all the previous studies only focus on a single strategy (either manual effort or KB versioning) to construct EL datasets with NIL labels.
In this work, (i) we define the out-of-KB mention discovery problem and propose a method named BLINKout, based on the BERTbased EL method, BLINK [45], where new techniques including NIL entity representation & classification and synonym enhancement are developed and applied; (ii) we summarize and apply strategies to automatically construct out-of-KB mention discovery benchmarks from an in-KB EL dataset by pruning or using older versions of the linked ontology, besides manual labelling.Five out-of-KB mention discovery datasets are selected and constructed using three data strategies (i.e., manual labelling, KB pruning, KB versioning), with texts of clinical notes, biomedical publications and Wikipedia articles, two medical ontologies (UMLS and SNOMED CT), and one general purpose KB (WikiData) covering various domains.
Experimental results on the datasets show the advantage of the BLINKout approach for out-of-KB mention discovery, with comparison to rule-based, threshold-based, and feature-based baselines and ablation studies, up to nearly 40% improvement of out-of-KB  1 score compared to the second best baseline.

PRELIMINARIES 2.1 Problem Definition
Given a KB containing a set of entities  and a text corpus in which a set of mentions  are identified in advance, Entity Linking (EL, a.k.a.Entity Disambiguation or Entity Normalisation) is to map each mention  in  to its corresponding entity  in  [40].Due to the limited knowledge coverage of the KB, it is possible that there is no entity in  that can be matched with a given mention.Thus we have the following more general problem of out-of-KB mention discovery, formulated as an extended EL task that focuses on identifying out-of-KB mentions along with linking in-KB mentions.
Given a text corpus with a set of identified mentions (each in a context window in a document)   and a KB with entities , out-of-KB mention discovery is to develop a function  such that each mention   in   is mapped to an item  (  ) ∈  ∪ {NIL}, where NIL is a special entity denoting that there are no entities in  that can be matched with   .In this study, we focus on both ontologies (e.g., SNOMED-CT) and general KBs (e.g., WikiData).Ontologies are a common type of KB that is often defined as a shared, explicit specification of a conceptualisation of a domain [15], and we consider an ontology's classes (or concepts) as its entities for the linking.Note that each entity may have definitions and synonyms that can be utilized for out-of-KB mention discovery.

BERT-based Entity Linking
We summarize the BERT-based Entity Linking methods (e.g., BLINK [45]) below.They usually have two stages: candidate creation and candidate ranking [25,34,45].In candidate creation, the approaches aim to narrow down the vast number of entities into a manageable subset (e.g., tens or hundreds), and in candidate ranking, the approaches aim to rank the candidate entities of each mention according to the probability that they match the given mention.
Candidate Creation with Bi-encoder.The bi-encoder finetunes two BERT models   and   to embed mentions and entities, resp., into a dense embedding space, so that given a new mention, the nearest candidates can be easily retrieved.For a mention   and an entity   , their embeddings can be accessed as where red(•) denotes a function for extracting the vector representation of a textual sequence from a BERT model using its last layer representation of the [CLS] token.
Mentions and entities can be fed into the BERT models in different ways.In the classic work [45], a mention is fed into   as where ctxt  and ctxt  are the left and right contexts of the mention in the document, resp.; and an entity (with name and definition) is fed into and [ENT] are special tokens for separation.
With these embeddings, the score which measures the similarity between a mention and an entity can be calculated, e.g., by their dot-product defined as  (, ) =   •   .The scores are further used to generate top- candidates for each mention.
Bi-encoder can be trained with a loss function to make each mention close to its matched entity in the embedding space, but far away from the other entities within the same batch.This can be realised with the max-margin triplet loss [3,35] described below 3 , where  is a margin of small value (e.g., 0.2) and [] + denotes max(, 0), for each mention to its gold entity (the -th) in a batch.
Candidate Ranking with Cross-encoder.Given that the biencoder can only coarsely identify top-k candidates but may not correctly rank them among a large number of KB entities [45], we use the cross-encoder to make the candidate ranking.The crossencoder is fed with information of both a mention and its top- entities from the bi-encoder, and performs a multi-class classification, i.e., the entity that is more likely to be matched when the mention is predicted with a higher score.
Each mention  is concatenated with each of its top- entities following the same input format as in the bi-encoder.The concatenation of the input for the mention and the entity  (without the [SEP] and [CLS] tokens in between) is denoted as  , , i.e., [

Softmax
Mention  (in context) +,-" (NIL or in-KB) Figure 1: BLINKout architecture for out-of-KB mention discovery, adapting BERT-based Entity Linking [45]: bi-encoder encodes separately the mention  in a context and the entities (each synonym as an entity) into a dense embedding space; cross-encoder classifies the most relevant entity candidate (with synonym concatenated), with NIL Entity Representation & Classification that appends a [NIL] special token to replace the last candidate (if NIL was not predicted by the bi-encoder), jointly learned with     to classify whether the mention is out-of-KB.
The input  , is fed into the cross-encoder, which is composed of a BERT model and a linear layer, for a score output  ( ) , .The vector  constitutes the parameters to be learned in the model.
The cross-encoder is learned with the following cross-entropy loss with softmax activation, where   denotes the indices of the top- candidate entities of , and  is the index of the gold entity.

BLINKOUT
The BERT-based EL [45] described previously in Section 2.2 does not consider the NIL entity.We present NIL entity representation & classification, an approach for out-of-KB mention discovery that adapts the cross-encoder, applicable either with or without training, and discuss the techniques to represent the NIL entity with LMs.Then, we propose methods to enhance both bi-encoder and crossencoder with synonyms or variants which are prevalent for entities.

NIL Entity Classification
For the classification approach, we ensure that the NIL entity is within the top- candidates by replacing the last candidate with NIL.This gives a chance for the cross-encoder with  ( ) to potentially classify NIL as the top entity for the mention, as inspired by [34].The textual representation of NIL, encoded with BERTlike LMs, will affect the performance of classification; we discuss different NIL entity representations in 3.2.
Joint Training for NIL Classification.We add a joint loss to learn to verify whether a mention is NIL or in-KB, conditioned on the mention representation, similar to [16].The sigmoid function  with a linear layer is used to form a probability score,      , given the mention representation.The vector     constitutes the model parameters, learned with the binary cross-entropy loss below.
The overall loss with joint training for the cross-encoder is described below.The best value of     varied from 0.01 to 0.25 across the datasets based on parameter tuning with the validation set.

NIL Entity Representation
The NIL entity does not have a native textual representation in the KB.A better representation of NIL will help the LM to discriminate NIL from in-KB entities.This representation can be either static, fixed (or unsupervised) or dynamic, fine-tuned (or supervised with NIL mentions) in the LM.
We represent NIL as a special token [NIL] by taking advantage of the tokenizer in a BERT-like LM.This assigns special semantics to the NIL entity so it is not confused with the names of other entities in the KB.Also, the continuous representation of [NIL] can be further fine-tuned with the LM.A more naive representation is "NIL", with the definition of "It is a NIL option.",used in a previous study [16]; we also replace NIL with [NIL] in the definition.Similar to in-KB entities, we add the [ENT] special token between the name and the definition.A list of NIL entity representations and their results are presented in Table 3.Our final approach uses the dynamic, fine-tuned [NIL] representation that leverages the labelled out-of-KB mentions in the training data.

Synonym Enhancement
The original BLINK [45] focuses on Wikipedia texts and does not consider synonyms.Synonyms are prevalent for real-world entities (see data statistics in Table 1), e.g., entity C0428977 in the UMLS has the name of "Bradycardia" and some synonyms such as "Slow heart beat" and "Heart rare slow".When in-KB entities are better represented with synonyms, the out-of-KB mentions can be more precisely identified by the LM.We thus enhance the bi-encoder and the cross-encoder with synonyms, with two different approaches, resp.: (i) augmentation of each synonym as an entity in the biencoder, for its thorough training, and (ii) concatenation with [SYN] special token in the cross-encoder, for its efficient training.
Synonym Augmentation in Bi-encoder.We use synonyms for data augmentation to enhance the training.Each synonym is treated as a separate entity to be matched to the mention.This can significantly augment the size of the training data.After the scoring, we aggregate the entities and synonyms into top- unique entity candidates by setting the score of an entity as its highest score among all its variations.
Synonym Concatenation in Cross-encoder.It is inefficient and infeasible to use synonym augmentation in the cross-encoder, as the number of classes can be significantly increased (e.g., by around 3-4 times for the UMLS and SNOMED CT subset) and unstable or non-fixed.Thus, we model each entity candidate  with the concatenation of its synonyms, separated by the [SYN] special tokens, i.e., [CLS] name [ENT] synonym_1 [SYN] synonym_2 ... synonym_n [SYN] definition. 4This keeps the number of entities to classify to  instead of treating each synonym as an entity, and thus significantly reduces the computation while still making full use of the synonyms.
Finally, we use BLINKout to refer to the approach that integrates BERT-based EL with synonym enhancement and fine-tuned NIL entity representation for classification.

DATASET CONSTRUCTION
There are different strategies to construct NIL-enhanced (or NILlabelled) EL datasets, which contain out-of-KB mentions labelled with NIL (i.e., each mention is linked to either an entity in  or NIL).One straightforward way is Manual Labelling.We adopt one main manually NIL-labelled dataset in the medical domain, ShARe/CLEF 2013, which consists of clinical notes (discharge summaries and electrocardiogram, echocardiogram, and radiology reports) in the version 2.5 of the MIMIC-II dataset [41].The target ontology is SNOMED CT (represented with their mapped UMLS CUI) refined with the Disorder semantic group of 10 semantic types in the UMLS, defined in the annotation guideline [13].We use UMLS 2012AB following Ji et al. [25].Around 30% of the mentions are out-of-KB. 5anual labelling costs much labour.We thus apply two automatic strategies, KB Pruning and KB Versioning, to synthesize out-of-KB mentions within normal EL datasets.
KB Pruning.We randomly sample a portion (e.g., 10% and 20%) of the entities in the target KB and remove them from the KB.To preserve the hierarchies (formed by subsumption relations, a.k.a."subclass of" or "isA" relations) in an ontology, we link the parent and child of each removed entity as in He et al. [18,19] 6 and this forms a new ontology.Mentions that are originally linked to the removed entities are labelled by NIL.
KB Versioning.We also consider replacing the current version of the KB with an older version, so that mentions linked to entities which are in the newer version but not in the older version become out-of-KB.Using the MedMentions dataset [32] as an example (see Figure 2), UMLS2017AA was used to exhaustively annotate entities in publication titles and abstracts in 2016; we chose its older version of UMLS2014AB (released in Nov 2014).We narrowed the UMLS semantic type to T047, Disease or Syndrome, and selected the source as "SNOMEDCT_US" (similar to ShARe/CLEF 2013).We gathered the entities that are in UMLS2017AA but not in the earlier version 7 .Mentions labelled with these new entities become out-of-KB 8 .We also adapted a recent general domain dataset constructed using the KB versioning approach.The recent NILK dataset [23] is created on the general, multi-domain KB, WikiData [43].The WikiData_2021 version is used to supplement entities in Wiki-Data_2017.In-text hyperlinks (each link to another Wikipedia article) in the Wikipedia 2017 dump 9 are used as the source of mentions; the hyperlinked Wikipedia articles are mapped to WikiData ID to generate in-KB (in WikiData_2017) and out-of-KB entities (not in WikiData_2017 but in WikiData_2021) for the mentions [23].We benchmark with a proportional random sample of 0.1% of the mentions for each of the original data splits, given that the whole NILK dataset is huge (over 107M mentions each to be linked to one of 14.6M English-described entities in the full WikiData dump).We also select entities in the WikiData KB which have at least one mention linked after the random sampling of the mentions.This results in data sample and entity sizes comparable to the other four datasets and keeps the same proportion of NIL mentions as the original data: around 1.5%, much lower than the other datasets.The advantage of KB Pruning is that it allows us to simulate the creation of out-of-KB entities and control their percentage, while KB Versioning creates datasets that are closer to real-life situations where new concepts emerge over time.
Statistics of the datasets are shown in Table 1, including the counts of mentions and documents, and the percentages of out-of-KB mentions, zero-shot mentions, and entities with synonyms.The KB Versioning approach resulted in a much lower percentage of out-of-KB mentions.Also, NILK-sample is a zero-shot EL dataset (from its original construction in [23]), i.e., no overlapping entities (except NIL) among the training, validation, and testing sets, while in the other datasets most of the entities are seen in training.We also note that only about a quarter (25.5%) of the sampled WikiData entities are associated with at least one synonym, compared to nearly all (>99.0%)entities in the subset of UMLS and SNOMED-CT having synonym(s).The average number of synonyms per entity (having at least one synonym) in the NILK-sample dataset is also lower, compared to the other datasets.

EVALUATION 5.1 Comparison Methods
The main baseline methods for out-of-KB mention discovery are: (i) Rule-based approach (or Sieve-based): using carefully crafted rules (each as a "sieve") designed for biomedical texts, where NIL is detected when no in-KB entity can be linked to [11].(ii) Thresholdbased approach: setting a threshold on the EL system's prediction score for each candidate, if the score for the closest in-KB candidate is still below the threshold, the mention is out-of-KB [7,24]; this is applied with BM25 for candidate generation and BERT-based crossencoder for candidate ranking and synonym enhancement ("ℎbased BM25+BERT+syn") [25], where we tuned the threshold (as 0.85) and the domain-specific setting, as SapBERT [27] ("ℎ-based BM25+SapBERT+syn"), based on the validation set.(iii) Featurebased NIL entity classification ("Ft-based Classifer"): including string-matching features [10,31,34], entity contextual feature space, and embedding-based feature space [46]; we also provide a featurebased baseline that uses the entity candidates and dynamic features (i.e., ").For model comparison, the default number of top- was set as 10 for the baselines using BM25, BLINK, and BLINKout, following Ji et al. [25].In the " and NIL rep tuned" setting, both  and NIL entity representation were tuned based on the validation set to optimise  1  (see Section 5.4 and Appendix B for details in parameter tuning, with the optimal settings in Appendix A).
Adapting BLINK without Training.We also implemented both threshold-based and the NIL entity representation approach to a BLINK model trained from all in-KB entities ("ℎ-based BLINK" 10and "NIL-rep-based BLINK"), either or not using synonyms ("+syn").

Evaluation Metrics
In the conventional EL setting, which does not discriminate between in-KB and out-of-KB mentions, the overall (or in-KB+out-of-KB) accuracy (), precision (), recall (), and  1 scores are the same, i.e.,  =  =  =  1 =   | | , where   and | | denote the number of correctly linked mentions and all the target mentions, resp.[38,40].
In this work, we propose to use out-of-KB precision (  ), recall (  ), and  1 ( 1  ) scores to measure how well out-of-KB mentions are detected.Analogously, we can calculate the in-KB precision (  ), recall (  ), and  1  scores.
where    ,    , and    (resp.,    ,    , and    ) are the numbers of true positive, false positive, and false negative out-of-KB (resp., in-KB) mentions.   refers to the number of mentions that are predicted with the correct entities in the KB instead of mentions that are simply predicted as in-KB. 1 is the harmonic mean of  and .

Results
Out-of-KB Metrics.Table 2 shows the main results for out-of-KB mention discovery.BLINKout, with SapBERT and for NILK-sample with BERT, achieved the top out-of-KB  1 ( 1  ) on all datasets.
The rule-based method (Sieve-based) [11] was effective on biomedical datasets, given the regular syntactic structure of the biomedical texts and concepts, but less effective on the general domain (NILKsample).The rule-based approach resulted in high out-of-KB recall scores, but still lacked in out-of-KB precision and  1  compared to BLINKout; this is because the former tends to result in a large number of out-of-KB mentions using the "NIL as no candidate" rule.
Feature-based methods, facing the challenge of imbalanced binary classification, also tend to predict a larger number of NILs and result in a high out-of-KB recall, but low precision and  1 scores.
Threshold-based approaches provide higher out-of-KB precision than feature-based, but lower metrics than the rule-based approach.The setting using synonyms with bi-encoder ("BLINK+syn") instead of BM25 ("BM25+BERT+syn") greatly improved the in-KB results and thus the out-of-KB results based on a threshold.NIL-rep-based approaches with fixed NIL entity representations performed better than or on par with threshold-based approaches.The special token [NIL], even fixed (unsupervised), can discriminate NIL from in-KB entities using the weights in pre-trained LM.
BLINKout outperformed all the baselines on all the datasets.A large margin was achieved compared to the rule-based (second best for biomedical datasets, i.e., except for NILK-sample): for MM-2014AB and NILK-sample, by about 30-40% improvement of  1  .
For NILK-sample, the improvement with synonyms was less obvious.This is due to the zero-shot nature of the dataset and inadequacy of synonyms: first, NILK is a zero-shot entity linking dataset, while synonyms most benefit seen entities rather than unseen entities 11 ; second, the percentage of entities having a synonym is much lower in NILK-sample (25%) than the other datasets (99%) and the average number of synonyms per entity is lower.
Overall Accuracy.For the overall accuracy on both in-KB and out-of-KB mentions (), the proposed models perform the best or competitively in all datasets.This shows that the approaches to discover out-of-KB mentions do not compromise the performance of in-KB entities.In-KB EL results are in Table 5 in Appendix C.

Sensitive Settings
The results of BLINkout are sensitive to the settings below.
The Number of Top-k Candidates.Out-of-KB metric scores generally have a positive correlation with  when it is below a limit.A higher top- may make it harder to train a model to discriminate out-of-KB from the remaining classes using the cross-encoder; this especially decreases the performance when there are limited outof-KB samples for training (e.g., in MM-2014AB).A lower top- is likely to decrease in-KB EL results.Details are in Appendix B.
NIL Entity Representation.NIL representation sets a prior anchor point in the embedding space for the bi-encoder and is used in the cross-encoder for candidate ranking.Results on ShARe/CLEF 2013 and NILK-sample are displayed in Table 3: [NIL] representation performed the best in most settings, either fixed or fine-tuned.
Domain-Specific LMs.In-domain and knowledge-enhanced LMs, SapBERT (and also PubMedBERT [17]), obtained better inand out-of-KB results for datasets in the biomedical domain.

QUALITATIVE ANALYSIS
We selected samples from the test set in the datasets regarding erroneous predictions for out-of-KB mentions, displayed in Table 4. NILK-sample dataset has many text sequences rendered from Tables (see rows 5-6), which provides a different context compared to other datasets, and fine-grained entities such as names of cities in other languages, asteroids, and animated series (see rows 5-7).
BLINKout models can correctly identify NIL mentions, while false but similar in-KB entities were predicted in ℎ-and/or NILrep-based methods (see rows 3, 5-6).BLINKout with joint learning affects the prediction score for NIL, which can result in both true or false positive NIL predictions (cf.row 1 and 2).The NIL-rep-based approach can sensitively identify NIL mentions with a high rank (if not predicted as the top, in rows 1, 3-7), without training with NIL-labelled mentions.The threshold-based approach may likely result in false negative predictions for out-of-KB mentions when the incorrect but similar entity was predicted with a very high score near 100% (as in rows 1, 4, 6-7).Finally, there are challenging out-of-KB entities which are very similar to in-KB entities in the KB pruning based dataset, MM-pruned-0.1,and fine-grained entities in NILK-sample, so that all models yielded a wrong prediction, but with a high rank of NIL as the 2nd or 3rd (in rows 4, 7).

DISCUSSION
Results w.r.t.Dataset Construction Strategies.The out-of-KB performance of BLINKout was the highest on the Manual Labelling data, ShARe/CLEF 2013, and lowest on the KB Versioning data, NILK-sample.This is partially related to the percentage of out-of-KB mentions: the lower the percentage in the training data, the more challenging their detection.On the more realistic KB Versioning dataset, BLINKout outperforms the rule-based approach with a larger gap compared to datasets created with the other strategies.
NIL Entity Representation.The results show that NIL entity representation based methods perform better or on par with the threshold-based method, and also have a high rank of NIL among in-KB entities for out-of-KB mentions.BLINKout with the fine-tuned [NIL] representation yielded a large margin of improvement.It is safer to use non-natural language tokens, i.e., [NIL], rather than "NIL" to avoid confusion with other vocabularies.Joint Learning.Using     allows to penalise the situation that a NIL mention being wrongly classified as an in-KB entity, or vice-versa.This may increase the out-of-KB recall   , but harm the precision   and  1  scores (except for NILK-sample), as shown in Table 2. Joint learning obtained the best overall accuracy on ShARe/CLEF 2013 and best out-of-KB  1 for NILK-sample.More effective use of     warrants further studies.
Synonyms Enhancement.Synonyms help address the entity variant problem and further differentiate in-KB entities from NIL after the fine-tuning.Our synonym enhancement on the bi-encoder and cross-encoder greatly improved the out-of-KB  1 scores across most datasets over BLINK.In the zero-shot EL scenario (e.g., NILKsample data and zero-shot test sets of other data in Appendix D), synonyms may be less useful given that synonyms of unseen entities are not presented during training.Making use of synonyms in the ZS setting warrants further studies.

RELATED WORK
Entity Linking (EL) for KB Construction.EL has been extensively studied, see reviews in Sevgili et al. [38], Shen et al. [39,40].The work in Rao et al. [34] focuses on a KB-centric view of EL, i.e., EL as a key step in KB construction and maintenance.Most relevant to KB maintenance is the identification of out-of-KB or NIL mentions so that they can be placed into the KB [34].Many recent studies in EL only considered in-KB entities [3,4,45].Out-of-KB mention discovery is also distinct from zero-shot EL [36,45], as the latter targets in-KB entities.
Recent studies form NIL clusters for the resolution of out-of-KB mentions into potential concepts or entities [1,20,26].The NIL clusters are formed either with a threshold-based approach or as a post-processing of the NIL mentions after their discovery, as summarized in Ji et al. [24].Our methods on NIL entity representation & classification are independent of clustering methods, and we leave the combination and comparison with them for a future study.
Out-of-KB EL Benchmarking.We summarized most of the NIL-labeled EL datasets in the introduction, including ShARe/CLEF 2013 [41], NILK [23], NEEL 2015-2016 challenges [37], and CLEF HIPE 2020 [12].The earliest dataset is TAC 2011 Knowledge Base Population Track [24], which has NIL mentions to enrich a KB derived from Wikipedia Infoboxes, however, is not freely available and the data size is small (around 1000 mentions in training and testing resp.).Another recent clinical note dataset is 2019-n2c2-MCN [29], which is not yet openly available for new users outside of the previous registration for the challenge.Another recent general domain dataset is EDIN-benchmark [26] to enrich older versions of Wikipedia using text mentions from news articles.EDINbenchmark requires an adaptation set of documents, which may not be always available.We summarized and applied strategies (manual labelling, KB pruning, KB versioning) for NIL-enhanced EL data creation.As far as we know, all previous studies have considered only one of the strategies and we are the first study encompassing all three strategies for dataset construction and benchmarking.).Normalised prediction score (after softmax) and the rank (if not 1st) are displayed after the predicted entity.Wrong predictions are marked with red.For the three datasets, ShARe/CLEF 2013, MM-pruned-0.1,NILK-sample, the threshold ℎ  for ℎ-based BLINK was 0.95, 0.80, 0.45, and the tuned number of top- for BLINKout was 150, 50, 50, resp.
Regarding evaluation, overall accuracy [7,16,25] is commonly used, which does not reflect the full picture.As far as we know, our study is the first to apply out-of-KB Precision, Recall, and  1 scores.
The research endeavour that mostly resembles ours is the ongoing work of Möller [33] on EL for KB enrichment.This study targets news events to enrich Wikipedia.Similar to the plans in Möller [33], our future work will canonicalise the new mentions by, for example, grouping and naming, and placing them in the KB.

CONCLUSION
We introduced the task of out-of-KB mention discovery from texts and proposed BLINKout, which utilizes a dynamic NIL representation & classification approach, enhanced with synonyms, founded on BERT-based Entity Linking.We also provided strategies, KB Pruning and KB Versioning, to construct out-of-KB datasets.The approach has been tested on datasets with various domains to enrich medical ontologies and WikiData.The method in this work, while extending the BERT-based Entity Linking (BLINK) approach [45], also has the potential to be applied to recent end-to-end Entity Linking methods [2].Future studies will focus on the canonicalisation and placement of out-of-KB entities in a KB.We followed Wu et al. [45] to use a maximum of 32 tokens for a mention in context and 128 for an entity candidate (including its synonyms, 32 for NILK-sample), learning rate as 3e-05, bi-encoder training batch size as 16, cross-encoder training batch size as 1.We used the "large" version of LM (e.g., BERT-large, 680M parameters) in the bi-encoder and "base" version of LM (e.g., BERT-base, 110M parameters) in the cross-encoder.For SapBERT and PubMedBERT, we used "base" version for the bi-encoder as no "large" is available.
The parameter in joint learning,     , was tuned to 0.25 (as in [16]) for ShARe/CLEF 2013 and MM-pruned-0.2datasets, to 0.2 for MM-pruned-0.1,0.05 for MM-2014AB, and 0.01 for NILK-sample, based on the validation set.We find that value of     is near to the percentage of out-of-KB mentions in Table 1 for each dataset.
We optimised the bi-encoder and cross-encoder using AdamW [28], with 3 and 4 epochs, resp.(except for NILK-sample, 1 epoch each due to the large data size).We trained all models with fixed

C RESULTS FOR IN-KB ENTITIES
In-KB EL results on the four datasets are presented in Table 5.

D ZERO-SHOT EL AND SYNONYMS
Zero-shot entity linking results w.r.t. the use of synonyms to enhance BLINKout are presented in Table 6.Results show that in the zero-shot setting of the testing sets, i.e., after removing testing mentions having overlapped entities to the training set, the performance of out-of-KB mention discovery (e.g.,  1  ) with BLINKout is less sensitive to synonym enhancement.The inference of out-of-KB entities may be affected when ranking the unseen entities, which have abundant synonyms not presented during training.

"Figure 2 :
Figure 2: Strategies for NIL-enhanced EL dataset construction: Manual Labelling, KB Pruning, and KB Versioning.The specified UMLS ontologies are the subset linked to SNOMED CT under the semantic type of T047, "Disease or Syndrome".

Figure 3 :
Figure 3: Out-of-KB  1 , in-KB  1 , and overall accuracy w.r.t. the number of top- candidates using BLINKout (SapBERT) on MM-2014AB validation set

Figure 3
Figure 3 displays the change of  1  ,   , and  with respect to the number of  on MM-2014AB with the BLINKout model (SapBERT).

Table 1 :
Statistics of out-of-KB linking datasets.Slashes separate the statistics of training, validation, and testing sets."MM" denotes MedMentions; pruned-0.1 and pruned-0.2denote the percentage of pruned concepts in the ontology; "2014AB" refers to the older version of the UMLS applied; "NILK-sample" denotes the random sample of the NILK dataset that enriches WikiData.

Table 3 :
Comparison results among NIL entity representations ("rep"), either fixed or fine-tuned

Table 4 :
Examples of out-of-KB mention discovery from clinical notes (ShARe/CLEF 2013), biomedical publications (MedMentions), and Wikipedia articles (NILK) with the top prediction from four BERT-based Entity Linking models: Threshold-based BLINK, NIL representation based BLINK, BLINKout (tuned), and BLINKout (tuned) with joint learning (

Table 5 :
Comparison results for in-KB Entity Linking (Precision, Recall,  1 scores for in-KB entities).* The parameters  and NIL rep are tuned using BLINKout (SapBERT) for ShARe/CLEF 2013 and MM datasets, and BLINKout (BERT) for NILK-sample.

Table 6 :
Comparison results on zero-shot (ZS) testing sets on BLINKout w.r.t.synonym enhancement.Their differences ("Diff") in the original data setting is based on Table2 and 5. random seeds to obtain reproducible results and the actual variance was low with random seeds, less than 1% (e.g., 0.85% on MM-2014AB with BLINKout, =10, over three runs).Time Used in Training and Inference.The training of biencoder with SapBERT on ShARe/CLEF 2013 took approx.0.63 hour; cross-encoder with SapBERT took approx.0.88 hour when =10 and approx.8.67 hours when =150.The inferencing stage is very fast compared to the training: inferencing with a new mention took approx.0.06 and 0.57 second for =10 and =150, resp.