Automated Ontology Evaluation: Evaluating Coverage and Correctness using a Domain Corpus

Ontologies conceptualize domains and are a crucial part of web semantics and information systems. However, re-using an existing ontology for a new task requires a detailed evaluation of the candidate ontology as it may cover only a subset of the domain concepts, contain information that is redundant or misleading, and have inaccurate relations and hierarchies between concepts. Manual evaluation of large and complex ontologies is a tedious task. Thus, a few approaches have been proposed for automated evaluation, ranging from concept coverage to ontology generation from a corpus. Existing approaches, however, are limited by their dependence on external structured knowledge sources, such as a thesaurus, as well as by their inability to evaluate semantic relationships. In this paper, we propose a novel framework to automatically evaluate the domain coverage and semantic correctness of existing ontologies based on domain information derived from text. The approach uses a domain-tuned named-entity-recognition model to extract phrasal concepts. The extracted concepts are then used as a representation of the domain against which we evaluate the candidate ontology’s concepts. We further employ a domain-tuned language model to determine the semantic correctness of the candidate ontology’s relations. We demonstrate our automated approach on several large ontologies from the oceanographic domain and show its agreement with a manual evaluation by domain experts and its superiority over the state-of-the-art.


INTRODUCTION
An ontology is a collection of concepts and relations.Each concept is unique and is often characterized by multiple attributes.Ontologies usually describe a single domain and are used to abstract and formally defne the semantic meaning of concepts in a domain and the relations between them [21].In data integration, ontologies can play an important role by unifying and relating data elements under concepts despite having diferent schemas [11].For example, the concepts of address and residence can be placed under the concept of location, i.e., an address/residence is a location.Then, during the integration of the two data sources, the felds address and residence are mapped to a common concept -location.Another example from the domain of oceanography is Nutrients.Nutrients refer to the amount of dissolved inorganic macronutrients in seawater such as Silicate or Phosphate.Similar to the location example, both concepts will be grouped under a common ancestor -Nutrients.Thus, ontology-based data integration/access (OBDI/OBDA) [10,49] requires the existence of an ontology that encompasses the knowledge domains of the datasets being integrated.
The construction of an ontology remains a challenge, and it is often painstakingly done manually by domain experts [2].One of the main challenges with manually constructed ontologies is their inability to adapt to other tasks.Not only do they contain subjective knowledge that may be incompatible with the task at hand, but they may also lack important concepts and relationships required for the specifc data integration task or even just contain errors [20,45].Therefore, reusing manually constructed ontologies requires an evaluation step.More precisely, the evaluation of the relevance and coverage of the set of concepts contained in the ontology and the semantic correctness of the relations between these concepts with respect to the domain.
Computer-generated ontologies are potentially far more robust in terms of size and scope as they are based upon a comprehensive review of the domain and require repeated evidence for each proposed concept and relation [35].However, these auto-constructed ontologies tend to lack nuanced information (such as defnitions, constraints, or axioms), and are limited in the type of generated relations.To take advantage of both the utility and nuance of manually constructed ontologies and the robustness of automated methods, one requires an automated method to evaluate and correct ontologies.In this paper, we suggest an automated ontology evaluation framework using a representative domain model to address this need.
Existing approaches of automated ontology evaluation with respect to a domain [6] [14] are limited in two respects.The frst is concept extraction from the domain, a required step that generates the pool of concepts to which the ontology under evaluation is compared.Existing methodologies utilize Part-of-Speech (POS) models that can only determine single-word terms.Multi-word phrases, such as Air Temperature, will be split into separate concepts (Air, Temperature).Moreover, when evaluating ontological relationships, existing approaches only consider their semantic relations with respect to external sources such as thesauruses, dictionaries, and vocabularies, all of which are general-purpose and are not representative of the domain.
To address these gaps we propose a novel automated evaluation framework able to both determine the semantic correctness of relationships between concepts as well as the completeness of an ontology with respect to a particular domain.To achieve this, the framework utilizes a language-model-based representation of the domain to serve as an authoritative source of truth.Our approach utilizes a Named-Entity-Recognition (NER) model, which can not only pick up multi-word phrases but also label their types.We employ a pre-trained bi-directional transformer-based language model BERT [13], which serves as an auto-generated representation of the domain.We demonstrate our method over the oceanographic domain and show how the evaluation generates useful and actionable insights that can be used to improve the evaluated ontology.We further evaluate the language model and show it to be a good representation of the domain, comparable to human experts.
The remainder of this paper is organized as follows.Section 2 provides preliminary defnitions, and Section 3 reviews previous work.In Section 4, our proposed automated evaluation method is described in detail.In Section 5, we demonstrate our evaluation method over three ontologies, and in Section 6, we perform a metaevaluation of the method.Finally, Section 7 presents our conclusions and directions for future work.

BACKGROUND AND PRELIMINARIES
As defned by Gruber [21], "An ontology is an explicit specifcation of a conceptualization".The representation is made through a collection of concepts and relations between them.Formally: Defnition 2.1 (Ontology).Let be a set of concepts, let be a set of relations and let be a set of relation associations such that ⊆ { (, )|∀ ∈ , ∀ ∈ , ∀ ∈ } then an Ontology is a triple :=< , , > Ontologies oft describe a single domain and are used as the defnitive source for the semantics of concepts in that domain.Ontologies are a crucial part of web semantics and information systems as they capture representations of the domain such that machines can interpret them.Such interpretations are mostly required in tasks such as information retrieval [37,46,50], data integration [17,27,52]), and knowledge alignment [9,24].It is important to note that ontologies may also encompass additional knowledge, such as constraints, axioms, instances, and properties [21] but were not explicitly stated in the defnition for simplicity.When evaluating an ontology, we use the term concept family, comprised of a parent concept and a set of direct child concepts, formally: Defnition 2.2 (Concept Family).Let =< , , > be an ontology and let ∈ be one of its relationships, then CF is concept family ⇐⇒ =< , > where ∈ , ⊆ , ∀ ∈ =⇒ (, ) Ontology construction by domain experts is a labor-intensive task.Thus, several (semi-)automated methods were suggested using rule-based approaches [15,25,29,33,43] later advancing to techniques based upon Formal Concept Analysis (FCA) [12,19,47,48] and Natural Language Processing (NLP) [2,8,16,41].
Both manually constructed and automatically-constructed ontologies require evaluation before they can be reused for a new task.Throughout this paper, we will use the term candidate ontology to refer to the ontology being evaluated.Raad et al. [39] reviewed ontology evaluation methods and identifed seven evaluation criteria defned as follows.
• Accuracy refers to concept defnition correctness.
• Completeness determines an ontology's coverage of the domain.
• Conciseness identifes irrelevant concepts in the ontology.
• Adaptability measures how well an ontology is suitable for its intended task.• Clarity assesses how well the intended meaning of the ontology is being projected, i.e., concepts should be independent of the context.• Computational efciency measures the usage cost of the ontology in terms of performance.• Consistency serves as a measure of contradictions within the ontology.
These criteria can be used in diferent evaluation methods that were classifed into the following four categories based on the artifact used as a basis of comparison to evaluate the candidate ontology.
(1) Gold standard-based methods compare an ontology with a previously (typically manually) created ontology.(2) Corpus-based methods extract terms from a corpus and use them to determine the evaluated ontology's ft to the domain represented by the corpus.These methods focus on the accuracy, completeness, and conciseness criteria.(3) Task-based methods evaluate the ontology by its ft to solve a specifc set of tasks that the ontology is designed for, focusing on the adaptability criterion.(4) Criteria-based methods evaluate the ontology by computing scores based on a set of rules and constraints.This evaluation is centered upon the structure of the ontology and often addresses the clarity criterion.
In this work, we propose a novel corpus-based method that covers four evaluation criteria, namely, accuracy, completeness, conciseness, and consistency.

RELATED WORK
Brewster et al. [6] frst proposed to extract terms from a document corpus to assess an ontology's completeness.They further suggested using WordNet [34] to expand the list of extracted concepts, although it remains limited to single-word concepts.Additionally, they perform a vector-space similarity comparison between the text corpus and the ontology to assess accuracy.We extend this approach by supporting phrasal (more than one word) concepts, utilizing a concept extraction method tuned to the domain at hand, and addressing additional coverage-based criteria such as conciseness.In contrast with our work, their method does not address consistency as in our work, where we evaluate the correctness of relations within the candidate ontology.Furthermore, the accuracy and utility of the extracted concept set is suspect, as the authors employ a general-purpose WordNet thesaurus and PoS tagger for this purpose.In this work, we create an accurate representation of the domain by utilizing a large language model extensively trained over a large representative set of documents from the domain.
DiGiuseppe et al. [14] proposed another corpus-based approach in which an ontology is generated from the corpus and compared to the candidate ontology.In their approach, concepts are extracted using PoS tagging and mapped via vocabularies to determine their synonyms and synonym symmetry.The synonym information is used to derive the concepts' hierarchy.The process results in a corpus-based ontology.The generated ontology is then compared to the candidate ontology.The coverage analysis outputs scores for classes, class equivalence, hierarchies, and breadth.The approach is both a corpus-based and criteria-based method.Again, only single-word nouns are considered, which is a limitation of PoS.Furthermore, the external dictionaries used to determine the synonyms are general-purpose English dictionaries that do not refect the true relations in the domain.In this work, we support multi-word concepts and utilize a domain-tuned language model to evaluate the candidate ontology's relations.
OOPS [38] is a web-based evaluation tool for OWL ontologies.Its evaluation is mostly based on lexical and structural patterns highlighting ontology pitfalls.This evaluation can be categorized as criteria-based since it employs rules and patterns.Although it can determine if an ontology is aligned with common standards, it cannot assess the ftness of the ontology to a particular domain as our work does.
Ontologies are sometimes mentioned in relation to linked data [5].However, while ontologies focus on the conceptual description of a domain, linked data refers to large sets of related entities representing instances of these concepts and relation types.Work around the evaluation of Linked-Data (LD) has been proposed [18,28,40,51], in which a rule-based approach is taken to fnd inconsistencies among data instances within an LD data source.In this work, we focus on evaluating ontologies rather than instances and records as in LD evaluation.
In a rare example of using large language models (LLM) in the context of ontologies, Liu et al. [32] present an approach for placing a set of new concepts within an existing ontology.In their paper, the authors utilize the BERT language model [13], specifcally its next sentence prediction capabilities, to determine if a hierarchical relationship between two concepts exists.They do so by pre-training BERT on corpus text from the domain, then fne-tuning it using a set of pairs of concepts that exhibit a taxonomic relationship (i.e., "IS-A"), taken from the SNOMED biomedical ontology.They then test the model over concepts from the latest version of the ontology that were not present in the previous version that was used as training data.The results of the trained model yield an average of 95% recall and 85% precision.This suggests that a language model, such as BERT, can learn the semantic meaning of the concepts and provide accurate relationship predictions even of unseen concepts.However, Liu et al. [32] do not attempt to evaluate an ontology but only demonstrate the ability of an LLM to learn the semantics of the domain and the relations between its concepts.In this work, we utilize this ability to evaluate the completeness, accuracy, conciseness, and consistency of a domain ontology.

AUTOMATED ONTOLOGY EVALUATION
In this section, we describe our automated approach for evaluating an ontology with respect to a domain of interest.Our method allows the evaluation of completeness (coverage) and correctness (semantic relation coherence).Furthermore, we can use the evaluation's results to identify specifc concepts missing from the ontology as well as misaligned semantic relations between its existing concepts.
Fig. 1 illustrates our proposed evaluation method in chronological order.In order to evaluate the candidate ontology, we must generate an accurate representation of the domain.This representation takes two forms.The frst is a domain-trained language model (Domain BERT), used to judge relations between concepts using the semantics of the domain rather than their general-purpose use in English.The second is a collection of phrasal concepts (Domain concepts) extracted from the domain text corpus using a specialized named entity recognition (NER) model.The candidate ontology's concepts can then be compared to this concept collection.In the following sections, we detail each step in the proposed evaluation pipeline.We start with describing how a collection of documents or text corpus (Section 4.1) is created, followed by our method for training a specialized NER model (Section 4.2) and pre-training a language model (Section 4.5).Using the NER model, domain concepts are extracted from the text (Section 4.3) and are then matched with the concepts in the candidate ontology (Section 4.4), from which a sub-set of this ontology (Sub-ontology) is derived.Next, using the pre-trained language model, an evaluation (Section 4.6) takes place, generating a set of scores that refect the correctness and completeness of the candidate ontology with respect to the domain.

Document Collection
Document collection is a crucial step since it serves as the core of this pipeline.It is assumed that the corpus encompasses the knowledge that is required to represent the domain since both the NER and language model are trained on it.The use of large collections of text to represent a domain is not new.Large text collections are routinely used in a variety of tasks, from training domain-specifc language models [22] to guiding automated literature surveys [3].We consider the use of peer-reviewed representations of domain knowledge created by thousands of experts to be a robust representation of the domain.Moreover, such ontologies often form the basis for ontology-based data access and data integration system, e.g., [10], and since scientifc datasets are often described by research papers, we expect the same concepts to describe these datasets.In this work 10,000 papers from oceanographic journals were used, collected using a web crawler and the Crossref API based on previous research [44].The papers were converted to raw text with ScienceParse1 , including only the title, abstract, and content.

Domain Specifc Named Entity Recognition
A typical NER model is capable of identifying phrases representing named entities in text, such as people (Marie Curie), places (Warsaw), or organizations (United Nations).But in order to be able to extract phrases representing (not necessarily named) concepts relevant to the domain such as temperature, one must train a custom NER model [31].Using the collected text corpus (Section 4.1), a domain-specifc NER model is trained.NER models are often created through supervised or semi-supervised approaches, requiring manual annotation of a sample of the corpus by domain experts.Then, this annotated sample can used by existing NER architectures (e.g., [1] that is used here) to train a domain specifc model.

Concept Extraction
Here we use the previously described NER model to extract a set of concepts (hereafter, domain concepts).The NER model, can detect multi-word phrases as well as semantically label them into classes, such as Organization or Measured Variable.After extracting the concepts, a threshold is applied to remove concepts with a small number of occurrences, assuming these are not representative of the entire but perhaps only a small subset of it.Finally, the remaining concepts are considered to be the domain concepts (gray and red dots in Figure 2).We fltered and kept only concepts that have appeared in at least ten diferent papers.

Ontology Subset Construction
Here, we construct a domain-relevant subset of the candidate ontology.It is assumed that concepts excluded from this subset are not relevant to the domain, an assumption we evaluate in Section 6. Fig. 3 depicts the process.We frst match all concepts from the candidate ontology to the previously extracted domain concepts.We begin by standardizing the textual representation of both by lower-casing and lemmatization.For example, concepts containing words like Raining, Temperatures, and Solids become rain, temperature, and solid, respectively.After applying this process to both the domain concepts and the candidate ontology's concepts, we perform an exact match search for overlapping concepts.We refer to these overlapping concepts as Shared Concepts (marked red in Figs. 2 and 3).Using the shared concepts we traverse the candidate ontology's hierarchy such that every ancestor in the hierarchy of a shared concept (yellow) is included, as well as every direct child of that concept (green).

Language Model Pre-training
Using the previously mentioned (Section 4.1 text corpus, we pretrain a BERT [13] language model such that it adjusts to the domain.
This process entails feeding the model with the text corpus using pairs of sentences where some follow each other and some not, letting the model learn to predict the most probable next sentence while at the same time masking some of the words to let the model predict the masked words.Previous research [23] has shown pretraining to increase the performance of downstream tasks utilizing such models.The fnal output of this phase is a pre-trained language model that is adapted to the domain.In the following section we utilize this model's embedding layer to encode the concepts into a vector space for similarity evaluation.

Evaluation Measures
To determine the completeness of the candidate ontology , we compute three metrics: Ontology Relevance, Sub-Ontology Relevance, and Domain Coverage based upon three quantifcations of the concept overlap between the candidate ontology and the concept set extracted from the text corpus, representing the domain (see Venn diagram in Figure 2).O represents the number of concepts in the candidate ontology.D Domain concepts counts the number of concepts extracted from the domain corpus after pruning (Section 4.3), S counts the number of shared concepts found between the candidate ontology and the domain concepts extracted from the corpus, and H the number of concepts in the subset of the candidate ontology constructed by taking the shared concepts and expanding them using the ontology's hierarchical relations (Section 4.4).The measures are defned as follows.
Revisiting the terminology introduced by Raad and Cruz [39], Eq. 1 represents a completeness measure, evaluating the completeness of the candidate ontology concept set with respect to the domain.Equations 2 and 3 measure conciseness, or the extent to which the candidate ontology (or its subset) is relevant to the domain.
Semantically similar concepts are expected to share properties [30].Thus, defning measures that estimate this similarity is important.Therefore, to measure the correctness of the semantic relationships between concepts within the ontology, we defne the following measures.All of the proposed measures rely on the measured cosine similarity between concept pairs using a vector space where a high-dimensional vector represents each concept.This representation is done by encoding the concepts using the domainadapted BERT language model.Since the model is domain-adapted, the similarity of the concept vectors is derived from the similarity of their contextual environment in the document corpus.Thus, terms used in the same grammatical role in similar sentences will be similar in the vector space.
We now defne three measures intended to be used to evaluate a single concept family (Hereafter CF, Defnition 2.2).The frst (CSS) represents an accuracy measure as it evaluates the correctness of the CF as constructed.The fnal two measure consistency, as they measure the extent to which the same relations (is-A) within a CF agree with each other.
(1) Child Similarity Score -CSS is the mean cosine similarity between every pair of siblings in a CF.We defne this function as follows where is the number of CF child concepts. =1,=+1 (2) Parent Similarity Score -PSS is the mean cosine similarity between the parent and each of its direct child concepts.

=1
Where is the parent concept and is the number of child concepts.
(3) Parent Diference Agreement -PDA makes use of the standard deviation of the similarity between the parent concept and its direct children.We can interpret this value as the amount of agreement between the siblings towards the parent with respect to similarity.It is defned as: Using the defned measures, we iterate over all concept families within the ontology with two or more child concepts and compute the mean of CSS, PSS, and PDA.All of the values are within the range of [0-1].Thus, having computed the measures, we determine the accuracy, completeness, conciseness, and semantic consistency using CSS, domain coverage, ontology relevance, and PDA, respectively.

EVALUATION
Here, we demonstrate our approach by performing an automated evaluation on three ontologies with respect to the oceanographic domain.We begin by describing the domain and candidate ontologies (Section 5.1), followed by the results and a comparison with previous work (Section 5.2).We then demonstrate how the measures can be used to improve an ontology (Section 5.3) and conclude with a discussion of the results (Section 5.4).

Domain and Candidate Ontologies
Following the previously described method (Fig. 1), we use a preexisting corpus of 10,000 academic papers collected in the oceanographic domain (Section 4.1) and a NER model that was trained on it [4] (Section 4.2).Using the NER model and a general-purpose PoS tagger [36], we extract the domain concepts from the text corpus (Section 4.3).This phase generated 455,051 unique concepts.After applying additional constraints and flters such as frequency and term length, 17,516 concepts remained.We then pre-trained the BERT 2 [13] language model on the corpus (Section 4.5), resulting in an oceanography-domain BERT model.Our candidate ontologies (Table 1) are ENVO [7], OMIT [26], and SWEET [42].While both ENVO and SWEET are environmental ontologies, OMIT is considered a microRNA ontology.However, due to its relatively large size, it substantially overlaps the oceanographic domain.We match each candidate ontology to the set of domain concepts extracted from the document corpus (Section 4.4) allowing us to perform the evaluation (Section 4.6).

Evaluation Results
The results are presented in Table 2.In terms of relevance (conciseness measures, Eqs. 2 and 3) and domain coverage (Eq.1), SWEET achieved the best results, with 34%, 48%, and 11% respectively.In terms of consistency, OMIT achieved the highest PDA score of 92% indicating a high level of similarity agreement among the children and the parent concepts (Eq.6).Lastly, all ontologies received a CSS (Eq.4) value between 71-72% indicating an average level of accuracy.The source code and datasets used are available online 3 .We now compare our method to the following recreation of Brewster et al. [6] (Table 3).The text corpus was fed into the LSA (Latent Semantic Analysis) algorithm with 20 clusters.From each cluster, the 15,000 most dominant words were fetched, resulting in a set of 33,754 unique words.Next, the WordNet expansion was applied in which two levels of hypernyms were fetched for each word.This expansion resulted in a larger set of 42,603 unique terms.From here, coverage and relevance (recall and precision in the original Results show that LSA consistently assigns lower coverage and relevance fgures than our method.We discuss this and the previous results in Section 5.4.

Improving an Ontology using CSS and PDA
Here, we demonstrate how one can utilize our method to improve an Ontology.CSS and PDA are defned (Eqs.4, 6) for a single concept family (Def.2.2).Thus, to gain better insight into the type of problems the model has identifed or specifc relations that may be inconsistent for future repair, one can use detailed similarity matrices (Fig. 4) for a family that received low scores.The matrix presents the cosine similarity between every pair of child concepts in the family.CSS is defned as the sum of this matrix.Fig. 4 presents such a matrix of all child concepts in the concept family of Environmental Condition.As highlighted by the colors, most child concepts are highly similar (>0.97 cosine similarity, colored red).However, the concepts environmental variability and altitudinal condition received a relatively low similarity score.Indeed (in this domain), these two concepts have a diferent relationship with the parent concept.
To demonstrate how one can use these results to improve the ontology, we create an interim concept separating the set of concepts (temperate, tropical, subtropical, subpolar, polar, arid) from their original parent concept Environmental Condition.We consulted with domain experts who suggested a few possible candidates.Out of the proposed candidates, the Climate Model concept achieved the highest value of PDA when introduced into the concept family (Fig. 5).As can be seen in the fgure, introducing the new concept markedly increases the CSS scores.

Discussion
Discussing the results obtained by our evaluation method over the diferent ontologies with domain experts yielded some interesting observations.The low overall relevance and domain coverage of OMIT, an mRNA ontology, was expected.However, the fact that we could extract a large and relevant sub-ontology from it using our method can form the basis for an automated ontology construction method in the future that can obtain signifcant portions of partially relevant ontologies to piece together a comprehensive domain ontology.The fact that the SWEET ontology, which purports to cover the entire earth science domain (including oceanography), scored so low on coverage was surprising.It prompted us to perform a meta-evaluation of our method to ensure we were looking for actual domain concepts and not irrelevant concepts indiscriminately collected from the text.The results of this meta-evaluation are presented in the following section.When comparing to the current state of the art [6], we get a better coverage and relevance score.This was expected as the limitations of the LSA method cause it to miss phrasal concepts and many domain-specifc concepts.We validate the assumption that, indeed, the method misses more of the domain concepts in the following section as well.
Introducing new intermediate concepts in an ontology is normally manual and time-consuming.However, as demonstrated here, using measures such as CSS and PDA, one can automate this process by fnding the most suitable candidate concepts and testing which best maximizes the measures.

META-EVALUATION
Evaluating an evaluation method requires special care as it must be based upon sound assumptions of what is considered a good result.
Here, we present a meta-evaluation that evaluates our proposed pipeline in two aspects.We begin by measuring the external agreement of our method with our intended target audience, oceanographic researchers.we then perform a statistical analysis to see how the diferent evaluation measures agree with each other and provide diferent perspectives on the candidate ontologies.

External Agreement -Coverage
To validate the coverage values obtained for the ontologies, we collect domain-specifc concepts from two oceanographic researchers, one from the marine biology sub-domain and the other from computational oceanography.We received 43 concepts the experts had suggested upon reviewing their latest publications.We then compared these to the domain concepts collected as described in Section 4.3 and to the three ontologies -ENVO, OMIT, and SWEET.If, indeed, our evaluation method is sound, the results should refect a high agreement between the domain concepts and the experts'  2).We found 88% of the experts' concepts in the domain concepts that were extracted from the text by our method, which is as expected, representing a good domain coverage.ENVO, OMIT, and SWEET covered 23.2%, 13.9%, and 34.8% of our experts' concepts.The SWEET result is perfectly in line with our coverage score.The ENVO and OMIT results are higher, but this refects an inherent bias in this meta-evaluation that over-represents marine biology concepts which are more prevalent in these two ontologies than in the domain at large.To validate our assumption that the lower LSA method scores in Table 3 are due to its poor coverage of the domain concepts, we tested it here as well and found it to be low, as expected, at 28%.

External Agreement -Accuracy and Consistency
Here we evaluate whether our CSS and PDA measures indeed measure the accuracy and consistency of the ontology.We compare the ruling of two domain experts over parent-child concept pairs to the efect on our measures of including these pairs in their concept family.For pairs that our experts believe have a parent-child relationship between them, we expect their inclusion in the same concept family to increase the CSS and PDA scores.The reverse is also true.We randomly sampled 300 concept pairs with a hierarchical (IS-A) relationship between them from the ENVO ontology alongside 300 auto-generated pairs that do not have a hierarchical relationship.A few examples are displayed in Table 4.
Of the 600 pairs, only 326 were used due to the lack of familiarity of the experts with the others.Out of the 326 labeled entries, 144 were labeled true, and the remaining 182 were labeled false.We measured a Kappa agreement score of 0.75 between the two experts over their overlapping pairs which can be interpreted as a substantial level of agreement.We iterate over each of the concept pairs and compute the following score before and after their inclusion in a concept family.
( ) = ( ) × 0.9 + ( ) × 0.1 (7) In consultation with the domain experts, CSS was given a higher weight than PDA due to the nature of the task, which is to determine if a concept belongs to the concept family or not.This decision, according to our domain experts, is substantially more impacted by the similarity to existing siblings than by the extent of diference from the parent.The fnal results of the model evaluation are as follows.Of 326 concept pairs, 127 are true positives, 139 are true negatives, 43 are false positives, and 17 are false negatives.Some of the example concept pairs are presented in Table 6.Thus, the model achieved an accuracy of 81%, a precision of 74%, a recall of 88%, and an F1 score of 81% (F1 Score), which strengthens the claim that the model's ability to correctly identify inconsistencies and inaccuracies is on par with that of domain experts.

Statistical Analysis
Here, we perform a correlation analysis between our consistency measures (Eqs.4, 5, 6) using Spearman's correlation (the measures are not normally distributed).The measures are defned over a concept family (Cf, Def.2.2), and we wish to ensure that they measure diferent aspects of the CF.Results (Table 7) over the SWEET ontology (657 concept families) show a low to moderate correlation between the measures, confrming our assumption that they capture somewhat diferent aspects of the CF.An analysis of the other ontologies returned similar results that are omitted for brevity.

CONCLUSIONS AND FUTURE WORK
In this work, we showcase a novel approach for the automated evaluation of ontologies with respect to a domain.We do so by pre-training a bi-directional transformer-based language model in an unsupervised fashion on a text corpus from the domain.We defne measures that make use of the language model to assess the accuracy and consistency of the ontology.Additionally, we use a NER model and PoS tagger to extract key concepts from the corpus, with which we create a concept set to evaluate the completeness and conciseness of an ontology.We validate the applicability of our approach by comparing the output of the model to that of domain experts.The results further strengthen the notion that language models such as BERT can adapt and encapsulate domain knowledge that can be utilized for a variety of tasks.Additionally, we showcase the potential applicability of our tools in both detecting a problem in the ontology and solving it.In this work, only hierarchical relations were considered due to the limitations of publicly available ontologies as well as computer-generated ontologies which are simple and lack other kinds of relations.However, the method can be expanded to work with other kinds of relationships as well as we intend to do in our future work.

A SUPPLEMENTARY MATERIAL
In order to verify the applicability of the pre-trained model, we showcase a few examples of the fll-mask task, in which the model is given a sentence with one of the tokens being masked and the task is to fll it.The suggestions of both the original and pre-trained model are presented in Table 8.
Table 9 showcases a set of concepts and concept families with respect to the relevant metrics.The Relevance column presents concepts from the diferent ontologies where the concept does not belong to the domain of interest, whereas on the other hand, the Coverage column showcases concepts that did exist in our domain concept set but did not appear in the ontology.
Finally, Table 10 presents a comparison of expert-provided terms and their presence in our domain concepts dataset as well as the ontologies we examined.

Figure ( 1 )
Figure (1) Ontology evaluation pipeline.Details of the steps can be found in the corresponding sections.

Figure ( 2 )
Figure (2) Concept extraction from text and the determination of shared concepts between the domain concepts and the ontology concepts.

Figure ( 3 )
Figure(3) Ontology subset derivation phase in which shared concepts (red) are frst identifed among the candidate ontology's concept.Next, the ontology's is-a hierarchy is used to add their ancestors (yellow).Finally, the children (green) of the shared concepts are added and the remaining unconsidered concepts (blue) are removed.

Table ( 2
) Automated evaluation results of three ontologies

Table (
(5)ure(5)Manual refnement of the Environmental Condition concept family.A new intermediate concept, Climate Model, is selected from a set of expert-suggested replacements using its PDA score, grouping similar concepts together.The numerical values represent the similarity to the direct parent (CSS).

Table ( 4
) Example of real and fake concept pairs.

Table ( 6
) Examples of concept pairs and how they were labeled by experts and predicted by the model

Table ( 8
) Comparison between BERT models before and after pre-training on the domain corpus on fll-mask task.the transition [MASK] between the warmer mixed water at the surface and the cooler deep water below.

Table ( 9
) Examples of diferent concepts with respect to diferent metrics Examples of con-Two Concept fami-Two Concept fami-Two concept famicepts that are part cepts that are part lies with high and lies with high and lies with high and of the ontology but of the domain but low child similarity low parent similarity low parent diference have no relevance are not part of the score