An Experiment in Retrofitting Competency Questions for Existing Ontologies

Competency Questions (CQs) are a form of ontology functional requirements expressed as natural language questions. Inspecting CQs together with the axioms in an ontology provides critical insights into the intended scope and applicability of the ontology. CQs also underpin a number of tasks in the development of ontologies e.g. ontology reuse, ontology testing, requirement specification, and the definition of patterns that implement such requirements. Although CQs are integral to the majority of ontology engineering methodologies, the practice of publishing CQs alongside the ontological artefacts is not widely observed by the community. In this context, we present an experiment in retrofitting CQs from existing ontologies. We propose RETROFIT-CQs, a method to extract candidate CQs directly from ontologies using Generative AI. In the paper we present the pipeline that facilitates the extraction of CQs by leveraging Large Language Models (LLMs) and we discuss its application to a number of existing ontologies.


INTRODUCTION
Competency Questions (CQs) [10] are a cornerstone of many ontology engineering methodologies [17,21], as they capture the tasks that arise in enterprise engineering and the consequent requirements that must be satisfied by the resulting ontology.CQs are natural language questions and their related answers that an ontologybased application must be able to answer correctly, thereby ensuring that the resulting system has the relevant knowledge to successfully achieve its intended purpose.They not only serve as a litmus test for the development and evaluation of an ontology [17], but also provide a common solution for modelling functional requirements [9] in most traditional and agile ontology engineering methodologies; such as Pay as you go [25], NeOn [27], eXtreme Design [22], Ontology Development 101 [17], On-To-Knowledge [26] and LOT [21].Furthermore, they support the verification and evaluation [6,12] of the ontological artefact being built as their "answerability" becomes a functional requirement [24].
Although several guidelines and methodologies (e.g.MIRO [15] or LOT [21]) recommend that CQs are made available alongside an ontology artefact, they are often not published as part of the ontology documentation [8,28], despite this being identified as a limitation both in the literature [7,12,21] and by the practitioners themselves [2].This can be problematic for third party developers who need to assess the correctness and reusability of an ontology.In such cases, they often have to resort to manually inspecting the classes and properties of an ontology to decide whether it should be reused; this is a major limitation [2] as it can be highly subjective and is based on their level of expertise in ontology development.Some approaches have emerged that assess ontologies for their reusability with respect to a set of requirements [1,5]; however, they rely on the availability of CQs in addition to the ontology itself.Thus, a mechanism by which CQs could be elicited from a corresponding ontology would greatly facilitate its reusability.
In this paper, we explore the feasibility of retrofitting CQs from existing ontologies, by using the class and property labels in order to generate usable questions that should capture the intended scope of the ontology.To achieve this, we exploit recent advances of Large Language Models (LLMs) using tools such as ChatGPT 1 and LLaMA2 .Thus, our contribution addresses the following research question: Q1: To what extent it is possible to generate "usable" CQs from existing ontologies that are representative of the scope and tasks for which the ontology was designed?Here by "usable", we mean CQs that could be used in place of the original ones written by the ontology engineers (henceforth called design CQs) for new tasks, e.g.assessing reusability wrt new requirements.We present a pipeline that parses an ontology to extract relevant information and uses it to instantiate a prompt for the automatic generation of candidate CQs by an LLM.The validity of our proposal for candidate CQs generation is then evaluated by conducting two experiments.In the first, the accuracy of the generated CQs is assessed with respect to the published CQs for the same ontology using CORAL [8], a comprehensive repository of CQs together with a dedicated CQ dataset [20].In the second, we solicit an ontology developer for an assessment of the CQs generated from one of their ontologies, and we ask them specifically to identify any questions that in hindsight would have been useful CQs at the time of designing the ontology.The results confirm the hypothesis that generative models can be used to retrofit usable CQs.
The paper is structured as follows: our research is framed with a discussion on the background in Section 2, and the proposed approach is presented in Section 3. Section 4 presents the empirical evaluation whilst results are discussed in Section 5. Finally, the conclusions and future trends are outlined in Section 6.

BACKGROUND
Several efforts have investigated approaches that identify CQ templates and patterns, or define a Controlled Natural Language that facilitates their formalisation into a target logic or query language [12].Both Ren et al. [24] and Bezerra and Freitas [6] analyse a set of ontologies to identify CQ patterns.Ren et al. [24] define a set of CQ "archetypes", syntactic patterns of CQs to be filled by domain expert; e.g., "Which [CE1] [OPE] [CE2]?",where CE1 and CE2 are class expressions (or individuals, in certain cases), and OPE is an object property expression.These patterns support ontology engineers in the formulation of machine processable CQs that can be used for ontology testing.Similarly, Bezerra and Freitas [6] identify several patterns that can be instantiated with elements from the ontology vocabulary in order to specify CQs that can be used for the automatic testing and validation of an ontology.
Wiśniewski et al. [28] catalogued 106 distinct CQ patterns determined over the CQs of five ontologies: Software Ontology (SWO), Stuff Ontology (Stuff), African Wildlife Ontology (AWO), Dementia Ambient Care ontology (Dem@Care), and Ontology of Datatypes (OntoDT) [28].These were analysed and organised in different categories for each of the ontologies.Keet et al. [13] built on this work with the proposal of CLaRO, a template-based controlled natural language to author CQs with 93 templates and their 41 variants.
Recent advances in Large Language Models (LLMs) have positioned them as a promising approach for the automatic generation of questions [16] in natural language.Auto-regressive LLMs (such as the GPT family [18]) are deep learning models trained on huge corpora of data in order to predict the next word in a sequence, given all of the previous words encountered.In particular, a novel paradigm for text generation is emerging, where LLMs are given a 'prompt' in order to generate a desired output [16].In this context, a prompt consists of prepending a string to the context given to the LLMs [14], which includes some control element such as a keyword etc., to guide the generation of the text.

THE RETROFIT-CQ PIPELINE
The RETROFIT-CQ pipeline has been designed to explore the feasibility of retrofitting CQs based on the ontology vocabulary and structure.It does this through a number of stages: the ontology is parsed to identify and extract the triples representing statements in the ontology; these are then used to generate prompts that are fed into an LLM in order to generate the candidate CQs.We investigate whether different zero-shot prompts 3 and the use of different LLMs have an effect on the CQs generated.Figure 1 illustrates the main steps in the RETROFIT-CQs pipeline: 1) triple extraction; 2) question generation; and 3) question filtration.Each of these steps are detailed in the following subsections.

Extracting Triples from Ontologies
This task parses an ontology to generate statements in the form of RDF triples, from which we extract the triple components (s,p,o) representing the 'Subject', 'Predicate', and 'Object' of a statement respectively.In this preliminary study we assume that resources in the triples are represented by HTTP URIs and that have readable local names, i.e. meaningful for human readers [4].Ontologies with opaque local names [4] (e.g.Wikidata Q-items) are excluded as well.
Likewise, we exclude from the list of generated triples those that have blank node identifiers as subject or object in the triple, since they would require dedicated processing. 4

LLM-based Question Generation
In this step we generate the set of prompts that instruct an LLM to generate queries regarding the list of statements (triples) produced (in Section 3.1), as illustrated in Algorithm 1.For the purpose of this investigation, we consider different prompt instructions: a general one (Prompt 1) and two with added context (Prompt 2 and 3). 5he rationale for using entire statements, rather than individual triple components separately is to provide a contextual boundary for the LLM, therefore limiting the generation of out of scope questions.For example, consider a Solar System ontology that contains the triple ['Hippocamp', 'type', 'Solar_System_Satellite']. For an LLM (e.g.Chat GPT3.5) the term 'Hippocamp' can equally refer to the 'Hippocampus' in brain anatomy as well as to the 'Hippocamp' moon in astronomy. 6Hence, it will generate questions that refer to both these meanings, e.g.'What is the primary function of the Hippocamp?' and 'Is Hippocamp a satellite of Neptune?'.By using the entire statement we provide a form of disambiguation.  ← "questions_" +  + "_" +  + ".csv" 13: save_to_csv(questions, filename) 14: end for 15: end for

Question Filtration
In this step we eliminate redundant questions by detecting duplicate ones, and other questions that refer to the modelling primitive used, such as "Is Multiplayer a class?".These questions should be excluded as CQs are designed to be independent of the chosen modelling style.In this step we also eliminate questions that require some subjective assessment or that require text generation, e.g."Could you envision a future where multiplayer games abandon traditional achievements in favour of more dynamic, player-driven goals and objectives?Why or why not?".The primary aim of CQs is to scope the ontology, and provide context in terms of how, where, when, why, who [25]; therefore questions that require the generation of a narrative are not suitable CQs.At the end of this step, we have a set of well formed generated CQs.

Implementation of the Method
The RETROFIT-CQ pipeline is implemented in Python 3.10.12,and is available in a GitHub repository 7 as supplementary material.RD-FLib 7.0.08 is used to process the ontologies and the RDF statements, and FuzzyWuzzy 0.18.09 is used to detect and remove duplicate questions by using approximate string matching.
In order to mitigate the bias on the questions generated by the choice of a specific LLM, we run our experiments using three LLM systems, with the following configurations: (1) gpt-3.5-turbomodel API : released by OpenAI.We use the pre-trained model with the maximum requested token value set to 4,096 tokens.Note that this is the current maximum value for the gpt-3.5-turbomodel, with 1 token approximately corresponding to 4 chars of English text.10 (2) gpt-4 model API : released by OpenAI.We use the pre-trained model with the maximum requested token value set to 8,192 tokens (current maximum value for the gpt-4 model).

EMPIRICAL EVALUATION
To evaluate the RETROFIT-CQ pipeline, two experiments were conducted: the first quantitatively matches the CQs generated with our pipeline against the corresponding CQs published for each ontology considered in our study; whereas the second solicits a qualitative assessment of the generated CQs from the ontology developer.We need to emphasise that the main aim of these experiments is to explore the feasibility of the proposed approach, therefore addressing the research question identified in Section 1, rather than providing a comprehensive evaluation of LLMs ability to generate usable CQs.The first experiment evaluates the pipeline across three existing ontologies and their related CQs.For this experiment we utilised three ontologies from the CORAL CQ repository [8], one of which has also been used by Wiśniewski et al. [28]: Video Game [19]; VICINITY Core12 ; and Dem@care. 13Each was selected on the basis that they were produced by different developers and that each had a significant number of published CQs.The characteristics of these ontologies are included in Table 3.In the second experiment, we retrofit CQs to an ontology independently developed and for which no explicit CQs were produced.This was then followed by an interview with the ontology developer to explore, at least anecdotally, the intent of an ontology designer when building an ontology.We selected the Solar System Ontology 7 (with 337 triples), developed within our research group for a different project.

Generating Candidate CQs from Repositories of Requirements
The aim of this exploratory experiment is to assess the feasibility of the proposed approach, therefore addressing our research question, Q1: To what extent it is possible to generate "usable" CQs from existing ontologies that are representative of the scope and tasks for which the ontology was designed?In the experiment we exploit LLMs to automatically generate questions in natural language, a task for which LLMs have been successfully employed [16].We use three different formulations of the prompt, adding further contextual information, to assess whether it would affect the questions generated.This allows us to address an additional, secondary research question: Q2: To what extent does the addition of specific context result in a more accurate generation of CQs?After having extracted the triples from each of these three ontologies, we use the selected LLMs, gpt-3.5-turbo,gpt-4, and Llama-2-70b-chat, to generate the questions from three prompt types: P1 -General Questions: this prompt instructs an LLM to generate questions for a given statement: ["Based on <statement>, generate a list of relevant question"+ statement.].P2 -Competency Questions: this prompt instructs the LLM to explicitly generate CQs for a given statement by explicitly providing the definition of CQs: ["Based on the <statement>, generate a list of competency question.Definition of competency questions: the questions that outline the scope of an ontology and provide an idea about the knowledge that needs to be entailed in the ontology."+ statement.].P3 -Use of a Role with Competency Questions: this prompt contextualises the prompt by specifying the role of "Ontology Engineer" and instructs the LLM to explicitly generate CQs for a given statement by including the definition of CQs: ["As an ontology engineer, generate a list of competency questions based on the <statement>.Definition of competency questions: the questions that outline the scope of ontology and provide an idea about the knowledge that needs to be entailed in the ontology"+ statement.].
The prompts generate questions for each extracted statement of type (s,p,o), where we filter out statements whose subject or object are blank nodes, as discussed in Section 3. Table 1 illustrates some of the questions generated by each prompt for the triple Multiplayer, subClassOf, Achievement) from the Video Game ontology.The candidate CQs, resulting from the question filtration step, are validated to check whether they match the design-stage CQs reported for these ontologies.In order to mitigate the effect of paraphrasing, or the use of different morphological structures (e.g.plurals) on the similarity assessment, we employed SBERT, which is a variant of the pretrained BERT approach that derives semantically meaningful sentence embeddings from siamese and triplet network structures, as it can be used for semantic similarity and paraphrase detection [23].The results are summarised in Table 2, where for each ontology we report the following performance metrics for each prompt and LLM: number of generated questions (No. of Q.), mean questions per triple (Mean Q/T), filtered questions, i.e. the final output (No.Candidate CQs), number of validated candidate CQs against existing CQs (No. Validated CQs), and traditional precision, recall and F-measure values.

Retrofitting CQ Results
The results obtained by retrofitting CQs to the Video Game, VICIN-ITY Core and Dem@care ontologies show that all models generate a significant number of candidate CQs, as evidenced by the high recall scores.Our approach achieves a recall of 0.95 or above for all prompts and LLMs, with the three lowest scores recorded in the Vicinity Core Ontology for gpt-3.5-turbowith Prompt 2 (0.95), 3 (0.95) and 1 (0.97), respectively.Therefore, we match the majority of the design CQs catalogued in CORAL for the three ontologies, and we confirm the usability of our generated CQs as they accurately identify the design CQs.The results, however, are not as clear when we consider precision.There are a number of design CQs that are not matched, as well as CQs that are not relevant, as can be seen in the column "(UnmatchedCQs) %" in Table 3.The worse overall precision is recorded for VICINITY Core, where precision varies between 0.39 and 0.80.For Prompt 3, which is where we provided the role of the ontology developer and the definition of CQs, all of the LLMs record lower precision: 0.46 for gpt-3.5-turbo,0.55 for gpt-4 and 0.64 for Llama-2-70b-chat.
Regarding the choice of different LLMs, gpt-3.5-turboyields the lowest precision for each of the three ontologies and prompts, except in the case of Prompt 3 when applied to the triples extracted from Dem@care, which achieves the best precision across the models.Adding contextual information to the prompt seems to yield a limited improvement in precision and in some cases no improvement at all (as in the case of VICINITY Core, where the highest precision for gpt-3.5-turbo is obtained when executing Prompt 1, and the precision for Prompt 3 over the three LLMs is almost always worse than the precision for Prompt 1 and 2).Some of these variations can be explained by the way the LLMs formulate questions about classes and their properties, especially for Prompts 2 and 3 where we specify the definition of CQs.In these cases, the LLMs formulate questions by asking about the property and its relation to the class, or conversely, about the class and its relation to the property.For example, for the triple where username is a datatype property whose domain is the class Player in the Video Game ontology, the LLMs generate the following questions for Prompt 3: • What is the relationship between the username and the player?
• Is there a player associated with every username?
• Does every player have a username?However, the injection of the definition of CQ and the ontology developer role results in candidate CQs expressed in terms of ontological modelling, and therefore far from what an ontology developer or engineer would write.Such a difference in phrasing style can potentially undermine the similarity matching of the CQs.For example, the question "How does a username connect a player to a text string?" generated by Prompt 3 used with gpt-4 does not match the design CQ "What is the username of the player?",despite having the same meaning.Some of the design CQs in the ontologies, however, are missed completely by this process.The reasons for this are discussed in more detail in Section 5.

Ontology Developer-Led Evaluation
The results obtained from the numerical evaluation presented in the previous section provides several useful insights on this exploratory study.However, the assessment of similarity between the design CQs and the candidate CQs cannot fully take into account the aims and intentions of the ontology engineer when writing the CQs.We therefore applied the RETROFIT-CQ pipeline to the Solar System Ontology and asked the ontology developer 14 to evaluate the candidate CQs.The Solar System Ontology represents astronomical knowledge that is intended to support the generation of Multiple Choice Questions (MCQs) for exam papers aimed at secondary school students [3].We interviewed the ontology developer to assess the candidate CQs to identify: 1) correct candidate CQs, and 2) candidate CQs that did not match the design CQs but that in hindsight could have been considered valid for the purpose and scope of the ontology.
The methodology for this evaluation is similar to that used in the previous evaluation (and reported in Section 4.2); however for this evaluation, we only explore the use of Prompt 1 with the three LLMs.Our aim is to assess whether, even with the simplest of settings, the LLMs generate questions that an ontology engineer would consider usable and that reflect their design aims.The interview conducted with the ontology developer confirmed that the CQs generated by the pipeline accurately captured the initial requirements (with a precision over 0.75) and that all the design CQs were matched.Interestingly, the developer identified a number of additional candidate CQs that in hindsight captured some of the intended meaning of the ontology and could have been included in the initial requirement elicitation phase.Some of these CQs refer to specific named individuals e.g."How long has Tethys been in operation?"and were formulated differently.Others, however, were not included in the design CQs but capture the knowledge modelled in the ontology; e.g."What are the technical criteria required for a celestial body to be classified as a planet?", or "What is composed of silicate rocks or metals?"Although ontology developers write CQs that are representative of the requirements of the ontology, there is no guarantee that the list produced is exhaustive or comprehensive, and thus by retrofitting CQs from the ontology, additional CQs can emerge.This additional set of CQs could also be useful when evaluating the ontology design by: 1) translating CQs into SPARQL queries; and 2) query the populated ontology to stress test the ontology design process and anticipate unintended uses of the ontology.

DISCUSSION
The motivation for our work is to facilitate the reuse of ontologies that have no associated available CQs, by retrofitting a viable set of candidate CQs for such ontologies that reflect the original aims of the ontology engineer [1,5].The results from the exploratory experiments described in Section 4 are encouraging and indicate the viability of using generative approaches to retrofit competency questions from existing ontologies when their design CQs are not available.We use the LLMs in their default configurations, as the focus of this exploratory study is the CQ generation, rather than an analysis of the performance of different LLMs.Even when using a combination of: 1) default settings, 2) prompts that do not use any examples or external KBs for inferring the related information,  and 3) a pipeline that is agnostic wrt the choice of prompt or LLM, we can still achieve recall values that are very close to 1.However, LLMs are very effective in generating natural language questions that are amenable to being paraphrased and reformulated, which is one of the reasons why we see such a large variation in precision values (ranging from 0.3956 to 0.9088) which seems to suggest that limiting the creativity of the LLMs can bring results that are more deterministic and produce fewer questions.In general, adding contextual information in the form of the definition of CQs seems to improve precision; however, with some LLMs it also increases the complexity and length of the questions that are framed in terms of ontological modelling, and thus may be less similar to the types of CQs written by practitioners (as discussed in Section 4).One positive consequence of the RETROFIT-CQ approach is that the resulting CQs provide a snapshot of how the intended model has been represented in the ontology and can detect any unintended modelling consequences, as highlighted by the results of the ontology developer based validation.Furthermore, there are a small number of design CQs that are not matched, as shown in Table 3, which warrants a closer investigation.An analysis of the design questions that are unmatched reveals that these fall under (a combination of) 5 main categories: 1) questions requiring calculation or some aggregation function, e.g."Who are the top 3 players in the game?"; 2) multi-hop questions (spanning more than one triple), e.g."What functional areas are of clinical relevance for the home and nursing home environments?"3) CQs with no corresponding or ambiguous ontology content, e.g."Which devices are located at UNIKL?"; 4) CQs that are poorly phrased e.g."Which properties from a panic button observed in events?"; and finally, 5) CQs whose answer involves some level of subjectivity from the developers perspective, e.g."Which devices can I see?".
Questions from category 1 require some level of calculation over the information modelled in the ontology, and LLMs have well documented limitations when reasoning over mathematical notions [11], which can lead to the generation of incorrect candidate CQs.Categories 3, 4, and 5 are all caused by the lack of consistent guidelines for writing competency questions [28], and display some of the problems identified [24].For example, the CQ given above in category 3 is not matched because the ontology does not contain a triple whose object is UNIKL (i.e. it does not contain any class or named individual whose label is UNIKL) and hence it violates one of the presuppositions in [24] that states that any element mentioned in a CQ should occur in the ontology (Presupposition Question Type).Some CQs are phrased in poor English, therefore increasing the likelihood that they are not matched by any question generated by LLMs, which have been effectively used for Automatic Question Generation [16].In other cases, the labels that are used to refer to the ontology elements can lead to confusion.For example, the Video Game ontology contains the triple (Multiplayer subClassOf Achievement).If the triple is taken in isolation (e.g. when an ontology developer who wants to reuse an ontology reads it without considering the documentation) it can be misleading and this ambiguity is reflected in the questions generated by the LLM as seen in Table 2.However, the documentation clarifies that the class Multiplayer models a Multiplayer Achievement, which would have been a more appropriate label.This type of issue is particularly troublesome when ontology developers that are seeking to reuse an ontology try to match terms they used in their requirements.
Category 2 questions highlight a different problem.Although our pipeline is configured to only generate single hop (as opposed to multi hop) questions as we on statements (s,p,o), the results still suggest that we match the majority of design CQs (as the recall is close to 1).This also suggests that those design CQs are also single hop, possibly due to the way knowledge has been modelled.However, if we consider examples of enterprise ontologies, whose model is typically an abstraction of one or more database schemas, we see that CQs are typically more complex and often involve more than one entity; e.g. the CQ "How many orders were placed in a given time period per their status?" was used as an example in [25].Therefore, future work should address the generation of candidate CQs from several triples at the same time.

CONCLUSIONS
This paper presents an exploratory study in retrofitting CQs on existing ontologies.It proposes a pipeline that exploits LLMs in order to automatically generate natural language questions about each triple in the ontology.In this study, we considered 3 LLMs, gpt-3.5-turbo,gtp-4 and Llama-2-70b-chat, and we investigate the use of different prompts, where: Prompt 1 asks to generate questions about a triple; Prompt 2 also adds the definition of CQ; and Prompt 3 extends this further by adding the role ontology engineer.We evaluate the pipeline over three ontologies and their respective competency questions, and observe that our approach has a recall close to 1, i.e. the CQs generated match the design CQs but with varying precision.We have investigated the reasons for the variations in precision, and we conducted an experiment with ontology developers, who were asked to assess the veracity of the generated CQs.Furthermore, we analyse the potential reasons for the observed performance.As future work, we plan a more comprehensive evaluation to confirm the findings of this paper wrt a larger corpus of ontologies and CQs.We will also investigate how adding further tasks, such as paraphrasing, translating, and comparing can fully exploit the LLMs' language capabilities to produce a more comprehensive set of prompt templates.

Table 2 :
Table 4 reports the result of applying RETROFIT-CQ to the Solar System Ontology, and for Summary for each prompt in the LLMs: number of generated questions (No.Q.), mean questions per triple (Mean Q/T), filtered questions in the final output (No.Candidate CQs), number of validated candidate CQs against existing CQs (No. of Validated CQs) and Performance Metrics including Precision, Recall & F1 score each LLM, lists: the number of generated questions (No. of Q.); mean questions per triple (Mean Q/T); the number of candidate CQs (No. of Candidate CQs) which also corresponds to the number of filtered questions; the number of validated CQs (No. of Validated CQs) representing the number of candidate CQs that were validated by the developer; and the precision (based on the ratio of validated CQs to the number of candidate CQs).

Table 3 :
Descriptive statistics for unmatched CQs from RETROFIT-CQs based on different prompts: ontology name, LLMs, unmatched CQ count and unmatched percentage ((UnmatchedCQs) %), Mean, Standard Deviation (Std), word count range (Min, 0.25, 0.50, Max).The Std value of 0 occurs when the CQs have the same number of words and value, whereas (-) indicates that the value is not available due to there being only one unmatched CQ.

Table 4 :
Solar System Ontology -For each LLM using Prompt 1 we report: number of generated questions (No. of Q.), mean questions per triple (Mean Q/T), filtered questions (No. of Candidate CQs), developer-validated candidate CQs (No. of Validated CQs), and the Precision.