Knowledge Synthesis using Large Language Models for a Computational Biology Workflow Ecosystem

An understanding of the molecular basis of musculoskeletal pain is necessary for the development of therapeutics, their management, and possible personalization. One-in-three Americans use OTC pain killers, and one tenth use prescription drugs to manage pain. The CDC also estimates that about 20% Americans suffer from chronic pain. As the experience of acute or chronic pain varies due to individual genetics and physiology, it is imperative that researchers continue to find novel therapeutics to treat or manage symptoms. In this paper, our goal is to develop a seed knowledgebase computational platform, called BioNursery, that will allow biologists to computationally hypothesize, define and test molecular mechanisms underlying pain. In our knowledge ecosystem, we accumulate curated information from users about the relationships among biological databases, analysis tools, and database contents to generate biological analyses modules, called π-graphs, or process graphs. We propose a mapping function from a natural language description of a hypothesized molecular model to a computational workflow for testing in BioNursery. We use a crowd computing feedback and curation system, called Explorer, to improve proposed computational models for molecular mechanism discovery, and growing the knowledge ecosystem. Since the pain knowledge ecosystem does not yet exist, we validate our approach over a similar application in fertility research.


INTRODUCTION
Evidence driven biology is usually very expensive because it demands laboratory experiments to validate testable hypotheses.Much of the evidence sought today is also available in the form of data [18,31] and knowledge [5,26] in various databases that can be used to inform hypothesis generation and refinement prior to laboratory testing.While computational tools and modeling [6,9] are being used by some researchers at great expense, there is a paucity of hypothesis testing tools that leverage the huge volume of complementary and redundant data , computational tools and models, and structured knowledge to aid researchers in generating novel theses.
For example, consider the fact that Americans are most likely the largest group of consumers of painkillers and opioids in the world.They spend more than $18 billion on pain medications, and pain contributes to about $635 billion in lost productivity [22,25].While many pain medications generally manage pain, variations in genetics and physiology of individual patients often contribute to serious side effects such as acute liver failure, spinal cord damage, opioid addiction, and respiratory complications.Idiopathic pain is also difficult to treat and manage due to multiple or unknown etiologies.It is therefore imperative that new therapeutics are continuously developed to ease the incidence, prevalence and cost of treating and managing pain.
These needs notwithstanding, to investigate the prevailing state of addiction and pain and develop novel interventions, an inquisitive biologist has no simple way of posing the following query to a computational system for a satisfactory scientific response, Define the expression profiles of spinal cord pain gate interneurons in superficial layers of the dorsal horn (SDH) that play major roles in processing touch/thermal/pain signals from skin nociceptors.The interneurons can be either excitatory or inhibitory.They receive sensory input from dorsal root ganglia neurons and deliver these signals to projection neurons that relay the mechanical/thermal pain signals to the brain.which is sufficiently informative to allow one to formulate the hypothesis "Novel cell type specific ion channels or G protein coupled receptors expressed by SDH neurons may contribute to sensory information processing and can be modulated pharmacologically to mitigate transmission of pain signals to the sensory cortex in disease states." for laboratory testing.However, one of the ways this evidence can be gathered would be to follow the steps outlined below by an informed biologist.
(1) Harvest Data from the single cell database SeqSeek [27], and cluster expressed genes of different cell types.Medlock et al [19] describes a number of interneuron cell types based on gene expression data.These genes may be sufficient to complete the clustering and cell type identification, but if not, then other gene marker sets need to be identified from other public databases.(2) Ion channels and G coupled receptors are widely considered important pharmacological targets for mitigating pain.Many of these protein complexes are broadly expressed in most neurons; however, we are interested in those with restricted expression profiles.To find them, the data from SeqSeek can be correlated with spatial expression patterns of the genes in the Allen mouse spinal cord atlas [1].Spatial protein expression databases may also be used to identify the channels of interest.(3) Cells that express restricted channels need to be identified using cell-type specific gene marker groups published in the literature or other databases.(4) Where possible, the densities and electrophysiological properties of the channels should be determined so they can be modeled and inserted into computational pain gate models.
The question we address in this article is whether a generalized computational hypothesis testing tool can be developed that not only suggests the steps above, but also computes the model and responds affirmatively.We discuss the outline of a possible system that leverages available resources and a knowledge ecosystem for biological research, in particular, the pain research outlined above.We do so by presenting an interaction with recently introduced smart query system ChatGPT [24] about pain models and identifying its potentials as a hypothesis generator, its limitations as a trusted platform and its current handicap as a computational tool for hypothesis validation.Based on these observations, we propose a new biological hypothesis generation and testing system architecture, called BioNursery, in which we aim to determine if a proposed computational strategy is a valid model, and can be successfully tested over available digital resources.If valid, the corresponding computational model will be added to the BioNursery ecosystem.If not, we will submit the hypothesized model to the crowd (the community) for curation and re-evaluation until it is sufficiently refined.BioNursery thus serves as a knowledge ecosystem in which biological hypotheses are constantly generated, tested and validated, refined and re-used in creating new knowledge to advance science and its understanding1 .

LARGE LANGUAGE MODELS AS PART OF A HYPOTHESIS TESTING ECOSYSTEM
The recent introduction of ChatGPT and BardAI are being touted as the "know it all" paradigm for question answering as well as a computational platform for a no-code code generation system (e.g., Microsoft's GitHub CoPilot) for general public.Such fictitious and blatantly made-up references raise serious doubts about the accuracy and validity of the seemingly interesting and testable hypothesis of pain models we are interested in.Unfortunately then, it is difficult for a thoughtful scientists to trust the computational models suggested by ChatGPT.Interestingly, when it was asked if the model in Sec 1 proposed by the user could be an alternative candidate solution, ChatGPT responded as follows: Your approach to achieving your goal of defining the expression profiles of spinal cord pain gate interneurons in the superficial layers of the dorsal horn (SDH) is reasonable.It also made the following completely reasonable observation.
This approach combines transcriptomic and proteomic data from publicly available databases to identify interneuron subtypes, ion channels with restricted expression patterns, and associated gene markers.This approach can be an approximation of the expression profiles of spinal cord pain gate interneurons in the SDH, but it may not capture all of the subtle differences between subtypes that can be revealed by singlecell sequencing.Nonetheless, it can be a useful starting point for identifying targets for further investigation.

BIOLOGICAL CONCEPTS AS 𝜋-GRAPHS
In BioNursery, we model online biological databases and analysis tools as functions that accept a set of inputs of specific types and return the results as a table.They are treated as resources and their functional abstraction is called a process.In our system, we string these processes in meaningful ways to construct computational pipelines.An algorithmic construction of such pipelines is aided by a knowledge ecosystem.This ecosystem is maintained and improved by the active participation of the community using a crowd computing architecture.In BioNursery, a hypothesis is a graph, more precisely a workflow, over a set of biological resources, and are called -graphs.A node in a -graph encodes a biological concept or hypothesis, and its computational interpretation, e.g., gene-disease association.We require that each hypothesis satisfy several consistency criteria to be useful.It is perhaps easy to see that a -graph, or part of it, can be abstracted as a black box of a biological concept, i.e., a summarized [23] process with defined input-output behavior.Technically, a summarized -graph, or a concept, is an admissible -graph with a defined or unique root in which the root primarily characterizes the concept.We explain the idea using  Public internet resource DisGeNET2 is a discovery platform containing collections of gene-disease associations.DisGeNET integrates data from expert curated repositories, GWAS catalogues [30], animal models and the scientific literature.DisGeNET data are homogeneously annotated with controlled vocabularies and community-driven ontologies.Additionally, several original metrics are provided to assist the prioritization of genotype-phenotype relationships.
The schematic representation of the DisGeNET database as a resource is shown at the top of Fig 1.This is the view a user generally has of individual processes in BioNursery.The middle figure captures the fact that genes identified with "testicular defects" are selected from the MIK database [16] (i.e., process), and fed to the DisGeNET process, consequently resulting in the -graph yielding gene-disease associations (GDA) for genes annotated with testicular defects, and conceptually summarized into a predefined process TestGDA as shown in Fig 1 .Finally, although we can compose a similar -graph by prepending a KEGG [17] process to the Dis-GeNET process, it will not yield a complete -graph, because the corresponding processes have mismatched output-input parameters (discussed shortly).In BioNursery, our goal is to summarize -graphs as high level implementations of biological concepts and use them whenever possible.Therefore, enhanced -graphs consist of edges involving both processes and summarized processes.

ARCHITECTURE OF BIONURSERY
The interaction with ChatGPT and the discussion in Sec 2 sufficiently highlights two realities -large language models (LLM) such as ChatGPT, or BardAI, may be more appropriately the BioMedLM3 , have significant potentials in hypothesis generation and testing using a simple natural language prompt from a naive user, but also cannot be fully trusted 4 because it has the capacity to convincingly fake the truth.The question then becomes whether it is possible to leverage the power of LLMs and shorten the distance between hypotheses generation and testing by using LLMs to generate the models and then validate the generated models by inserting humans in the loop as curators in the knowledge gathering ecosystem.The BioNursery architecture aims to answer this question affirmatively.
Fig 2 shows the main components of the BioNursery system and their functional relationships.In this system, user queries are processed by the user interface module, which transmits the query either to the BioSmart query translator via the digital assistant (when a process description is not included in the query, as in the infertility Gene-Disease Association example in Sec 6.2) for the generation of a process description, or directly to the BioSmart query translator (as in the pain example in Sec 2), for onward processing.
The BioSmart system analyzes the natural language description of the suggested computational procedure and maps it into a set of possible -graphs using the BioNursery knowledgebase.Recall that the BioNursery knowledgebase includes all curated process descriptions in the form of conceptual resource descriptions 5 (CRD) and node description of resources6 (NDR).If BioSmart fails to generate a -graph, it initiates a crowd computing request by forwarding the missing process description specifications to the community curator database for help.Once a response is submitted by members of the community, it enters the system and is checked for completeness.Once a complete -graph becomes available, it is then sent to the graph generator to generate a template for a Needle [12] query sequence as an implementation of the workflow, more specifically, a testable graph, or a testable hypothesis.
The testable graph is then converted to a VisFlow program, executed by the VisFlow system [21], and the responses forwarded to the user for consideration.If the generated response is accepted by the user as valid, it is forwarded to the curation system for community curation.Once the procedure, or the -graph, surpasses a credential threshold, it is appended to the BioNursery knowledgebase as a valid process.The -graph remains searchable and usable at all times, but starting with lower credibility until it reaches the credibility threshold and enters the curated knowledge pool.However, inclusion in the knowledge pool does not alter the atained credibility of the hypotheses.

Components of BioNursery
BioNursery system has four interacting components as shown in Fig 2: the digital assistant (for user interaction), the BioNursery knowledge ecosystem (hypotheses dictionary as curated biological knowledge), natural language to hypothesis mapping (using graph representations of knowledge), and a computational hypothesis validator (as a workflow query processing engine over biological databases).These four major components accept a user query, generate a hypothesis using an LLM, map the hypothesis into a computational strategy using the BioNursery knowledgebase, translate the proposed strategy into an executable workflow query and return the response as the test response.In the event the system fails to generate a computational strategy, or the user disagrees with the generated response, a community feedback and curation request is generated with the goal to correct or refine knowledge behind the apparent failure.4.1.1The Digital Assistant.BioNursery uses ChatGPT as its digital assistant to generate query responses.Users interact with BioNursery through a front-end user interface that encapsulates its four main components as an attempt to present BioNursery as just a query answering system similar to Alexa and Siri even though it performs significantly more tasks to help maintain the knowledge ecosystem in the background.The user interface accepts queries from the user, forwards the query to ChatGPT, collects the model from ChatGPT, forwards the model to BioSmart system [14], and receives a response from the hypothesis validator when a response is computable (or from BioSmart, when a response cannot be successfully computed) for onward transmission to the user.

Computational Model to Strategy
Mapper.Since the computational models generated by an LLM are always in a natural language of choice, they need to be converted or mapped to a computational language such as Python [3] or SQL [7] for execution.
While the text to standard programming language mapping is maturing, such mapping technologies to non-traditional languages remain illusive and need to be considered on a case-by-case basis [4].Text to programming language mapping experience suggest [11] that the chances of successful mapping are higher when the target language is declarative [2].In the case of workflows over heterogeneous biological databases, the need for schema matching and wrapper applications complicate workflow execution many folds, and not many suitable declarative languages are available for data integration as part of a workflow language.
VisFlow [21] is a visual language that uses a declarative data integration and workflow language BioFlow [11] as its query engine.
In BioNursery, we too leverage BioFlow, the only declarative workflow language with data integration support available for workflow processing.However, BioFlow requires numerous resource specific details that can only be supplied by a dictionary or users.In other words, those details need to be defined apriori manually for a mapping algorithm to leverage.The ad hoc nature of supported queries in digital assistants over arbitrary resources, some never visited by BioNursery, makes it difficult to pre-fabricate the resource descriptions and amass in the system knowledgebase.Instead, we design a two level resource description structure.The first type of descriptions, the -graphs, is highly abstract and independent of lower level details.This type of abstract resource descriptions can also be used to support knowledge curation and collection by the research community (discussed next), and to construct abstract workflows to isolate meaningful ones for expansion.Consequently, we adapt the BioSmart system [14] to map natural language computational strategies to abstract workflow graphs, from which we generate executable BioFlow queries.
4.1.3Workflow Query Generator.Since the -graphs abstract individual resource descriptions in terms of their I/O behaviors, applicable tools, online locations, etc., they can be used directly as standalone applications, e.g., retrieving gene-disease association from the DisGeNET database.However, when these resources are part of a global workflow in which data moves from one resource to another in a pipeline, several aspects of resource compatibility, such as schema heterogeneity, data heterogeneity, and computational platform disparities, become salient.We note here that the query generator algorithm Needle for -graph to VisFlow translation is capable of resolving a wide range of disparities.The algorithm is also capable of ranking the candidate workflows in descending order of relevance relative to the user query allowing the selection of the most relevant hypothesis to be tested.

Crowd Computing as an Open Knowledge Ecosystem of Biological Hypotheses
As discussed in Sec 4, BioNursery system relies mainly on three components -an LLM-based digital assistant, a process graph constructor and ranker, and an executable workflow query generator.The latter two components rely heavily on a fourth component, we call Open Knowledge Ecosystem (OKE), created and curated by the community of researchers and users using a crowd computing platform similar to CrowdCure [15], called the Crowd Curation System (CCS).CCS has access to three databases -a community of curators, a testable hypotheses repository (THR), and a biological resource description knowledgebase (RDK).Combined, they serve as the OKE of BioNursery.
Both THR and RDK are in constant flux and are constantly evolving.If a biological resource such as a database or an online computational tool is missing a description in OKE, a request for crowd help to describe it is received by CCS from the user interface of BioNursery.CCS then communicates this task as a broadcast to all relevant community users, who then respond to the request within a defined time frame, and the resource description enters the RDK as an accessible textual resource declaration, i.e., create resource statements discussed in Sec 5. A graph, more precisely a node, representation corresponding to the resource description is also registered in the THR.A testable hypothesis in general is an abstract graph of resource descriptions.The workflow generator uses these descriptions to generate executable workflow queries.
These resource descriptions and the hypothesis graphs are constantly curated to ensure relevance and accuracy.Since the resource descriptions are either contributed by community users, or are harvested by automated web-crawlers, they can be erroneous either due to perception error or algorithmic deficiencies.Since meaningful hypothesis construction depends on the completeness and accuracy of the process descriptions, they in turn can also be flawed.It is therefore imperative that some form of revision and corrective opportunities are available.In BioNursery, the crowd computing model uses the well developed Information Source Tracking (IST) method [29] to assign credibility of the resource descriptions.We invite readers to [29] for a discussion on the IST method, and to [15] for a discussion on CrowdCure in which IST has been used to assign credibility to stored information.

RESOURCE DESCRIPTION LANGUAGE
The resource description language we introduce is adapted from the language called Needle [12].In Needle, we develop the concepts of processes and resources separately, and use an algorithm to construct admissible -graphs from the process descriptions, and stored knowledge.The following example illustrates the idea.
DisGeNET allows programmatic access to its content through several mechanisms.For example, direct HTTP access is allowed for specific genes using get or post methods.The problem is that many of the column names in DisGeNET tables use terms for human consumption and require subjective interpretation that cannot be discovered by a machine without additional help or structural information such as an ontology.A machine also needs help to discover the get, post, or the browser protocol to access the GDA information.Additionally, a machine also needs to discover the fact that DisGeNET uses Entrez IDs to search GDA information.
A more efficient access protocol using REST API is also supported.In particular, DisGeNET supports APIs to access GDA by genes, UniProt Accession, disease (two different APIs), source, and evidences by genes and diseases.For each API, a description page 7is provided for human consumption to help fabricate programmatic communication protocols.The API key required to access information using these APIs are listed on a page8 as well.The important caveat is that while all pertinent information is available, a generic machine interpreter is unlikely to be able to decipher and exploit such helpful information autonomously.

Needle Process Description
To aid in automatic construction of the -graphs, we adapt create process statements along the lines of Needle's create webtable statements as follows.The statement below helps understand the functionalities of the browser-based access of GDA in DisGeNET and captures the three necessary components, i.e., the process identifier, the resource address and its features, and the input and output From this statement, we are able to reconstruct the URLs9 of GDA descriptions for a given gene.The $ sign in the postfix clause indicates substitutions available in the accepts clause.On the other hand, the process description for API access by genes, is constructed as follows in a similar but distinct way.Depending on the access type we like to adopt, an algorithmic construction of retrieval protocol is now possible from either of these statements.

Resource Description
A resource description  for DisGeNET is constructed from its process description to complete the knowledge about this internet resource in the following format.The primary function of the resource description is to help the hypothesis generation algorithm find, assemble and construct executable workflows as the implementation of an admissible -graph by selecting the most credible components from its knowledgebase.Recall that most of the resource descriptions in BioNursery are contributed by a wide range of community members, some with limited domain expertise or credibility as curators.
% resource identifier create resource DisGeNETb ( % resource narrative for machine consumption narrative "This browser access accepts an Entrez gene ID and returns its disease association", % process contributors contributors {Alex, Abebi}, % applicable data integration tools meta: matcher {Cupid, OntoMatch}, wrapper {FastWrap}, mapping {Determination Process: Semantic Type}, % process testers and validators validators {Alex, Maya}, ); The resource description above links the process description Dis-GeNETb and additionally captures the essential information required for the construction of a credible workflow for a hypothesis.In particular, it helps operation by including the list of effective schema matchers, and wrappers.It also lists term mapping exceptions under mappings clause that must be used and thus overrides any mapping decision by any schema mapping algorithm, e.g., the pair Determination Process: Semantic Type under mapping is one such mapping.One of the meta entries also lists the users who validated the accuracy of this resource description.

Engineering Resource Descriptions
The creation of the resource descriptions and the process involved is inherently complicated and subjective, and requires a fundamentally manual effort.To help ensure fidelity, we have designed a crowd enabled knowledge engineering toolkit, called Explorer, using which a novice user is able to generate the resource descriptions of any biological database or analysis site, such as DisGeNET or GENAVi 10 , respectively.In this extraction tool, we have entries such as URL (primary and postfix), type of access (one of Browser, API, Get, Post, input parameters, output table scheme, key fields, contributor, validator, schema matching tool used, wrapper used, custom matching, etc.Based on the selection of access type, the form entries change slightly to adjust to the specificity dictated by the selection.
Explorer brings a single step resource at a given URL for examination and probing.In Explorer, the left panel has an exploration task or request, and a fillable form that resembles a resource description template mimicking a combination of the create resource and create process statements.The right panel has a browser page at the URL listed in the left panel, i.e., the DisGeNET A1CF gene page.Users are able to choose the categories of items they are inspecting and select by highlighting.For example, selecting scheme, and highlighting the column names as shown, Explorer copies the attribute names in the left panel form, and allows the user to choose the data types of these attributes.
In the event the exploration request includes identifying input and output behaviors corresponding to a specific set of parameters (or attributes), Explorer will enable selection of available set of schema matchers and wrappers to map and extract the listed attributes.For example, if the request included input: gene symbol and output: disease phenotype, reference, and a Cupid schema matching is used, the correspondence might include, gene symbol: A1CF, disease phenotype: disease, reference: last ref.However, it will also allow the user to override the mapping to say disease phenotype: disease, reference: first ref, and take priority.Explorer also registers the user's ID as both contributor and validator for reliability assignment purposes at a later stage on saving the observation.

Crowd Validation of 𝜋-graph Components
Recall that a process description is an elementary -graph, and in general, a -graph is a directed acyclic graph of processes.While any user is able to validate a process, or an entire -graph, a user can specifically request a validation of a -graph or its part.That means, such a request becomes an active validation task and all users of BioNursery receive a validation request notification, and are able to initiate a validation.
Any BioNursery user can initiate a validation request, or remove it from the request queue.A request queue  is a database wide list that all users are able to see and act on.They initiate a validation request by appending it in .Such a request  is a pair of the form ⟨, ⟩ where  is the process name or identifier, and  is a natural language description of the validation request.A validation request is a crowd help request to verify correctness of a -graph in general, and a process in particular.Such a request is made in plain natural language with a pointer to the -graph stating the nature of validation being sought.Once initiated, the crowd query becomes public and appears on users' dashboards as a community help notification.

PROCESS COUPLING FOR ADMISSIBILITY
The construction process of the -graphs from the process descriptions also need to consider two additional factors.The first is structural compatibility to ensure admissibility, and the second is schema heterogeneity for data integration toward admissibility.As an illustration, consider the MIK database, which optionally returns filtered genes based on male infertility phenotypes using a drop down selection menu.Databases often support options to return tables using simple filters, and such process descriptions are expressed in BioNursery as follows.
create process MIKDB at http://mik.bicnirrh.res.in/mip.phpaccess browser postfix /mip.php/accepts filter ( Phenotype String) ( Symbol GeneSymbol primary key, ChrLoc string, Disease string ); The process descriptions above contain enough information about the resources for an algorithmic construction of CGI or API calls to the sites for data retrieval.However, extracting disease information for the genes found in MIKDB by linking them to the genes in DisGeNET needs to address issues related to type compatibility.Additionally, the column name Symbol does not match the input column name of the DisGeNET process, i.e., Genes.Therefore establishing schema correspondence is essential [8].Furthermore, The MIKDB database returns the genes using gene symbols, while DisGeNET expects the entries in Entrez gene ID format.Such a conversion is conceptually challenging.Researchers have developed numerous biological ID mapping tools [13] whose performance depend on various complex choices.The important issue here is that BioNursery must recognize such mapping needs and, in this instance, insert one additional process as shown in Fig 3.

Executable Workflows
In BioNursery, we aim to support queries of the form: Find all genes implicated in obesity related male infertility using the genes in the table crrews.To execute this query , BioNursery generates the -graph in Fig  Ultimately, the natural language query is rewritten as the BioFlow workflow query below, where we replace the MapBase and Dis-GeNet tables with two extract statements to retrieve the tables from the MapBase and DisGeNet databases, respectively.select Disease, Type, N_genes, Score_gda, EL_gda, N_PMIDs, First_Ref from (with mapbase as (extract GeneSymbol, GeneID using matcher S-match wrapper Web-Prospector from https://www.mapbase.orgsubmit crrewbase) extract Type, N_genes, Score_gda, EL_gda, N_PMIDs, First_Ref using matcher S-match wrapper Web-Prospector from https://www.disgenet.org/browser/1/1/0/submit mapbase) where Disease = 'obesity' and Score_gda > 0.01 In the above query, the mapbase table is computed as follows: extract GeneID using matcher S-match wrapper Web-Prospector from https://www.mapbase.smartdblab.orgsubmit crrewbase In the above statement, S-Match [10] is a schema matcher, and Web-Prospector [20] is a wrapper.The extract statement above submits each gene in the crrewbase11 table to retrieve their corresponding EntrezIDs, which are needed to access the DisGeNet database.This is because inspection of the DisGeNET process description shows that any access to content must occur via a Entrez gene ID handle.It is thus necessary for Needle to obtain a mapping of a gene symbol to its EntrezID from MapBase.Note that the extract statement only returns GeneID because GeneSymbols are no longer necessary for the cross table join.
The EntrezID is then submitted to DisGeNET database using the extract statement below.Could you suggest a procedure to identify genes associated with male infertility using semen expression and pathway data?
We are relying on our community curated knowledge ecosystem called BioNursery and show that there is a good chance that we can develop an automated hypothesis testing system and an open knowledge ecosystem for molecular modeling of pain soon.We do not insist on a flawless generation of a computational procedure.Instead, we propose a mechanism that expects possible erroneous suggestions or mappings, but can eventually be cured successfully.The model proposed relies on expert and user participation in the knowledge ecosystem in their role as curators.The crowd computing model accommodates a reliability assignment mechanism [28] that tracks user curation efforts to identify reliable curators using a trust model that changes over time and adjusts to new realities.
The current implementation of natural language to -graph mapping also needs improvement because of its reliance on Needle create webtable syntax.To accelerate automation, the mapping  algorithm needs to better recognize the modified create process statements.Finally, we plan to explore a new and dedicated language model similar to GPT-3.5/GPT-4 for the sole purpose of understanding molecular mechanisms which will improve the generation of strategies for computational models.
Fig 1 below.In these figures, a dashed arrow represents manual selection of filtering conditions, and a dashed rectangle process represents manual operations.
The concept TestGDA in Fig 1 is a shorthand for the -graph in the middle panel of Fig 1, i.e., TestGDA is a summarized graph.A closer inspection of the process description in Sec 5 should reveal two discrepancies.First, the MIKDB database returns a table with three columns -Symbol, ChrLoc, Disease, while the DisGeNET process only needs a set of genes, labeled genes.Therefore, a projection operation is required to filter out the ChrLoc and Disease columns from the returned table.

Figure 3 :
Figure 3: Insertion of ID mapping step for semantic coupling.
4 in a similar fashion to MIKDB, and establishes type compatibility for genes in the two databases by linking the MapBase process.

Figure 4 :
Figure 4: -graph corresponding to the query .

extract
Type, N_genes, Score_gda, EL_gda, N_PMIDs, First_Ref using matcher S-match wrapper Web-Prospector from https://www.disgenet.org/browser/1/1/0/submit mapbase This statement serves as the select clause of the with statement which uses the EntrezIDs in the MapBase table to access DisGeNET.The BioFlow query then displays the results as shown in Fig 5.

7
CONCLUSION AND FUTURE RESEARCHThough we have discussed BioNursery features using simple, intuitive and small examples throughout the article, our goal in this paper was to show that BioNursery holds promise for constructing and testing complex and arbitrary hypotheses of the form shown as the -graph in Fig 6generatedagainst the query below using available knowledgebases.

Figure 5 :
Figure 5: Results of query  in Sec 6.2.

Figure 6 :
Figure 6: Computational model of an infertility-related hypothesis discussed in Sec 7 by BioNursery system.