Thought Graph: Generating Thought Process for Biological Reasoning

We present the Thought Graph as a novel framework to support complex reasoning and use gene set analysis as an example to uncover semantic relationships between biological processes. Our framework stands out for its ability to provide a deeper understanding of gene sets, significantly surpassing GSEA by 40.28% and LLM baselines by 5.38% based on cosine similarity to human annotations. Our analysis further provides insights into future directions of biological processes naming, and implications for bioinformatics and precision medicine.


INTRODUCTION
The systematic study of human disease necessitates an in-depth understanding of the links between diseases, drugs, phenotypes, genes, and biological processes [4].Analyzing gene sets that share common biological functions, locations, or regulatory mechanisms can reveal patterns in gene behavior across health and disease states, contributing to the advancement of precision medicine for cancer treatment [7].Yet, the task of identifying biological processes from gene sets is fraught with challenges.Individual genes often display weak signals, and when strong signals are present, they rarely converge on a singular biological theme [7].This complexity is compounded when different research groups studying the same biological systems arrive at vastly divergent conclusions.
In response to these challenges, our paper introduces the Thought Graph framework that aims to address two critical aspects: firstly, it adopted a Tree-of-Thought (ToT) [11] architecture to facilitate thought expansion with Large Language Model (LLM), ensuring inclusive yet precise coverage of biological processes across varying specificity levels.Thought expansion is strategically directed with the assistance of a voter LLM, which guides the decision-making for future steps.This design aims to mitigate the potential discrepancies in human annotations encountered by researchers, yet ensure the quality of the generated processes.Second, our framework prioritizes the integration of domain-specific external knowledge bases to understand the semantics of connections within the Thought Graph.Consequently, it creates semantic relationships like "is-a"

SLC15A4 ABCC5 CDH17
You are given a set of genes, and your task is to propose at least five high-level BPs that may be likely to be performed by the system involving the expression of these genes.
• Here is the set of genes:

Gene Set
Given a set of genes and proposed biological processes describing the system, your task is to generate more specific biological processes describing the system.and "part-of" among various thought steps.This strategy not only facilitates complex decision-making processes but also ensures a more nuanced and interconnected understanding of biological systems, facilitating data interoperability and knowledge integration.Our novel contributions can be summarized as follows: (1) We propose Thought Graph as a complex reasoning framework that generates diverse yet precise entities to tackle potential annotations discrepancies in biological processes.(2) Thought Graph can generate thought graphs with edge semantics by recalling external knowledge (e.g., Gene Ontology) to build rich semantics among thought steps.(3) We have successfully applied Thought Graph in biological process generation with significant improvement compared to SOTA methods, surpassing GSEA by 40.28% and LLM baselines by 5.38% in cosine similarity score, and identified the optimal steps of complex reasoning by balancing specificity and accuracy.

RELATED WORK 2.1 LLM Reasoning
Prompt strategies attempt to decompose a complicated problem into a sequence of smaller sub-problems so that the problem becomes more manageable [12].One popular line of study is the Chain-of-Thought (CoT) [9] series, structuring prompts to encourage the LLM to step through its reasoning process, such as Least-to-Most prompting [12], and Self-Consistency with CoT (CoT-SC) [8].However, these prompting strategies only utilize linear reasoning paths and struggle in tasks that require exploration and strategic lookahead.Alternatively, Tree of Thoughts (ToT) [11] and Graph of Thoughts (GoT) [3] excel in these sorts of tasks.LLM-based prompting frameworks' effectiveness is hindered by inherent limitations such as self-bias and hallucination.To address this, through in-context learning, our work introduces the semantics of edges within our Thought Graph, offering structural information.

Knowledge Graph for LLM Reasoning
LLMs exhibit limitations in integrating new knowledge and occasionally generate hallucinations.A survey [1] on knowledge-graphbased knowledge augmentation in LLMs reveals using knowledge graphs (KGs) as a source of external information has promising results in reducing hallucinations.For example, MindMap [10] has developed a prompt pipeline enabling LLMs to comprehend and integrate KG input with their implicit knowledge.In our approach, we give LLM examples from the gene ontology knowledge graph to enable the edge semantics.

LLM Reasoning in Biomedical Domain
With the rise of LLMs, recent studies explore LLMs' application in various biomedical tasks.The gene set biological process was formulated by [5] as inputting a gene set to an LLM and outputting a biological process name that is predominant in the system and correctly describing the function of the gene set.It's challenging because it requires the LLM to accurately understand and interpret complex biological concepts, including the nuanced roles of genes in various cellular contexts and their interactions within intricate biological networks.Although their results [5] have shown that GPT-4 provides better biological process names than the conventional Gene Set Enrichment Analysis (GSEA) [7], the performance is still far from perfect.

METHODOLOGY 3.1 Problem Formulation
Given a gene set  = { 1 ,  2 , ...,   }, where each   is a gene, the objective  =  ( ) is to design a framework  to generate a tree structure graph  = ( , ) that represents the terms (e.g., biological processes or pathways) associated with the genes in  .In this graph,  is the set of nodes, and  is the set of edges between these nodes.

Infrastructure of Thought Graph
Our framework Thought Graph adapts ToT [11] as a graph generator to generate a curated tree graph , named Thought Graph.Thought Graph contains terms as the nodes  and their dependencies as edges .ToT uses self-reflection to prune and only explore relevant paths.The result, after exploration, is a graph Thought Graph that illustrates the reasoning path and a final answer chosen selected from the last layer of the graph as the term that best describes the gene set  = { 1 ,  2 , ...,   }.

Thoughts expansion.
Thought Graph process with  steps proceeds in a breadth-first fashion to generate a tree of depth .At each step, the process expands the tree by generating a set of candidate nodes.The first step generates a set of general "high-level" terms that describe the gene set, and subsequent steps iterate on the candidate terms by proposing more specific but related terms.
Step 1 (Initial Expansion).The first step is unique from all subsequent steps because its task is to generate the initial set of  candidate terms   = 1 1 , . . .,  1  , where    denotes the term  from layer .This set of candidate terms is generated with an "initial prompt" that takes the gene set as input:   ∼   ( ).This process will be conducted recursively for  − 1 times (minus the initial expansion).For the final layer,   1 ∼    are presented to the LLM to choose the final answer.

Thought Graph
The Thought Graph output provides a representation of the stepwise reasoning process and integrates edge and node semantics for domain-specific context.Each node   ∈  is a unique biological process, arranged hierarchically to reflect varying levels of specificity.The edges  represent the relationships between these processes.Specifically, we use four pre-defined relations from the Gene Ontology (GO): is a, part of, has part, and regulates.These relations establish a hierarchy where, for instance, if A is a subtype of B, A is deemed more specific than B. This approach helps to elucidate the nuanced relationships between different biological processes, as detailed in the GO database. 1

EXPERIMENT & EVALUATION 4.1 Data Collection
The GO database [2] forms the basis of our study.We specifically use a dataset compiled by Hu et al. [5] from the Biological Process branch of Gene Ontology consisting of 12,214 human gene sets, each annotated with a biological process name and description.Due to constraints in financial and computational resources, we randomly select 100 samples from this dataset for evaluation.

Baselines and Model Description
Our evaluation framework includes one domain-specific tool and five LLM baselines.GSEA (gene set enrichment analysis) [7] is a statistical method for associating the expression of groups of genes with biological processes.Our LLM baselines involve different approaches.Input-Output (IO) Prompting with zero-shot and zero-shot-9 prompts generate one and nine unique terms for a single gene set, respectively, with no examples, while few-shot includes five question-answer examples.Chain-of-Thought (CoT) employs the two top pathways from Thought Graph for detailed step-by-step prompting.The approach by Hu et al. [5] integrates expert-curated prompts with specific guidelines that solicit post-hoc critical analysis.For all LLM instances, we use GPT-4 (gpt-4-1106-preview) in Chat Completion mode with temperature 0.7.In Thought Graph, we set the number of steps to five and vote on two samples at each step to proceed.

Evaluation Methods
We use two evaluation metrics: cosine similarity and similarity percentile.Cosine similarity measures the semantic similarity of the predicted term to ground-truth term from 0 (no similarity) to 1 (identical).We calculate similarity using embeddings from SapBERT [6], a masked language model trained to model medical entity relations.After calculating the similarity between the predicted and ground-truth terms, we also calculate the similarity between the predicted term and all 12,214 terms in our dataset to form a null distribution.The percentile score is the percentile of the similarity between the predicted term and the ground-truth term in our null distribution.We also include the proportion of similarity percentiles greater than 99% as a proxy for accuracy.
Among the nine nodes that receive positive votes (indicated as green nodes in Fig. 1), the one with the highest similarity score is selected as the best score (b), while the score of the node predicted by Thought Graph is recorded as the predicted score (p).To establish a fair baseline comparison, we implemented IO zero-shot-9 to generate nine answers, and select the best of these for evaluation.

Performance Evaluation
Overall Performance: Table 1 indicates that Thought Graph (b) achieves the top performance in both cosine similarity (65.06%) and similarity percentile (95.05%).In particular, we want to posit that IO zero-shot learning emphasizes coverage across a wide range of biological process names (diversity), while the CoT focuses on an in-depth exploration of these names (specificity), whereas our framework is designed to balance both.Thought Graph (b) outperforms IO zero-shot-9 (b) and CoT, indicating that depth without breadth, or vice versa, is insufficient.Thought Graph and other LLM baselines outperform GSEA, and we also noticed GSEA cannot provide any terms for 26% of the time, highlighting the advantage of the LLMs.In addition, Thought Graph (p) scores lower than few-shot and Hu et al. baselines.This may be the result of our decision to constrain the final answer to the last layer.However, that Thought Graph (b) outperforms all baselines, including zero-shot-9, assures us that our approach to generating candidate sets of terms is promising, and that it is adept at generating a correct answer, but further optimization is needed.
Thought Graph Analysis: Layer-by-layer analysis in Fig. 2 demonstrates increasing performance from layers 1 to 3, followed by a decrease in layers 4 and 5.This trend suggests a trade-off between specificity and accuracy, with layer 3 the optimal level by a small margin.While the performance at layer 1 is lower, this is largely because our initial prompt specifically requests "highlevel" terms, and only generates three of them.As expected, the variance in mean similarity scores increases with the number of layers, as deeper layers explore deeper and more distant parts of the ontology, but stabilize after layer 3.In the latter layers, more specific terms are often voted out in favor of more accurate, general terms, demonstrating the ability of the voting mechanism to dynamically moderate specificity.Though our results reflect a modest sample size, layer 3 emerges as an early candidate for the optimal depth.

CONCLUSION
Thought Graph represents an advancement in the field of gene ontology and bioinformatics.Integrating gene set analysis with semantic graphs allows for a more nuanced and comprehensive understanding of biological processes.The effectiveness of the Thought Graph in mapping complex gene interactions and functions has been demonstrated, showing its potential to outperform existing methods.This novel method not only enhances the accuracy of gene set analysis but also opens avenues for research in understanding genetic influences on various BPs.Future work can expand on this foundation, exploring broader applications and measuring uncertainty in complex reasoning.

Figure 1 :
Figure1: The flowchart presents the application of the Thought Graph to the Gene Ontology (GO) database.First, Thought Graph uses a gene set and initial prompt to generate three Biological Processes (BPs).Then, a voter evaluates and selects the best BP (dark green) and second best BP (light green), which are more accurately descriptive of the gene set.Each chosen BP, along with a subsequent prompt, is utilized to generate two additional, more specific BPs.This procedure is conducted recursively until Thought Graph has reached five layers.Finally, a voter chooses the final answer from the last layer.

Figure 2 :
Figure 2: The distribution of the mean similarity score at each layer using Thought Graph (p).The blue line denotes the median of layer 3.

BP 122 Round 3 BP 122, BP 111 Voter Round 1 BP 1221, BP 1222 Round 2 BP 1221, BP 1222 Round 3 BP 1221, BP 1222 Voter Round 1 BP 12211, BP 12213 Round 2 BP 12212, BP 12211 Round 3 BP 12211, BP 12213 Given
a set of genes and proposed BPs describing the system, your task is to vote on the two best BPs describing the system.• Here is the set of genes: Genes: [GENE SETS].• Here are the BPs for you to vote on: BPs [CANDIDATES]

Table 1 :
Mean cosine similarity, mean cosine similarity percentile, and proportion percentile above 99% of a domainspecific tool and seven LLM methods on 100 GO data samples.