Retrieval-Augmented Large Language Models for Adolescent Idiopathic Scoliosis Patients in Shared Decision-Making

As health-related decision-making evolves, patients increasingly seek help from additional online resources such as "Dr. Google" and ChatGPT. Despite their potential, these tools encounter limitations, including the risk of potentially inaccurate information, a lack of specialized medical knowledge, the risk of generating unrealistic outputs (hallucinations), and significant computational demands. In this study, we develop and validate an innovative shared decisionmaking (SDM) tool, Chat-Orthopedist, for adolescent idiopathic scoliosis (AIS) patients and families to prepare a meaningful discussion with clinicians based on retrieval-augmented large language models. Firstly, we establish an external knowledge base with information on AIS disease and treatment options Secondly, we develop a retrieval-augmented ChatGPT to feed LLMs with AIS domain knowledge, providing accurate and comprehensible responses to patient inquiries. In addition, we perform a cyclical process of human-in-the-loop evaluations for system validation and improvement. ment. Chat-Orthopedist may optimize SDM workflow by enabling better interactive learning experiences, more effective clinical visits, and better-informed treatment decision-making.


INTRODUCTION
Adolescent idiopathic scoliosis (AIS) is a structural, lateral, rotated curvature of the spine that impacts 1-4% of children within the at-risk age group of 10-16 years [34].If left untreated, scoliosis may lead to altered spinal mechanics and degenerative changes, resulting in pain, loss of spinal mobility, possible function loss or disability, and decreased quality-of-life [33].Common treatment options for scoliosis include observation, bracing, and spinal fusion surgery [17].The decision to perform interventions on AIS patients depends on multiple factors, including patient maturity, curve characteristics, curve magnitude, location of the curve, and the possibility of progression [22].For pediatric patients during Figure 2: Clinical workflow and usability of the proposed SDM tool with LLM-enabled dialogue system.The proposed SDM tool serves as a chatbot to answer questions from AIS patients and help them to prepare knowledge about diseases and treatment for the coming clinical visit.Besides, clinicians could review patient questions and responses to understand questions from patients, prepare for clinical visits, and improve the quality of responses.Created with BioRender.com.growth periods, the primary goals of interventions are to limit or halt the progression of the deformity, restore trunk balance, and prevent long-term consequences [15].Non-surgical treatments, such as bracing and physiotherapy, aim to reduce the need for operations by preventing curve progression, but their effectiveness has not been rigorously assessed despite their widespread use [14].On the other hand, all types of spinal surgery carry significant risks, both short-term (approximately a mean infection rate of 3.6%) and long-term (re-operation risk of 8.3% with a mean follow-up of 14.9 years) [19].Consequently, there is an important clinical need for adolescent patients diagnosed with moderate to severe idiopathic scoliosis, who often face a critical decision between observation and intervention, to have access to validated tools that can assist in their treatment decision-making.
For AIS patients and families, the decision to have surgery or not can be overwhelming [26,35].Shared decision-making (SDM) is a collaborative approach, which enables patients, families, and physicians jointly participate in the medical decision-making process, reaching a consensus on treatment plans [1].Specifically, clinicians possess knowledge about diseases, tests, and treatments, while patients and families are familiar with their bodies, daily life circumstances, and healthcare expectations.Healthcare providers explain treatments and alternatives to patients, assisting them in choosing the option that best aligns with their values and preferences (e.g., treatment benefits, surgical complications, pain, rehabilitation, and cost) and achieving the ideal of evidence-based and patientcentered medicine [5].Thus, SDM mechanisms play a crucial role in empowering patients, families, and clinicians to collaboratively identify treatment solutions that are intellectually, practically, and emotionally appropriate [11].However, current SDM tools (e.g.pre-recorded video playing, reading materials) often lack effective knowledge sharing with patients, resulting in less efficient clinical visits and conversations.Existing SDM approaches [3,11,16] usually rely on educational materials and passive knowledge sharing from clinicians to patients during clinical visits, which may limit patients' motivation to actively seek information and prepare for effective consultations.
The rapid advancement of conversational and chat-based language models has led to remarkable progress in artificial general intelligence.Large language models (LLMs) have demonstrated remarkable capabilities because of pre-training on a vast corpus with reinforcement learning from human feedback [4,24].By utilizing LLMs, we can convert a conventional passive knowledge sharing in SDM (e.g., pre-recorded videos or reading materials) to an active knowledge inquiry, such as patient-clinician question-andanswer (Q&A) (Figure 1).However, adapting LLMs to biomedicine has been rarely explored due to the lack of medical domain knowledge [6,30,42].Moreover, generative models are susceptible to producing hallucinated information and often struggle with logical reasoning in the context of complex inferences.In addition, other concerns, such as computational costs and model transparency, further impede the adoption of LLMs in real-world clinical settings [7,12,28,36,39].
To address these challenges, we propose an innovative SDM tool for AIS patients and families, leveraging a retrieval-augmented ChatGPT to equip LLMs with AIS disease and treatment knowledge (Figure 2).With a retriever and an external knowledge base, the proposed SDM tool could augment ChatGPT and any LLMs by leveraging external resources (e.g., search engines, medical papers, treatment guidelines, etc.) to answer queries related to clinical concepts and treatment recommendations.This framework mitigates the need for time-intensive and costly fine-tuning and facilitates timely updates without the necessity of re-training the entire model.Notably, we perform a human-in-the-loop assessment for validation and improvement with a diverse group of targeted users and domain experts.Furthermore, the SDM tool provides a positive clinical impact by minimizing human biases in treatment recommendations enabled by LLMs based on large objective domain knowledge.
The main contribution of our work is four-fold: • We develop an innovative SDM tool for AIS patients and families to facilitate comprehensive pre-operative consultations, thereby improving patient treatment outcomes and the efficiency of clinical visits.• We validate the proposed tool with targeted user groups and domain experts to quantitatively and qualitatively demonstrate the feasibility of adopting LLMs in clinical settings.• We employ a retrieval-based framework to augment LLMs with the most recent domain-specific knowledge, thereby significantly improving the computation efficiency.• We effectively mitigate the risk of hallucination with an external knowledge base.In addition, it enhances model transparency by identifying source information and enabling human-in-the-loop validation.

RELATED WORKS
Recent advancements in general-domain LLMs [4,24] have demonstrated exceptional capabilities in following instructions and generating responses that closely mimic human conversation.However, few LLMs have been adapted or fine-tuned for the biomedical domain [6,23].As a result, when generating responses related to domain-specific topics, standard LLMs often suffer from a deficiency in providing sound medical advice.Due to challenges such as insufficient domain knowledge and computational costs, only a few LLMs [10,40] have been adapted for biomedicine by fine-tuning open-source LLMs (typically LLMs with 6.5B-13B parameters) for medical consultation.For example, ChatDoctor [40] has fine-tuned LLaMA [31] (with 7B parameters) to answer clinical questions based on 100k real-world patient-physician conversations from an online medical consultation site.Similarly, MedAlpaca [10] has also finetuned LLaMA [31] with publicly available medical datasets, such as Anki Medical Curriculum flashcards, for biomedical Q&A tasks.However, several significant challenges exist when attempting to implement LLMs in practical clinical settings (Table 1).First, models specific to the medical domain often utilize comparatively smaller-scale LLMs (e.g., LLaMA [31] compared to ChatGPT with 175B parameters), which may resent less accurate and robust representations [23].Second, the fine-tuning of even these smaller LLMs, typically comprised of 7 to 13 billion parameters, is both computationally demanding and cost-intensive [40].Furthermore, the introduction of new knowledge necessitates the complete retraining of the model, imposing additional burdens on developers.Third, in general, LLMs are susceptible to hallucination and struggle to represent the comprehensive long tail of knowledge from the training corpus [2,8,18,20].
To solve these challenges, we propose to leverage retrievalaugmented language models [13,27,38] to access medical knowledge from an external database for enabling domain expertise, reducing computational costs, minimizing hallucination, and enhancing coverage.Our goal is to improve and accelerate LLMs for clinical use cases by incorporating patient interaction and clinical prompting into dialogue systems.Comparing existing applications of LLMs in healthcare, the proposed Chat-Orthopedist aims to answer patient questions during SDM, using an external training corpus in conjunction with an off-the-shelf retrieval model.This approach allows for asynchronous updates with new knowledge, eliminating the need to retrain the entire model.To the best of our knowledge, our proposed work represents one of the first innovative attempts to leverage the advantages of retrieval-augmented LLMs (>175B model parameters) for SDM in clinical research and practice.

METHODOLOGY
Given the rapid advancement in AI, it is feasible to facilitate online clinical conversations to provide necessary knowledge support for patients in SDM.To equip LLMs with domain-specific knowledge related to scoliosis, we introduce Chat-Orthopedist, a retrievalaugmented ChatGPT, for AIS patient Q&A during pre-operative SDM.The proposed Chat-Orthopedist is comprised of three key components: an external AIS knowledge base, a retriever, and an LLM.With user query as input, the retriever seeks out the most relevant content from an external knowledge base, which contains additional information that is not typically stored within the LLM's parameters.Once a subset of the most relevant content has been identified, this information is seamlessly reintegrated into the prompts, thereby augmenting the inherent capabilities of the original LLM.Paired with the user's query, this enriched context is then conveyed to the LLM for an improved response with optimized domain knowledge.

Knowledge Base Establishment
In Chat-Orthopedist, we collect external knowledge based on multiple evidence-based and physician-authored clinical knowledge data resources (Figure 3).As the sizes of paragraphs are various from different sources (or even from the same source), we apply chunking to segment the supportive materials into manageable units.Each corpus consists of 2000 tokens, thereby standardizing the information input irrespective of its original source.The utilization of an external knowledge base in our approach offers the flexibility to readily update existing materials in alignment with the latest advancements and clinical recommendations.Specifically, it is achieved without necessitating the retraining of the entire model, thereby optimizing the computation efficiency while maintaining its relevance in an ever-evolving field.

Retrieval Process
To fully capture the semantic information from the external knowledge base, the proposed retriever in Chat-Orthopedist follows a dense retrieval manner.In the retriever, we use a dense encoder   (•) to map all the text passages into -dimensional real-valued Table 1: Comparison of different online dialogue systems or search engines for healthcare SDM, including conventional searching engine (e.g., Google search), fine-tuning LLMs (e.g., LLaMA with 6.5B parameters), instruction tuning LLMs (ChatGPT with 175B parameters), and our proposed retrieval-augmented LLMs, Chat-Orthopedist.

# Parameters Instruction Type
Human-Like Dialogue

Knowledge Update
No Hallucination Source Transparency

Multi-Source Reasoning
Google Search ---LLaMA [31] 7-65B -ChatGPT [25] >175B -ChatDoctor [40] 13B Tuning MedAlpaca [10] 7-13B Tuning Chat-Orthopedist >175B Prompting vectors.We then build an index for all the  passages that we will use for retrieval.During the retrieval process, to ensure congruity in encoding between the user's query and the corresponding external knowledge, we adopt a uniform dense encoder to map the user queries into -dimensional vectors, where   (•) is the user query encoder.Furthermore, it facilitates the retrieval of the  passages whose vector representations are the closest in proximity to the vector corresponding to the question, thereby aligning the user's inquiry with the most relevant knowledge extracts.We use dot product between the high dimensional vectors to define the similarities between passages and user queries: Regarding the encoders, we adopt an off-the-shelf sentence transformer model, MP-Net [29], for both query and passage encoding purposes.Specifically, we take the representation at [CLS] token as the output, with the high dimensionality  of embeddings set as 768.This retrieval strategy effectively enables the optimal coherence between queries and relevant passages.

Augmented LLM
To augment the LLMs with domain-specific knowledge, it is important to include the retrieved materials in the context for model reference, as shown in Figure 4.However, the data originating from various sources in the knowledge base play different roles in SDM.
The appropriate source of information can vary depending on the specific situation and the nature of the knowledge sought.For example, if a patient's question primarily pertains to the diagnosis, priority is given to retrieving information from guidelines related to physician diagnosis.If the patient is asking about complications related to treatment, we recommend seeking information from clinical trial-related papers.If the model cannot answer the question with information from both sources, it then seeks Google Search for additional assistance.Consequently, to model the logical reasoning underpinning the knowledge from these disparate resources, we regard the retrievers from these various knowledge sources as distinct tools.We then apply the architecture of ReAct [38] to synergize reasoning and acting to leverage LLMs' reasoning ability to induce and modify the actions.

Reasoning Over
Steps.To enable the model to understand which data resource to retrieve knowledge from, we enable Chat-Orthopedist to retrieve from a single source of data for each step and learn to reason over these steps to combine the knowledge from different sources.Consider Chat-Orthopedist as an agent that interacts with the environment with different retrieval tools and obtains information from different sources.At time step , the agent needs to decide which action to take   ∈ A, where A is the action space, containing retrieval operations from the pre-defined and created external knowledge bases.To learn how to make the decision, the target of the agent is to follow a policy  (  |  ), where is the context to the agent and   is the observation obtained from the environment at the current step.For example, in Figure 1, when  = 2, Chat-Orthopedist obtains the information from the past records  1 ,  1 ,  2 that the previous action "Google Search' did not receive enough information to answer the questions and determined the next action,   , is to seek help from the external knowledge base.

Prompt
Engineering.We concatenate information from the following dimensions to augment LLMs with domain-expert knowledge and reasoning ability to organize the information from different resources: • Data Source Descriptions: We offer a brief description covering the introduction of what the source contains and when the model needs to seek this source for information.More specifically, for the Google Search source, we leverage a description as:"Google Search is a portal to access public web pages.When you think you cannot answer the questions correctly only with information from the knowledge base, you may seek information in this source."; • Few-Shot Exemplar: To enable the model to effectively retrieve information from these varied resources in the appropriate manner (format of call functions), we present three different exemplars to briefly guide the generation; • Historical Reasoning Records: As reasoning and action are synergized step by step, we incorporate all historical reasoning records to enable the model to comprehend the historical states and the most recent environmental feedback.
Then the model can be aware of the most suitable next action in line with the current context accordingly.

Statistical Analysis
We performed multiple quantitative and qualitative examinations to comprehensively assess the feasibility, acceptability, and effectiveness of adopting LLMs in SDM tool development.3.4.1 Usability test.Following System Usability Scale (SUS) standard [9], we first conducted a usability study of our SDM tool to evaluate the effectiveness and usability of targeted users (i.e., parents).The SUS (user form) is a 4-item questionnaire designed to assess information accuracy, response clarity, answer relevance, and ease of understanding.The evaluation utilizes a scale ranging from 1 to 5, with 5 representing the most positive response.
3.4.2Knowledge test.We performed two parallel 6-item, multiplechoice knowledge tests on two distinct user groups (those with and without access to our SDM tool) to determine the intervention's effects on knowledge of the relevant AIS disease condition and associated treatment options.A univariate analysis was subsequently conducted to compare the level of patient knowledge between the control group and the SDM group.The difference in scores from the knowledge test between these two groups was analyzed using Mann-Whitney U tests, with statistical significance defined as a p-value < 0.05.

3.4.3
Human-in-the-loop.We recruited a multidisciplinary team of orthopedic surgeons and researchers from multiple sites to conduct an iterative human-in-the-loop analysis and clinical validation.Frequently asked questions (FAQs) with generated answers, along with retrieved source information from the knowledge base, were reviewed by the expert team for a comprehensive evaluation and further improvement.

Tool comparison.
We conducted an extensive comparative analysis of the responses generated by different tools with the expert team, including conventional search engines (Google), LLMs (ChatGPT), and retrieval-augmented LLMs (Chat-Orthopedist).Specifically, we integrated the Google search and ChatGPT API into our SDM tools to ensure a single-blinded experimental design.Beyond the original 4-item SUS, we designed an additional questionnaire (expert form) to provide a more comprehensive comparison of the three tools employed in SDM settings.This expanded questionnaire further encompasses factors such as completeness, fluency, credibility, verifiability, level of aggressiveness, and ethical concerns.

RESULTS AND DISCUSSIONS
Our central objective was to evaluate the feasibility and acceptability of the proposed SDM tool as a knowledge-sharing instrument in clinical settings.Consequently, our analyses of collected data were primarily descriptive and qualitative, supplemented by several inferential statistical analyses.We refrain from attributing statistical significance to purely descriptive results.All statically analytical procedures were carried out using IBM-SPSS version 28.0 and Prism version 9. Specifically, the following hypotheses were tested: • Hypothesis 1: The mean scores on the usability test with the SUS questionnaire will indicate positive satisfaction (mean item scores >3.0) with the developed SDM tool.• Hypothesis 2: The mean scores from the knowledge tests for parents with access to the SDM tool (the SDM group) will be significantly higher in comparison to the mean scores of parents without access (the control group).for both conventional search engines (Google) and LLMs (ChatGPT).The analyses for Hypotheses 1 and 3 were primarily descriptive, based on the examination of the distributions of item scores and total scores across various measures.In addition, the analyses for Hypotheses 2 and 4 employed Mann-Whitney U tests, t-tests, and ANOVA tests to perform comparative statistical analyses, with a focus on establishing statistical significance (i.e., p-value< 0.05).

Case Studies
We presented an example of patient Q&A outcomes using Chat-Orthopedist and other potential SDM tools, such as Google and ChatGPT, in Figure 5. Specifically, the patient is asking a question about the long-term effect of AIS.Google search provides lowquality source information with wrong answers since untreated AIS will have a severe influence on both physical and mental health of patients [34].Similarly, ChatGPT only focuses on the short-term effect on physical condition, which fails to provide a comprehensive answer.Chat-Orthopedist is able to successfully understand the requirement long-term effect in the query.With the external knowledge base and Google search engine, Chat-Orthopedist provides a more accurate and comprehensive answer regarding the potential long-term physical effect and self-reported outcomes, by reasoning from multiple sources (e.g., Google and external knowledge base).This also demonstrates that the established knowledge base is able to provide domain-specific information to facilitate SDM for AIS patients and parents.In Chat-Orthopedist, the external knowledge base contains multiple evidence-based and physician-authored clinical knowledge data resources, including PubMed clinical papers (e.g., meta-analysis, case studies), Scoliosis Research Society's (SRS) practice guidelines 1 , UpToDate 2 , and Dynamed 3 .Specifically, the proposed framework can be readily generalized to incorporate

Usability Tests (Hypothesis 1)
A total of 128 targeted users (i.e., parents) were included in the usability test and final analysis after quality control.The demographic characteristics of participants represent a diverse national user group of parents from all genders, ages, regions, and income levels.Figure 6 provides descriptive statistics of summarized outcomes of all questions in the SUS (user form) from multiple perspectives.The summary plot reveals that the proposed SDM tool has attained average scores and standard deviations (mean±std) of 3.57±0.84,3.83±0.73,3.82±0.76,and 3.84±0.77for accuracy, clarity, relevance, and simplicity (i.e., ease to understand), respectively.Given that all mean item scores surpass 3.5, the usability test outcome reflects a broad positive satisfaction among the targeted user groups.From the detailed distribution, we can observe a relatively consistent and robust mean item score among different examples, with accuracy ranging from 3.45 to 3.69, clarity from 3.69 to 3.98, relevance from 3.77 to 3.90, and simplicity from 3.56 to 4.09.

Knowledge Tests (Hypothesis 2)
Eighty-three intervention group users (with access to the proposed SDM tool) and 70 control group users were included in the final analysis.We selected the top 6 questions as knowledge tests from FAQs of AIS patients and families summarized by surgeons.Table 2 provides descriptive statistics and Mann-Whitney U Test outcomes for two groups in the knowledge test.We also present the difference between the two groups for each question in estimation plots (Figure 7) for more direct visualization.The assumption of normality for the scores of the knowledge tests was confirmed, thus negating the necessity for any data transformations.The SDM group obtained a significantly higher average score compared to the control group from each question (Q1-Q6), as indicated by the Mann-Whitney U Test results (p < 0.0001).The average score for the control group across all questions was 0.2667 out of 1, whereas the SDM group yielded a significantly higher average score of 0.6044.The difference between these group means was 0.3378 (95% CI: 0.2532 -0.4223), which was statistically significant (p < 0.0001).This indicates a consistent and significant improvement in the knowledge test with the SDM approach compared to the control, suggesting the potential effectiveness of the proposed tool for knowledge enhancement.

Human-in-the-Loop (Hypothesis 3)
We recruited a multidisciplinary team with two orthopedic surgeons and 15 researchers from multiple sites to conduct an iterative human-in-the-loop analysis and clinical validation.The expert team reviewed the generated answers from the most popular patient questions identified by the clinical team, along with retrieved source information from the knowledge base.We reported the following SUS (expert form) evaluation in Figure 8.For the quality of generated answers, we can observe an overall positive evaluation on accuracy, clarity, completeness, relevance, fluency, and simplicity, with mean item scores of 4.46±0.73,4.61±0.61,3.89±1.07,4.50±0.77,4.63±0.63,and 4.40±0.90,respectively.Given that the majority of mean item scores are higher than 4.0, the SUS outcomes convey an overall positive assessment of the quality of the generated answers.Moreover, similar results on ChatGPT-generated responses further demonstrate the feasibility of adopting LLMs in clinical settings.
In addition, we also collected narrative comments from participants and the expert team during user feedback, clinical panels, and usability sessions to further evaluate the acceptability and effectiveness of the proposed SDM tools in the clinical workflow.Based on the results of the panel evaluation, we can conclude that the proposed SDM tool effectively delivers relatively accurate and unbiased information regarding the risks and benefits of various treatment options for AIS patients and their families.Moreover, after reviewing the source information of generated responses in our established knowledge database, we can observe that Chat-Orthopedist could properly infer or cite evidence-based conclusions in published treatment guidelines, scholarly reviews, and clinical papers.Successfully retrieving informative knowledge via search is critical in ensuring the quality of generated response.Through expert evaluation of source information, we can timely update the knowledge base by eliminating low-quality resources and incorporating more relevant and recent materials.Compared with the original ChatGPT or other LLMs, the retrieval-augmented framework could potentially solve the hallucination and black-box nature of LLMs, thereby promoting the adoption of LLMs in clinical settings.

SDM Tool Comparisons (Hypothesis 4)
We compared Google, ChatGPT, and our proposed Chat-Orthopedist using the same SUS (expert form) evaluation as SDM tools for AIS patient care.The expert team reviewed the generated answers to the same question set by Google and ChatGPT.Specifically, we only provided original source information from Google, due to the lack of such information in ChatGPT.As an exploratory hypothesis test, we collected 150 evaluation records for each tool from a single evaluation perspective.Through the descriptive statistics, the mean item scores of three tools are presented in Figure 8.In terms of the quality of generated responses, it can be observed that the Chat-Orthopedist outperforms in the areas of accuracy, clarity, and relevance, whereas ChatGPT excels in completeness, fluency, and simplicity.Responses from the search engine (Google) were generally found to be of relatively low quality in most aspects of answer quality.However, when assessing the credibility and verifiability of source information, ChatGPT receives the lowest evaluation results.This could be attributed to the 'black-box' nature of LLMs, which hinders the provision of clear source information.Additionally, all three tools obtain similar performance with respect to low levels of aggressiveness and minimal ethical concerns.The overall scores of Google, ChatGPT, and our proposed Chat-Orthopedist are 4.22, 4.25, and 4.46, respectively.In addition, Table 3 presents the inferential statistical analysis via t-tests and ANOVA test with Greenhouse-Geisser correction.In summary, both the descriptive and inferential analyses demonstrate that LLM-enabled tools significantly outperform conventional search engines in the SDM process for AIS knowledge sharing.Specifically, Chat-Orthopedist illustrates its feasibility, acceptability, and effectiveness by augmenting ChatGPT with more accurate and relevant domain-specific knowledge.The clear identification of source information further improves credibility and enables validation through a human-in-the-loop approach.

Limitations and Future Works
Our design process for Chat-Orthopedist incorporated in-depth user feedback into model development.We also summarized potential limitations and action items for promoting LLMs-enabled SDM tool adoption in pediatric healthcare.Firstly, the responses generated must be easily comprehensible.Given the target audience, which includes AIS patients and their parents, it is crucial to consider the simplicity of the generated answers.This is particularly important in light of the evolving role of the growing patient in decision-making.It is also important to ensure that users of various ages, cultural backgrounds, and educational levels can derive benefit from this tool.Secondly, our current knowledge base is constructed through a systematic screening of existing materials and papers based on titles or keywords.An in-depth review of this knowledge base could prove instrumental in further enhancing the success rate of information retrieval.Thirdly, further demonstrations, including clinical trials assessing the efficiency of

CONSIDERATIONS OF LLMS FOR HEALTH
During human-in-the-loop iterative sessions for model validations and improvements, we focused on design preferences to improve the usefulness and usability of LLMs tools that provide AI-generated feedback on patient-provider communication during SDM and clinical encounters.Drawing insights from clinical feedback and extensive literature reviews, we distilled several major considerations with potential solutions for widely adopting LLMs in real-world healthcare for future studies.Firstly, we have observed that LLMs encounter difficulties with less prevalent factual knowledge, which may lead to hallucinated or less reliable generations [21].To enhance the credibility of LLMs, we have leveraged a 'knowledge brain' grounded in Google search and authoritative databases in Chat-Orthopedist.Moreover, this retrieval-augmented framework can access timely updated domainspecific information to address patients' inquiries based on a trustworthy knowledge source, which is important for clinical settings with minimal tolerance for errors or hallucinations [40].
Secondly, LLMs with over 100 billion parameters, such as GPT-3.5, are usually commercially restricted and not open-sourced [4,25].On the other hand, even open-source LMs like LLaMA-13B or LLaMA-65B [31] require significant computational resources for local fine-tuning.For example, fine-tuning a BLOOM-176B requires 72 A100 GPUs, each with 80GB memory and costing $15k apiece [27].This substantial resource requirement makes LLMs largely inaccessible for researchers and developers with limited resources.Consequently, the common approach for downstream AI applications is transitioning from fine-tuning specialized models towards prompting generalist models (e.g., in-context learning) [41].
Thirdly, most of the current medical LLMs [10,40] lack adequate security measures to assure accurate medical diagnoses and recommendations.Considering the real-world clinical practice requirements, responsible AI is a critical prerequisite for adopting LLMs in healthcare.As LLMs, such as ChatGPT, are only accessible through black-box APIs, where users can submit queries and receive responses, a trade-off emerges between model capability and model transparency.In Chat-Orthopedist, we improve model transparency by presenting source information with reasoning (e.g., chain-of-thoughts) and acting (e.g., action plan generation).
Fourthly, simply scaling the model size has not proven sufficient for achieving high performance on complicated tasks such as reasoning in intricate clinical scenarios [32].One potential challenge in pediatric healthcare stems from the necessity of appropriate understandability, given the changes in education level during growth.Another challenge that lies in the potential applications for fewshot or zero-shot learning scenarios is the diagnosis of rare diseases.To address weaknesses in model reasoning, potential solutions like chain-of-thought [32] or tree-of-thought [37] could be readily implemented with prompts, serving as intermediate steps towards problem-solving for improved reasoning and easy understanding.It enables models to decompose multi-step problems into intermediate stages with an interpretable insight into the model behaviors.

CONCLUSION AND BROADER IMPACT
In this study, we developed and validated an innovative SDM tool, Chat-Orthopedist, to prepare AIS patients and families for a meaningful discussion with clinicians.The usability tests with humanin-the-loop demonstrate the effectiveness of the proposed LLMenabled SDM tool in delivering accurate and unbiased information on disease knowledge and treatment options.In addition, we discussed several critical considerations and potential solutions for widely adopting ChatGPT-like LLMs to facilitate clinical practice.From a clinical perspective, successfully implementing the proposed SDM tool could assist AIS patients by increasing access to required clinical knowledge for effective medical consultations.In addition, it could potentially facilitate clinicians to enhance efficiency in clinical visits and in reducing workload.From a technical perspective, this work may promote the adoption of LLMs in real-world clinical applications by solving challenges associated with generalizing AI at scale for domain-specific tasks.This study may serve as a pilot and feasibility examination to support the feasibility, acceptability, and effectiveness of LLMs-enabled SDM in pediatrics.We expect the cross-discipline collaboration between LLMs and SDMs will ultimately improve the treatment outcomes of children with AIS in a family-centered and collaborative environment.

Figure 3 :Figure 4 :
Figure 3: Automated pipeline for large-scale external domain-specific knowledge database establishment using multiple formats of raw materials, including document parsing, chunking, vectorization, and storage.

• Hypothesis 3 :
The mean score of each item on the SUS questionnaire (expert form) will indicate positive satisfaction (mean item scores >3.0) with the LLMs-enabled tools.•Hypothesis 4: As an exploratory hypothesis, the mean scores of the SUS questionnaire (expert form) of the proposed SDM tool will significantly exceed the mean scores External Knowledge Base Action Input: "adolescent idiopathic scoliosis growth" Observation: Curves progress in approximately two-thirds of skeletally immature patients before they reach skeletal maturity ... Curves measuring ≤30°at the end of growth typically do not progress.Thought:That was helpful, but I want to know more about the long-term effects.Action: Google Search Action Input: "long-term effects of adolescent idiopathic scoliosis" Observation: Dec 8, 2016 ... The authors concluded that idiopathic scoliosis may lead to severe pain, cardiopulmonary compromise, lack of social participation, and early ...Thought: I have a better understanding of adolescent idiopathic scoliosis now.Final Answer: Adolescent idiopathic scoliosis can cause problems during growth and development and may lead to long-term effects such as severe pain and cardiopulmonary compromise.Treatment options include observation, bracing, and surgery, depending on the risk of curve progression.Adolescent idiopathic scoliosis can cause physical problems such as back pain, limited mobility, and difficulty breathing if left untreated.Treatment options depend on the severity of the curve and may include bracing, physical therapy, or surgery.

Figure 5 :
Figure 5: Comparison of three potential SDM tools in answering AIS disease and treatment-related questions.Actions and observations shrouded in pink are incorrect conclusions or imperfect inferences, whereas those in green are correct.

Figure 6 :
Figure 6: Descriptive statistics for average results of all questions in usability test from accuracy, clarity, relevance, and simplicity perspectives.

Figure 7 :
Figure7: Estimation plots (a)-(f) with descriptive statistics to show the difference in knowledge test results between the SDM group and the control group.In each plot, the right side shows the 95% confidence interval.The mean scores from the knowledge tests for the SDM group are significantly higher in comparison to the mean scores of the control group in (g) and (h).

Figure 8 :
Figure 8: Descriptive statistics for expert evaluation results for comparison of Google, ChatGPT, and Chat-Orthopedist.
Dynamed: https://www.dynamed.comother interventions and other diseases by updating our existing knowledge base to encompass the new knowledge domain.

Table 3 :
Inferential statistical analyses for comparison of different Q&A tools, including (1) t-test between Google and ChatGPT (Group 1), (2) t-test between Google and Chat-