Human-Algorithmic Interaction Using a Large Language Model-Augmented Artificial Intelligence Clinical Decision Support System

Integration of artificial intelligence (AI) into clinical decision support systems (CDSS) poses a socio-technological challenge that is impacted by usability, trust, and human-computer interaction (HCI). AI-CDSS interventions have shown limited benefit in clinical outcomes, which may be due to insufficient understanding of how health-care providers interact with AI systems. Large language models (LLMs) have the potential to enhance AI-CDSS, but haven’t been studied in either simulated or real-world clinical scenarios. We present findings from a randomized controlled trial deploying AI-CDSS for the management of upper gastrointestinal bleeding (UGIB) with and without an LLM interface within realistic clinical simulations for physician and medical student participants. We find evidence that LLM augmentation improves ease-of-use, that LLM-generated responses with citations improve trust, and HCI varies based on clinical expertise. Qualitative themes from interviews suggest the perception of LLM-augmented AI-CDSS as a team-member used to confirm initial clinical intuitions and help evaluate borderline decisions.


(HCI). AI-CDSS interventions have shown limited beneft in clinical
outcomes, which may be due to insufcient understanding of how health-care providers interact with AI systems.Large language models (LLMs) have the potential to enhance AI-CDSS, but haven't been studied in either simulated or real-world clinical scenarios.We present fndings from a randomized controlled trial deploying AI-CDSS for the management of upper gastrointestinal bleeding (UGIB) with and without an LLM interface within realistic clinical simulations for physician and medical student participants.We fnd evidence that LLM augmentation improves ease-of-use, that LLM-generated responses with citations improve trust, and HCI varies based on clinical expertise.Qualitative themes from interviews suggest the perception of LLM-augmented AI-CDSS as a team-member used to confrm initial clinical intuitions and help evaluate borderline decisions.

INTRODUCTION
Artifcial intelligence (AI) and machine learning (ML) algorithms have the potential to provide value to clinicians in their alreadycomplex clinical workfows.AI interventions in medicine in randomized controlled trials have shown limited improvement in clinical outcomes, with no clear evaluation of the human-AI interaction [76].In order to optimize the benefcial efect of these AI interventions and mitigate the potential harms, there is a need to study and involve clinicians as end-users in ML technology to create and iterate systems that improve clinical workfows [117].ML-based clinical decision support systems can achieve practical clinical relevance when they work seamlessly with existing workfows [116] that utilize the electronic health record (EHR) [85].
Previous eforts at qualitatively characterizing healthcare provider interaction with AI systems has spanned multiple domains, including alert systems [88,92], risk estimators [10,12], and imageretrieval [16].These studies usually explore user interactions as single-user systems between providers and ML models or integrating interactions between patients and ML models [40].While these single-user interactions may be salient in specifc care settings, such as patients in an outpatient clinic, the complexity of modern medicine has led to a transition from receiving care from a single provider to provider teams.This includes cooperation among physicians and medical providers across specialties, training levels, and responsibilities [31].There is a paucity of studies that evaluate the behavior of provider teams and their user experience with AI clinical decision support systems (AI-CDSS).One challenge to shifting this paradigm is the accessibility of these AI systems to multiple users.
Recently, large language models (LLMs) have emerged as systems with potential to aid clinical decision-making.Recent exploratory studies have assessed LLMs' ability to answer clinical questions [97,120].LLMs are accessible to users across a spectrum of expertise, which provides an opportunity to design AI-CDSS that can be used in a team-based setting.LLMs' ability to generate text answers to clinical questions positions it as a potentially useful tool that providers can interact with similarly to how they interact with human team members and experts [95].Previous work has shown that adoption of ML tools in medicine is more likely when clinicians view the tool as a "partner" to enhance their expertise, this is similar to the role of team members in the clinical team [36].To our knowledge, there are no studies that evaluate usability of large language model (LLM)-based systems for active clinical workfows in a team-based clinical decision setting.Our study seeks to understand the user patterns that emerge when physicians utilize AI-CDSS and LLMs to make clinical decisions in a live simulated clinical workfow.
We developed a risk-prediction machine learning model trained on data from patients with upper gastrointestinal bleeding (UGIB).We introduce an interactive dashboard for visualization of risk and GutGPT, an LLM trained on gastroenterology guidelines for UGIB [52].Limiting GutGPT's context to the risk-prediction ML model or UGIB guidelines places bounds on the LLM in an efort to limit hallucinations and response variability.To test the implementation of the dashboard and GutGPT, we designed a randomized controlled trial to determine how physician and medical student teams utilize and interact with these systems in a series of simulated patient encounters.Using UGIB as a disease process is instructive as it is an acute high-stakes, time-constrained clinical problem with a clear value proposition for risk assessment (in our case, assisted with a high-performing AI-CDSS) that necessitates strong teamwork within provider teams for optimal patient care and frequent inter-specialty collaboration.Evidence-based management of UGIB requires considering guidelines authored by professional societies.The guidance from these guidelines can be difcult to parse quickly while applying to unique patient scenarios [6].LLM integration into GutGPT is designed to incorporate patient data to give answers that apply the guidelines to complicated situations.To our knowledge, this is the frst study to deploy a LLM-based CDSS in a clinical simulation to assess usability, trust, and interaction patterns.
Themes were generated from post-simulation interviews, survey data, and query data from GutGPT inputs.We found that LLM output format and integration into the electronic health record (EHR) infuences the perception of usability and may afect the adoption of AI-CDSS technology into physicians' workfows.Trust in AI systems for physicians was limited by preconceived notions of AI-CDSS being unreliable or untrustworthy, and improved after increased use of the system with exposure to answers with highreliability features (e.g.detailed citations).We contribute to the human-computer interaction community (HCI) by delineating barriers to wide-spread adoption of AI-CDSS in physician workfows, providing understanding of factors that infuence physician trust in AI, and presenting three design principles for LLM-augmented AI-CDSS.

RELATED WORKS 2.1 Upper Gastrointestinal Bleeding
UGIB is one of the most common causes for hospitalization for gastrointestinal disease in the United States, accounting for over 400,000 emergency department (ED) visits a year in the United States [74].It is a common cause of hospital readmission, morbidity, and mortality [74].Common etiologies for UGIB include peptic ulcer disease, esophageal varices, and esophagitis, and the diagnosis of UGIB may include obvious symptoms such as bright red blood in gastric contents or stool to insidious symptoms such as fatigue and dark stools [18,105].
Multiple elements from patient reported clinical history, physical examination, and laboratory measurements that may suggest UGIB.Diagnosis and management of UGIB usually requires interplay between diferent medical providers and staf in the healthcare environment.For example, patients with UGIB often frst present to the ED.They are frst seen and assessed by emergency medicine physicians and clinical staf who provide an initial diagnosis andmanagement.If the condition is deemed to be severe enough to require specialist evaluation, the emergency medicine physician initiates communication with specialist gastroenterologists and internal medicine physicians to consider admission to the hospital with urgent endoscopic evaluation [98].Risk stratifcation to identify patients who are "very low risk" and can be discharged from the ED is the frst key management decision for the provider caring for a patient with UGIB; identifying very-low-risk patients is recommended by national guidelines for the management of patients with acute UGIB [52].However, it is possible that these patients may require urgent care: high-risk patients with UGIB can clinically deteriorate quickly if they have uncontrolled bleeding and may require hospital-based interventions such as transfusion of red blood cells or interventions to stop bleeding [48].No existing studies evaluate the implementation of AI-CDSS for UGIB risk assessment.Our paper provides qualitative themes of provider behaviors when interacting with an AI-CDSS for UGIB risk assessment in a simulated environment.

Clinical Decision Support Systems with and without Artifcial Intelligence
Clinical Decision Support Systems (CDSS) have existed in healthcare for decades as an attempt to reduce errors made by medical staf [44,63].CDSS are designed to improve healthcare delivery by providing relevant, timely, and useful clinical knowledge to providers that help them to make decisions regarding diagnosis, prognosis, and treatment [72].The most basic CDSS usually function by matching the characteristics of an individual patient to a computerized clinical knowledge base.Patient-specifc assessments or recommendations are made, and subsequently presented to the clinician for a decision [94].The clinician's role is to combine these evaluations with their own prior knowledge to make the fnal decision.
Despite only providing simple diagnostic support, early forms of rules-based CDSS (i.e., a treatment that is suggested when a certain part of patient history is fagged) still showed efectiveness in clinical decision support by identifying high-risk patient groups and reducing cases of misdiagnosis [50,66].CDSS are able to support many aspects of the healthcare process including disease prevention, screening, diagnosis, treatment, and follow-up [29] while reducing medical costs by minimizing side efects from drug treatments [60,73].In the modern era of electronic health records (EHR), CDSS are often integrated into the EHR [68].However, many clinicians have expressed concerns regarding their trust in CDSS when introduced into their workfows [75].
In the era of increasing data volume and computational capacity, modern CDSS integrate the use of ML and AI in AI-CDSS [61].AI interventions in healthcare have been studied in randomized controlled trials with a steadily increasing number of Food and Drug Administration (FDA) -approved medical ML applications [76].AI-CDSS have evolved to provide predictive clinical insights using medical data available across multiple domains, with over 500 clinical prediction models built on EHR data published and 44 published reports of implementation studies [56,115].A slight majority of EHR-based AI-CDSS implemented in published studies have demonstrated some improvement in clinical outcomes after implementation [56].For example, CDSS have showcased the capacity to predict the probability of diabetic complications among individuals with diabetes and guide clinicians with the optimal timing for diagnostic tests [37,90].The issue of trust remains challenging for clinician adoption of AI-CDSS.This includes a lack of justifcation of model predictions [77], which has been partially addressed with the emergence of "Explainable" AI [7].Explanations from AI models can be categorized as global or local; in ofering an explanation of the entire model or single predictions, respectively [4].Ante-hoc, or inherently explainable methods, are understandable on their own while post-hoc understandability methods communicate information about an output after the model produces the output [7].Recent advances in explainable AI-CDSS to improve clinician trust have spanned domains of text, graphical, and image explanations, among others.For example, a convolutional neural network to aid glaucoma diagnosis used class activation mapping to generate heat maps for image analysis [24].An AI-CDSS for identifying women at risk for gestational diabetes mellitus used Shapley additive explanations to graphically represent model features [25].Stakeholder analysis suggests that clinicians prefer AI-CDSS with feature importance and transparency regarding how confdent or uncertain the model was in its predictions, and clinicians also indicated that the ML tools had to be tested in real clinical situations so users could grasp their strengths and weaknesses and foster sustainable trust [101].Our paper provides qualitative and quantitative descriptions of usability for an AI-CDSS with post-hoc explainability methods that contributes towards the understanding of clinician trust when utilizing AI-CDSS.

Human-Computer Interaction in Artifcial
Intelligence-Clinical Decision Support Systems Currently CDSS are implemented in conjunction with clinicians' medical knowledge, intuition, and willingness to incorporate such systems into their decision-making process.Thus, HCI inherently plays a crucial role in the design of CDSS [85].This is particularly relevant in healthcare, where providers have sufered from the unintended consequences associated with high alert burden in EHR CDSS that are caused due to system design processes that are not physician-centered, such as sepsis alerts [70,104,113,114].Poorly designed CDSS can lead to nonadherence, high override rates, and "alert fatigue" in which clinicians neglect the alert, thereby reducing their efectiveness and potential benefts [65].
To prevent such adverse efects and maximize CDSS usability, several methodological approaches for usability engineering and cognitive task analysis have been developed [51].Most notably, heuristic evaluation of medical device interaction and patient safety [118], cognitive factor analysis for GUI evaluation in tele-mental health psychotherapy services [3], the Task, User, Representation and Function (TURF) framework for EHR usability [119], an ethnographic study to create guidelines on designing electronic communicable disease reporting systems [89], and natural language querying to resolve time-event dependencies in clinical information systems [86] are frameworks for exploring diferent HCI methods to evaluate and develop CDSS.
HCI becomes particularly crucial when it comes to AI-CDSS, as the complexity and lack of usability of sophisticated computational systems like AI may discourage clinician use [93].Indeed, the difculty in explaining modern AI-based systems that have a "black-box" nature may also hinder integration into clinicians' workfow [32,107].To address these issues, the frst step of HCI should be to provide training to users about the inner workings of AI-CDSS and its strengths and weaknesses [17].The goal of HCI frameworks for developing AI-CDSS is to seamlessly integrate them into pre-existing healthcare information and clinician workfows [67].Since AI cannot completely emulate physicians' mental models and physicians are unable to access large amounts of data to make conclusions, AI-CDSS should be designed as interactive systems that physicians can use to support their cognitive processes, as part of a human-AI collaboration paradigm [85,111].While explainability plays an important role in trust in AI tools, there are other factors that are vital in clinician adoption.Even in studies where the ML tool underlying the CDSS was opaque, trust was increased when adoption was endorsed by colleagues or superiors [35].Trust is enhanced by reference to resources that are familiar to clinicians; previous research in AI-CDSS has found that clinicians preferred evidence-based explanation of outputs over model features [41].Physicians' cognitive model for risk stratifcation and management incorporates information from reputable clinical guidelines.AI-CDSS that delivers guideline-driven advice mimics the role that human teammates play in the medical team [59].In this paper, we provide a unique perspective by studying AI-CDSS in a team setting, where diferent team dynamics may afect perceptions towards interactions with AI-CDSS.

Large Language Models in Artifcial Intelligence-Clinical Decision Support Systems
LLMs represent a subset of AI models that excel in diverse natural language understanding and generation tasks since they are autoregressive, with the ability to predict the next word in a given context [79].These models owe their profciency to the massive scale of the transformer-based neural networks (with billions of parameters) and extensive training on a vast corpora of text [14].
In the realm of healthcare, LLMs' exceptional natural language processing capabilities render them a powerful tool to be integrated into EHRs, which are vast repositories of patient data that include substantial amounts of unstructured note text.The potential for LLMs has already attracted signifcant research and commercial attention, with partnerships established between electronic health record vendors and AI companies with cutting-edge LLMs.Examples include the collaboration between Microsoft and Epic on integrating GPT-4-powered services into EHR, as well as the incorporation of Google-designed Med-PaLM 2 healthcare AI chatbot into Meditech [15,19].
ChatGPT is a famous application of LLMs [71].Its underlying LLM, the Generative Pre-trained Transformer (GPT), is trained on diverse online text sources to produce human-like responses in versatile conversational interactions [79].Since its release in November 2022, active investigations into ChatGPT's potential in healthcare have spanned research, practice, and education [57,87].ChatGPT's ability to process health-related information from the EHR and ability to interact with users in natural language ofers unique opportunities for a wide scope of potential applications in clinical decision support [27].By comparing GPT-3.5-poweredChatGPT's responses to human medical experts' answers to clinical questions in multiple subspecialties, several studies in general medicine, radiology, and pediatrics have suggested the adequacy of LLMs for providing decision support throughout the pathway of clinical care, from diagnosis to treatment recommendations [42,80,81].
However, these studies also reveal limitations in ChatGPT including the opacity of its training data, the phenomenon of hallucinations, and limited model explainability [42,58,80,81].Recent studies suggest that new LLMs reproduce and amplify human biases [49].Misuse of ML tools in the healthcare environment can also promote over-reliance on these tools, leading to errors when clinicians delegate verifcation and safety checks [61].While certain strategies and frameworks for ChatGPT-based CDSS have been suggested to address these limitations [27], an AI-CDSS that queries reliable clinical guidelines with guardrails could ameliorate many of these complaints while reducing response variability.There is an urgent need for a deeper understanding of user behavior when integrating an LLM in clinical workfows to further develop design principles and usage guidelines for real-world adoption.Our study seeks to elucidate specifc patterns of user behavior with LLMs within simulation scenarios.

Medical Simulation
Medical simulation can be defned as a technique to replace or amplify real experiences with immersive guided interactive experiences to replicate aspects of the real world [28].Simulations can have enough fdelity with real clinical environments that they can be used to study human factors and behaviors that contribute to the efectiveness of a provider team [38].In medicine, simulation has a key role in maintaining and promoting patient safety and quality improvement for high-stakes scenarios where provider error could have adverse efects on patient outcomes.
In 1999, the Institute of Medicine released a report on medical errors that revolutionized the approach towards patient safety [47].The report highlighted simulation as a key driver of healthcare improvement [109].Simulation allows for improvement both for individual practitioners and for provider teams.On the individual level, simulation centers can help individual medical trainees to practice skills and techniques in safe environment to prepare for situations in which real patients might be at risk, such as learning central line insertion techniques to increase successful insertion rates [9] and lower central line infection rates [8].On the provider team level, simulation studies have been successful at studying the human factors involved in domains of teamwork [82] and team communication [11].As medicine has increased in complexity providers increasingly work in teams to provide clinical care for disease management.
Beyond training individuals and provider teams on existing best practices and protocols, simulation centers can also add value in the testing of new medical devices and advanced technology, such as AI/ML.[62,96] Usability testing for EHR technology, anesthesia machines, and numerous other medical devices is frequently performed in simulation centers [53,64].AI and ML products in medicine are considered software as medical devices (SaMD) [103] and require rigorous real-world clinical deployment and evaluation [102].Simulation environments are underutilized in the development pipeline of these SaMDs to be tested within simulation center environments.
LLMs have rapidly evolving capabilities relevant to clinical applications, and solutions integrating LLMs into AI-CDSS are potential SaMDs that may be integrated into the clinical workfow [43].Existing partnerships between EHR vendors and LLM companies provide a trajectory for LLMs to be used by providers in routine clinical care [15,34].However, clinician skepticism remains a formidable challenge [78].Among several concerns regarding safety is the potential for hallucinations that result in fabricated citations [21] that may lead to errors when integrated into high-stakes clinical environments.No study to our knowledge has used medical simulation to test LLM-augmented AI-CDSS, which we believe may be useful for developers of AI healthcare systems to facilitate clinical evaluation and safety testing by understanding user needs and behavior.
Our paper demonstrates the feasibility of using a simulation setting to test and to evaluate usability, trust, and human-AICDSS interaction for an EHR-integrated LLM-augmented AI-CDSS.

METHODS 3.1 GutGPT
GutGPT is an in-house CDSS designed and developed to provide a natural language-based interface for two tasks: guideline-based question answering and an interactive dashboard for risk prediction.It is built on top of a high-fdelity ML model validated using an existing clinical dataset.With patient data automatically loaded at launch, GutGPT provides patient-specifc predictions of the risk for hospital-based intervention and grounds its reasoning on this information to generate responses to clinicians' questions.Formulation, development, and implementation of dashboard and chatbot tools were performed by a multidisciplinary team.Practicing clinicians in this team directly contributed to the creation of GutGPT and oversaw building the tools from their genesis to experimental trial.
GutGPT's risk-prediction machine learning model was developed using electronic health records (EHR) of patients presenting with signs or symptoms of gastrointestinal bleeding at a large health system.The inputs to the model include demographic data (age and sex), nursing assessments, lab test results, personal medical history, and medication classes in the form of Clinical-Classifcation-Software codes [1].We consider a composite binary variable as the outcome, where the value of 1 signifying a high-risk patient that required a hospital-based intervention, such as red blood cell transfusion, intervention to stop bleeding, or 30-day all-cause mortality, and 0 otherwise.Multiple machine learning (ML) and deep learning models were explored, including LASSO regression [100], random forests with honesty [110], gradient boosted trees [23], and feedforward neural networks with 2 and 5 layers [84].Data pre-processing included dimensionality reduction via LASSO regression to the patients' medical history and medication classes tuned using 10-fold cross validation.Random forests with honesty was applied to the variables with non-zero coefcients, in addition to demographics, nursing assessments, and lab test variables.This fnal model exhibited the highest true negative rate (TNR) at a true positive rate (TPR) of 99% recommended by national UGIB guidelines as the very low risk threshold [52].The model had an AUC 0.91 (0.88-0.93) on an internal validation set and 0.92 (0.90-0.95) on an external validation set (data from a diferent hospital).At the 99% sensitivity threshold, our model exhibited a specifcity of 0.46 on the internal validation set and 0.33 on the external validation set, which outperforms existing recommended clinical risk scores.
The interactive dashboard displays risk predictions with interpretability plots for the ML model used within GutGPT.Users can visualize partial dependency plots (PDPs), individual conditional expectation (ICE) plots, and accumulated local efects (ALE) plots for any covariate in the model, assisting their understanding of the efect of selected covariates on the model's predicted risk [69].The incorporation of these interpretability plots was implemented after an iterative process where a multidisciplinary team including clinicians, data scientists, statisticians, and human factors experts to enhance users' understanding of the ML model's decision-making process, ensuring alignment with their clinical mental model.
Furthermore, users have the ability to modify patient covariate values in real time and observe how the predicted risk of hospitalbased intervention or 30-day mortality changes accordingly.The dashboard also provides other information to help contextualize the risk with regards to the general population of patients with acute gastrointestinal bleeding.For example, it reports a patient similarity index, quantifying how similar the queried patient is to patients in the training data.To facilitate population-level understanding, histograms depict the distribution of each variable and highlight the target patient's value relative to all patient values in the training data.When a user types a question into the chatbot interface, GutGPT classifes the query as either a question about the predicted risk from the ML model or regarding clinical management from the guideline recommendations.For both types, the structured datafelds stored in the EHR are automatically loaded onto GutGPT for individualized prediction.If the query pertains to the risk prediction of GIB, GutGPT retrieves the interactive dashboard, extracts information relevant to the user's query, and provides interpretation of graphically presented information in human language.For example, for a question regarding the predicted risk itself, GutGPT generates a paragraph stating the risk score of a specifc patient, with an addendum according to clinical guidelines.It notes that a risk score below the 99% sensitivity threshold should be considered as "very low risk" according to the American College of Gastroenterology (ACG), and "not very low risk" otherwise.
GutGPT also can answer a user's questions regarding clinical management by drawing upon care recommendations for a patient's profle based on the guidelines from the ACG for the management of upper GIB [52].These guidelines are organized into discrete sections, including pre-endoscopic and endoscopic management, summary of evidence, recommendations, and conclusions.Preprocessing of the sections include separating each section into separate text chunks and converting each chunk into a vector embedding using OpenAI's text embedding model.When a user types a question, the query is converted into a vector embedding and then compared with the vector embeddings of the guideline text sections.This process enables the retrieval of the most relevant sections from the guidelines through a similarity search.The retrieved portions of the guidelines are then integrated into a user's question, along with the patient's EHR data.Instructions on text and reference formatting is also provided in the prompt, which is then supplied to the GPT model to generate a response for the user.

Participants and Simulation
We recruited 31 participants from various medical education levels.Of those participants, 9 were Emergency Medicine (EM) resident physicians, 6 were internal medicine (IM) resident physicians, and 16 were medical students (MS).They were placed into provider teams of 2-4 participants, for a total of 12 provider teams across the study period.Eforts were made to recruit resident physicians and medical students of all experience levels.IM resident physicians ranged from training levels of post-graduate year 1 (PGY-1) to PGY-3.EM resident physicians ranged from PGY-1 to PGY-4.Medical students ranged from the second year of medical school to fourth year students (including students taking a research year and MD-PhD candidates in the research portion of their degrees).
We sought to include both internal medicine and emergency medicine as these two specialties have frequent contact with patients who have UGIB.Resident physicians of diferent experience levels were solicited to help identify trends in the experience using the AI system based on training level.Medical students were sought to further diversify the participant pool based on experience -medical students are less likely to be familiar with UGIB management and ACG guidelines than resident physicians.The provider teams performing the simulation activities consisted solely of one training category (EM, IM, or MS).However, within that category, provider teams had varying experience levels.This was done to mimic a typical provider team in clinical environments.
The randomized controlled study is comprised of two arms (see Figure 3).Each provider team was randomized to either the GutGPT arm or dashboard arm separately for the two separate phases of the study (Risk Assessment and Content Assessment).If randomized to the GutGPT arm, a workstation with access to GutGPT, the interactive dashboard, and any internet tool was available to the participants.If randomized to the dashboard arm, the workstation could only access the interactive dashboard and any other internet tool.During the Risk Assessment phase, the participants underwent three risk scenarios in which they decided to admit the simulated patient to the hospital, observe in the ED, or discharge from the ED.During the Content Assessment phase the participants underwent two scenarios which tested their medical management of simulated patients.For all phases, the provider teams were presented with cases of UGIB and the order in which the scenarios were presented was randomized.They interacted with a SimMan full-body mannequin (Laerdal) for the interview and physical exam of the simulated patient.A gastroenterology specialist voiced the patient, and their voice was broadcasted through a speaker in the mannequin.In the simulated clinical environment, the simulated patient's chart was accessible through a workstation that mimicked the electronic medical record -complete with past medical history, laboratory values, and medications.To simulate the clinical team dynamics, the most senior member was assigned to use the dashboard and/or GutGPT interface.The other members occupied the rest of the clinical team involved in data gathering from the mannequin and the EHR. Figure 4 displays the number of sessions where the chatbot feature was accessible by the provider team for each simulation scenario.This study was deemed exempt by a university Institutional Review Board.

Data Collection
To understand the user interaction pattern of clinicians with AI-CDSS in realistic clinical simulations, we collected three types of data: qualitative interviews, GutGPT chatbot conversations, and quantitative surveys.

3.3.1
Post-simulation Qalitative Interview.We conducted brief one-on-one, semi-structured interviews with each of the participants in separate rooms directly following after participants had fnished both the Risk-focused and Care Management scenarios.Each interview lasted between 5-10 minutes.The interviews began with general refections on participant experiences interacting with the GutGPT during the session.Then, the researchers asked about the participant's willingness to use GutGPT in their clinical decision making process in a real clinical situation, and to elaborate their reasoning.Next, the researcher asked for feedback on the user interface of the Chatbot and the Dashboard features within Gut-GPT.Each interview session was audio recorded with participants' consent.

3.3.2
Post-trial Qantitative Survey.We administered an adapted version of the System Usability Scale (SUS) [13] to participants immediately after completing the Risk-focused scenarios.We retained four positive items that encompassed similar themes covered in the  original scale (Table 1).The abbreviated SUS survey was appended to an already-long survey administered after the Risk scenarios for a diferent experiment.We chose the positively-keyed items from the SUS to add to the existing survey, they matched the existing positively-keyed survey items to prevent participant confusion or errors and reduce time taking the survey to avoid an excessively long simulation.We note here that this portion of data collection was added later, resulting in 22 available observations out of the 31 participants.

GutGPT Chatbot Prompting History.
All conversations between GutGPT and the participants, including all question inputs ("prompts"; examples see Table 2) from the participants and response outputs from the LLM-augmented chatbot, were automatically recorded in text format with information about their corresponding simulation sessions and scenarios.

Analysis
3.4.1 Qalitative Analysis of Participant Interviews.Our team of three researchers led the analysis of 31 participant's interview data and regularly discussed emerging themes.We used rapid qualitative analysis methods to efectively extract insights from our data [33].For the rapid analysis, we created an interview summary template that asked each reviewer to consider initial impressions, system usability, and the role of GutGPT in clinical decision-making.We began by holistically reviewing the transcripts to familiarize ourselves with the data and then delved into paragraph-level understanding.Subsequently, all three researchers compiled signifcant quotes and observations from the interview summaries onto a shared research board.Through an iterative process, we categorized and organized these notes into common themes and broader feedback categories.This analysis led to the identifcation of several key insights outlined in this paper: usability in managing various aspects of AI-CDSS and variations in chatbot utilization based on medical specialties and levels of training.Our team's unique interdisciplinary composition, combined expertise in HCI, clinical practice (including specialized knowledge in UGIB), and AI/ML, facilitated a comprehensive understanding of our participants, especially when adhering to a user-centered research framework.
3.4.2Qantitative Analysis of Post-trial System Usability Scale Responses.In the SUS survey, participants rated the usability of Gut-GPT for each statement using a 5-point Likert scale, from "strongly disagree" to "strongly agree".We recorded the responses for each sentiment per statement.We quantifed participants' average attitude towards each SUS statement by assigning a numerical score to each sentiment category: "strongly disagree" as -2, "disagree" a as -1, "neutral" as 0, "agree" as 1, and "strongly agree" as 2. Following this assignment, the mean score was then zero-centered, hence "neutral"-centered, weighted by the frequency of responses for each sentiment.A positive mean value thus suggested a general agreement with the statement, while a negative mean value indicated disagreement.Additionally, separate calculations of the average attitude were made based on the GutGPT chatbot's accessibility (Figure 5).
It is important to note that we determined a threshold of 85 participants in each arm for the experiment to reach an efect size of the technology acceptance metrics of UTAUT (Unifed Theory of Acceptance and Use of Technology [108]) to reach Cohen's 2 = 0.1 with 80% statistical power.While UTAUT data were used in another study in our GutGPT series [20], we adhered to this threshold for the sake of the overall study's coherence.At the time of this manuscript, enrollment for the study has continued.Therefore, the quantitative scores presented herein are primarily indicative of observed trends rather than being conducive to conclusive statistical signifcance testing."I thought the system was easy to use." 2 "I would imagine that most people would learn to use this system very quickly." 3 "I found the various functions of this system were well integrated." 4 "I felt confdent using the system." "what is the patient's age" 10 "Do you give both octreotide and vasopressin or one" 3.4.3Qantitative Analysis of Chatbot Prompting Patern.We measured the frequency and length of questions asked by the participants when using the GutGPT chatbot, taking into consideration their medical education level and the type of clinical scenarios (risk versus content).As described in Section 3.2, participants were randomly assigned access to the GutGPT chatbot for both the content and risk scenarios separately.In addition, we conducted simulation sessions with medical student and resident physician teams, with varying numbers of sessions for each group.Hence, to ensure fair comparison, we tallied the total number of sessions allowed for using the chatbot for each medical education level (medical students or resident physicians), type of scenario (risk or content), scenario (A, B, or C for risk scenarios, and A or B for content scenarios), and combination of these conditions, respectively.The question frequency in each situation (e.g., by provider teams of medical students in risk scenario A) was then calculated by dividing the total number of questions typed into the chatbot with the corresponding total number of sessions when chatbot usage was allowed (Figure 6).
Conversely, the average length of questions was straightforwardly defned as the total word count of questions asked in a situation divided by the corresponding number of questions (Figure 7).To maintain simplicity and consistency, a "word" here referred to a continuous string of text between empty spaces.Under such defnition, abbreviations such as "yo" (short for "year old") were considered as single words.
Likewise, the insufcient number of participants restricts a robust statistical analysis, making these statistics indicative of trends rather than allowing for defnitive statistical signifcance testing.

Elucidating Design Principles.
After completing the initial quantitative and qualitative analysis, the research board created after rapid analysis (section 3.4.1)was re-examined.In conjunction with quotes and sentiments from the qualitative analysis, quantitative results from the SUS and prompt data were analyzed with the goal to extract principles for the efective use of AI-CDSS in clinical care.Qualitative themes and preliminary conclusions from quantitative data were pooled into common themes and insights that constitute the three design principles outlined in 5.4.

FINDINGS
First, we explore in Section 4.1 user behavior with the LLM chatbot through quantitative and qualitative analysis of user-generated prompts, user reaction to the generated responses, and how either the LLM chatbot or the interactive dashboard afected the clinical workfow.Then, we focus in Section 4.2 on the efect of clinical context and user characteristics on human-computer interaction with the LLM chatbot, such as the type of clinical task expected, varying levels of prior exposure to AI-CDSS or clinical expertise, and provider team dynamics.Finally, we describe provider concerns specifcally pertaining to trust in Section 4.3.

Usability
4.1.1System Usability Score.With the score assignment described in 3.4.2, the reported agreement to the SUS statements one through four from our participants could be summarized as follows using the format mean (standard deviation): 0.75 (0.698), 0.7 (0.954), 0.55 (0.921), and 0.35 (1.014), respectively.The medians were consistently near 1 (corresponding to Likert scale response "Agree") for the frst three statements and 0.5 for the fourth, while the most frequent score (mode) across all respondents was 1 for each statement (Figure 5).The calculated average, median, and mode were all positive, suggesting with insufcient statistical power that participants appeared to have an overall positive attitude towards our model's usability, regardless of their access to the chatbot feature.
4.1.2User-Generated Prompts.LLM chatbots like ChatGPT have garnered signifcant attention from both the public and the media, given their widespread availability online [99].As a result, many people have frst-hand experience with LLMs and have set expectations for LLM performance and functionality.When interacting with GutGPT, many participants drew indirect or direct comparisons with other LLMs they had used previously.Eight participants referenced other LLM chatbots they had used in response to questions about GutGPT's usability.Many cited their familiarity with ChatGPT as a reason for fnding GutGPT easy to use, noting similarities between the two.One IM resident physician answered: "I think it was extremely user friendly, I had used ChatGPT before and so it seemed pretty similar to it." An IM resident physician also reported "I immediately knew what to do when I started using it, it's just like ChatGPT." Many of these participants noted that prior experience with similar systems facilitated a smooth transition for users to GutGPT.We found that higher average prompt frequency per scenario (3.9 versus 2.4) and higher average word counts per prompt (15.3 versus 11.0) in content scenarios compared to risk scenarios, with similar frequency and word counts regardless of clinical expertise level.
Unlike the interactive dashboard, the chatbot requires direct user input to produce an output.While this allows for personalized questions, it also means users must craft questions they believe the chatbot can answer.This extra decision-making step proved challenging for many participants, particularly for those unfamiliar with chatbots.An EM resident physician explained: "The hardest part is AI is brand new to everybody, we don't really know the right questions to ask it or what it can and can't do.What is it going to give me appropriate data for. . . is it going to mislead me because I don't understand it?"However, as participants interacted more with the system as the trial progressed, their comfort grew."At frst I wasn't really sure what it knew and didn't know and how to make sure the questions I asked were the appropriate questions, it got easier as I went" reported an IM resident physician.To make the transition to use easier, several participants recommended a frequently asked questions (FAQ) section or to adopt autocomplete functionality similar to email clients or search engines.

GutGPT Text Responses. Clinicians value clinical decision
support systems that are easy to use and deliver desired information quickly and intuitively.Seven of the participants suggested that GutGPT's text output was too lengthy for efcient use.One participant commented, "It puts out large blocks of text at times, especially when citing sources. . .that takes a while to read through." Time pressure is a signifcant concern for all physicians, but it's especially pressing for EM physicians who see a large number of patients during their shifts in the emergency department.They must process vast amounts of data and make numerous clinical decisions in short periods of time.An EM resident physician noted about the text output: "I like the response, however I can see myself saying 'this is taking too long to read' on shift, and I don't think I would do it for every patient, I would probably do it for patients I'm a little unsure about."In addition to the volume of text in the typical chatbot responses, the structure of the responses were also emphasized.Three participants pointed out that the information was often presented as a dense paragraph, making it hard to skim or quickly comprehend.They suggested using bullet points or emphasizing key management principles for a clearer presentation, rather than the uniform format of GutGPT's outputs.

Integration into Existing Workflow.
Resident physicians and medical students are accustomed to using the EHR to acquire patient information and aid their clinical decision-making.Users' experience with the tool's EHR integration varied based on their usage patterns and trust in the model's incorporation of patient data.Several contrasted this with traditional CDSS data entry.A medical student noted, "I thought it made [GutGPT] very diferent than existing clinical prediction tools because I don't need to input every detail myself because they are already incorporated in and it makes me more comfortable that I'm able to use such information."Five participants said they were not pleased with the EHR integration of the chatbot.While laboratory and vital data were populated into the chatbot's risk calculations, some participants still took considerable time entering this data via text entry into the chat queries.As a result, several participants indicated during the post-trial interview that this data entry signifcantly slowed down their interaction with GutGPT and emphasized the need to refne this feature.When prompted to consider using GutGPT in a real clinical situation, a medical student indicates: "I think it could defnitely help, I feel like it would probably be dependent on its integration into Epic." A common refrain from participants dealt with the scope of information the chatbot could access to generate its responses.Clinical workfows in evaluating UGIB by IM and EM physicians usually involve calculating the Glasgow-Blatchford score (GBS), a tool that stratifes patients with suspected UGIB into high or low risk bleeds [55].High risk bleeds are more likely to require hospital intervention, while patients with low risk bleeds can likely be safely discharged.Physicians and medical students frequently use online medical reference tools such as MDCalc to access the GBS.During the simulations, several participants tried to use GutGPT to score their simulated patients on the GBS.Often they would directly query GutGPT to calculate the GBS for the patient.As GutGPT is an LLM trained on clinical gastroenterology guidelines, it does not have access to clinical calculators.Thus, when questioned about the GBS, GutGPT typically either stated its inability to compute the score or listed the GBS's components without performing the actual calculation.Such responses understandably frustrated participants, three spoke about it during the qualitative interview.The prevailing  suggestion was to integrate GBS calculation within the chatbot.An IM resident physician remarked "it's easier to go onto MDCalc and do [a GBS calculation] rather than using GutGPT if it's going to say 'well this is what GBS means, but we don't actually have the data to pull'." Participants had a preference to access familiar CDSS and anticipated that GutGPT had the ability to access those tools.A number of participants volunteered answers that indicated that the dashboard was not as straightforward for quick interpretation.An IM resident physician explained, "I couldn't fgure out how to use the dashboard in time, " a sentiment echoed by others.
Participants found the graphical representation challenging to interpret.An EM resident physician likened it to "looking at the engine of a car that you just bought and you have no idea how it actually runs", highlighting its complexity.The resident physician further reasoned that while potentially benefcial, learning how to use the tool would demand signifcant time.The intricacy of the graphical representation deterred some from further interaction, believing it wasn't worth the time they could otherwise spend interacting with the scenario, collaborating with teammates, or using other clinical tools.Nine participants reported difculty in interpreting the graphs.

Human-Computer Interactions
4.2.1 Clinical Tasks.The qualitative analysis of the GutGPT chatbot conversations reveals a consistent interaction pattern among provider teams of clinicians, irrespective of their clinical expertise level.A provider team typically prompted the chatbot between 1 to 5 times per simulation scenario, with an average of 3 times (Figure 6).The questions had an average length of 13 words (Figure 7).
Notably, when faced with content scenarios, participants asked an average of 1.5 more questions than in risk scenarios.This difference can be possibly attributed to the varying complexity of tasks in these two scenarios.In risk scenarios, the only task for the participants is to decide whether to admit the patient.A brief query like "Should I admit the patient?" or a single search for risk scores could sufce.In contrast, content scenarios require the participants to make a series of management decisions, necessitating reference to medical guidelines and adaptability to changing patient conditions.Ideally, a single query to our LLM chatbot could supply comprehensive guidance, but participants frequently probed further for detailed information or clarifcation, leading to an increased number of questions to the chatbot.
A related observation is that the participants asked many more questions in Scenario B of the risk assessment phase compared to the other two risk scenarios.Scenario B is a "borderline" case, where the decision to admit or discharge is not as straightforward as in the other cases as the patient's medical data could support either decision.This elevates the case's complexity and requires additional decision support.

4.2.2
Familiarity with the System.Alongside the diference in tasks, the sequence of risk scenarios before content scenarios might contribute to the observed diference in the prompting frequency.As participants became more familiar with GutGPT through risk scenarios, their confdence and willingness to use the system were likely to be higher in the subsequent content scenarios.This might lead to an increase in their interactions with the model, including its chatbot feature.
The reduced average word count per question in content scenarios also suggests a possible infuence of familiarity with the system on clinician-chatbot interactions.By inspecting the prompting data, we fnd that longer questions often unnecessarily repeat patient information embedded in the model, indicating user unfamiliarity.Then, shorter questions in content scenarios could signify improved user understanding and more efcient interactions.
However, we note that this interpretation should be taken with caution, since the conversation between the users and the GutGPT chatbot for each scenario appears to follow a pattern where the users list out the patient's details in the frst question and ask follow-up questions without repeating the information.Therefore the lower average word count of questions in the content scenarios could simply be a result of participants asking more follow-up questions in those scenarios.

Level of Clinical
Expertise.While quantitative comparisons of participating medical students' and resident physicians' prompt data revealed similar model interaction patterns, qualitative postsimulation interviews indicate that the purposes and experiences of the interactions difered across levels of clinical expertise.Compared to the resident physicians, participating medical students reported more frequently that GutGPT was "super helpful" and could "provide expertise" given their knowledge level.Nine medical students responded in this way, compared to only two resident physicians.Some medical students even positioned GutGPT in leadership roles such as "consultant" and "attending", while resident physicians recognized GutGPT more as a "partner" or "team member" that performs an assistant role in decision-making.One medical student mentioned adopting the model's suggestions even if they conficted with their own judgment.In contrast, half of resident physicians that performed the leader role during the simulation reported that the recommendations generated by the model did not afect their decisions.4.2.4Team Composition and Dynamics.In addition to individuallevel factors such as medical expertise, composition and dynamics of provider teams are expected to shape clinician-model interactions as well.Our provider team-based simulation design did not refect this aspect in the quantitative data, but the post-simulation interviews provided some valuable insights.The roles of the team members in the simulation were designed to refect real life clinical teams, with senior members in executive decision-making roles (using the dashboard and/or GutGPT) and junior members more responsible for gathering information (e.g.interviewing the mannequin and gathering lab data).That perspective informed some of the participants' responses in the interview.As an IM resident physician placed as a junior member stated: "I was focused on the patient so I didn't get a good look [at the GutGPT system], but it seemed like a useful response from my brief look." Several other IM resident physicians and medical students in the junior member role shared similar comments.Another junior IM resident physician said "I let [the senior team member] deal with the model, and I just worried about the history and physical." Some participants also remarked that their adoption of an AI system depends on team dynamics.A medical student projected that their usage of the model would depend on other provider team members' opinions toward the model: "I think I would be more likely to use it if my attending wanted me to use it."Several participants' opinions diverged when discussing if using GutGPT would be more amenable to good clinical practice in a team or alone.An IM resident physician noted "Typing into the chatbot takes you away from the primary focus of the patient, I would only use it if I'm part of a team for that reason." Others appreciated the input of another source that could function like a team member.Two participants reported their perceived benefts of using the model could depend on the number of people present in the provider team.The context the medical team works in also made a diference for some participants.A medical student said "I think I would feel weird using it in acute situation, " indicating the setting a team practices in could infuence adoption.

Trust
Clinicians' trust in the AI-CDSS they interact with plays a large role in whether they decide to adopt the tool into their workfow [32,107].In our interviews, several participants discussed their trust of AI-CDSS, some refecting on their general attitudes about AI-CDSS, and others on their trust of GutGPT after interacting with it frst-hand.
While talking about their general attitudes surrounding AI-CDSS, participants voiced concerns about the moral and legal implications of fully adopting these tools in healthcare.Many stated that they "did not believe that clinicians should 100% rely on the model, " while citing reasons such as how they were "concerned with liability and responsibility if [they] followed the model and the patient had a bad outcome."Some participants who had prior knowledge of AI systems also reasoned that the AI may output false responses.For instance: "I was concerned that the chatbot will hallucinate, which is particularly bad in medicine." Participants reported they would be less willing to employ LLM-augmented CDSS when there are inaccuracies in the information they output.
Participants also provided direct feedback on their trust levels about GutGPT, based on their experience using the tool during the simulation scenarios.The most commonly cited reason for why participants could not trust the chatbot's outputs was that they did not know what data the chatbot was drawing from.For example, one participant expressed dissatisfaction as the chatbot "did not provide any citations", and another participant said "it would be nice to have hyperlinks of sources and knowledge of where the AI pulled from."This is consistent with [26]'s fndings that having fully transparent insight into how an AI generates its output is principal in a clinician's decision to utilize the tool.Further, some participants believed that an AI chatbot could not fully replace a clinician's intuition, and therefore could not be trusted fully.One participant claimed that collecting atmospheric and "emotional" data when entering a patient room is an important part of their workfow, so a limitation of GutGPT was the fact that it did not have such information.We believe that further exploring the implications of these feedback is important for establishing trust in LLM-augmented AI-CDSS in the future.

DISCUSSION
Our qualitative and quantitative fndings suggest that an LLMaugmented AI-CDSS may increase ease of use in Section 5.1, address challenges with user trust in Section 5.2, and elicit diferent user patterns based on clinical context and user background in Section 5.3.In Section 5.4 we synthesize our fndings into three principles for building LLM-augmented AI-CDSS systems that can meaningfully enhance the work of provider teams in clinical care.

Large Language Models May Increase Ease
of Use for AI-CDSS, but Familiarity Afects User Perceptions Familiarity is a key aspect of usability, as users are more likely to fnd recognizable features intuitive.GutGPT was designed with popular AI systems like ChatGPT and text messaging platforms in mind.The participants' opinions on usability sharply difered between the LLM-augmented AI-CDSS (GutGPT) and the interactive dashboard AI-CDSS alone.GutGPT had a recognizable interface that may have contributed to its positive initial reception supported by the interview feedback from many participants as well as the survey results suggesting the perception that systems are easy to use.While the chatbot was seen as immediately intuitive by a large portion of participants, the majority of participants commented that the dashboard was difcult to interpret or not worth taking the time to interpret.However, the qualitative results are more nuanced -despite GutGPT being easier to use than the interactive dashboard, participants remarked on an "activation energy" required to use a chatbot that is not an issue for AI-CDSS without LLM.For example, the interactive dashboard requires no user input for a risk score to be displayed.For GutGPT, there were specifc user remarks regarding hesitancy to use due to uncertainty regarding prompt formation that must be overcome.This hesitancy can be especially problematic if the user is unfamiliar with LLMs -refected by participants who found themselves at a loss on what questions to ask GutGPT.The slight disagreement in the SUS statement "I felt confdent using the system" among participants who used the GutGPT chatbot compared to those who used the dashboard alone could be explained by the unfamiliarity with the system as well, given that no one disagreed with the statements that "the system was easy to use" and "most people would learn to use this system very quickly".Uncertainty with use can be problematic in a real clinical situation, as there are many competing demands make clinical care increasingly time-constrained.Typing a query and reading a response in natural language places a further demand on the clinician that could be costly from a time and cognitive-load perspective.Reassuringly, difculty in prompting faded as participants became familiar with the chatbot and became comfortable working with it.From these observations, the initial approachability was a key factor in allowing participants to experiment with the chatbot and eventually become accustomed to it.However, the dashboard's interface seemed like too steep a challenge to interpret in a short simulated case -and was ignored in many trials.User familiarity with the interface is not the only experience that matters; in clinical decision-making, familiarity with existing and traditional CDSS can hinder use of AI-CDSS.In UGIB, traditional CDSS is a clinical score, the GBS.Since this CDSS is familiar to providers when caring for patients with UGIB, it was natural that participants reached for this CDSS rather than utilizing the AI-CDSS in our study.This fnding refects similar fndings in another usability study of AI-CDSS [111], where the frustration at using the AI-CDSS comes in part from an incomplete understanding of the technical capabilities of AI-CDSS.Interestingly, participants also expressed a desire for the text output from GutGPT to mirror their preferred clinical reference styles.Many desired bullet points or highlighted management steps in the text output, similar to medical reference texts like UptoDate.Possible solutions proposed by users include clear statements of the AI-CDSS capabilities to prevent frustrations that impair usability, as well as the functionality to access traditional CDSS.
The LLM efect on usability for AI-CDSS is consistent with "Unremarkable AI", an idea that stresses unobtrusiveness as crucial to successful adoption of AI-CDSS [116].If the ideal implementation scenario for an AI-CDSS is one that fts smoothly into the existing workfow of a clinician with little deviation, there should be efort made to craft AI-CDSS that resemble existing tools or applications that clinicians have confdence in navigating.Ideally, AI-CDSS would complement activities that physicians already perform in their jobs.Much of physician responsibility involves data collection, writing clinical notes in the EHR, and deciding which tests and treatments to order.EHR integration is an important factor to access the familiarity that will promote use of an AI-CDSS.EHRs not only contain patient data but also ofer clinical calculators, like MDCalc, and clinical pathways to guide diagnostic and treatment choices.During the trials, participants expressed the desire that any AI tool needed to be integrated seamlessly with EHRs.Adequate integration addresses the time pressure and ease of use that many participants alluded to in their answers -an embedded assistant within the EHR that is quickly accessible and helpfully collates relevant patient data.Borrowing from the TURF framework for EHR usability, a system attains acceptable usability when it is easily "learnable" and requires little mental efort to use [119].

Large Language Models Require Justifcation with Citations to Promote Trust
We understand trust of algorithmic interfaces as Kizilcec does: "an attitude of confdent expectation in an online situation of risk that one's vulnerabilities will not be exploited" [46].Overall, participants expressed their lack of trust towards LLM-augmented CDSS, and that this lack of trust would deter them from adopting the tool into their workfow.This is consistent with fndings by Rousseau et al. that trust plays an important role in determining whether or not one is willing to adopt new technologies, particularly involving AI [83].However, from qualitative interviews we found that one factor that may positively afect trust in the GutGPT responses was the presence of relevant citations, which may indicate the need for transparency regarding the data used to generate the responses.Clinicians are inundated with vast amounts information that they must sift through to make evidence-based diagnostic and management decisions.Clinical guidelines from medical professional societies can be lengthy and difcult to parse for relevant details pertaining to a specifc patient.The primary literature from which the guidelines are constructed can be even lengthier and sometimes contradictory.In response to this, clinical reference websites such as UptoDate or ClinicalKey have risen in popularity, ofering concise, aggregated information with relevant citations.Many participants reported that GutGPT chat outputs were hard to trust because some of them did not provide citations outlining the source of the information provided.When the chat included citations, participants specifcally emphasized how useful they found the response to be.Clear communication about the data used to generate responses from LLM-augmented CDSS is consistent with other studies that found that high-quality labeling leads to higher perceived training data credibility, which in turn enhances users' trust in AI [22].It is thus imperative to be transparent about the data from which the LLM is generating its responses; when providing a recommendation for clinical management, direct relevant citations should be displayed with every response that is generated.While websites like UptoDate have made evidence-based clinical decision-making easier, they are still general reference materials.They need to be tailored to individual patient scenarios and might not cover unique clinical situations.This represents an opportunity for chat-based AI-CDSS, as information from primary sources can be presented to the clinician in easy-to-understand natural language.

Human-Computer Interactions Vary By
Clinical Tasks and Team Dynamics, But Large Language Model Usage Metrics Are Similar We found that participants interacted with GutGPT diferently based on the clinical task required.Teams using GutGPT in content scenarios submitted more queries to the chatbot than teams in risk scenarios.There were an average of 3.9 queries in content scenarios compared to 2.4 in risk scenarios.This indicates that the chatbot feature was used more heavily when making decisions regarding a care plan, and less utilized for risk assessment.Teams in risk scenarios were asked to determine the risk assessment for the simulated patient -essentially sorting the patient into one of three risk categories.Teams in the content scenarios were asked to stabilize and treat the simulated patient -this is an open-ended situation in which the management options are numerous and unstructured.Choosing the "correct" management decisions requires clinical expertise and a familiarity with UGIB guidelines.GutGPT's guideline-driven recommendations can be perceived as more helpful in these management situations.This diferential use is consistent with the paradigm that workfow incorporation of CDSS depends on the needs of human practitioners [94].
We also found diferences in how teams interacted with Gut-GPT according to their level of clinical expertise.Our qualitative interviews suggested that provider teams with more clinical expertise (resident physicians) usually interacted with the AI-CDSS to confrm their own impression or decision, whereas those with less real-world experience (medical students) attributed more expertise to the AI-CDSS and interacted with the system with more deference.From the qualitative interview data, inter-team dynamics contributed to potential use behaviors.The medical team can be a hierarchical structure, group dynamics are often modeled after senior members [106].Participants assigned as junior team members volunteered that their likeliness of using an AI-CDSS would be increased if those tools are accepted by superiors and peers, indicating that the social expectations of the medical team are an important infuence on AI-CDSS adoption and continued use.Division of labor in the medical team also tracks along seniority level, with junior members of the team functioning primarily as data gatherers and reporters while senior members shoulder a larger burden in decision-making, resource allocation, and planning.These roles were reproduced in our simulations, with junior members reporting that they did not occupy their time with familiarizing themselves with the system but instead dove into their roles in interviewing and examining the simulated patient.As our simulated teams approximate real clinical teams, these fndings show the importance engaging the key stakeholders in targeting AI-CDSS.In the busy medical team, junior members may fnd lengthy interactions with an AI-CDSS poorly suited to their role while more senior members might be better situated to devote time and cognitive energy to properly use AI-CDSS.Our fndings can be placed in context with existing literature suggesting that interactive technologies are highly dependent on team processes and can infuence leadership and team management [30,54].These team dynamics are particularly important to consider when deploying technologies such as LLM-augmented AI-CDSS in environments with heterogeneous teams.
Interestingly, we found that provider teams had similar patterns of prompt generation and length across diferent level of clinical expertise, and suggests a baseline for interactions of on average 3 prompts with 13 words each for provider teams using LLM-augmented AI-CDSS in time-limiting, high-stakes scenarios.This benchmark is particularly valuable because, to our knowledge, we are the frst group to measure usage of a LLM-augmented AI-CDSS under real-world clinical simulation conditions.

Design Principles
Drawing from the user-model interaction insights gleaned from our study, we propose three design principles for AI-CDSS with LLM-augmented interfaces: 5.4.1 Comprehensive Usability Focus.The reported common frustration with dashboard graph interpretation as well as chatbot prompting in our study underscores the necessity of crafting an integrated solution that gives due attention to improving both the usability of the algorithmic output and the LLM-augmented user interface.While enhancing interpretability of algorithm-generated outputs remains crucial for AI-CDSS, equal importance should be placed on providing users with clear guidance on how to interact with and what to expect from new technologies like LLMs.Moreover, a strong design should prioritize seamless integration of these functionalities, an aspect our LLM-augmented AI-CDSS users expressed dissatisfaction with in the SUS survey.Parcipants placed a special emphasis on EHR integration, which is a common refrain from several usability studies with AI-CDSS [112].

Customized Deployment Strategies.
As reported in their interviews, participants formed diferent perceptions of our model's usability and role in clinical decision-making through the same simulation setup, according to their own clinical expertise levels and roles in the workfow.This emphasizes the importance of tailoring the model's deployment strategies to accommodate the varying medical specialties and specifc needs of diferent users within the healthcare ecosystem.As Sendak et al. highlights, stakeholders of varying specialties and expertise should be engaged to provide and iterate feedback of LLM-based AI-CDSS [91].5.4.3Understanding and Navigating Team Dynamics.Our study provides preliminary evidence that provider team composition (e.g., in a team or alone) and dynamics (e.g., other team members' perception of the model) exert a complex infuence on clinician-CDSS interactions.While further investigations are necessary for a deeper understanding of this topic, design of AI-CDSS should prioritize adaptability and customization to adapt to diverse team compositions.Additionally, strategies such as training in the use of emerging technology may be implemented to ensure efective and harmonious clinician-AI interactions.

LIMITATIONS
Medical simulation is primarily designed as an educational exercise to facilitate acquisition of skills by medical trainees in an environment that emulates some of the practical realities of interacting with a patient.However, the simulation environment is an imperfect approximation of a real clinical environment.Use of a simulation mannequin, the lack of distractions, and the abridged time-course of a simulation are examples of factors that prevent medical simulation from achieving strict fdelity with the clinical environment.As a result, medical simulation is a calmer environment than the clinical one, which could encourage AI-CDSS use when the time and social pressures of the real clinical environment might cause trainees to fall back on familiar traditional CDSS.Medical simulation is an environment in which experimentation is welcomed, participants took time to test out and interpret the dashboard and chatbot -luxuries that might not have been aforded to them in the clinical environment.While we provide a quantitative snapshot of potential user patterns of an LLM-augmented AI-CDSS, the interactions were pooled by all members of the provider team and could not depict the individual-level user behaviors (e.g., how model interactions vary based on the user's role in the provider team).Likewise, our study did not capture scenarios in which the clinician interacts with the model independently: some participants touched on this aspect in their qualitative interviews, presenting contradictory views for model usage in such situations.All participants in this study were trainees, and had not yet qualifed to practice independently.The majority (76.7%) were under the age of 29, younger than the average independently practicing attending physician.Trainees are in a period of rapid learning of tools and methods that help in clinical care.It would likely be easier for trainees to adopt new technologies in their clinical workfows than more experienced clinicians.Clinicians with a greater amount of experience are more confdent in their clinical decision-making and might be less willing to incorporate a new tool into their workfow.Younger people are also more likely to have higher acceptance of AI technology [45].
Another limitation of our study arises from the ongoing and rapid advancement of AI.New models, architectures, and techniques emerge with improvements in their capabilities and performance, so the usability challenges associated with AI-powered systems are likely to shift quickly.One immediate example is that the response latency issue of GutGPT has been mitigated with recent improvements in GPT-3.5-Turbo'sinference time.
Lastly, our research plan could be improved.The present design of the study struggled to distinguish between the impacts of increased model familiarity from learning, task complexity, and model performance on our data, especially the prompting pattern.Further exploration of the dashboard's usability through an independent assessment is needed to establish a baseline for better evaluating the value of LLM integration.The truncated SUS survey may limit its comparisons to standards of usability; the fact that the SUS survey was administered only for risk scenarios hindered its ability to reveal users' perception of usability for the whole trial.

FUTURE WORK
Our work to evaluate GutGPT and elucidate a more comprehensive understanding about clinicians' attitudes surrounding the AI-CDSS is still ongoing.
We will continue recruiting participants for our usability research to reach the efective size for statistical testing.We will improve our study design to address the limitations described at the end of Section 6.More relevant data such as performance metrics of the LLM component, and time spent for each simulation scenario with or without the chatbot will be collected and assessed in the future trials for a better understanding user-model interactions.
We acknowledge the importance of iterative design in the humancentered approach.We plan to extend our current understanding of user preferences with the following research directions: 1) for usability, we plan to provide guidance on query construction and evaluate its efect on decreasing the initial activation energy that hampers use of the chatbot; 2) for trust, we plan to explore a more active role of an LLM as a team member that listens to and summarizes provider team interactions during the clinical decision process; 3) for user-computer interactions, we plan to customize a workfow that allows individuals to interact with the LLM-augmented AI-CDSS and integrate the user prompts with the provider team interaction with LLM-augmented AI-CDSS.
We plan on updating GutGPT to refect the three design principles for LLM-augmented AI-CDSS that we proposed in Section 5.4.To improve usability in respect to the LLM-augmented user interface, one potential solution is to provide guidance on writing queries: this can be achieved through query suggestions generated based on commonly asked questions, or "query building blocks" in which clinicians can simply click on components of queries to quickly build their prompt.To better serve users with diferent level of clinical expertise in UGIB, customizable modes might be developed: clinicians with fewer years of training could be defaulted to model responses with more detailed explanations and more references that provide required expertise, while those from higher training level could choose to receive more concise replies for factor decision-checking.To address difculties in model interactions in special cases such as when clinicians work alone in emergency care, advanced features like real-time speech recognition might be augmented to GutGPT to enable automatic patient-interview summarization that streamline the clinicians' workfow.We plan on implementing these design changes to GutGPT before testing on additional participants, further examining how these changes infuence user behavior both at the individual-and team-level.
We also believe that user experience should be studied under conditions that are difcult to achieve in the physical simulation environment.Medical simulations are a valuable training tool that has been found to enhance clinical competence at the undergraduate and postgraduate levels [2].Even so, medical trainees struggle when transitioning into a real clinical setting [5] due to discrepancies including a static physical environment and lack of environmental distractions present in simulation rooms.Virtual Reality (VR)/Augmented Reality (AR) solution has the potential to improve the levels of realism to enhance learning for simulation studies of LLM-augmented AI-CDSS [2,39].Transitioning to VR/AR simulation would allow scalable research for new AI-CDSS like GutGPT, increase the capacity to introduce diversity in medical training (including patient "dummies" of diverse demographics), and increase fexibility in creating and extending simulation environments.We are optimistic about this transition and are interested investigating how it afects future AI-CDSS HCI research.

CONCLUSION
In this paper, we sought to extract insights from healthcare providers after they interacted with a LLM-augmented AI-CDSS in simulated clinical scenarios.We present preliminary fndings from a randomized controlled trial with 31 participants arranged in provider teams who undergo simulated scenarios of UGIB with an interactive dashboard AI-CDSS with or without an LLM.We fnd that LLM-augmented AI-CDSS increases ease of use, and that trust can be improved with transparency with supporting evidence of citations in the responses.We found that there appeared to be a baseline utilization pattern of the LLM-augmented AI-CDSS of approximately 3 prompts averaging about 13 words per prompt in each scenario across all participants, though the perception of the LLM-augmented AI-CDSS in human-computer interaction varies by clinical expertise -medical students appreciated the model's expertise while physicians used the model as a check on their intuition.Senior and junior members of the clinical team displayed diferent behaviors towards AI-CDSS, with greater engagement from senior-level decision-makers.These insights underscore the importance of closely involving healthcare providers in the design and implementation of AI-CDSS.In light of our fndings, we propose three fundamental design principles that can guide future refnements of GutGPT and the broader spectrum of AI-CDSS.

Figure 1 :
Figure 1: GutGPT chatbot interface.The chat interface on the left adopts a typical conversation-like design.Figures on the right display the patient's vital data and their efect on hospital-based-intervention risks predicted by our model's underlying ML model, in the context of patients in the training database.

Figure 2 :
Figure 2: GutGPT dashboard interface.The left column displays on top the hospital-intervention risk for the current patient and has sliders below for the users to calibrate the model by adjusting the patient's vitals, labs, medications, and more (not entirely captured).The same fgures in the chatbot interface are displayed on the right.

Figure 3 :
Figure 3: Flowchart depicting the study design.The study is comprised in two phases, with randomization occurring separately at each phase.

Figure 4 :
Figure 4: Bar plot displaying number of sessions with the GutGPT chatbot.

Figure 5 :
Figure 5: Participant responses to the System Usability Scale survey.Data from participants who had access to the chatbot feature are colored in turquoise, while data from those did not used the chatbot were in purple and hatched.The total height of each bar refects the combined data.

Figure 6 :
Figure 6: Number of user prompts entered into GutGPT chatbot in each scenario per simulation session.Plot (b) displays the same data in (a) stratifed by level of clinical expertise.

4. 1 . 5
Interactive Dashboard Usability.The interactive dashboard displays patient risk scores either numerically or via a graphical representation.One medical student noted their preference for the interactive dashboard over the chatbot interface: "I trust much more when I see the numbers than the words. . .I have seen other AIs that are text-based and I've personally experienced that they are not working well, so I'm less inclined to trust it." They found it easier to comprehend the explainability of the interactive dashboard which fostered trust in the system.Five other participants did not fnd the dashboard as intuitive as the chatbot: "a quick glance didn't tell me how to assess [the dashboard], but the [chatbot], I caught on pretty quick what the goal was, how to use it, how to interpret it."

Figure 7 :
Figure 7: Average user prompt length submitted to the GutGPT chatbot in each scenario.Plot (b) displays the same data in (a) stratifed by level of clinical expertise.

Table 1 :
SUS Statements in the Post-trial Survey.Responses were in 5-point Likert scale.

Table 2 :
Example Question Inputs to the GutGPT Chatbot -old woman with history of hypertension presents with one day of hematemesis in the setting of persistent vomiting for the past two days.Her vitals are 110/70 without other hemodynamic changes, and she takes no medications other than Calcium and amlodipine.What is a relevant diferential and should we admit her? " 3 "Hello GutGPT, acting as a consulting gastroenterologist, write a consult note for a 70 y.o.male who presents with chest pain and a week ago with melena, with PMH of heart failure?THe patient is currently taking aspirin, statin and ACE inhibitor.What should the next steps for management be?" 4"Labs are all normal, I think this pt should be discharge do you agree?" 5 "should i admit this patient" 6 "What is this pt's risk of in-hospital intervention?"7 "Can you help me calculate the patients GBS score" 8 "what is the next best steps in management for a patient with GBS score of 7" 9