Towards AI-Driven Healthcare: Systematic Optimization, Linguistic Analysis, and Clinicians’ Evaluation of Large Language Models for Smoking Cessation Interventions

Creating intervention messages for smoking cessation is a labor-intensive process. Advances in Large Language Models (LLMs) offer a promising alternative for automated message generation. Two critical questions remain: 1) How to optimize LLMs to mimic human expert writing, and 2) Do LLM-generated messages meet clinical standards? We systematically examined the message generation and evaluation processes through three studies investigating prompt engineering (Study 1), decoding optimization (Study 2), and expert review (Study 3). We employed computational linguistic analysis in LLM assessment and established a comprehensive evaluation framework, incorporating automated metrics, linguistic attributes, and expert evaluations. Certified tobacco treatment specialists assessed the quality, accuracy, credibility, and persuasiveness of LLM-generated messages, using expert-written messages as the benchmark. Results indicate that larger LLMs, including ChatGPT, OPT-13B, and OPT-30B, can effectively emulate expert writing to generate well-written, accurate, and persuasive messages, thereby demonstrating the capability of LLMs in augmenting clinical practices of smoking cessation interventions.


INTRODUCTION
Smartphone-delivered message intervention has become an essential part of an effective smoking cessation program.Meta-analytical reviews have demonstrated the efficacy of receiving smartphone-delivered message interventions via short message service (SMS) or smartphone applications in promoting both short-term self-reported quitting behaviors and long-term abstinence [82,85,93].Compared to traditional face-to-face counseling interventions, a short intervention message can offer real-time and tailored support when individuals are vulnerable to lapse or relapse of smoking, thereby significantly enhancing tobacco abstinence rates [21,43,70,97].Research in the domain of smoking cessation treatment and behavioral intervention broadly calls for the integration of smartphonedelivered message interventions to improve treatment efficacy [31,57,98].
Despite the well-established efficacy of smartphone-delivered message interventions, maintaining user engagement and intervention effectiveness rely on the provision of novel and non-repetitive content [69].Repetition in messages can elicit perceptions of redundancy and a feeling of information overload, which further diminish the perceived utility of the received information and attenuate attitudinal and behavioral changes [86].Qualitative feedback echoes this observation; for example, in [14] participants who received a text messaging intervention designed to provide nutrition education and encourage better dietary choices suggested that future designs of text messaging intervention should avoid content repetition and offer an opportunity to learn new information for each interaction message [14].Consequently, the development of effective smoking cessation treatment necessitates the creation of substantial, high-quality intervention messages, a labor-intensive process for experts that poses a significant challenge.
In the field of Natural Language Processing (NLP), the advent of Large Language Models (LLMs) has significantly facilitated the capability of message generation.These LLMs, equipped with vast training datasets, can not only mimic human writing but also draw upon the knowledge embedded in the training data to produce fluent text [12,15].Previous research has examined the feasibility of LLMs for message generation in diverse health-related contexts.For example, Karinshak et al. [41] compared COVID-19 pro-vaccination messages generated by the GPT-3 model with human-authored messages released by the Centers for Disease Control and Prevention (CDC).Their findings revealed that messages produced by GPT-3 were perceived by crowdworkers as more effective, presenting stronger arguments, and evoking more positive attitudes about vaccination.Similarly, Lim and Schmalzle [50] undertook a comparison of health awareness messages on folic acid generated by the BLOOM-7B1 model with tweets on the same topic.Their study demonstrated that LLM-generated messages were rated higher in terms of message quality and clarity, compared to human-written tweets.Furthermore, computational text analysis indicated that LLM-generated messages exhibited similar characteristics to those written by humans in terms of sentiment, reading ease, and semantic content.
While prior research has supported the feasibility of LLMs in message generation [15,41,50], using LLMs to generate intervention messages for smoking cessation treatment necessitates the exploration of two critical questions: 1) How can LLMs be optimized to mimic human expert writing, and 2) Do LLM-generated messages meet clinical standards to be safely implemented in tobacco treatment programs?To address these pivotal questions, the present study conducted a systematic examination of the message generation and evaluation processes of LLMs within the context of smoking cessation intervention.
In the message generation process of LLMs, both the prompt choice and the decoding method serve as critical determinants that influence the quality of the generated text [3].To refine LLMs to emulate human expert writing, this study conducted prompt engineering (Study 1) and decoding optimization (Study 2), involving the testing of five prompts and eight decoding methods to generate intervention messages across five LLMs (see Figure 1).The evaluation of LLM performance in message generation is a multi-faceted task, requiring a comprehensive assessment of message quality, coherence, relevance, grammaticality, and accuracy.Consequently, the employment of commonly adopted automatic metrics in LLM assessment is often oversimplified and inadequate in capturing the nuanced linguistic properties and overall text quality [4,9,75,81].In recognition of these challenges, our study introduced a comprehensive evaluation framework encompassing diversity, quality, and efficiency.We built upon the computational linguistic approach and utilized the Linguistic Inquiry and Word Count (LIWC) tool to analyze linguistic features of LLM-generated messages [87], using intervention messages written by tobacco treatment experts as the benchmark.In Studies 1 and 2, we identified the prompt and decoding methods that most closely resembled human expert writing.These strategies were subsequently recommended and subjected to evaluation in terms of their clinical utility (Study 3).
While LLMs have demonstrated their capacity in generating effective public health messages [41,50] and have exhibited knowledge levels comparable to third-year medical students [29,44,83], their applicability in the high-stakes healthcare context necessitates rigorous evaluations to ensure both reliability and safety.Previous research underscores the importance of expert review in increasing the validity of LLM evaluations, especially for tasks requiring specialized expertise [9,23,38,48,56,95].Therefore, to establish the efficacy of LLMs in generating intervention messages for clinical use, we invited certified tobacco treatment specialists (TTS) to conduct an expert review (Study 3).These specialists rigorously evaluated the messages generated by the five LLMs from Studies 1 and 2, and the newly released ChatGPT, on message quality, accuracy, credibility, and persuasiveness, using expert-written messages as the benchmark.Drawing upon their extensive clinical experiences in tobacco treatment counseling, they further evaluated whether the generated messages met the standards in TTS training and were ready to use in clinical practice.Our results indicate that larger LLMs, including ChatGPT, OPT-13B and OPT-30B, can effectively emulate human expert writing to generate well-written, accurate, credible, and persuasive messages, thereby demonstrating the efficacy of LLMs in augmenting smoking cessation interventions in clinical settings.
Our paper contributes to the current research landscape on several key aspects.First, we conducted a systematic examination of message generation and evaluation with LLMs, involving prompt engineering, decoding optimization, and expert review, conducted across six state-of-art LLMs, thereby ensuring the generalizability of our findings.Furthermore, we proposed an innovative and comprehensive evaluation framework for LLM assessment, integrating automatic metrics, linguistic attributes, and expert evaluation, offering a multifaceted assessment that delves into diversity, quality, and efficiency aspects of the LLM performance.To date, this is the first study that adopted the computational linguistic approach for the assessment of LLMs and established the efficacy of LLMs in specialized healthcare contexts through expert reviews.
This paper is organized as follows: Section II and III address the first research question on prompt engineering and decoding optimization.Section II provides an overview of prompt engineering literature, introduces the LLMs employed for message generation, presents a multi-faceted evaluation framework, and assesses prompts accordingly.Section III reviews literature on decoding parameters and examines how decoding methods influence message generation.Section IV addresses the second research question by incorporating expert reviews of LLM-generated intervention messages.Section V discusses future applications and limitations of LLMs in the healthcare context, and Section VI concludes.

Related Work
Large language models (LLMs) are deep learning models trained to understand and generate natural language.Autoregressive LLMs work by using a series of input tokens (words or fragments thereof) to generate subsequent tokens [15,73].Grounded in the transformer architecture, which is a cutting-edge neural network design characterized by its massive parameter sizes, LLMs can successfully understand patterns within the input tokens through a self-attention mechanism [89].In recent years, the efficacy of LLMs has been examined and validated for various natural language tasks, including automatic summarization, machine translation, and question answering [15,60].Leading open-source models in the field include GPT-J-6B, the BLOOM series, and the OPT series.For the purposes of this research, we employed models compatible with our computational resources, namely GPT-J-6B [90], Bloom-7B1 [78],and OPT 6.7B,13B,and 30B [100].GPT-J-6B is an openaccess alternative to GPT-3 and was trained using the Pile dataset [11], a predominantly English-centric text corpus that combines sources such as English Wikipedia and PubMed Central.BLOOM introduces modifications to the conventional Transformer architecture [89], including the incorporation of ALiBi Positional Embeddings [72] and the Embedding LayerNormal.BLOOM is trained with the ROOTS dataset [46], a multi-lingual text corpus.The architecture of OPT is similar to GPT-3 and was trained on a corpus comprised by subdatasets of RoBERTa [53], a subset of the Pile, and a subset of the Pushshift dataset [8].The primary language of OPT's corpus is English.We used LLMs of diverse sizes, ranging from 6 billion parameters (as in GPT-J-6B) to 30 billion parameters (as in OPT-30B) to identify the optimal model for message generation in the healthcare context and to enhance the generalizability of our findings.

Prompt engineering for
LLMs.-We employed the "tuning-free prompting" approach [25,52,92,102] to generate intervention messages, a method where the parameters/weights of the model are fixed and in-context learning is used [15].This approach leverages the input context to guide the model's responses without the need for task-specific fine-tuning.For example, to instruct LLMs in generating intervention messages for smoking cessation, we can provide a handful of intervention messages as examples to help models infer the format of message output.The number of examples provided can be used to differentiate the learning process into three categories: zero-shot, one-shot, and few-shot.In zero-shot learning, the model is given a natural language description of the task without any examples, whereas one-shot and few-shot learning provide the model with one or a few context-relevant examples.As concluded by Brown et al. [15], LLMs are meta-learners that integrate outer-loop gradient descent learning with in-context learning, and few-shot learning, by prepending context-specific examples, can greatly enhance model performance and adaptation to task contexts, especially in areas requiring specialized knowledge and messaging styles, such as smoking cessation intervention.
For message generation tasks using few-shot learning, a prompt is defined as the initial input provided to the LLM for generating message output [52].In our study, this input comprises both the task instruction and the message exemplars authored by human experts.For example, when prompting the BLOOM-7B1 model with this task instruction "Write motivational messages to encourage people to quit smoking:" along with four sample messages written by human experts, the model returned the following message: "If you know that you are having a strong craving, you can take a deep breath and relax your muscles.When you can relax, you can focus on something else, like reading or watching TV.Do not attempt to fight the urge to smoke.Instead, take the next deep breath and wait for it to pass." Both the prompt and the accompanying examples can influence the quality and relevance of the model output.Prompting serves multiple purposes: setting the context, clarifying the nature of the task, guiding response format, improving relevance and accuracy, and reducing ambiguity and potential biases [15,53,73].As high-lighted by the PromptBench benchmark [103], LLMs are sensitive to prompts, underscoring the importance of prompt engineering for optimal model performance.Therefore, enhancing model efficacy necessitates prompt engineering, an iterative refinement process which involves rephasing the prompt for clarity and precision, providing context-relevant examples, and specifying response format.
In this study, we employed manual template engineering to generate prompts[52], (see Table 1).The first three prompts were adapted from [15], with each subsequent prompt version increased in length and detail.The first version provided a broad instruction with no context ("Message:"), the second version provided a specific theme for the message output ("Messages to help you quit smoking:"), and the third version instructed explicitly on both message theme and desired tone ("Write motivational messages to encourage people to quit smoking:").Further, the fourth version introduced a variation in placement by positioning the task instruction after the example messages (Example messages + "Write messages like the previous ones:").By prepending message exemplars before the general instruction 'Write messages like the previous ones,' we were able to test whether providing context-related exemplars and having LLMs infer the task instructions, without explicitly conditioning the tasks, themes, or tones, would lead to more relevant and accurate output.Lastly, the fifth version adopted a structured format [51,79], using tags to label the task and examples ("Task: Write messages that are on the same topic" + Message 1: … + Message 2: …).Structured prompting was introduced to help LLMs better distinguish between task instructions and exemplars and its effectiveness in increasing the accuracy, relevance, and context appropriateness of the model's output was further examined.The message example dataset contains intervention messages from existing research on smoking cessation intervention written by tobacco treatment experts [16,17,35,36].

Perplexity and Computational Linguistic
Analysis.-Thisstudy first evaluated prompt performance with the perplexity measure [6].Perplexity is a standard metric for evaluating the quality of language models and quantifies how "surprised" the model is when seeing a passage of text [24].A passage of text with a perplexity too high may contain language errors or nonsensical content [30], while a perplexity too low may signify repetitive and uninteresting text [34].
The words we use in communication not only reflect our identity but also the social and relational context in which communication takes place.Linguistic features, such as emotional tones or the use of complex words, allow individuals to infer the medical expertise of their healthcare provider, subsequently influencing their trust and adherence to health-related messages [88].To explore the extent to which LLMs emulate human expert writing, we examined the psychometric properties and linguistic features in LLMgenerated messages, using human expert writing as the benchmark.We adopted the computational linguistic approach, which involves quantifying linguistic patterns and psychometric properties in a given text by investigating word usage within predefined psychometric categories [87].In this study, five critical linguistic features, including word count, clout, emotional tone, authenticity, and use of complex words, were chosen to evaluate the properties of the intervention messages [13].The clout score is a proxy for relative social status, confidence, or leadership.A high clout score indicates greater expertise, certainty, and confidence in communication, whereas a low score suggests a tentative style of expression [40,59].Authenticity reflects the varying degrees of personal and disclosing styles in discourse [63].Higher authenticity score indicates honest, personal, and disclosing communication, whereas lower score implies a distanced and impersonal style.Emotional tone ranges from negative (values <50) to positive (values >50), with a midpoint of 50 on the 100-point scale representing a neutral emotional tone [26].Scores for clout, authenticity, and emotional tone are standardized composite variables transformed to a scale ranging from 1 to 100.Complex words refers to the proportion of words that are 7 letters or longer [13].
The five selected linguistic features signify the quality, credibility, and trustworthiness of writings in the context of substance use intervention and health communication.Partch and Dykeman [68] analyzed the linguistic attributes of text messages used in substance use disorder treatments and found that, in comparison to everyday Twitter posts, intervention messages exhibited greater clout, maintained neutral emotional tones, and displayed lower authenticity.Additionally, Toma & D'Angelo [88] posited that medical advice messages with a higher word count and frequent utilization of complex words were perceived as more authoritative.Together, previous literature has identified linguistic markers that bolster the perception of credibility and trustworthiness in health intervention messages: a higher word count and greater clout reduce uncertainties and provide comprehensive and assertive explanations; neutral emotional tones and impersonal writing foster a psychologically detached and objective stance; and the employment of complex words signifies cognitive complexity and thereby enhances the perceptions of formality, expertise, and professionalism.

Method
A library of 899 intervention messages written by tobacco treatment experts and validated in clinical trials served as the training and validation data [16,17,35].Drawing upon research in few-shot learning, which has demonstrated significant improvements in text generation performance when increasing the number of examples from 1 to 4, with diminishing returns beyond 4 examples [101], we opted to use 4 message exemplars in combination with each of the 5 prompt versions under examination (refer to Table 1 for 5 prompt versions and refer to Table 2 for an example of the prompt input with message exemplars).These exemplars were randomly selected from the human-written message library and used as input for 5 state-or-art LLMs, namely GPT-J-6B, Bloom 7B1, and OPT 6.7B, 13B, and 30B.The message generation process was repeated 100 times for each LLM, utilizing the Transformer library from HuggingFace.Computational tasks were performed on a server with an AMD Ryzen Threadripper PRO 3955Wx (16-Cores, 32-Threads) and two Nvidia RTX A6000 48GB GPUs with the Ubuntu 20.04.4 LTS OS.The initial process yielded a total of 21,638 messages.
To ensure relevancy of LLM-generated messages, this study employed a two-step filtering process (see Figure 2).In the first step, BLEU-4 scores were computed to measure repetition between each message and the rest of the generated messages from the same LLM [67,104].The BLEU-4 metric evaluates the similarity up to 4 consecutive words between pairs of intervention messages, producing scores that range from 0 to 1, with values nearing 1 indicating high similarity between messages.We set the threshold at 0.5 to discard redundant messages with BLEU-4 scores equal to or larger than 0.5.In the second step, messages were filtered out based on three criteria tailored to the context of smoking cessation and the characteristics of the example dataset: (1) the presence of words like "app", "apps", or "applications".This criterion was applied because the example data contained messages designed for smartphone-based smoking cessation interventions, and a small portion of which aimed to promote app engagement rather than smoking cessation itself.( 2) the inclusion of underscores ("_"), which indicated placeholders for information to be inserted.And (3) messages consisting of fewer than six words.This threshold was chosen because expert-written messages contain at least six words, and messages shorter than six words tend to lack the necessary informativeness for effective intervention.Subsequent message filtering based on repetition (BLEU-4 ≥ 0.5) and three context-specific criteria resulted in a final count of 11,558 messages.We combined the filtered LLM-generated messages (N=11,558) with the original human-written messages (N=899) to create the final sample for evaluation (N=12,457).
Five prompts were rigorously assessed based on three key dimensions: diversity, quality, and efficiency of the generated messages.Diversity was evaluated using two primary indicators: perplexity scores and the pass rate of repetition filtering.Perplexity scores for each prompt version and each LLM were computed, utilizing the perplexity score from the expert-writing condition as the gold standard.The selection of the optimal prompt was predicated on its performance in close proximity to this reference perplexity.Importantly, this study refrained from the pursuit of lower perplexity scores, as such metrics could potentially signify a lack of diversity in the generated messages [34,67].Quality was ascertained through a comprehensive analysis of critical linguistic features, using expert-written messages as the benchmark.Linguistic attributes, including word count, clout, emotional tone, authenticity, and the use of complex words, were quantified for each generated message using the Linguistic Inquiry and Word Count (LIWC) software (LIWC-22, version 1.3.0)[13].Efficiency was gauged by calculating the average number of messages retained after both repetition and criteria-based filtering applied to each iteration.This measure served as a practical indicator of the prompt's efficiency in generating messages that not only meet quality standards but also do so in an efficient manner.

Results
In the initial message generation phase, each of the five prompt versions yielded an average of 6.7 to 13.8 messages per iteration.The initial repetition filtering step discarded between 23.5% and 72.8% of these messages.Subsequent criteria-based filtering removed an additional 3.2% to 7.0% of these messages.As a result, the proportion of messages retained, relative to the initial count generated by each prompt version, ranged from 20.2% to 73.3%.On average, the human-authored messages comprised 26.16 words (SD=10.80).They exhibited high clout (M=76.62,SD=29.90), a low degree of authenticity (M=43.74,SD=37.24), and a neutral emotional tone (M=48.79,SD=38.97).These messages contained approximately 22% complex words (SD=10.74).
Five one-way analysis of variance (ANOVA) tests with post-hoc multiple comparison analyses were conducted to compare linguistic features of messages generated with the five different prompts with expert generated messages.Results indicated that messages generated with the prompts significantly differed from human-written content in terms of word count (F(5, 12451) = 172.42,p < .001,partial η 2 = .07),clout (F(5, 12451) = 293.29,p < .001,partial η 2 = .11),authenticity (F(5, 12451) = 102.95,p < .001,partial η 2 = .04),emotional tone (F(5, 12451) = 2.97, p = .011,partial η 2 = .00),and the use of complex words (F(5, 12451) = 48.20,p < .001,partial η 2 = .02).Subsequent comparisons using Dunnett's test were conducted between each prompt and the expert-writing condition (Figure 3).Results demonstrated that compared with human-written messages, messages generated with five different prompts consistently had fewer words and used less complex vocabulary (all p values <.001).Compared with expert generated messages, messages generated with prompt version 4 exhibited higher authenticity scores, whereas messages generated with the other four prompts had lower authenticity scores (all p values <.001).In addition, both the generated messages and the expert generated messages used a relatively neutral emotional tone (all p values >.05).Messages generated with prompt versions 1 and 5 displayed clout similar to human writing, whereas messages generated with prompt versions 2 and 3 had significantly higher clout and those generated with prompt version 4 had significantly lower clout.The ANOVA test with post-hoc multiple comparison analysis revealed significant differences in perplexity between each prompt with the expert-writing condition (F(5, 99) = 20.87,p < .001,partial η 2 = .51).Post-hoc Dunnett's test further indicated that LLMs using all five prompts had lower perplexity (M ~ 4.29 to 4.79; SD ~ .39 to .66)compared to messages written by human experts (M=6.54,SD=.63, all p values <.001).

Discussion
Based on a comprehensive evaluation encompassing diversity, quality, and efficiency, our findings highlight the significant impact of different prompts on the ability of LLMs to generate intervention messages, assessed against the expert-written benchmark.Despite advances in LLMs, a discernible gap remains between machine-generated and humanwritten messages.Across all five prompt versions, LLM-generated messages exhibited significant deviations from the expert-writing condition in terms of word count, authenticity, the use of complex vocabulary, and model perplexity.Specifically, messages generated by LLMs were less detailed (evidenced by a reduced word count), more formal and monotonous (with lower authenticity, except for those generated with prompt version 4), less professional (utilized less complex vocabulary), and less diverse (indicated by lower perplexity scores).Notably, our findings indicate that the sequencing of task instructions after message examples, as observed in prompt version 4, adversely impacted the performance of all evaluated Large Language Models (LLMs), extending the ordering effects of training examples on model performance to the relative placement of instructions and examples within prompts [101].In particular, the "example + instruction" arrangement led to highly repetitive outputs, with a repetition rate of 72.8% compared to rates ranging from 23.5% to 33% for other prompts.Additionally, the quality of messages generated using this prompt was suboptimal, as evidenced by a 7% message discard rate due to criteriabased filtering, compared to discard rates of 3.2% to 3.8% for other prompts.Furthermore, linguistic features of these messages diverged significantly from human writing across all critical dimensions.Finally, the production efficiency of messages generated using this prompt was markedly lower, with a pass rate of 20%, in contrast to pass rates ranging from 62.8% to 73.3% for other prompts.
Prompt versions 1, 2, and 3 in our study featured task-specific instructions that ranged from a general directive ("Message:") to more detailed guidelines with a constrained tone ("motivational") and theme ("smoking cessation").Existing literature posits that detailed and specific prompts act as semantically meaningful task instructions, thereby facilitating more efficient model learning -akin to how specific task instructions enhance human learning efficiency [15,61,79].This aligns with a common assumption in few-shot learning research, which suggests that optimal model performance necessitates expertly crafted, clear, and accurate task descriptions [54,79].Contrary to these expectations, our findings indicated that the general instruction (prompt version 1) outperformed its more detailed and specific counterparts (prompt versions 2 and 3) across metrics of message diversity, quality, and efficiency.Specifically, messages generated using prompt version 1 exhibited performance closely aligned with human expert-generated messages in terms of perplexity and key linguistic features such as clout and emotional tone.Moreover, after applying a two-step filtering process, prompt version 1 yielded an average of 7.1 messages per iteration, achieving higher efficiency for message generation compared to 6.6 messages for prompt version 2 and 6.8 messages for prompt version 3.
The finding that a general prompt outperformed detailed and specific ones, although counterintuitive, in fact echoes emerging research that questions the necessity for models to receive semantically meaningful instructions [61,71,91].Empirical studies have shown that, in a few-shot learning context, models perform comparably well when given irrelevant or misleading instructions as opposed to clear and specific directives.This suggests that prompts may serve more to help models learn the distribution of input text rather than to provide explicit task instructions [45,60,91].Furthermore, our findings align with research by Yang et al [97], which demonstrated that prompts with a structured format (referred as schema-based prompts, e.g., "Title: ....; Author....") consistently outperform natural language (NL)-based prompts (referred as template prompts, e.g., "The title is ....; The author is ....") in few-shot learning across various NLP tasks.In our study, both prompt version 1 ("Messages:") and version 5 ("Task: Write messages that are on the same topic" + Message1: …+ Message2: …) adopted a structured format and outperformed the natural language sentences employed for task-specific instructions in prompt versions 2 and 3. Despite comparable quality in the messages generated by prompt versions 1 and 5, the more succinct version 1 exhibited higher efficiency, yielding an average of 7.1 messages per iteration, as compared to an average of 3.5 messages generated by prompt version 5.These findings suggest a general pattern within the context of few-shot learning: model performance benefits from both the brevity and the structured nature of the prompt.

Related Work
A decoding method interprets LLM's output probabilities for the subsequent token (word, sub-word, or character) in a sequence and select the most suitable one for text generation [1].Various decoding methods exist, each with its own set of parameters.The selection of decoding methods and parameter values needs to balance the quality and diversity of the message output.In this study, we employed three decoding methods, namely temperature sampling, top-k sampling, and nucleus sampling, due to their extensive application in text generation tasks [1,16,36].
Temperature sampling modulates the softmax output probabilities using a temperature parameter T. As the temperature value approaches 0, the distribution becomes peakier, thereby favoring the most probable tokens and leading to more deterministic but potentially less diverse output.Conversely, a temperature close to 1 keep the same distribution, increasing the likelihood of sampling less probable tokens.Hashimoto et al. [32] explored the balance between diversity and quality in temperature annealing across single-sentence natural language generation tasks, including summarization, story generation, and chitchat dialogue.Their findings revealed a quality-diversity trade-off: while lowering the temperature (T=0.7, as compared to T=1) improved the quality of generated text, it also reduced diversity and introduced repetition issues across all three tasks.Similarly, Holtzman et al [34] observed that temperatures exceeding 0.9 yielded messages diversity akin to human writing, measured by self-BLEU scores, and temperatures above 0.7 effectively mitigated repetition issues.In the healthcare context, Schmalzle and Wilcox [80] conducted a pilot test employing temperature settings of 0.3, 0.5, 0.7, and 1 with a fine-tuned 355M GPT-2 model, aiming to create messages about folic acid.Their results indicated that T=0.7 produced the most balanced outputs in terms of both quality and diversity, whereas T=1 led to incoherent outputs and T<0.5 produced text that was highly homogeneous with the training text.In light of these observations, previous studies generally recommend an optimal temperature threshold within the range of [0.7, 1] for text generation tasks.
Temperature sampling is frequently used in conjunction with other techniques such as top-k or top-p sampling.In top-k sampling, the k most probable subsequent tokens are filtered, and the probability mass is then redistributed exclusively among these k options.This approach introduces a degree of stochasticity into the decoding process, thereby enriching the diversity of the generated text relative to other methods like greedy decoding or beam search.Previous research commonly recommended combining top-k sampling with a temperature setting of T=0.7, as in [28,73].The optimal value of k may vary depending on the specific task and requirements [37].For example, Fan et al. [28] employed k=2 for summarization tasks and k=40 for text generation.Similarly, in [37], k=10 was recommended for story generation, aiming to produce text that is coherent, contextually relevant, and diverse.Nucleus sampling, also known as top-p sampling, deviates from the fixed-number approach inherent in top-k sampling.Instead, it selects tokens from the smallest subset that has a cumulative probability exceeding a predefined threshold p [34].This method allows for dynamic adjustments in the number of tokens considered at each decoding step, thereby enhancing the diversity and creativity of the generated text.Holtzman et al. [34] observed that nucleus sampling with p=0.95 closely matched human performance in terms of both perplexity and self-BLEU scores.Additionally, De Lucia et al.
In message generation tasks, the selection of decoding methods is critical for achieving the optimal balance between diversity and quality [18,28,32,33,37].A systematic comparison of temperature sampling, top-k sampling, and nucleus sampling has indicated that these sampling strategies can yield comparable performance with tuned hyperparameters [62,99].For example, a configuration with k=500 and t=0.8 was found to perform closely to k=30.However, when the emphasis is on quality over diversity, nucleus sampling has been shown to outperform other decoding methods, as recommended in [99].Based on previous findings, this study investigated the efficacy of eight decoding methods (refer to Table 3), encompassing temperature sampling, top-k sampling, and nucleus sampling, in the generation of intervention messages for smoking cessation.Specifically, in version 3, we adopted the top-p sampling value recommended by [33] and reduced the value from 0.95 (as in version 3) to 0.9 (in version 2) to examine potential improvements in model performance.Version 5 employed a temperature setting of 0.7, as suggested by [80] for message generation tasks, and we increased the value to 0.9 in version 4 to explore the impact of a temperature variation on model performance.For top-k sampling, we followed [73]'s advice, using k=40 in version 6 and adjusting it to k=30 in version 7 to study its effects.Furthermore, we adopted a hybrid approach to versions 1 and 8: version 8 was a combination of versions 5 and 6; and version 1 adjusted values of p, k, and temperature, aiming to achieve a balance between restriction and flexibility to ensure both diversity and coherency for message output.

Method
We replicated the message generation and evaluation process from Study 1.The recommended prompt version 1 from Study 1 was adopted.Eight decoding methods were applied to each LLM, as detailed in Table 3.In alignment with Study 1, each LLM produced messages across 100 iterations.This initial phase resulted in a total of 18,731 messages, which subsequently underwent the same filtering process (see Figure 2).The refined LLM-generated messages (N=15,575) were amalgamated with the original human-authored messages (N=899) to constitute the final evaluation sample (N=16,474).Furthermore, eight decoding methods were systematically assessed on the same dimensions in Study 1 including diversity, quality, and efficiency.

Results
We replicated the analysis process from Study 1, including descriptive analysis on message filtering and multiple comparisons of linguistic features and perplexity.In the initial message generation phase, each of the eight decoding methods yielded an average of 3.4 to 7.1 messages per iteration.The initial repetition filtering step discarded between 4.3% and 23.8% of these messages.Subsequent criteria-based filtering removed an additional 1.4% to 4.6% of these messages.As a result, the proportion of messages retained, relative to the initial count generated by each prompt version, ranged from 72.3% to 93.8%.

Discussion
Results revealed that temperature sampling (version 5, t=0.7) and a hybrid approach (version 1, p=0.9, k=50, t=0.8; version 8, k=40, t=0.7) exhibited perplexity scores closely to that of the expert-writing condition.Consistent with prior studies [28,33,73,80], decoding methods with higher p values (version 2, p=0.9; version 3, p=0.95) or higher temperature settings (version 4, t=0.9) were found to sample too many unlikely tokens, resulting in messages that were diverse yet incoherent.The perplexity findings also echoed the repetition filtering process: decoding methods (version 1, 5, 8) that exhibited lower and approximatingreference perplexity scores had a higher proportion of messages (7.2% to 7.7%) discarded due to repetition; conversely, decoding methods with higher perplexity scores (version 2, 3, 4) demonstrated lower rates of repetition (3.7% to 4.5%).These coherent findings between perplexity and repetition filtering reconfirm the coherence-diversity trade-off that decoding methods favoring the most probable tokens tend to produce more deterministic and less diverse message outputs, thereby raising the likelihood of repetition issues [32,34].
Moreover, decoding versions 6 (k=40) and 7 (k=30), which were designed to produce more deterministic and coherent outputs through lower k-values, still displayed significantly higher perplexity scores compared to the expert-writing condition.On the other hand, prompt versions 1 and 8, despite employing equal or higher k-values (version 1, p=0.9, k=50, t=0.8; version 8, k=40, t=0.7), achieved perplexity scores closely approximating human performance.This suggests that a hybrid approach, incorporating nucleus sampling with top-k and temperature adjustments (as in version 1), or top-k sampling with temperature adjustments (as in version 8), may offer a more balanced model performance in terms of message quality and diversity [34].

Related Work
The development of LLMs has revolutionized the NLP field and offered new opportunities for augmenting clinical practices [2,41,50].However, their application in clinical contexts has raised ethical concerns related to misinformation and message quality [49].Consequently, the evaluation of LLMs' applicability in clinical practices necessitates a thorough examination of their safety and quality, considering subdimensions such as credibility/trustworthiness, contextual relevance, and accuracy [74].While automated metrics like perplexity and BLEU scores are commonly used due to their cost-effectiveness, speed, and repeatability, they have been criticized for their limitations in assessing linguistic properties and overall text quality [4,75,76,81].Automated metrics can be under-informative and may not provide an accurate representation of text quality.For instance, low BLEU scores, often interpreted as indicative of poor text quality, may actually result from correct but unconventional phrasing [4].Moreover, incremental improvements in BLEU scores, typically in the range of 1-2 points as observed in most experimental studies, corresponded to true improvements only about half of the time when subjected to human evaluations [58].In addition, previous studies noted a lack of correlation between automated metrics and human evaluations, highlighting the limitations of relying solely on automated metrics for comprehensive LLM evaluation [55,64,76].When applying LLMs in healthcare, automatic metrics fall short in assessing model performance for safety and quality concerns.Therefore, human evaluation continues to be a critical element in the assessment of LLMs, frequently serving as the gold standard against which automated metrics are compared [48,64,66].This underscores the importance of incorporating human judgment into the evaluation framework to capture the nuanced aspects of text quality and safety that automated metrics may not fully encapsulate.

Human Evaluations of Message Generation with
LLMs.-Human evaluation typically involves recruiting annotators to manually assess the quality of text generated by LLMs.This approach closely approximates real-world application scenarios and provides a more comprehensive and nuanced understanding of a model's performance.Evaluation criteria often include fluency, informativeness, relevance, coherence, accuracy, clarity, grammaticality, and appropriateness [22,47,56] However, due to budgetary constraints, human evaluations often rely on crowdsourced workers from platforms such as Amazon Mechanical Turk rather than on specialized experts [23,38,48,56,95].Although this approach is cost-effective, it has received criticisms for potentially compromising the validity of evaluations.Research indicates that non-expert evaluations may not consistently align well with expert assessments, especially for tasks that require specialized expertise, such as disinformation detection [9].Crowdsourced annotators have been observed to focus on superficial textual features, such as text length and grammatical precision, over more substantive criteria like content accuracy and consistency [23,95].Further analysis has suggested that heuristics employed by non-expert annotators, including features like nonsensical and repetitive text, grammatical issues, rare bigrams, and long words, can be flawed and misleading in evaluation of .This raises significant concerns in contexts that are sensitive to safety and reliability, such as messages that provide health behavior recommendations, where the deployment of LLMs necessitates rigorous evaluation to ensure both reliability and safety.The lack of specialized expertise in the evaluation process could compromise its validity, thereby posing the risk of disseminating inaccurate or unsafe information through LLM-generated outputs.Therefore, while human evaluation remains a critical component of LLM assessment, the qualifications and expertise of the evaluators should be carefully considered, especially in high-stakes contexts.

4.1.2
ChatGPT.-During this study, ChatGPT, a state-of-the-art LLM, attracted considerable attention for its exceptional performance in natural language tasks [65].Operating on a closed-source paradigm, ChatGPT is powered by GPT-3.5, an LLM trained on OpenAI's 175-billion parameter foundation model.The training process utilized a vast corpus of text data from the internet and employed a combination of reinforcement and supervised learning methods.ChatGPT has outperformed its GPT-based predecessors in linguistic capabilities and has been evaluated for its applicability in various sectors, including healthcare education, research, and practice [44,77].For example, a study by Gilson et al. [29] assessed the efficacy of ChatGPT in a medical setting and found that ChatGPT was capable of utilizing logical reasoning and external information to provide accurate, coherent, and contextually relevant answers in question-answering scenarios, demonstrating a level of competency comparable to that of third-year medical students.Given ChatGPT's proven utility in medical contexts, this study aims to further include ChatGPT as an additional LLM and examine its feasibility in generating intervention messages for smoking cessation.

Method
Distinct from other LLMs, ChatGPT incorporates an additional fine-tuning process utilizing reinforcement learning from human feedback [66].This methodology involves ranking a broad spectrum of responses to diverse prompts based on human-labeled feedback, thereby enabling the model to discern between high-quality and suboptimal outputs.Given this unique fine-tuning process, it is not advisable to assume that the optimal prompt identified in Study 1, which was empirically validated with other LLMs, would yield similar results when applied to ChatGPT.To address this issue, a separate evaluation was conducted to identify the most effective prompt for ChatGPT.
In this evaluation, five prompts from Study 1 were initially tested on ChatGPT each across 20 iterations, to compare their effectiveness in message generation.The initial generation phase returned an average of 3 to 11 messages per iteration, except for prompt version 5, which returned only 1 message per iteration.Repetition filtering removed 17% of the messages generated using prompt version 1, but had no impact on messages generated with other prompt versions.Criteria-based filtering further discarded 1% of the messages generated with prompt version 2. Prompt versions 3 and 4 demonstrated superior performance, yielding multiple messages per iteration, all of which passed both repetition and criteria-based filtering, thereby achieving a 100% pass rate.Compared with version 3, version 4 returned more messages per iteration, thereby was recommended and applied to ChatGPT for subsequent message generation, taking into account both quality and efficacy considerations.
Following the message generation process in Study 1, prompt version 4 combined with four message exemplars randomly selected from the expert-written message library were used as input for the ChatGPT interface for 100 iterations, using its default decoding method.The initial process yielded a total of 493 messages.Subsequent message filtering discarded 0 messages for repetition (BLEU-4 ≥ 0.5) and 4 messages for context-specific criteria, resulting in a final count of 489 messages.
Founded in 2008, the Council for Tobacco Treatment Training Programs (www.ctttp.org)has developed an interdisciplinary approach to implementing training standards [84] for Tobacco Treatment Specialists (TTS).The TTS training ensures that TTS are well-prepared to address the dynamic and complex needs of tobacco users in diverse settings.Seven certified tobacco treatment specialists (TTS) were invited to participate in an expert review of messages generated by LLMs in Study 3, using expert-written messages as the benchmark (See Table 4 for examples of intervention messages and scores from expert evaluation).Messages generated using prompt version 1 and decoding version 1 from Study 2 were chosen for review due to their superior performance and efficiency.Each specialist was randomly assigned to review a set of 100 messages, with approximately 16 messages stratified and randomly selected from each of the five LLM message banks (N=624 for GPT-J-6B, N=300 for Bloom 7B1, N=498 for OPT 6.7B, N=537 for OPT 13B, and N=599 for OPT 30B), ChatGPT message bank (N=489), and the expert-written message library (N=899).
Drawing on the widely validated message evaluation protocol for health interventions [20,42], each message was evaluated using 10-point scales for quality, accuracy, credibility, and persuasiveness.Message quality was assessed using a single item: "Considering both content and style, how well-written is the message?" with scores ranging from 1 (poorly written) to 10 (very well-written) adopted from [19].Accuracy was assessed by a single item: "How accurate is this message?(Accurate refers to no misinformation and no factual errors)" [39].Credibility was assessed by a single item asking "how credible does this message seem to you?" [5].Persuasiveness was assessed by a single item asking "To what extent do you feel this message can help smokers avoid smoking?[7].Additionally, a binary measure was employed asking whether the message met the standards in TTS training, making it ready to use in smoking cessation interventions.

Discussion
Our findings offer compelling evidence for the capabilities of larger LLMs like ChatGPT, OPT-13B, and OPT-30B in generating high-quality intervention messages for smoking cessation.Notably, ChatGPT not only met but exceeded human performance in critical evaluation metrics.Specifically, it outperformed human experts in terms of message quality and generated a significantly higher rate of messages that met the TTS standards, as assessed by certified TTS.These results demonstrate the efficacy of LLMs in generating intervention messages for clinical practices.The superior performance of ChatGPT in generating readyto-use, high-quality intervention messages implies its potential as a valuable tool for healthcare professionals in the field of tobacco treatment and suggests possibilities for broader adoption in various healthcare contexts.

GENERAL DISCUSSION AND LIMITATION
Collectively, these three studies present a thorough exploration of message generation and evaluation with LLMs in the specialized context of smoking cessation interventions.Based upon a comprehensive evaluation framework that integrates automated metrics, linguistic attributes, and expert assessments, we have demonstrated that succinct prompt with general instructions (Study 1) and a hybrid decoding approach, which incorporates nucleus sampling with top-k and temperature adjustments (Study 2), achieved optimal performance on message diversity, quality, and generation efficiency, closely approaching the level of human expert writing.Furthermore, expert review (Study 3) revealed that larger LLMs, including ChatGPT, OPT-13B and OPT-30B, can effectively emulate human expert writing to generate well-written, accurate, credible, and persuasive messages, with about half of the generated messages meeting the clinical standards to be directly adopted in smoking cessation intervention.This research makes several substantial contributions to the field: 1) It identifies optimal prompt and decoding strategies for message generation in specialized healthcare contexts, offering generalizable insights across five state-of-art LLMs, and thereby establishing a roadmap for their effective deployment in healthcare; 2) It adopts the computational linguistic approach and introduces a comprehensive evaluation framework that assesses LLM performance in terms of message diversity, quality, and efficiency, offering a robust framework for LLM evaluations for future studies; 3) It underscores the potential of LLMs to augment health interventions by generating a substantial volume of high-quality, accurate, and persuasive messages, thereby offer the opportunity to enhance the scalability and efficacy of future health interventions; and 4) It offers practical guidelines for healthcare professionals on the adoption of LLMs in clinical practice, addressing key considerations such as model optimization, message filtering, comprehensive evaluation, and expert review.Importantly, our findings demonstrate that advanced LLMs like ChatGPT, OPT-13B, and OPT-30B can serve as valuable adjuncts in clinical settings for content generation.Moreover, this research addresses, for the first time, the safety and quality concerns associated with the use of LLM-generated messages in healthcare contexts through the rigorous review from clinicians.

Optimizing LLMs for Message Generation in Healthcare
Optimizing LLMs for message generation necessitates a nuanced understanding of both the nature of the messages to be generated and the specific characteristics of the LLMs employed.Results from Study 1 indicate that a succinct prompt, featuring general instructions placed prior to message exemplars, yielded optimal performance in the generation of healthcare-related messages.This observation is consistent with existing literature [45,60,61,71,91], which posits that, in the context of few-shot learning, LLMs derive greater benefit from message exemplars for learning the distribution of input text, as opposed to relying heavily on explicit task instructions.This finding is particular true for the generation of specialized content, such as healthcare messages, which exhibit unique linguistic patterns that distinguish them from general, everyday discourse [68,87,88].In a similar vein, we hypothesize that the suboptimal performance observed with prompt version 4, which positioned the task instruction after the message exemplars, may be due to a disruption of in-context learning of the latent concept about smoking cessation.A Bayesian interpretation of the in-context learning posits that a list of independent and identically distributed (IDD) training examples provided in the prompt enables the LLM to locate a hidden concept shared among these examples [94,96].The task instruction after these examples may have shifted the posterior probability distribution of the learned concept, leading to inferior performance.
Intriguingly, despite its suboptimal performance across five other LLMs, prompt version 4 yielded the most effective results when applied to ChatGPT.This underscores the notion that prompt selection should be tailored to the specific attributes of the LLM in use.ChatGPT's unique fine-tuning process [66], which enables it to better interpret and act upon task instructions within the prompt.In our task, the instruction after the examples may have allowed ChatGPT to better act on the latent concept learned from the IID training examples.This capability distinguishes ChatGPT from other LLMs and aligns with recent findings in the literature [91].
Moreover, Study 2 identified the optimal decoding method for generating intervention messages, specifically recommending a hybrid decoding approach that combines nucleus sampling with top-k and temperature adjustments (decoding version 1, p=0.9, k=50, t=0.8).This approach outperformed the restricted top-k sampling methods (decoding versions 6, k=40 and version 7, k=30) by offering a balanced performance in terms of both message quality and diversity, consistent with existing literature [34].Given the critical importance of safety and quality in healthcare messaging, our study aimed for LLMs to closely emulate human expert writing.The decoding method recommended in Study 2, tailored to this specific requirement, utilizes a restrictive strategy to prioritize accuracy over creativity.Consistent with prior research, our study suggests that different tasks necessitate different decoding methods.For instance, in tasks where accuracy is paramount, such as summarization, translation, or message generation for a healthcare context, more restrictive decoding methods are often employed (e.g., k=2, as in [28]).Conversely, tasks prioritizing diversity, like creative writing, advertising copywriting, or storytelling, may benefit from more flexible decoding methods (e.g., k=100, as cited in [10]).Therefore, we suggest customizing decoding methods to align with the specific requirements of the NLP task in future studies.

Evaluating LLMs in Healthcare Contexts
The evaluation of LLMs is important for ensuring their safe and effective deployment in healthcare settings.Traditional automatic metrics like perplexity or BLEU often prove inadequate for comprehensively assessing model performance in the context of automatic message generation [55,64,76].Moreover, these metrics can yield results that are difficult to interpret [4].To address these limitations, our study proposes the augmentation of automatic metrics with computational linguistic analysis.This multi-faceted approach allows for a more nuanced and interpretable understanding of model performance.Specifically, results indicate the incompetence of perplexity to correctly reflect subtle messaging style of linguistic features.This means that LLMs can produce messages that differ significantly from those written by experts, despite having similar perplexity.Conversely, a significant discrepancy in perplexity scores between LLMs and human experts does not necessarily indicate differences in linguistic patterns either.Our work integrates computational linguistic analysis to contextualize model performance in message generation tasks, focusing on critical linguistic features.This approach provides a more holistic and interpretable evaluation of Large Language Models' (LLMs) capabilities.Furthermore, it is worth noting that even in the best-performing case of ChatGPT, approximately 30% of generated messages did not meet the TTS standards for clinical use.Therefore, while LLMs can augment smoking cessation interventions, our results emphasize the importance of subjecting LLM-generated messages to systematic evaluation before presenting to patients.In particular, we propose a practical guideline for healthcare professionals on the integration of LLMs in clinical practice.We advocate that the process of LLM-based automated message generation should encompass model optimization, message filtering, comprehensive evaluation, and expert review to ensure its safety, quality, and effectiveness for clinical application.

Limitations
The current study is subject to several limitations.First, this study primarily focused on prompt engineering and decoding optimization in the message generation process.Subsequent research may improve the model performance through two aspects:1) incorporating model fine-tuning as a third strategy, and 2) pre-processing the human-written messages before prompting.In particular, expert-written intervention messages adopted in this study were originally designed for a smartphone-based smoking cessation intervention, which covered a wide range of topics, such as coping with smoking urge, motivation to quit smoking, mood change and anxiety during withdrawal.Therefore, to improve model performance, future studies could pre-categorize expert-written messages by topics and increase the coherency of message examples in prompting.Second, despite the engagement of certified tobacco treatment specialists for the expert assessment of messages generated by LLMs, the sample size of these specialists was relatively small (N=7, reviewing a total of 700 messages).This limitation was contingent on the availability of TTS specialists within our tobacco treatment group, which could introduce bias and potentially compromise the robustness of the evaluations.Consequently, the assessments may not comprehensively represent the spectrum of opinions and clinical practices within the field.Third, the crosssectional nature of the expert review, combined with the volume of messages (N=100) assigned to each specialist for evaluation, resulted in an out-of-context assessment of the messages.For instance, certain messages may have been crafted to address specific scenarios, such as experiencing the social pressure to smoke or coping with stress eating during nicotine withdrawal.Consequently, the validity of the evaluation concerning message persuasiveness was compromised when devoid of this contextual information.Future research could better balance between the quantity of messages and the need for in-context review.By providing contextual information specific to each message, evaluators could assess the messages' effectiveness within the scenarios they are intended to address.This approach would likely yield a more nuanced and accurate understanding of the messages' persuasive effectiveness, thereby improving the validity of future research.

CONCLUSION
We conducted a systematic examination of message generation and evaluation process with LLMs to address two critical questions regarding the applicability of LLMs in healthcare: 1) How to optimize LLMs to mimic human expert writing and 2) Do LLM-generated messages meet clinical standards to be safely implemented in tobacco treatments?Through three studies on prompt engineering, decoding optimization, and expert review, we identified optimal prompt and decoding strategies across state-of-art LLMs for message generation in healthcare, using human expert writing as the benchmark.We proposed a comprehensive evaluation framework encompassing automatic metrics, linguistic attributes, and expert review to assess LLMs in healthcare context on diversity, quality, and efficiency.Further, LLM-generated messages were evaluated by certified TTS on message quality, accuracy, credibility, and persuasiveness.Drawing upon their extensive clinical experiences in tobacco treatment counseling, the TTS concluded that larger LLMs, including ChatGPT, OPT-13B and OPT-30B, can effectively emulate human expert writing to generate well-written, accurate, credible, and persuasive messages, thereby demonstrating their applicability in clinical practices.
Language Models.-Currently,there are open-source and closed-source models based on the availability of model weights.

Figure 1 :
Figure 1: An overview of the three studies, including prompt engineering (Study 1), decoding optimization (Study 2) and expert review (Study 3).

Figure 2 :
Figure 2: Two-step message filtering with sample sizes at each step for studies 1 and 2

Figure 3 :
Figure 3: Confidence intervals for Dunnett test for prompt mean -human mean

Figure 4 :
Figure 4: Confidence intervals for Dunnett test for decoding mean -human mean.

Five
messages that are on the same topic.LabelledProc SIGCHI Conf Hum Factor Comput Syst.Author manuscript; available in PMC 2024 June 21.