Task Supportive and Personalized Human-Large Language Model Interaction: A User Study

Large language model (LLM) applications, such as ChatGPT, are a powerful tool for online information-seeking (IS) and problem-solving tasks. However, users still face challenges initializing and refining prompts, and their cognitive barriers and biased perceptions further impede task completion. These issues reflect broader challenges identified within the fields of IS and interactive information retrieval (IIR). To address these, our approach integrates task context and user perceptions into human-ChatGPT interactions through prompt engineering. We developed a ChatGPT-like platform integrated with supportive functions, including perception articulation, prompt suggestion, and conversation explanation. Our findings of a user study demonstrate that the supportive functions help users manage expectations, reduce cognitive loads, better refine prompts, and increase user engagement. This research enhances our comprehension of designing proactive and user-centric systems with LLMs. It offers insights into evaluating human-LLM interactions and emphasizes potential challenges for under served users.


INTRODUCTION
The release of ChatGPT has sparked considerable interest in the interaction between humans and AI.This interest has led to a rising number of individuals employing large language models (LLMs) for various purposes such as task assistance, entertainment, education, and even as an alternative to traditional search engines [3,5].Despite the prevalence of ChatGPT, users still face challenges formulating prompts, and cognitive barriers and biased perceptions further impede task completion [26,34].These issues reflect broader challenges identified within the fields of information seeking (IS) and interactive information retrieval (IIR), particularly concerning task context and user perceptions, such as task topic and type, user intent, topic familiarity, and task expectations [2,14,16,23,30].
Previous IIR studies have underscored the complexity of integrating these fluctuating user perceptions into a predominantly static search system.Fortunately, the evolution of LLMs marks a transformative phase in information access (IA) paradigms, introducing a promising avenue for incorporating more nuanced interaction data through Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).conversational context between the user and generative IA (GIA) system [24].Therefore, it becomes crucial to explore methodologies for embedding task context and user perceptions into ChatGPT interactions, subsequently evaluating their impact on user experience and task completion, which are essential criteria for evaluation IIR and Human-AI Interaction [1,17,19,26].
To achieve this, we have developed a task platform that emulates the official ChatGPT interface, incorporating the GPT-3.5-turbomodel.Aiming to support users with the challenges mentioned above, we designed and implemented three supportive functions: 1. Perception Articulation: allows users to clarify their perceptions, including topic familiarity, and expected task complexity.This perception articulation will be then input to ChatGPT through prompt engineering to enrich the context information; 2. Prompt Suggestions: generates prompt revisions and follow-up questions, aiding users who struggle with prompt formulation; 3. Conversation Explanation: generates explanations for the ongoing conversation (i.e., the user's prompt and ChatGPT's response pair) for users to better comprehend ChatGPT's interpretation of the conversation.
To validate our approach, we conducted a naturalistic user study, involving 16 participant of college students and crowdsourced workers with self-defined tasks.These tasks spanned various lengths and cover diverse topics including creative writing, professional development, and specific programming questions.
Our analysis underscores the effectiveness of the supportive functions, illuminating their role in facilitating user experience and task completion.The findings reveal that these functions proved instrumental in managing user expectations, reducing cognitive load, guiding prompt refinement, and increasing user engagement.This research further enhances our understanding of designing proactive and user-centric systems with LLMs, offering insights into evaluating human-LLM interactions from both the system and user ends, and underscoring potential challenges for under served users in this new era of AI.

Capability of ChatGPT
LLMs have emerged as a groundbreaking development in the realm of artificial intelligence, leveraging sophisticated architectures trained on extensive data to understand and emulate human-like text generation [20].One such application is ChatGPT, which has seen the most significant growth in its userbase.One key access to the versatility of LLMs is prompt engineering, a method that uses specific information and instruction in the input to optimize ChatGPT's output content and format [18].Previous studies have examined using ChatGPT and other LLMs in educational settings where they can personalize content delivery and foster enhanced learning experiences [4,8,29,35].In addition, ChatGPT has demonstrated potential in providing emotional support, by playing therapeutic roles based on user sentiment and need [29].

LLMs as Information Access Systems
Incorporating LLMs in information access systems introduces transformative prospects from GIA, particularly through multi-turn interactions that resemble traditional IIR processes [24,26].However, harnessing LLMs in information access systems presents pronounced challenges.Users encounter difficulties in query formulation or interpreting search results, often due to cognitive barriers, which are common when initializing and refining prompts during interactions with LLMs [23,34].These barriers often stem from task context and user perceptions, such as lack of prior knowledge, low familiarity level with the topic, high complexity, and inappropriate expectations, leading to potential misconceptions about how LLMs interpret and respond [6,7,19,23,30,31].

Evaluating Human-LLM-Interaction
In recent literature, the evaluation of human-LLM interaction has garnered significant attention, especially as these models become increasingly used in human tasks.A comprehensive approach to this evaluation has been proposed, encompassing aspects such as task performance, user experience, and general "Human-AI eXperience" (HAX) [1,10].
With similar interaction process and challenges, there are also notable parallels between the evaluation of human-LLM interaction and the assessment of IIR.Both areas highlight the significance of user experience and perception.Another critical facet enhancing user trust and comprehension is explainability in AI, which merits more profound exploration [9,33].

RESEARCH QUESTIONS
In respect to the challenges and opportunities in human-LLM interaction and the roles of task context and user perceptions in IS and IIR research, our study aims to investigate the research question RQ: How can we provide support with task context and user perception information to mitigate user challenges in tasks when interacting with ChatGPT?
To answer this question, we explore our methodology for collecting and integrating features related to task context and user perception, importing these features into the system through prompt engineering, and developing supportive functions to enhance user assistance in the task.

System design and supportive functions
We developed an interface similar to the official ChatGPT, with GPT-3.5-turbo as the model, and integrated questionnaires and supportive functions.The interface and the prompt templates for the supprortive functions are shown in Figure 1.
To interact with the system, users need to enter the pre-task questionnaire highlighted in upper left of the interface.The pre-task questionnaire collects features of task context and user perceptions, including task topic and type, familiarity level, and expectations (e.g., expected task complexity, spending time, and outcome).The detailed descriptions of these features are in Table 1.Within the pre-task questionnaire, perceptions articulation is implemented as the generative features: familiarity level and expected complexity.Unlike traditional surveys that use Likert scales to measure these two features, our approach utilizes ChatGPT to generate five degrees of familiarity or task complexity with descriptions and examples.This function aims to assist participants in better comprehending and selecting the option that most closely aligns with their perceptions.The chosen degree, along with its description and example, are then formatted in the prompt template to enrich the context of the main prompt template, which is demonstrated in the left dashed box in Figure 1.We developed this main prompt template with several components, including role, description, narrative, aspect, and format according to a prior study for designing effective prompts [28].This template aims to provide ChatGPT with comprehensive background information about task context and user perceptions.
After this pre-task questionnaire, we implement the second supportive function, prompt suggestions, by using the main prompt template involved ongoing conversations.The suggestions are displayed in separate tabs at the bottom of the interface.Furthermore, for each conversation, we implement the conversation explanation function in the rating questionnaire.This function generates five explanation options, allowing users to select the one that best

Explanation
Five potential explanations for corresponding {user prompt} and {ChatGPT response} Explanation utility Five-degree options with descriptions from "very poor" to "excellent" Conversation usefulness Five-degree options from not useful to extremely useful.

Conversation satisfaction
Five-degree options from very unsatisfied to very satisfied.aligns with their intent.We also include an explanation utility question, which allows users to rate the utility of the chosen explanation on a five-point scale.The purpose of the conversation explanation function is to present ChatGPT's interpretation of each prompt or response for users and investigate the potential of these explanations in enhancing user engagement and experience.

Participant recruitment
We targeted two distinct user groups for our research: college students at a research university and crowdsourced workers from Amazon mTurk.The recruitment process contains two steps: Step 1: participants were asked to complete a registration survey.This survey collected background information, including demographics and prior experience with ChatGPT.Step 2: We then inquired participants if they wished to proceed to the remote user study.Those who opted in then reported the tasks they planned to perform with ChatGPT.We required participants to report task plans with three anticipated task lengths: short (less than 30 minutes), medium (1 to 2 hours), and long (3 hours or more).
According to the naturalistic study setting, participants were allowed to edit the task plan when they had new task ideas and complete planned tasks in five days.Before their own tasks, they would perform a warm-up task to get familiar with the platform interface and the study process.After the user study, participants could opt for an interview where we sought their feedback and insights on their experiences.In these interviews, we specifically explored their views on the task experience using our platform, the effectiveness of the supportive functions, and any suggestions or opinions they might have.Compensation for participants includes $5 for the step 1 registration survey and $50 for the step 2 user study.This compensation exceeds the minimum wage threshold, and our research has received approval from the Institutional Review Board (IRB).
The decision to choose two distinct participant groups aimed at broadening user diversity.While past studies focused on college students as early adopters of ChatGPT, they still highlighted the need for a more heterogeneous user group.
Consequently, our participant pool includes college students from diverse fields such as Computer Science, Library and Information Science, and Public Health.Additionally, we incorporated crowd workers to ensure an even broader user spectrum.However, we set a qualification with an age range of 18-25 for crowd workers to facilitate a comparative analysis between the two groups.

Analysis
For this small-scale user study, we utilized a descriptive analysis by presenting the tasks users performed on our platform and explaining how the platform influenced user experience and assisted them in task completion.In addition, we delved into the interview data as case studies to illuminate users' experiences and insights.

Participants and tasks
As a result, 16 participants enrolled in the user study, comprising 8 college students and 8 crowd workers.The college students came from various academic backgrounds, including computer science, library/information science, and public health, ranging from sophomores to graduates.The crowd workers specialized in fields such as information technology and business, and were either pursuing or had already obtained their bachelor's degrees.Out of the 16 participants, six completed the tasks according to their task plans and participated in the interview (3 college students and 3 crowd workers), while the remainder finished at least the warm-up task.
Table 2 presents the average results of the tasks.Excluding the warm-up task, there were 29 tasks in total, comprising 10 short tasks, 13 medium tasks, and 6 long tasks.There were notable differences between the college student group and the crowd worker group, especially concerning the numbers of prompts and used prompt suggestions, and task duration.
College students submitted approximately 5 to 6 prompts in short or medium tasks, though the duration for medium tasks was about double that of short tasks.They submitted over 20 prompts in the long task, completing the task in almost two days.In all task lengths, they adopted about one prompt from the suggestions in average.Conversely, crowd workers spent more time and prompts on short tasks than college students did, but they spent less time on medium and long tasks.This discrepancy could stem from their reliance on prompt suggestions, as nearly all their prompts were derived from these suggestions.Regarding the conversation ratings, college students had a positive experience (high  Task topics and types in "()" provide clarifications for ambiguous topics or types that users entered in the pre-task questionnaire, as further inferred from actual conversations.usefulness, explanation utility, and satisfaction) in short and medium tasks but a moderate experience in long tasks.
Crowd workers generally had a positive experience, except for the moderate explanation utility in long tasks.
To gain deeper insight into the participants' experiences, table 3 provides a summary and examples of tasks topics and types, grouped by users and expected task length.College students (especially computer science students) engaged in tasks that included learning new topics (in short and medium tasks) and solving specific programming problems (in medium and long tasks).In contrast, crowd workers did not specify clear task topics in the pre-task questionnaire.They used vague terms like "JK," "learning," and "developing," primarily intending to engage in casual conversations with ChatGPT.Consequently, based on their input and heavy reliance on the prompt suggestion function, those suggestions led the conversations towards topics such as "J.K. Rowling", "online learning platforms", and "learning programming languages".

Interview case study on task experience
We further present insights from the interview as case studies, examining the impact of supportive functions on user experience and task completion.Table 3 outlines participants' backgrounds, prior experiences with ChatGPT, task experiences during this study, and insights.We interviewed three computer science college students, P1, P2, and P3, all of whom showed considerable enthusiasm and engagement in both the tasks and subsequent interviews.Additionally, we interviewed crowd workers P4, P5, and P6.We delve into detailed feedback from P1, P2, and P3 and summarize the concerns raised by P4, P5, and P6.
P1, a self-identified expert of ChatGPT, considering ChatGPT as useful but not compatible with human experts.In this study, P1 recognized the value of perception articulation in "guiding expectations".In addition, the conversation explanations helped P1 "balance expectations" and satisfaction.For instance, P1 previously experienced ChatGPT's shortcomings in understand the specific rules of haikus, and they (a gender-neutral alternative to he/she) had a low expectation and started with some simple rules of haikus in this study.Surprisingly, P1 found the generated haikus followed the correct rules.However, inconsistencies arose when P1 increased the rule complexity.After reviewing the conversation explanations, P1 adjusted their expectations, acknowledging ChatGPT's limitations but remaining satisfied.
P1 also noticed that explanations were more detailed for tasks with low familiarity but more concise and "straight to the point" for more familiar topics.However, P1 found the prompt suggestion function sometimes misaligned with their intended task direction, particularly in specific programming problems.Nonetheless, P1 appreciated the guidance from suggested prompts when exploring unfamiliar content, especially in the internship related question.They also highlighted the study platform's effectiveness in maintaining focus on specific tasks, thereby enhancing engagement and efficiency.
P2 previously found ChatGPT unexpectedly efficient in suggesting coding solutions.However, they did not anticipate fully completing tasks from ChatGPT's responses but expected to "get a broad idea" to guide their own solution development.In this study, P2 recognized how the platform interpreted task context and user perceptions in the pre-task questionnaire.For example, when encountering challenges that ChatGPT's response exceeded P2's knowledge, they realized overestimated familiarity of a topic and reassessed it.The conversation explanation function allowed P2 to understand where any miscommunication began and to refine prompts.For example, in the complex C++ coding task, P2 gained confidence and refined prompts through iterative interactions and could use ChatGPT to "build all remembered functions in one prompt", attributing this improvement to learning from the explanations.However, P2 became dissatisfied when the code generated by ChatGPT was incompatible with an updated game development package.They then realized that ChatGPT had limited knowledge on this development package and discontinued the conversation for alternative resources.Regarding the prompt suggestions, P2 found them useful in initializing short tasks, but overly broad and generalized in the longer or more complex tasks.P3 self-identified as proficient with ChatGPT, but they experienced miscommunication with ChatGPT's interpretations.In this study, P3 provided unique insights when revising their resume.P3 observed redundancy and paraphrasing issues among the explanation options.For the prompt suggestions, P3 felt that those suggestions could sometimes distract from their initial intents.Nevertheless, P3 acknowledged the suggestions' potential to unveil new perspectives, such as unconsidered resume layouts.P3 also emphasized the effort required to modify ChatGPT's output to their preferred style and content.
Regarding the crowd workers (P4, P5, and P6) involved in IT-related fields, their previous interaction with ChatGPT was primarily work-related.In this study, as per the tasks summarized in Tables 1 and 2, they appeared hurried in their completion, heavily relying on prompt suggestions, possibly due to their focus on Human Intelligence Tasks (HITs).
Their feedback during interviews remained nonspecific, centered more around task completion for compensation rather than the study's investigative purpose.It seems they misunderstood the task purpose of this study, or they were outliers of target users, which raised concerns discussed in the next section.

DISCUSSION AND CONCLUSION
In closing discussion, we incorporate insights derived from our study, focusing on enhancing human-LLM interaction and identifying prevalent challenges during these interactions.

Insights and Implications for System Evaluation:
The interview insights suggest distinct evaluation metrics at both the system and user ends.For the system, we can propose task reliability, which reflects the LLM's deep understanding and capability of task-specific knowledge beyond the text completion.This reliability could be evaluated through the LLM's ability to describe different levels of task familiarity and complexity for better recognizing the user's task states given the task familiarity and complexity levels.Another evaluation aspect could be conversation explainability, reflecting the LLM's ability to interpret the conversation from a task-aware perspective.Enhancements of both task reliability and explainability might include graph-based methods to represent the task in a knowledge structure with contextual information that enhances situational awareness [22,32].On the user end, metrics on task engagement could be measured by users' willingness to refine prompts and be influenced by the expectation of LLM's capability [17,34].
Proactive User Interface and System Design: Our study offers insights for proactive user interface and system design, which extends previous work on proactive IR [15].Our approach involved utilizing questionnaires and specific interface tabs, yet the need for more seamless integration emerged.Following established AI interaction guidelines, a separate interface, such as a sidebar or secondary tab, could be deployed to facilitate supportive functions without interruptions on the ongoing conversation with ChatGPT.This proactive interface could also adopt a conversation-based mechanism for capturing user perceptions and provide suggestions, assisting the main ChatGPT interaction.This system holds the potential to leverage a fine-tuned LLM, for analyzing ongoing conversations and offering task-aware recommendations [11][12][13]25].This insight represents a step towards more intuitive, assistive AI, catering to diverse user needs without over-complicating the interaction.
Recognizing the Unique Position of Crowd Workers: This study revealed unexpected findings concerning crowd workers' interactions with ChatGPT.These anomalies may not solely be attributed to users' misunderstanding of the study's purpose but also possibly to the inherent challenges these users faced in formulating their own tasks.HITs for crowd workers are usually straightforward and repetitive tasks, like data labelling, even though they still require instructions and a gold standard to ensure accurate labeling [28].Although they can produce text resembling task plans to meet the HIT criteria, this text does not necessarily mirror their real information needs [21,27].Formulating their "own" tasks or even identifying their information needs may be internally complex tasks.Additionally, the emphasis on task-centric LLM in this study might overshadow challenges with exploratory or open-ended user intentions.Therefore, it is crucial to comprehend the diverse information needs and usage contexts among more diverse user categories, so that the

Fig. 1 .
Fig. 1.User study platform and prompt templates for supportive functions.Yellow boxes highlight the components for the questionnaires and supportive functions.Grey boxes contain features (including generative features) collected through the questionnaires.Solid arrows indicate the features collected in the pre-task questionnaire, subsequently utilized in prompt suggestions and conversation explanations through prompt engineering.Dashed and dotted boxes contain prompt templates, with {variable features}.Dotted arrows indicate the application of prompt templates in implementing the supportive functions.

Table 1 .
Features in the pre-task questionnaire and the conversation rating questionnaire.

Table 2 .
Average results of the tasks and user experience.This duration reflects the gap between the start and end of the task, not the exact amount of time spent on the task. *

Table 3 .
Summary and examples of task topics and types grouped by expected length and participant group.

Table 4 .
Summary of user experience and insights from the interview.