Analysing Utterances in LLM-Based User Simulation for Conversational Search

Clarifying underlying user information needs by asking clarifying questions is an important feature of modern conversational search systems. However, evaluation of such systems through answering prompted clarifying questions requires significant human effort, which can be time-consuming and expensive. In our recent work, we proposed an approach to tackle these issues with a user simulator, USi. Given a description of an information need, USi is capable of automatically answering clarifying questions about the topic throughout the search session. However, while the answers generated by USi are both in line with the underlying information need and in natural language, a deeper understanding of such utterances is lacking. Thus, in this work, we explore utterance formulation of large language model (LLM)–based user simulators. To this end, we first analyze the differences between USi, based on GPT-2, and the next generation of generative LLMs, such as GPT-3. Then, to gain a deeper understanding of LLM-based utterance generation, we compare the generated answers to the recently proposed set of patterns of human-based query reformulations. Finally, we discuss potential applications as well as limitations of LLM-based user simulators and outline promising directions for future work on the topic.


INTRODUCTION
Conversational information retrieval, also known as conversational search, refers to the process of retrieving relevant information in response to a natural language conversation or query.The primary goal of a conversational search system is to satisfy the user's information need by retrieving relevant information from a given collection.To successfully do so, the system needs to have a clear understanding of the underlying user need.Since users' queries are often under-specified and vague, a mixed-initiative paradigm of conversational search allows the system to take the initiative in the conversation and ask the user clarifying questions or issue other requests.Clarifying the 62:2 I. Sekuli et al.
user information need has benefited both the user and the conversational search system [4,32,77], providing a solid motivation for such mixed-initiative systems.
However, evaluating the described mixed-initiative conversational search systems takes considerable work [47].The challenge arises from expensive and time-consuming user studies required for holistic evaluation of conversational systems [20].Such studies require real users to interact with the search system for several conversational turns and provide answers to potential clarifying questions prompted by the system.A relatively simple solution is to conduct offline corpus-based evaluation [4].However, this limits the system to selecting clarifying questions from a pre-defined set of questions, which only transfers well to the real-world scenario.Moreover, such offline evaluation remains limited to single-turn interaction, as the pre-defined questions are associated with corresponding answers and unaware of previous interactions.User simulation has been proposed to tackle the shortcomings of corpus-based and user-based evaluation methodologies.A simulated user aims to capture the behaviour of a real user, i.e., being capable of having multi-turn interactions on unseen data, while still being scalable and inexpensive like other offline evaluation methods [7,56,78].
In this article, we extend our conversational User Simulator (USi), proposed in Sekulić et al. [61], to explore utterance formulation of large language model (LLM)-based user simulators.Given an initial information need, USi interacts with the conversational system by accurately answering clarifying questions prompted by the system.The answers align with the underlying information needed and help elucidate the intent.Moreover, USi generates answers in fluent and coherent natural language, making its responses comparable to real users.
We experiment with two LLM-based approaches to simulate users.First, we base our proposed user simulator on a large-scale transformer-based language model.We fine-tune GPT-2 [49] to generate answers to posed clarifying questions.This method was presented in our recent paper [61].Second, we use in-context learning, that is, prompting, a few-shot technique made possible with the next generation of LLMs, such as GPT-3 [12], LLaMa [71], and Chinchilla [28].A GPT-3-based method, ConvSim, was recently proposed by Owoicho et al. [44].Both methods generate answers to clarifying questions in line with the initial information needed, simulating the behaviour of a real user.In the first case, we ensure that through a specific training procedure, resulting in a semantically controlled language model.With a GPT-3-based simulator, we utilise incontext learning (i.e., prompting) to guide the model into following specific steps to answer posed questions.
We evaluate the feasibility of our approaches with an exhaustive set of experiments, including automated metrics and human judgements.First, we compare the quality of the answers generated by our methods and several competitive sequence-to-sequence baselines by computing several automated natural language generation (NLG) metrics.In Sekulić et al. [61], we found that the GPT-2-based model significantly outperforms the baselines.This work extends the experiments to ConvSim [44], a GPT-3-based model, and finds an even more robust performance.However, as automated NLG metrics often yield unrealistic evaluations, we further analyse a crowdsourcing study conducted in Owoicho et al. [44] to assess how honest and accurate the generated answers are compared with solutions caused by humans.Furthermore, we extend this evaluation setting in a multi-turn conversational scenario.The crowdsourcing judgements show significant differences both in the naturalness and usefulness of answers generated by USi and ones developed by ConvSim.TheConvSim model outperforms the other, especially in the multi-turn setting.Performance compared with human responses remains similar.Next, we perform a qualitative analysis of utterance reformulations generated by our LLM-based approaches in response to clarifying questions.We map our findings to recently proposed patterns for conversational recommender systems [79] and 62:3 find that user simulators tend to rewrite the original query to explain the underlying information need further.However, we note that types of such reformulations highly depend on the training data and the prompts given to the models.Finally, we discuss the applications and future work of LLM-based user simulators.
In summary, our contributions are the following.
-We compare two streams of LLM-based user simulators by the automated NLG metrics.
-We analyse the type of utterances generated by LLM-based methods.
-We discuss in detail the potential applications of LLM-based user simulators, their cost, and their limitations.Moreover, we outline potential future work in the space of user simulation, aimed at going beyond answering clarifying questions.
The rest of the article is organised as follows.Section 2 reviews related work on the topic.Section 3 describes a user's role in conversational search system evaluation and the desirable characteristics of a simulated user.In Section 4, we motivate and describe in detail the implementation of the two approaches to user simulation, covering both USi [61] and ConvSim [44].In Section 5, we construct several experiments to answer key research questions on the feasibility of the proposed methods.In Section 6, we then analyse patterns identified in the simulator's responses, compare them to human-generated answers, and extend an existing set of practices for utterance reformulations.We present the results in Section 7. In Section 8, we discuss the advantages versus shortcomings of the approaches and outline our future work aspirations.In Section 9, we present our conclusions.

RELATED WORK
Our work is part of a broad area of conversational information retrieval and user simulation.In this section, we present an overview of the relevant work on the topics.

Conversational Search
Recent advancements in conversational agents have stimulated research in conversational information access [16,67,75] that started many years earlier [18].The report from the Dagstuhl Seminar N. 19461 [5] identifies conversational search as one of the essential areas of information retrieval (IR) in the upcoming years.Radlinski and Craswell [50] propose a theoretical framework for conversational search, highlighting the multi-turn user-system interactions as one of the desirable properties of modern conversational search.This property is tied with a mixed-initiative paradigm in IR [30], where the system is passive and prompts the user with engaging content, such as clarifying questions.
Clarification has attracted considerable attention from the research community, including studies on human-generated dialogues on question-answering (QA) forums, utterance intent analysis, and asking clarifying questions [11].Asking clarifying questions is beneficial for the conversational search system and the user.For example, Kiesel et al. [32] studied the impact of voice query clarification on user satisfaction and found that users like to be prompted for clarification.Moreover, Aliannejadi et al. [4] proposed an offline evaluation methodology for the asking clarifying questions and showed the benefits of clarification in terms of improved performance in document retrieval once the question is answered.Hashemi et al. [27] proposed a Guided Transformer model for document retrieval and next clarifying question selection in a conversational search setting.Zamani et al. [77] proposed reinforcement learning-based models for generating clarifying questions and the corresponding candidate answers from weak supervision data.Sekulić et al. [59] proposed a GPT-2-based model for generating facet-driven clarifying questions.
Although extensive work related to clarification in search exists, effective and efficient evaluation methodologies of mixed-initiative approaches are still being determined.
Another research direction in the conversational search area is multi-turn passage retrieval, led by the TREC Conversational Assistant Track (CAsT) [19] and Interactive Knowledge Assistance Tack (iKAT) [2].The system needs to understand the conversational context and retrieve appropriate passages from the collection.As a further improvement, Ren et al. [54] introduced the task of conversations with search engines, where the system generates a short, summarised response of the retrieved passages.Other studies in the area of conversational search include user intent classification [48], response ranking [19,58,63], document features for clarifying questions [62], user engagement prediction [39,60], and query rewriting [46,62,72].
In the field of natural language processing (NLP), researchers have studied question ranking [51] and generation [52,74] in dialogue.These studies usually rely on large amounts of data from query logs [53], industrial chatbots [74], and QA websites [51,52,70].For example, Rao and Daumé [51] developed a neural model for question selection on an artificial dataset of clarifying questions and answers extracted from QA forums.Their later study proposed an adversarial training mechanism for generating clarifying questions given a product description from Amazon [52].Unlike these studies, we study user-system interaction in an IR setting.The user's information need is presented in short queries (versus a long detailed post on StackOverflow), resulting in a ranked list of relevant documents.Furthermore, the IR system can ask clarifying questions to elucidate the user's information need, which needs to be answered.

User Simulation in Information Retrieval
Given the complexity of human-computer interactions and natural language, there has been an ongoing discussion in the NLP community about the credibility of automatic evaluation metrics based on text overlap [43].Metrics such as BLEU and ROGUE, which try to judge a system's output solely based on how much overlap it has with a reference utterance, cannot capture the performance of the system accurately [8].Hence, human annotation should be done to evaluate a system's performance when a generative model is used in summarisation and machine translation tasks.Moreover, the evaluation of a system becomes even more complex if an ongoing interaction between the user and the system exists.Not only must the system evaluate the generated utterance, but it should also be able to incorporate a human response.For this reason, researchers adopt human-in-the-loop techniques to mimic human-computer interactions and further perform human annotation to evaluate the whole system's performance (in response to humans).Recent work of Lipani et al. [38] proposes a metric for offline evaluation of conversational search systems based on a user interaction model.
To alleviate the need for time-consuming and expensive human evaluation, researchers proposed replacing the user with a user simulation system [56,66].Simulation in IR has long been studied (1973) [17] to generate pseudo-docs and pseudo-queries to analyse literature search system performance.The work was then followed by Griffiths [26], proposing a general simulation framework for IR systems.Tague et al. [69] later studied the problems for user simulation in bibliographic retrieval systems.User simulation for evaluation was first proposed in 1990 by Gordon [25], who proposed a framework for generating simulated queries.This work has been long followed in the literature to study various hypothetical user and system actions (e.g., issuing 100 queries in a session) that cannot be done in a real system [6].In particular, Azzopardi [6] proposed to study the cost and gain of user and system actions and studied the effect of different strategies using simulated queries and actions of users (e.g., clicking on relevant documents).Mostafa et al. [42] studied different dimensions of users' interests and their impact on user modelling and information filtering.Diaz and Arguello [21] adapted an offline vertical selection prediction model in the presence of user feedback for user simulation.
More recently, there has been research on simulating users to evaluate the effectiveness of systems [15,56,66,76,78].For example, Carterette et al. [15] proposed a conceptual framework for investigating various aspects of simulations: system effectiveness, user models, and user utility.With the recent developments of conversational systems, more attention has been drawn towards simulating users in a conversation.Sun et al. [66] proposed a simulated user for evaluating conversational recommender systems based on predefined actions and structured response types.Kim and Lipani [33] extend their work by offering a multi-task neural model that predicts user action, satisfaction, and an utterance in conversational recommender systems.
Closer to our work, Salle et al. [56] proposed a user simulator for information-seeking conversations in which the simulator takes an information need and responds to the system accordingly.However, we would like to draw attention to the various limitations of their work.Even though their proposed simulator takes an information need as input and aims to answer the system's request according to the need, it fails to generate responses.In other words, the approach is limited to predicting the relevance of the system's utterance to the user's information need and selecting an appropriate answer from a list of human-generated answers.The simulator becomes valid only if predefined pools of clarifying questions and their answers are available.In this work, we take one step further and generate human-like answers in natural language.Also, the work by Zhang and Balog [78], which simulates users for recommender system evaluation, uses structured data and response types.This work proposes a simulator that generates natural language responses based on unstructured data.
Finally, Zhang et al. [79] study query reformulations in conversational recommender systems.They identify several types of query reformulations and find that users often reformulate their query by repeating the previous utterance by rephrasing it or further expressing their information needs.In this work, we analyse reformulations of user utterances and utterances generated by our simulators in mixed-initiative conversational search systems.

USER SIMULATION
In this section, we explain a user's role in evaluating conversational search systems.We also discuss several desired characteristics of a user simulator and propose two simulation methods, with a focus on answering clarifying questions.

User's Role in Conversational Search System Evaluation
Previous work in task-oriented dialogue systems and conversational search systems mainly evaluate the performance of the systems in an offline setting using a corpus-based approach [20].The offline evaluation must accurately reflect the nature of conversational systems, as the evaluation is possible only at a single-turn level.Thus, to adequately capture the nature of the conversational search task, it is necessary to involve users in the evaluation procedure [10,36].User involvement allows proper evaluation of multi-turn conversational systems, in which users and systems take turns in a conversation.Even with such an approach, which most precisely captures the performance of the systems in a real-world scenario, the involvement of users in the evaluation is tiresome, expensive, and unscalable.To alleviate the evaluation of dialogue systems while still accurately capturing the overall performance, a simulated user approach has been proposed [66,78].The simulated user is intended to provide a substitute for real users [7], as it is easily scalable, cheap, fast, and consistent.Next, we formally describe the characteristics of a simulated user for conversational search system evaluation.

Problem Definition
As mentioned, evaluating conversational search systems is challenging due to the necessity of human judgements at each turn of interaction with the search system.In this work, we aim to alleviate the procedure of evaluating certain types of mixed-initiative conversational search systems.Specifically, we provide a simulated user with an information need capable of answering various clarifying questions prompted by any modern conversational search system.
Our simulated user U is at first initialised with a given information need in.Simulated user U formulates its need in the form of the initial query q, which is then given to the general mixedinitiative conversational system S.The system S elucidates the information needed in through a series of clarifying questions cq.We do not go into details of the implementation of such a system, but different approaches have been proposed in recent literature [4,27].Next, the simulated user U needs to provide an answer a to the system's question.The answer a needs to align with the user's information need in.

Single-turn responses.
Formally, user U needs to generate an answer a to the system's clarifying question cq, conditioned on the initial query q and the original user's intent in: ( The user U is expected to answer the question in line with its information need, not just based on a potentially vague and under-specified query, as traditional chatbots would be inclined to do.

Conversation history-aware user.
The system can take further initiative and ask additional clarifying questions.Thus, our simulated user U needs also to track the conversation flow.Formally, at the conversational turn i, U generates an answer given by where H is conversational history, consisting of the interaction between the user and the system up until the current turn: H = {(cq j , a j )}, where j ∈ [1 . . .i − 1].The following section explains how we modelled the described simulated user.

SIMULATION METHODOLOGY
In this section, we motivate and describe the two approaches to user simulators for answering clarifying questions.The first one is based on semantically controlled text generation via finetuning the LLM model, specifically GPT-2 [49], which is proposed in Sekulić et al. [61].The other approach is based on in-context learning (prompting) and the next generation of LLMs, GPT-3 [12], which is proposed by Owoicho et al. [44].

Semantically Controlled Text Generation
We define generating answers to clarifying questions as a sequence generation task.Thus, we employ language modelling as our primary tool for generating sequences.The goal of a language model (LM) is to learn the probability distribution p θ (x) of a sequence of length n: where θ are the parameters of the LM.Current state-of-the-art language models, such as GPT-2, learn the distribution in an auto-regressive manner, i.e., formulating the task as a next-word prediction task: However, recent research showed that transformer-based LLMs, although generating text of near-human quality, are prone to "hallucination" [22] and generally lack semantic guidance [55].
Thus, we fine-tune a semantically conditioned LM with a specific fine-tuning technique and careful input arrangement.As mentioned in the previous section, answer generation must be conditioned on the underlying information need.To this aim, we learn the probability distribution of generating an answer a: where a i is the current token of the answer and a <i are all the previous ones, whereas in, q, and cq correspond to the information needed, the initial query, and the recent clarifying question from Equation (1), respectively.

GPT-2-based simulated user.
GPT-2 is a large-scale, transformer-based LM trained on a dataset of 8 million web pages capable of synthesising text of near human quality [49].As trained on a highly diverse dataset, it can generate text on various topics, which can be primed with an input sequence.GPT-2 has previously been used for various text-generation tasks, including dialogue systems and chatbots [13].Therefore, it suits our task of simulating users by generating answers to clarifying questions in a conversational search system.
We base our proposed user simulator USi on the GPT-2 model with language modelling and classification losses, i.e., DoubleHead GPT-2.In this variant, the model learns to generate the appropriate sequence through the language modelling loss and how to distinguish a correct answer to the "distractor".This has been shown to improve the sequence generation [49] and has demonstrated superior performance over only-language loss GPT-2 in the initial stage of experiments.The two losses are linearly combined.
Single-turn responses.We formulate the input to the GPT-2 model, based on Equation ( 4), as where [bos], [eos], and [SEP] are unique tokens indicating the beginning of the sequence, the end of the sequence, and a separation token, respectively.Information needs in, initial query q, clarifying question cq, and a target answer a are tokenised before constructing the entire input sequence to the model.Additionally, we construct segment embeddings, which indicate different segments of the input sequence, namely, in, q, cq, and a.
When training the DoubleHead variation of the model, we formulate the first part of the input as described above.Additionally, we sample the ClariQ dataset for distractor answers and process them like the original answer, based on Equation (5).Therefore, the DoubleHead GPT-2 variant accepts as input two sequences, one with the original target answer in the end and the other with the distractor answer.It then needs not only to learn to model the target answer but also to distinguish between original and distractor answers and provide a binary label indicating which of the two solutions is desirable.We sample the distractor answers from the datasets above.When possible, we ensure that if the target answer starts with "Yes", the distractor answer starts with "No" to enforce the connection between the solution, the clarifying question, and the information needed.Likewise, if the answer starts with "No", we sample a distractor answer that begins with "Yes".Note that USi does not generate answers that begin strictly with a "yes" or a "no".
Conversation history-aware model.The conversation history-aware model calls for a different input and training formulation.The input to history-aware GPT-2 is constructed as where [user ] and [system] are additional unique tokens indicating the conversational turns between the (simulated) user and the conversational system, respectively.
Inference.During inference, we omit the answer a from the input sequence, as our goal is to generate this answer to a previously unseen question.To generate answers, we use a combination of state-of-the-art sampling techniques to develop a textual sequence from the trained model.We utilise temperature-controlled stochastic sampling with top-k [23] and top-p (nucleus) filtering [29].After the initial experiments and consultation with previous work, we fix the temperature parameters to 0.7, k to 0, and p to 0.9.

Prompt-Based Text Generation
In this section, we describe the prompt-based generation method.We follow Owoicho et al. [44] and utilise recently developed GPT-3 [12] to answer posed clarifying questions.To this end, we use prompting [24] -a method to describe the task for the LLM to perform without requiring further fine-tuning.The prompt is a chunk of text that describes a task we are interested in, preferably giving several examples of such a study being executed.We want to generate the answer a to the clarifying question cq.As mentioned, the answer must align with the underlying information needs description.
Prompt-based generation has several potential advantages over the previously introduced finetuning-based approach.For example, only a couple of examples must be given to the model, thus mitigating the need to create task-specific datasets.As such, prompt-based few-show learning can adapt to various tasks.While we focus solely on utilising such methods for answering clarifying questions in this work, they can be used for other user-specific utterance-generation tasks, such as providing explicit feedback [44].Nonetheless, prompting only became a recently valid method due to significant advancements in LLMs.However, the next generation of LLMs requires significantly more processing power and is not feasible to run on single-compute nodes.This consequently raises the cost of such methods.Thus, fine-tuning medium-sized LLMs, such as GPT-2, might still be a potentially desirable path.In this article, we compare the two methods across several aspects and discuss the potential advantages of one over the other.

EVALUATION METHODOLOGY
In this section, we describe our methodology for evaluating the proposed user simulation methods.We compare text generated by our simulators to human-generated text with regard to multiple aspects.First, we use automated NLG metrics to assess the differences between the two simulation methods.Second, we employ crowdsourcing to evaluate the usefulness and naturalness of the generated answers.Finally, we perform a qualitative analysis of the simulator's utterance reformulations and map them into recently identified patterns (see [79]).All of the comparisons are performed both in single-and multi-turn settings.

Single-Turn Conversational Data
For training and evaluating our proposed simulated user USi, we utilise two publicly available datasets, Qulac [4], and ClariQ [3].Both datasets aim to foster research in asking clarifying questions in open-domain conversational search.Qulac was created on top of the TREC Web Track 2009-12 collection.The Web Track collection contains ambiguous and faceted queries, often requiring clarification when addressed in a conversational setting.Given a topic from the dataset, clarifying questions were collected via crowdsourcing.Then, given a topic and a specific facet of the case, workers were employed to gather answers to these clarifying questions.This results in a tuple of (topic, f acet, clari f yinд_question, answer ).Most of the topics in the dataset are multifaceted and ambiguous, meaning that the clarifying questions and answers must align with the actual facet.ClariQ is an extension of Qulac created for the ConvAI3 challenge [3] and contains additional non-ambiguous topics.Relevant statistics of the datasets are presented in Table 1.We utilise these datasets by feeding the corresponding elements to Equation (4).Specifically, facet from Qulac and ClariQ represents the underlying information need, as it describes in detail the intent behind the issued query.Q represents the current asked question, whereas answer is our language modelling target.

Multi-turn Conversational Data
A significant drawback of Qulac and ClariQ is that they are both built for single-turn offline evaluation.A conversational search system will likely engage in a multi-turn dialogue to elucidate user needs.To bridge the gap between single-and multi-turn interactions, we construct multi-turn data that resembles a more realistic interaction between a user and the system.Our user simulator USi is then further fine-tuned on this data.
To acquire the multi-turn data, we construct a crowdsourcing-based human-to-human interaction.At each conversational turn, the crowdsourcing worker is tasked to behave as a search system by asking a clarifying question on the topic of the conversation.Then, another worker is tasked to provide the answer to that question, considering the underlying information needed and the conversation history, imitating the actual user's behaviour.We construct in 500 conversations up to a depth of three, i.e., we have three sequential question-answer pairs for a topic and its facet.
We construct several edge cases to further study the effects of specific clarifying questions on the search experience.In such cases, the clarifying question prompted by the search system is considered faulty, as it is either a repetition, off-topic, unnecessary, or completely ignores the previous user's answers.We obtain answers to these questions to provide more realistic data for the training of our model, making our simulated user as human-like as possible.These clarifying questions are intended to simulate a conversational search system of poor quality and provide insight into users' responses to such questions.We employ workers to provide answers to an additional 500 clarifying questions of poor quality, up to the depth of two.The specific edge cases and their descriptions with examples are presented in Table 2.We publicly release the acquired multi-turn datasets in Sekulić et al. [61].In this work, we use the multi-turn data to evaluate both fine-tuning-based and prompting-based approaches to generating answers to clarifying questions in a conversational setting.

Research Questions
We aim to evaluate whether our proposed simulated user can replace real users in answering clarifying questions of conversational search systems, which would make evaluating such systems significantly less troublesome.Overall, we aim to answer four main research questions, extending the list from Sekulić et al. [61]: To what extent are the answers generated by the two simulation methods in line with the underlying information need?RQ2: How coherent and natural is the language of the generated answers?RQ3: How do LLM-based simulators behave in multi-turn interactions?RQ4: What are the advantages and disadvantages of either simulation methodology?In order to address these questions, we first compute several NLG metrics to compare the generated answers to the oracle human answers from ClariQ.As several NLG metrics received criticism from the NLP community, significantly since they do not correlate well with the text's coherence, we perform a crowdsourcing study to evaluate the naturalness of generated answers.To evaluate whether the generated answers align with the information needed, we conduct an additional crowdsourcing study and assess the usefulness of answers.Finally, we perform a qualitative analysis of generated answers by identifying certain patterns in utterance formulations.
As it was done in Sekulić et al. [61], we compare our LLM-based user simulators with two sequence-to-sequence baselines.The first baseline is a multi-layer bidirectional long short-term memory (LSTM) encoder-decoder network for sequence-to-sequence tasks [68]. 1 The second baseline is a transformer-based encoder-decoder network, based on Vaswani et al. [73].We perform a hyperparameter search to select the models' learning rate, number of layers, and hidden dimension.Both baselines are trained with the same input as our primary model.

Automated NLG Metrics
We first study the language-generation ability of USi and the previous baselines.We compute several standard metrics for evaluating the generated language.We use two widely adopted metrics based on n-gram overlap between the generated and the reference text.These are BLEU [45] and ROUGE [37].Next, we compute the EmbeddingAverage and SkipThought metrics to capture the semantics of the generated text, as they are based on the word embeddings of each token in the developed and the target text.The metric is then defined as a cosine similarity between the means of the word embeddings in the two texts [35].The models are trained on the ClariQ training set and evaluated on the unseen ClariQ development set.We evaluate ClariQ's development set since the test set does not contain question-answer pairs.We take a small portion of the training set for our development.The answers generated by USi and the baselines are compared against oracle answers from ClariQ, generated by humans.

Response Naturalness and Usefulness
To simulate a real user, the generated responses by our model need to be fluent and coherent.Thus, we study the naturalness of the generated answers.We define naturalness as an answer being natural, fluent, and likely caused by a human.Similarly, fluency [14] and humanness [57] have been used for evaluating generated text.We also assess the usefulness of the answers generated by our simulated user.We define usefulness as an answer that aligns with the underlying information need and guides the conversation towards the topic of the information need.This definition of usefulness can be related to similar metrics in previous work, such as adequacy [65] and informativeness [16].
We perform a crowdsourcing study to assess the naturalness and usefulness of generated answers to clarifying questions.We use Amazon Mechanical Turk to acquire workers based in the United States with at least a 95% task approval rate.The study was done in a pair-wise setting, i.e., each worker was presented with several answer pairs.Our model generated one of the answers, and the other was by a human, taken from the ClariQ collection.Their task was to judge which answers were more natural or valuable, depending on the study.The workers have been provided with the context, i.e., the initial query, facet description, and clarifying question.
We annotate 230 answer pairs for naturalness and 230 answer pairs for usefulness, each judged by two crowdsource workers.We would define a win for our model if both annotators voted our generated answer as more natural/useful and a loss for our model if both voted the human-generated answer as more natural/useful.If the two workers voted differently on a single answer pair, we define that as a tie.With this study, we aim to shed light on research questions RQ1 and RQ2, i.e., whether the generated answers are natural and in line with the underlying information need compared with human-generated answers.Additionally, we compare Transformer-seq2seq to USi.
We also compare the two LLM-based simulation approaches in a multi-turn casual setting.The results of both single-and multi-turn comparisons are presented in Section 7.2.

RESPONSES TO CLARIFYING QUESTIONS
In this section, we analyse human-and simulator-generated answers to posed clarifying questions.Specifically, we conduct expert annotation to identify patterns in the given answers, grounding our findings in prior work.To this end, we analyse the answers in light of patterns identified by Krasakis et al. [34], focusing on the Qulac dataset [4].Krasakis et al. [34] find that users' answers vary in polarity in length.For example, the user can answer with a negative short answer, such as "No", but also potentially provide a longer answer, e.g., "No, I'm looking for X instead".Naturally, the answer can also be positive polarity depending on the information needed and prompted clarifying questions.Furthermore, we compare the generated answers to patterns identified by Zhang et al. [79].Although Zhang et al. [79] focus on query reformulations in conversational recommender systems, we find the overlap of the findings to be high.Thus, we map their proposed query reformulation types to answers in a mixed-initiative conversational search.Finally, we analyse answers to faulty clarifying questions proposed by Sekulić et al. [61].

Responses Patterns
We analyse answers to prompted clarifying questions in light of previously identified utterance reformulation types [79].In other words, we map and expand the existing utterance reformulation ontology for conversational recommender systems to answer formulations in conversational search.While specific differences exist between recommender and search systems, our initial analysis suggested that the common conversational setting incites similar user behaviours.In their study, Zhang et al. [79] analyse how users reformulate their utterances in subsequent turns given a prompt from the conversational recommendation agent about its lack of understanding of user's needs.Similarly, in conversational search, we have the user's initial query, the clarifying question prompted by the search system, and the user's answer.Thus, we analyse these answers through the lens of reformulations from the user's initial query.
Zhang et al. [79] identify seven utterance reformulation behaviours: (1) start/restart -users start to present their need; (2) repeat -user repeats previous utterance without significant change; (3) repeat/rephrase -user repeats last turn with different wording; (4) repeat/simplify -user repeats the word with a more straightforward expression, reducing complexity; (5) clarify/refine -user clarifies or refines the expression of an information need; (6) change -user changes the information need (topic shift); ( 7) stop -user ends the search session.We encourage an interested reader to refer to Zhang et al. [79] for a more elaborate explanation of the reformulation types.In our analysis, we focus specifically on answers to clarifying questions.Thus, some user utterances must be observed and not discussed in other sections.Mainly by the design of our research setting, described in Section 4, we do not deal with utterance types (1) start/restart, (6) change, or (7) stop.However, we add two additional categories, mostly to deal with edge cases: (8) hallucinationwhen the provided answer is not in line with the underlying information need and ( 9) short answer -when the answer is just "no" or "yes".Examples of the observed utterance types are presented in Table 4.

Responses to Faulty Clarifying Questions
In order to gain further insight into designing a reliable user simulator for conversational search evaluation, we must adapt it to be resilient to unexpected system responses.For example, if a conversational search system responds with an off-topic clarifying question or an unrelated passage, the simulated user needs to react in a natural, human-like manner.However, to design such a simulator, we first need to learn how real users would react to incorrect responses from the search system.To this end, we acquired a dataset of human responses when prompted with faulty clarifying questions.The published dataset is multi-turn and can thus be used to improve our multi-turn user simulator model.
Examples from the acquired dataset are presented in Table 2.The dataset contains several scenarios in which a conversational search system asks follow-up clarifying questions.We acquired a dataset of 1, 000 conversations, with crowd workers assuming the user role and responding to clarifying questions.Initial analysis of the crowd workers' answers offers several insights.In the case of appropriate clarifying questions (Natural), users tend to respond naturally by refining their information needs, as expected.However, in the case of faulty clarifying questions (repeat, off-topic, or similar), users either repeat their previous answer (20% of analysed answers), expand their last reply with more details on their information need (23%), or rephrase the previous answer with different wording (37%).Next, we aim to evaluate the resilience of our proposed USi to such faulty questions by analysing its correspondence to human-generated answers.

RESULTS
In this section, we present the results of the evaluation methods described above.First, we show the performance of user simulation approaches as measured by automated NLG metrics, followed by a crowdsourcing-based study on response usefulness and naturalness.Finally, we qualitatively analyse the generated utterances.

Automated NLG Metrics
Performance of the baseline models and our simulated user models, as evaluated by automated NLG metrics described in Section 5.4, is presented in Table 3. USi significantly outperforms all baselines by all computed metrics on the ClariQ data.Even though LSTM-seq2seq showed strong performance in various sequence-to-sequence tasks, such as translation [68] and dialogue generation [64], it performs relatively poorly on our task.A similar outcome is observed for Transformer-seq2seq.We hypothesise that the poor performance in this task is due to limited training data, as the success of these seq2seq models on various tasks was conditioned on large training sets.Our GPT-2-based model does not suffer from the same problem, as it has been pre-trained on a large body of text, making the fine-tuning enough to capture the essence of the task, which is generating answers to clarifying questions.An interesting observation is the fact that the GPT-3-based model, ConvSim [44] performs worse than the GPT-2-based USi.We attribute this result to the aforementioned issues with the unreliableness of the automated NLG metrics.As such, they capture solely exact matching of the wordings of the generated answer and the gold answers, largely failing to adjust to differences in vocabulary between the two answers, although they might be conveying the same message.Thus, in the next section, we report more reliable crowdsourcing-based annotations of the

Naturalness and Usefulness
Table 6 presents the results of the crowdsourcing study on usefulness and naturalness, comparing answers generated by USi and humans, as described in Section 5.5.Both in terms of naturalness and usefulness, we observe a large number of ties, i.e., the two workers annotating the answer pair did not agree on which one is more natural/beneficial.Since we are comparing answers generated by our GPT-2-based simulated user with the answers written entirely by humans, this result favours our proposed model.Moreover, the difference between losses and wins for our model is relatively tiny (38% losses, 17% wins) for naturalness and even smaller in terms of usefulness (32% losses, 23% wins).We conduct a trinomial test for statistical significance [9], an alternative to the Sign and binomial tests that consider ties. 2 Regarding naturalness, we reject the null hypothesis of equal

Repeat/rephrase
Repeat the last utterance by using different wording.
U: I want to learn about rock art.S: Do you want pictures of rock art?U: Yes, I want information on rock art.

Repeat/simplify
Repeat the last utterance in a simpler language.
U: Find me information about the sales tax in Illinois.S: Do you want to know what the Illinois state tax rate is?U: Yes, I want to know about the taxes in the state.

Clarify/refine
Refine the information need expressed in last utterance.
IN: Find the homepage for the NASA Voyager mission.U: Tell me about Voyager.S: Are you interested in where the Voyager has been?U: No, I am interested in the website for the mission.

Other
Short response, none above applicable.
U: "All men are created equal" S: Are you looking for Declaration of Independence at the National Archives?U: No.

Hallucination
Response not grounded in information need.
IN: How is workplace diversity achieved and managed?U: Tell me about diversity.S: Would you like the legal definition of diversity?U: Yes, I would like to know about the definition of diversity.The patterns are extended from the list proposed by Zhang et al. [79].
Table 5. Results on Naturalness and Usefulness of Responses, USi vs. Transformer-Seq2Seq Baseline [61] vs. Human-Generated Answers USi [61] Baseline [61] Ties USi [61]  The † sign indicates a statistically significant difference under the trinomial test with p-value < 0.05.performance with p < 0.05, i.e., human-generated answers are more natural than those generated by USi.Nonetheless, 45% of ties between USiand human-generated answers suggests the high quality of the generated text.Regarding usefulness, we accept the null hypothesis of equal performance with p = 0.43, i.e., there is no statistically significant difference between the performance of human annotators and USi.Table 5 presents the comparison results between the Transformer-seq2seq and USi.We observe a win of the proposed USi over the baseline by a large margin.Our GPT-2-based model significantly outperforms the baseline (p < 0.05) both in terms of naturalness (50% wins and 3% losses) and usefulness (66% wins and 3% losses).This finding is in line with the automated evaluation of generated answers.
Regarding the research questions RQ1 and RQ2, i.e., whether the responses generated by our model align with the underlying information need and, at the same time, coherent and fluent, we arrive at the satisfactory performance of the simulated user.The generated answers to clarifying questions seem to be able to compete with the solutions produced by humans both in terms of naturalness and usefulness.Moreover, the strong performance of USi over Transformer-seq2seq additionally motivates the use of large-scale pre-trained language models, such as GPT-2, for the  task.These results make a strong case for using a user simulator for mixed-initiative conversational search system evaluation.

Single-Turn Analysis.
In this section, we analyse several conversation samples of our user simulator with a hypothetical conversational search system.Table 7 shows five interaction examples.The user simulator USi is initialised with the information that needs description text.Given an initial query (omitted in the table for space), the conversational search system asks a clarifying question to elucidate USi's intent.Then, USi generates the answer to the prompted question.The information needed and the questions for these examples are taken from the ClariQ development set.Most TREC-style datasets contain the information need (facet/topic) description alongside the initial query.Thus, our simulated user can help evaluate conversational search systems on any of such datasets, as it only requires a description for initialisation.Then, the system we aim to evaluate can produce clarifying questions and receive answers from USi.
The first two examples in Table 7 initialise USi with different information needs.However, given the same initial query "How to cure angular cheilitis" and the same prompted clarifying question, USi answers differently, in line with the basic information needed for each case.In the table's last three rows, we have different information needs for one broad topic of hobby stores.Given the initial query "I'm looking for information on hobby stores", USi again answers questions in line with the underlying information need.We notice that the text produced by our GPT-2-based user simulator is coherent and fluent and, in the given examples, indeed in line with the underlying information need.Moreover, USi is not bound by answering the question in a "yes" or "no" fashion.Instead, it can produce various answers and even express its uncertainty (e.g., "I don't know").Table 8 shows the prevalence of the aforementioned types of utterance reformulations on the ClariQ development set.We expertly annotated 150 answers generated by both generative approaches as well as human answers taken directly from the ClariQ dataset.As indicated in Table 8, USi hallucinates in 7% of analysed cases.The hallucination accounted for in the table is limited to cases when a long answer is generated.However, we observed that USi often needs a better short answer.For example, with an information need related to finding the list of dinosaurs with pictures, when prompted with a clarifying question "Are you looking for pictures of dinosaurs?",USi answers "No".Such short answers are mapped under Other in Table 8, as the focus of the analysis was to capture the extent of the short-versus-long answers.Moreover, we observe the hallucination phenomena in several answers taken from ClariQ, constructed by crowd workers.We attribute this to the potentially swift manner in which the answers were written rather than to crowd workers not understanding that their answers were not in line with the given information need.On the contrary, the prompt-and GPT-3-based ConvSim method does not suffer from the mentioned issue.
The prevalence of different utterance reformulations differs between human-generated answers and the answers generated by USi and ConvSim.Specifically, we observe a greater frequency of short answers (e.g., "yes", "no") in answers generated by GPT-2-based USi.On the other hand, GPT-3-based ConvSim tends to refine and clarify the given information need in the majority of the cases.While both long and short types of answers to clarifying questions are acceptable, as long as they are in line with the information need, certain users have a slight preference towards one or the other.Thus, as discussed in the last section, as a step towards more realistic user simulators, we aim to model users according to their cooperativeness level.In other words, the simulator would be able to generate either concise or long and elaborate answers depending on the cooperativeness parameter for a specific underlying user model.

Multi-turn Analysis.
We perform an initial case study on the multi-turn variant of USi.While the initial analysis of multi-turn conversations suggests that usefulness and naturalness of single-turn interactions transfer into a multi-turn setting, additional evaluation is needed to support that claim strongly.Thus, future work includes a pair-wise comparison of multi-turn conversations inspired by ACUTE-Eval [36].
We also aim to observe user simulator behaviour in unexpected, edge-case scenarios.For example, initial analysis of the created multi-turn dataset showed that humans tend to repeat their previous answers when the clarifying question is off-topic or repeated.Similarly, our multi-turn USi has been observed to generate answers such as "I already told you what I'm looking for" when prompted with a repeated question.However, such edge cases need to be clarified for the multiturn model, which leads to a higher presence of hallucination than in the single-turn variation.This means that the user simulator drifts off the topic of the conversation and starts generating answers outside the basic information needed.This effect is well documented in recent literature on text generation [22] and should be approached carefully.Although edge cases are also present in the acquired dataset, the GPT-2-based model needs additional mechanisms to simulate the behaviour of users in such cases.We leave a deeper analysis of the topic for future research.

DISCUSSION AND FUTURE WORK
In this section, we discuss the advantages and shortcomings of both simulation approaches, their applicability in evaluation, and topics for future work.

Performance versus Cost
While both GPT-2-and GPT-3-based user simulators can generate natural and valuable answers to clarifying questions, as demonstrated by our experiments presented in Section 7.2, GPT-3 is still significantly better.The difference in performance becomes wider in the multi-turn setting, indicating the overall superiority of the GPT-3-based simulator for the task.This was expected, as GPT-3 was trained on a significantly larger dataset (570 GB of text) than GPT-2 (40 GB of text) and is much more significant in terms of parameters (175 billion for GPT-3 vs. 1.5 billion for GPT-2) [12].However, the increase in performance comes with a rise in cost.The Davinci model used in our experiments costs $0.0200 per 1K tokens. 3When the cost of pre-training such models is considered, it extends well beyond the cost of pre-training and GPT-2-based methods.For example, to run GPT-3, we need at least 5 80 GB A100 GPUs, 4 whereas GPT-2 runs smoothly on a single 12 GB GPU.As such, it would be incredibly beneficial if smaller-scale LLMs could be used on specific tasks with reasonable success.Achieving such performances with smaller-scale LLMs could be the direction towards sustainable artificial intelligence (AI) [41].

Beyond Answering Clarifying Questions
In this article, we focused on simulators for answering clarifying questions posed by the system.However, as indicated in Section 3, a user's role in conversational search system evaluation extends beyond answering clarifying questions.Thus, a well-designed user simulator should have different properties, such as information-seeking behaviour and explicit feedback generation.We hypothesise that GPT-2-and GPT-3-based methods could be used to a reasonable extent for such purposes.However, an essential distinction between the two methods is that GPT-2 would require fine-tuning.This entails the need for appropriate datasets for each property we intend to include in the simulator.For example, if we aim to include an explicit feedback generation module in our simulator, that is, the simulator's ability to generate positive or negative feedback in a natural language to the system's responses, we would require a substantially large dataset of such feedback, together with initial queries and the system's responses.Then, GPT-2, and models alike, could be fine-tuned for the task.
On the other hand, the next generation of LLMs, such as GPT-3, allows the use of an in-context learning approach -prompting, explained in Section 4.2.This eliminates the need for task-specific datasets, as the prompt contains a brief task description and a few examples.The description and examples of the task being carried out correctly can be designed directly by the researcher and typically do not require external annotation.Thus, such LLMs are highly adaptable to other simulator-related tasks, making the extensions to the aforementioned desired properties significantly easier.While some of the properties have been recently explored in ConvSim [44], namely, providing explicit feedback, future work includes the expansion of current simulators to others.However, we note that prompt design is not a silver bullet compared with the fine-tuning approach, as minor adjustments to the prompt text can result in the generation of vastly different utterances.

Parameterised Simulator
We analysed the types of utterance formulations given by our generative LLM-based simulators.As reported in Section 7.3, we observe certain patterns in the formulations and discuss specific differences between the GPT-2-and GPT-3-based models.Our line of thought is that a fine-tuned model exhibits behaviours of the data it was trained on.As such, the distributions and varieties of the answers are similar to the data it was trained on.On the other hand, prompting allows for easier tweaking of the reformulations we want to exhibit.However, a realistic user simulator should closely follow the behaviours of real users [7].To achieve that, we need both the underlying user model we design our simulators by and more control over the types of utterances generated by the simulators.A solution towards this goal lies in parameterised user simulators.
Parameterised user simulators would allow for adjustment towards certain types of users.For example, Salle et al. [56] model cooperativeness -how lengthy and informative the simulator responses are, and patience -how many turns of answering clarifying questions is the simulator willing to partake in before giving up.Similarly, Owoicho et al. [44] model ConvSim's patience.However, many other parameters could be introduced, allowing for fine-grade user models.Such parameters are demandingness -how precisely does the system's response need to be for the user to provide positive feedback to it and chattiness -is the user simulator simply answering questions and giving feedback or does it include certain conversational elements (i.e., "That's interesting", "I didn't know that").Another important aspect of the user simulators is their ability to develop and change their information need throughout the conversation, including shifting the topic of the conversation completely.

Applications of the Simulator
The first application of our proposed simulators, discussed throughout this article, is evaluation of conversational search systems.This can be done by allowing the search system to interact with the simulator, which then assumes a user's role and exhibits the implemented behaviours.The conversational search system's primary goal is to satisfy the underlying information need, which can be evaluated using a standard Cranfield paradigm [38].In other words, starting from the initial query, the system must provide a ranked list of documents, which are evaluated against query relevance judgements.Evaluation via simulation enables quicker, more cost-effective, and more robust comparisons of conversational search systems without the potential loss in quality.
User simulators can be used for evaluating conversational search systems, but also for creating synthetic data to be used for downstream tasks [40].Moreover, we can identify break points in search systems by purposely generating faulty utterances, thus probing the search system's robustness.For example, we might design a simulator that performs a sudden topic shift and indicate what kind of actions would be expected of the search system.

CONCLUSIONS
In this article, we have examined two recently proposed approaches for simulating users to alleviate the evaluation of mixed-initiative conversational search systems.More specifically, we demonstrated the feasibility of substituting expensive and time-consuming user studies with scalable and inexpensive user simulators.Through several experiments, including automated metrics and crowdsourcing studies, we showed the simulator's capabilities in generating fluent and accurate answers to clarifying questions prompted by the search system.A crowdsourcing study of answer 62:19 usefulness and naturalness showed that answers generated by USi tied with human-generated solutions in 51% and 45% of cases, respectively.Moreover, we confirmed the even stronger performance of the GPT3-based simulator, ConvSim, especially in the multi-turn setting.We also performed a qualitative analysis of the utterance reformulation types, finding that the majority of the answers aim to clarify and refine the underlying information need.Based on all of the results and observations, we discussed further steps towards a more realistic simulator by introducing parameters that would allow for modelling different types of users, such as cooperativeness and patience.Moreover, we plan to investigate the application of ConvSim for proactive user simulation [1].

Table 1 .
Statistics for Qulac and ClariQ Datasets

Table 2 .
Multi-turn Dataset Acquired Through Crowdsourcing U: I'm looking for an online world atlas.S: Are you interested in satellite maps?U: No, I want an online world atlas.S: Which mountain ski resort would you like information around the Pocono area?U: I am not interested in this topic.400 Sample conversations of depth three are omitted due to space limitations.

Table 3 .
Performance of Different Answer Generation Methods, Measured by Automated NLG Metrics on the ClariQ Development Set

Table 4 .
Identified Reformulation Patterns in Responses Generated by Our Proposed User Simulator(s)

Table 6 .
Results of a Crowdsourcing Study Assessing Naturalness and Usefulness of Generated Answers Between ConvSim, USi, and Human-Constructed Answers

Table 7 .
Qualitative Analysis of Answers Generated by User Simulator USi

Table 8 .
Prevalence of Utterance Reformulation Types for Answers to Clarifying Questions