Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions

Conversational question-answering (CQA) systems aim to create interactive search systems that effectively retrieve information by interacting with users. To replicate human-to-human conversations, existing work uses human annotators to play the roles of the questioner (student) and the answerer (teacher). Despite its effectiveness, challenges exist as human annotation is time-consuming, inconsistent, and not scalable. To address this issue and investigate the applicability of large language models (LLMs) in CQA simulation, we propose a simulation framework that employs zero-shot learner LLMs for simulating teacher-student interactions. Our framework involves two LLMs interacting on a specific topic, with the first LLM acting as a student, generating questions to explore a given search topic. The second LLM plays the role of a teacher by answering questions and is equipped with additional information, including a text on the given topic. We implement both the student and teacher by zero-shot prompting the GPT-4 model. To assess the effectiveness of LLMs in simulating CQA interactions and understand the disparities between LLM- and human-generated conversations, we evaluate the simulated data from various perspectives. We begin by evaluating the teacher's performance through both automatic and human assessment. Next, we evaluate the performance of the student, analyzing and comparing the disparities between questions generated by the LLM and those generated by humans. Furthermore, we conduct extensive analyses to thoroughly examine the LLM performance by benchmarking state-of-the-art reading comprehension models on both datasets. Our results reveal that the teacher LLM generates lengthier answers that tend to be more accurate and complete. The student LLM generates more diverse questions, covering more aspects of a given topic.


INTRODUCTION
Over the years, the information retrieval (IR) community has strived to create an interactive and iterative search system that effectively retrieves information [4,9,13,21].Recent advancements in conversational question-answering (CQA) systems have been successful in achieving this goal by retrieving relevant information and engaging in back-and-forth interactions with users to fully understand their information needs [34,38].Under this case, existing work captures the iterative dynamics of conversations, where a set of annotators play the role of the questioner (student) and the answerer (teacher) over a pre-defined search topic [11,24,49].
Despite the effectiveness of previous efforts in this task, several drawbacks exist.One major challenge is the maintenance of a large team of annotators to generate a substantial number of conversations.This process can be time-consuming, resource-intensive, and expensive.Additionally, relying solely on human annotators may introduce variations in the quality and consistency of the generated conversations.Also, in many cases, the human student cannot effectively explore a given topic that is out of their background knowledge.For example, a person who has expertise in geography can better explore a related topic rather than a person who does not.In contrast, large language models (LLMs) can leverage their vast background knowledge to effectively play the role of a geography expert in a conversation.Therefore, it is crucial to explore automated approaches that can generate simulated conversations, reducing the dependency on human annotators and making the process more efficient and scalable.
User simulation is an important emerging research frontier for conversational search development and evaluation [7,28], where the focus mainly is on simulating the user behavior under a certain condition, such as responding to system's actions [47], answering clarifying questions [42], and giving feedback on system answer [28].The main drawback of existing research on user simulation is its reactive nature, where the simulated user just passively respond to the system's utterance.In real-world scenarios, however, users' actions are a mix of proactive and reactive actions, initiating and frequently guiding conversations by posing questions that stem from their underlying information need.
In this work, we aim to explore LLMs' effectiveness in simulating a proactive user, exploring a pre-defined topic in a conversational setting.To this aim, we replicate the teacher-student conversational simulation adopted by Choi et al. [11] while replacing both human parties with LLMs, enabling us to effectively evaluate and compare the performance of LLMs with human annotators.This leads us to our first research question, RQ1: how can we employ LLMs to generate such simulated conversations effectively and automatically?
We answer this question by proposing a zero-shot LLM-to-LLM simulation framework where the student LLM aims to explore a topic by posing various questions and the teacher LLM's goal is to provide complete and correct answers to the questions.We implement both the student and teacher by zero-shot prompting GPT-4.
The usage of LLMs in this setting leads us to the next research questions RQ2: how can we evaluate the role of LLMs in CQA simulation? and RQ3: how do LLM-and human-generated conversations compare?To address these questions: (i) We first conduct an extensive independent evaluation of the teacher, measuring its effectiveness in this task.To this aim, we conduct an extensive human evaluation task where the annotators compare LLM-and human-generated answers on the same questions side by side.(ii) We then evaluate the performance of the student.To this aim, we compare the patterns and question-asking behavior of the LLM and human from various perspectives, discovering interesting patterns.For example, we find that LLM-generated questions lead to more topical coverage.(iii) Finally, we conduct extensive analyses to thoroughly examine the performance of the LLM by benchmarking state-of-the-art reading comprehension models on both datasets.We find that LLM-generated answers are generally lengthier and more comprehensive.Also, they are more consistent and fluent.Moreover, our human evaluation reveals that the LLM teacher is more accurate in providing correct answers.Upon benchmarking state-of-the-art reading comprehension models, we find that pre-trained models exhibit more effective performance on LLMgenerated data.This efficacy may result from certain biases in the generated conversations and the enhanced consistency within it.
Overall, our contributions can be summarized as follows: • We leverage LLMs to mimic human-to-human interaction in a CQA setting using zero-shot prompting.We prompt two LLMs to conduct teacher-student simulation and propose an LLMgenerated dataset, called SimQuAC.  Figure 1: A high-level view of the architecture of our Simulation framework.

METHODOLOGY 2.1 Problem setting
Our experimental setup involves simulating an information-seeking conversation, where a student interacts with a teacher in a questionanswering conversation.Based on that, we adopt the setting established by the QuAc! (QuAc!) dataset [11], which serves as a widely recognized benchmark for evaluating the effectiveness of CQA models.The dataset revolves around discussions based on Wikipedia articles.It consists of conversation contexts where a crowdworker plays the role of a questioner (student) and engages in a conversation with another crowdworker who acts as an answerer (teacher).Specifically, the teacher is given access to the entire Wikipedia section article and aims to generate responses to questions posed by the student.To ensure fairness, the teacher's responses are limited to selecting the appropriate answer span from the article.In contrast, the student is only provided with the article's title and tries to use this limited information to ask relevant questions and explore the topic.As the students engage in the conversation, they explore the topic by asking questions, and the conversation unfolds accordingly.

Task formulation
As is mentioned in Section 2.1, the conversation evolves around a Wikipedia article titled .The student is only provided with limited access to information, including the section header ℎ as information need and the first paragraph of the main article , which serves as the background information.The teacher, on the other hand, has access to the additional information, including the full text of the section .The conversation begins when the student raises an initial question  0 and the teacher provides an answer, denoted as  0 .After receiving the answer from teacher, student continues to ask more questions until some stoppage criteria are met.Specifically, following previous work [11,37,49], instead of answering with freetext, the teacher must select one or several contiguous spans from text as the answer.Note that although limiting the LLM to select text spans restricts the teacher's ability to freely provide answers, it offers the advantage of simplified answer evaluation and prevents hallucination.This setting enables us to examine the proficiency of LLMs in tasks like CQA and reading comprehension (RC) by comparing their performance against existing methods.

Model framework overview
In order to have a better understanding of RQ1, we propose a LLMbased framework.Figure 1 illustrates the overall architecture of our model, showcasing the interactions between the two LLMs.The entire process evolves around a Wikipedia page.The purple box on the right plots the simulated student, named student Sim , while the orange box on the left plots the simulated teacher, named teacher Sim .The teacher Sim and student Sim contain several components to generate acceptable answers and questions, respectively.The process of generating the conversation starts with initializing student Sim by giving the instruction prompt Instruction S .The Instruction S prompt aims to guide the student LLM student Sim in generating the first question  0 in the question generation component (  ).Then, we pass the generated question  0 to the question validation component (  ).This component plays a critical role in ensuring the structural integrity of the generated questions.If it determines that the structure of the question is not acceptable, student Sim will prompt   again to regenerate  0 .After that, we forward  0 to teacher Sim , which concatenates it with the instruction prompt Instruction T , forming a combined input.This combined input is then fed to the answer generation component (  ) for generating the answer  0 .To ensure that the generated answers adhere to our defined setting (i.e., corresponding to one or multiple segments in the section text ), an answer validation (  ) component is leveraged to check the validity of  0 .If  0 is determined as an invalid answer, a prompt selection teacher component   will select the appropriate prompt (  ) and pass it to   to regenerate the answer  0 .This step continues when  0 is determined to be valid by   , it is passed back to student Sim .
Similarly, student Sim incorporates the prompt selection for student component (  ) to select the optimal prompt   for generating the subsequent question   .Once chosen, it transfers   to the question generation   module again, where   is then employed to generate the next question   .This back-and-forth questionanswering process continues until the stoppage criteria are met.In each turn, the generated question   and answer   will be stored.Algorithm 1 shows the detailed simulation process of our model.In the following sections, we will provide detailed explanations of each component to further elucidate its functionality.

Teacher simulation
Answer generation (  ).This component belongs to teacher Sim and is initialized with Instruction T in a zero-shot manner.The Instruction T includes the instruction to copy the exact spans from  to answer the given question and some information about the Wikipedia page including the title , background , and section text .We instruct teacher Sim to generate the sentence "I cannot find the answer."when  does not contain the answer.Additionally, to prevent the generation of excessively long answers that could potentially impede readability, we implement a two-step mechanism to control the length of the generated answers: (i) we specify in the prompt that the selected span should not exceed a maximum (ii) we include the statement "Remember that you should select the shortest possible span from the text," at the end of each question, making teacher Sim itself decide on the length of the sentence within the maximum limit.

Answer validation & regeneration (𝜎 𝑇
).Rather than solely rely on the instruction prompt Instruction T for one-time answer generation, we adopt an iterative model   to validate and refine the generated answers in succession to ensure they are in line with the request of Instruction T .This component serves as a reminder to   of the validation criteria and prompts teacher to generate an answer that aligns with the given section.We define that a valid answer (  ) should include exact copies of contiguous spans in the section text , or it should be the phrase "I cannot find the answer," if the question (  ) cannot be answered from the text.Therefore, we verify an answer's validity based on two criteria: (i) whether   contains one or multiple exact copies of the text spans in  or being "I cannot find the answer"; and (ii) whether   is copied from the text section , rather than the background .
We follow the steps below to address the two validation criteria.First of all, to address criterion (i), we conduct a simple text search and see if   (or each sentence of   ) is from .Notably, we notice that most of the time LLMs do not copy the texts inside the brackets and neglect the extra white spaces within the text.Therefore, we In this task, I am a teacher and have a document, you are a curious student who wants to explore this document by asking questions.The main objective is to learn most of the documents that I have.I will explain to you the topic and background knowledge of the document.Then I will give you the title of the document and you should ask questions about this title one by one.When you ask a question, I give you the answer, and then you ask your next question.I'm only allowed to find the answer to your questions from this document, so if I cannot find the answer, I will say "I cannot find the answer, please ask your next question".You shouldn't ask questions that can be answered from my previous answers to your previous questions.You should sometimes ask follow-up questions from my previous answers.
Topic ) "Please answer from the given section not the given background description, " to remind   of this criterion.We continue these steps of validation and regeneration until the generated answer satisfies both validation criteria.Finally, once the valid   has been confirmed by   , it is passed on to student Sim , which utilizes   to formulate the next question ( +1 ) in the conversation.However, there are cases where the loop continues for an excessive number of iterations.We terminate the loop in such cases, assuming that teacher Sim fails in finding the answer from , or the question is not answerable.Similar to QuAC, we set the answer to such questions to "I cannot find the answer."This is necessary to prevent an infinite loop and ensure that the system remains efficient and responsive.

Student simulation
Question generation (  ).To simulate the student, we prompt the Question Generation   component of student Sim in a zeroshot manner.With Instruction S , we instruct student Sim to explore the given information (ℎ and ) by posing questions, under the assumption that it does not possess knowledge of .As shown in Table 1, we include the topic  and  as well as the section header ℎ in Instruction S to ensure that student Sim has some basic knowledge about the given topic.

Question validation (𝜎 𝑆 ).
To ensure that an LLM-generated question   is structurally sound, we employ a validation step called   .This component serves the purpose of verifying and validating the syntactical correctness and coherence of the generated question.We observe that while   is supposed to be exactly one question in our setting, sometimes the LLM tends to generate multiple questions in one go.To address this issue, we consider a question valid if it adheres to the following criteria: (i) it should not exceed 25 words in length and (ii) should not contain a newline character or enumerated items (e.g., 1, 2, 3).This simple yet effective validation helps to filter lengthy and intricate questions, including those containing multiple sub-questions.
Prompt selection for student (  ).As the conversation progresses, there may be instances where the generated question   remains unanswered from the given text () despite being relevant to the information need (ℎ) and topic ().For instance, students tend to ask very specific follow-up questions that cannot be answered from  (e.g., "Was Newsom's mayoralty generally well-received by the citizens of San Francisco?").To address this issue, it is crucial to continuously assess the ability of the teacher simulator to answer the generated question   and make necessary adjustments to the student prompt   to enhance the quality of the question.The refined   aids the generative component   in generating questions that can be answered from the given information .For instance, if the response   is "I cannot find the answer, " there is a higher chance that the subsequent question  +1 might be overly specific and cannot be answered directly from .To solve this issue,   randomly selects one of the following guiding prompts as   and passes it to   .These guiding prompts include: (i) Ask a general question and do not ask a too specific question; (ii) Ask a question starting with where, when, or who; (iii) Ask a question about what is interesting in this article; (iv) Ask a question about another aspect of the topic.By utilizing these guiding prompts, we can effectively prevent the generation of overly specific questions and guide student Sim by offering additional clues and information.This approach allows for more efficient exploration of the given information need (ℎ) by the student Sim , ultimately enhancing its overall understanding.

TEACHER EVALUATION
In this section, we describe our experimental methodology to evaluate the performance of teacher Sim from various perspectives, which addresses RQ2 and RQ3 from the teacher perspective.Firstly, we  describe the data source to perform teacher evaluation.We then introduce the human evaluation process of the teacher, with a particular focus on assessing the generated answers by comparing them against human-generated answers.

Experimental setup
Data for evaluating teacher Sim .To simulate the teacher and ensure a fair comparison between the LLM-and human-generated answers, we maintain consistency in the conversation topic and questions across the comparison.In detail, we randomly select 50 conversations from the training set of QuAC [11].From each conversation in the sampled data, we borrow the topic information and all associated questions.Following this, we pass the questions to our teacher Sim to generate the answers and then compare them with the original answers from QuAC.
Parameters.In our experiment, we adopt GPT-4 [27]2 as our base teacher and student LLM model.In our preliminary experiments, we explored using other LLMs such as GPT-3.5 and LLaMA [48] as teacher.However, we found that GPT-4 is the only LLM that can copy an exact segment of the text as an answer in a zero-shot manner (we later discuss it as a direction for future work in Section 6).Other models failed in this task by either generating broken or free-text sentences that did not satisfy our requirements.In our model, we set the patience parameter of   to a fixed value of 4, which means the teacher validation loop breaks after a maximum number of 4 iterations.
Human evaluation.To evaluate the performance of the teacher in our task, we conduct human evaluation on a professional crowdsourcing platform Prolific. 3We ask the crowd-workers to compare the answers generated by teacher Sim (i.e., answer Sim ) with the answers of QuAC (i.e., answer QuAC ) in terms of correctness, completeness, and naturalness.We explain each aspect in detail: • Correctness aims to determine whether the selected text span accurately serves as a correct answer to the question, based on the context of the conversation.• Naturalness measures the fluency and human-likeliness of a text span.Although both QuAC and teacher Sim contain a selected text span as a response, we observe that in many cases of QuAC, the selected spans are unnatural and do not form complete sentences.• Completeness measures whether the provided answer is complete and comprehensive.It is important to note that an answer can be correct but incomplete.For example, if the question is about the albums of an artist, a more complete answer is the one that lists more albums, if not all.Additionally, we ask the crowd-workers to indicate which system (human in QuAC vs. teacher Sim ) they would prefer to interact with, aiming to capture the overall quality of the generated data in a conversation.Also, we ask them to provide a short statement justifying their preference.
Crowdsourcing task design.We design a crowdsourcing task accordingly for the assessment between two conversations.The annotators begin by comparing the responses from both systems for each question.We display the background information () and the section text () on the left side of the page.On the right side, we include each question along with the simulated answer (answer Sim ) and the original QuAC answer (answer QuAC ).For each annotation aspect, we ask the annotators to indicate which system is better by choosing from the four options, namely, "System A," "System B," "Neither A nor B," and "Both A and B." The annotators can easily locate the selected text spans by clicking on the answers.The text will be highlighted in , enabling them to compare the two spans efficiently and easily.Note that we do not ask the annotators to evaluate the questions when the answers from answer QuAC and answer Sim are identical.However, we still include them in the interface as they can contribute to the context of the conversation.Also, when one of the answers is "I cannot find the answer, " we only ask the annotators to evaluate its correctness, as other metrics cannot be evaluated for these cases.
Annotation and quality check.We randomly sampled 50 conversations from two datasets and divided them into 10 batches, each containing five conversations for evaluation.To ensure reliable assessments, we have a minimum of three crowd-workers evaluate each conversation independently.We consider one system to have won over the other when the majority of the crowd-workers choose it.However, we acknowledge that there may be instances where the two systems perform equally well.In such cases, no system receives the majority vote, leading to a tie.In Table 2, we provide several cases when the LLM answer answer Sim wins the human answer answer QuAC under different aspects.
To avoid any position bias in the annotations, both QuAC and teacher Sim examples are randomly switched and positioned as System A and System B for each conversation.Also, to ensure English proficiency, we made the task visible only to native English speakers.Additionally, before starting the annotation task, we asked the crowd-workers to complete an onboarding test, consisting of some questions about the task itself (e.g., (i) What does "System A is correct" mean? and (ii) Is a correct answer always natural?).We also provided around 10 sample annotations for the crowd-workers to refer to.Upon completion, we evaluated their responses, and only if they answered at least 75% of the onboarding questions correctly, they were allowed to start the main annotation task.This approach helps to guarantee that crowd-workers are adequately prepared and knowledgeable before undertaking the annotation tasks.Moreover, we manually check the consistency of preference justifications with the labels by reading their open comments.We noticed that in some cases (7%) they do not match, so we removed them from our dataset.

Experimental results
In this section, we evaluate the performance of teacher Sim .
Answer comparison: QuAC vs. teacher Sim .We report the performance on 359 questions extracted from the 50 sampled conversations.For 77 questions (21.4%) answer Sim is identical to answer QuAC .Furthermore, for 106 of the questions (29.5%), there is an overlap between answer Sim and answer QuAC , indicating that one is a substring of the other.For 176 questions (49.0%), answer QuAC and the answer Sim do not overlap.Notably, for 41 questions, teacher Sim returns more than one segment from the text as the answer.The statistics on comparison of answer QuAC and answer Sim can be found in Table 3.
Answer-level human evaluation.We report the result of teacher Sim human evaluation (Fleiss'  = 0.4365) in Table 4.The result shows that teacher Sim outperforms the human teacher of QuAC in terms of all question-based metrics by a large margin.Additionally, we see that the annotators prefer the answers provided by our teacher Sim over the answer QuAC in 87.7% of the topics.teacher Sim answers exhibit enhanced accuracy and naturalness due to a significant number of incomplete answer spans in QuAC.This leads to grammatically incorrect sentences (e.g., "platinum.Thank U, the', ' "Tori Amos on the 5 and a Half Weeks").It is also noteworthy that we allow teacher Sim to select multiple spans from the text to provide more complete answers to the questions, when necessary -something that is missing in QuAC training set, but only available in the test set [11].Furthermore, as in our task, we limit the LLM to answer questions from the given text, the risk of hallucination is highly decreased and is indeed verifiable, making LLMs more reliable.These findings are in line with Faggioli et al. [15], showing the potential of LLMs in replacing crowdsourcing in annotation and simulation tasks, addressing the concerned research question RQ3.
Conversation-level human evaluation.We further compute Preference as reported by the annotators, who indicate their preference for interacting with either of the two systems.We see in Table 4 that answer Sim is the winner in terms of conversation-level annotator preference, where 87.7% of them prefer answer Sim over answer QuAC .This indicates the promising potential of LLMs in engaging in a conversation, as long as they are sufficiently informed about the task and certain verification steps are employed.We follow Siro et al. [43] and cluster the open-ended justifications provided by the annotators into different categories to gain more insights into other aspects of quality that can be overlooked in our human evaluation.In our analysis, we find that most of the comments mention the three aspects that we include in our annotation task (i.e., correctness, naturalness, and completeness).We also find comments that can be classified into seven new categories, namely, clarity, coherency, directness, being comfortable, trustworthiness, factuality, and conciseness.We see that many annotators found answer Sim answers more factual and found the conversations more comfortable.Interestingly, even though answer Sim answers are lengthier on average, some annotator justified their preference because of their conciseness.

SIMULATION EVALUATION
To provide a more comprehensive evaluation of RQ2 and RQ3, we assess the performance of the LLM simulation in comparison to human performance.We first introduce an LLM-based simulated dataset SimQuAC.Furthermore, we report the results of state-ofthe-art reading comprehension methods on the two datasets to shed light on the quality and difficulty of the simulated dataset.

SimQuAC dataset
We first introduce our dataset named SimQuAC for simulation evaluation, using the simulation framework described in Section 2.
To collect SimQuAC we used GPT-4 to implement student Sim and  teacher Sim .We randomly select 342 conversations from the training set of QuAC and simulate 334 conversations using the unique topics from this sample.SimQuAC consists of 4,005 questions with an average of 1.32 answer spans per question.The statistics of SimQuAC are presented in Table 5, alongside those of the original QuAC conversations.

Student evaluation
Due to the nature of a student's role in a conversation, which involves asking questions and exploring a topic, it becomes challenging to define an objective metric that determines which model is "better." Therefore, our emphasis lies in highlighting the distinctions between the behavior of two systems by contrasting their linguistic characteristics from various aspects.
Question comparison: QuAC vs. student Sim .Table 6 presents a sequential collection of questions in a conversation within both the QuAC and SimQuAC datasets, sharing the same topic.An observation can be made that GPT-4 tends to inquire about more detailed and lengthy questions compared to humans.Additionally, it is worth noting that the human student in QuAC ceases asking questions after the fourth one, while the simulated student in SimQuAC continues to pose additional queries.
Coverage.We assess the ability of the two students to explore a topic by comparing how much of the  is covered by the answers provided to the questions posed.We plot the distribution of coverage in Figure 2a.We observe that SimQuAC questions cover a significantly (two-tailed t-test; -value < 0.001) larger portion of the text (mean = 0.365; std = 0.163), compared to QuAC (mean = 0.238; std = 0.122), suggesting that careful prompting of LLMs can lead to a diverse and comprehensive set of questions in a conversation.
Conversation flow.Next, we compare the questions posed in terms of how they shape the flow of the conversation.Our objective  in this experiment is to evaluate the naturalness of the conversation flow and the smoothness of topic transitions.We hypothesize that a conversation that strictly follows the sequential order of the content in  is less natural.To measure this, we assign an order to the questions based on the positions of their corresponding answers in .For instance, let us consider questions A, B, and C. To determine their order, we examine the text spans of their respective answers and sort them based on the start position of each answer.In this case, say question B's answer is at the earliest position of , followed by A and C. Therefore, the question order would be {B, A, C}.To assess the sequential nature of the conversation flow, we compare the order of the questions in the conversation to their order in the document, considering their corresponding answers.In our example, if questions {A, B, C} appear in the conversation in the same order, i.e., A is followed by B and then C, the conversation flow would be considered completely linear.
To evaluate the degree of correlation between the question order in the conversation and the corresponding answer order, we calculate the Kendall rank correlation coefficient (KRCC) metric [1] for each conversation.KRCC measures the distance between two ranked lists, where a lower value indicates more distance between the two lists.In our case, the lower the value, the less sequential a conversation flow is. Figure 2b plots the distribution of the two datasets in terms of KRCC.We can see that the average value of KRCC is lower for SimQuAC than QuAC, indicating that the student of QuAC poses more questions in a sequential order, compared with student Sim .This suggests that student Sim tends to explore the topic by jumping from one part to another part.While there is no indication as to which order is more natural, we can see that there is a clear difference in their behavior.It is noteworthy that exhibiting a more random behavior in posing questions can lead to more challenging datasets, as it prevents model learning such a biased behavior of student.

Reading comprehension benchmarking
To gain a deeper understanding of the distinctions between the human-generated (QuAC) and the LLM-simulated data (SimQuAC), we utilize several pre-trained discriminative and generative reading comprehension baselines for evaluating the teacher model.These models are pre-trained on the SQuAD dataset [37] and we test them on two datasets directly without further fine-tuning.To ensure a fair comparison with the QuAC subset, we impose a limitation on SimQuAC, restricting it to a maximum of 3 questions within a Table 7: Experimental results of reading comprehension models on QuAC and SimQuAC in terms of precision (Pre.), recall (Rec.),F1-measure (F1), and exact match (EM).'-b' refers to the '-base' variant of the models, while '-l' refers to their '-large' variants.All the numbers are shown in percentages.The results demonstrate that, in comparison to QuAC, most models exhibit superior overall performance when tested on SimQuAC.This suggests that the LLM-simulated data may provide a more favorable context for these models, leading to improved results.Furthermore, it is noteworthy that the EM score in SimQuAC is lower compared with QuAC.This discrepancy can be attributed to the fact that, in SimQuAC, the answers generated by LLM tend to cover a longer span compared to the answers in QuAC, posing more challenges for matching.Moreover, there are more questions with no answer in SimQuAC.The pre-trained models always output a span as an answer, instead of predicting no answer, leading to a lower EM measure.Additionally, our observations reveal the superior performance of generative methods, such as T5, compared to discriminative methods, such as BERT.This finding emphasizes the importance of utilizing generative LLMs for this particular task.

RELATED WORK 5.1 Conversational question answering
CQA requires the ability to correctly interpret a question in the context of previous conversation turns [53].Under this context, modern CQA systems can be divided into two types: sequential knowledge-based question-answering (KB-QA) agents [12,20,39] and conversational machine reading comprehension (CMRC) systems [19,25,35,38,52].In sequential KB-QA systems, agents need to search the database for the appropriate information to generate the answer.In this paper, we focus on the CMRC setting, where the conversation revolves around a given article and the answers are typically a span in the given resource.To this end, several datasets such as CoQA [38], FlowQA [19], have been proposed.Among all the CMRC datasets, QuAC contains over 14K crowdsourced QA dialogs [11].This dataset allows for a student posing a sequence of free-form questions to learn as much as possible about a hidden Wikipedia article and a teacher is hired to find the answer to each question in the text.Following this line, extensive tasks such as reading comprehension [19,32,35,55], answer ranking [34], question generation [22,51] are adopted to measure the performance on both student and teacher levels.

User simulation
While user simulators have been studied in the information retrieval (IR) community extensively [8,10,26], including applications such as simulating user satisfaction for the evaluation of taskoriented dialogue systems [44] and recommender systems [2,54], they are often limited to reacting to a system's action.The emergence of LLMs provides the opportunity to improve user simulation, making it more realistic.LLMs are ideal for human simulations due to their remarkable ability to process text in the natural language format.They are also able to generate coherent and contextually appropriate language that is very similar to how humans communicate [29,30,56].One such application is using LLMs as evaluators to mimic human evaluation, which has proven to be highly effective in various contexts [3,16,31,33,41,57].For instance, Guo et al. [16] compare ChatGPT with human experts by collecting tens of thousands of comparison responses from both sources.Tan et al. [45] assess the performance of ChatGPT as a KB-QA system using its own knowledge.Another common application of LLMs in simulation is leveraging them as an annotator-free tool for data augmentation [5,6,17,18,23,46,50]. Sekulic et al. [42] employ GPT-2 and propose an evaluation framework based on mixed-initiative conversations.Owoicho et al. [28] take it one step further and utilize GPT-3.5 to simulate a user that can also provide feedback on the relevance of a returned document in a conversational search setting.In the context of document re-ranking, Askari et al. [5] prompt LLMs to generate synthetic training data for cross-encoder re-rankers.Most recently, Hu et al. [17] adopt LLMs as user simulators in a taskoriented dialogue system.To the best of our knowledge, our work is the first to utilize LLMs as annotator-free teacher-student simulators in a CQA system, where the student takes a proactive role in exploring a topic.

CONCLUSIONS AND FUTURE WORK
We explore simulating human-to-human conversations using zeroshot prompting of LLMs in a question-answering setting.Our framework involves two GPT-4s interacting on a topic: one as the student generating questions based on background knowledge, and the other as the teacher seeking answers within a text on the given topic.To assess the system, we initially evaluate the teacher's performance through both automated methods and human assessment.Subsequently, we compare conversations generated by the LLM and those by humans for the student-level evaluation.In summary, our investigation highlights the potential of LLMs in facilitating interactive and informative retrieval experiences.
Despite the superiority of our model, several limitations persist, which point to avenues for future research.Firstly, according to our findings, only GPT-4 consistently follows instructions to generate reasonable conversations, constraining the overall effectiveness of the pipeline.Besides, language models can exhibit various biases, when used for such simulation.It therefore becomes essential to further develop methods to mitigate these biases.Moreover, although we have devised prompting strategies to mimic human interaction, the manual construction of instructions can be time-consuming.Future work should explore more advanced and efficient automatic prompting strategies to enhance the system.
Our work revolves around utilizing LLMs to simulate users.Given the recent emergence of LLMs and the vast interest in using them for various research directions, we believe that pursuing such a direction is necessary as it unveils the potential of LLMs, while at the same time exhibiting their potential ethical considerations.Below we list some of these concerns that need to be considered and addressed in this research area: • Bias and discrimination: LLMs are biased towards their training data.The simulated data could in turn carry the same biases and further propagate stereotypes and discrimination.• Misrepresentation: using an LLM to simulate users would introduce certain biases on the type of users being represented.The biases that exist in the data that the LLM is trained on would be reflected in the simulated user.• Transparency and accountability: the decision-making process within LLMs can be opaque, making it challenging to understand how or why a particular simulated conversation is generated.This lack of transparency can lead to ethical challenges, particularly in contexts where clear justification for a decision is required.• Environmental impact: the training and operation of LLMs consume significant computational resources, contributing to energy consumption and potentially having a negative environmental impact.While simulating users using LLMs has various advantages, it must be approached with careful consideration of the potential ethical implications.

Correctness 1 ) 2 ) 3 )
How old was he when he went on pilgrimage?answer QuAC : In 1897, answer Sim : He was twenty-eight, had been married ten years, and had an infant son with another child on the way.Completeness What shows did David Frost have?answer QuAC : Sunday morning interview programme Breakfast answer Sim : Sunday morning interview programme Breakfast; Through the Keyhole; Al Jazeera English.Naturalness Did he perform in the later 30s? answer QuAC : Agency.Mills though continued to record Ellington answer Sim : In 1937, Ellington returned to the Cotton Club which had relocated to the mid-town Theater District.

Figure 2 :
Figure 2: Comparison between student of QuAC and SimQuAC in terms of (a) topic coverage and (b) conversation flow. 1

Table 1 :
The template for constructing Instruction T (left side) and the Instruction S (right side).The variables inside "[ ]" would be filled based on the input Wikipedia page.In this task, you will be given a text about the topic explained above.You will answer my questions from this text.Please remember that you cannot generate the answer on your own but should only copy a continuous span from the original text and the copied answer should not exceed 40 tokens.If you cannot find the answer in the text, please generate 'I cannot find the answer'.

Table 2 :
Examples of cases where answers generated by teacher Sim win the original QuAC answers in each aspect.

Table 3 :
Statistics of comparison on answer QuAC and answer Sim under different conditions based on their answer span and type.The "I cannot find the answer" answers are represented by 'None.'We refer to the single-span answers generated by answer Sim by 'single'.

Table 4 :
Results of the pairwise human evaluation of answers generated Teacher simulation answer Sim compared to original answer QuAC answers.Each cell reports the percentage of cases where the three human annotators agreed that either answer QuAC or answer Sim wins.We also report the percentage of ties, where the annotators disagreed on a winner.

Table 5 :
Statistics of the collected dataset by simulating the conversation with teacher Sim and the student Sim .

Table 6 :
The questions from QuAC on "Talland House (1882-1894)" section of the "Virginia Woolf" topic with the questions generated by student Sim .Where is Talland House located? 2 ) Did Virginia Woolf live in Talland House during the period of 1882-1894? 3 ) Who owned Talland House during this period of 1882-1894? 4 ) What is the architectural style of Talland House? 5 ) Did any notable events take place in Talland House during the period between 1882-1894? 6 ) What impact did living in Talland House have on Woolf's later work?