Cooking with Conversation: Enhancing User Engagement and Learning with a Knowledge-Enhancing Assistant

We present two empirical studies to investigate users’ expectations and behaviours when using digital assistants, such as Alexa and Google Home, in a kitchen context: First, a survey (N = 200) queries participants on their expectations for the kinds of information that such systems should be able to provide. While consensus exists on expecting information about cooking steps and processes, younger participants who enjoy cooking express a higher likelihood of expecting details on food history or the science of cooking. In a follow-up Wizard-of-Oz study (N = 48), users were guided through the steps of a recipe either by an active wizard that alerted participants to information it could provide or a passive wizard who only answered questions that were provided by the user. The active policy led to almost double the number of conversational utterances and 1.5 times more knowledge-related user questions compared to the passive policy. Also, it resulted in 1.7 times more knowledge communicated than the passive policy. We discuss the findings in the context of related work and reveal implications for the design and use of such assistants for cooking and other purposes such as DIY and craft tasks, as well as the lessons we learned for evaluating such systems.

Overall, our study provides new and innovative findings for knowledge-grounded conversational information assistants.It suggests that users are interested in an agent capable of providing background knowledge.We also find that an active policy is an important design factor for an assistant, leading to longer conversations and significantly increasing the amount of knowledge transferred.
This research benefits developers and researchers in conversational agents, offering insights into user expectations and the impact of interaction policies in the context of cooking assistance.Smart device manufacturers and designers of cooking assistants can find valuable guidance for enhancing user experience and knowledge transfer in this domain.Additionally, academics in natural language processing and human-computer interaction can leverage the study's methodology and findings for further research.

RELATED WORK
The presented work builds on and is influenced by a growing body of related research.We summarise this in two main subsections.The first reviews how humans interact with conversational assistants and what is known to influence their behaviour; the second focuses on the kitchen domain by reviewing research contributions offering assistance in this context.

Interacting with Conversational Assistants
Extensive research has been conducted by numerous scholars to explore the dynamics of human interaction with conversational assistants, as well as the multitude of factors that shape and influence these interactions.Numerous investigations have been conducted in diverse contexts to explore the interaction between users and conversational assistants.These investigations span various tasks, including practical ones such as booking flights [11] or holidays [40], as well as informational tasks such as finding accounts of heroism or evaluating the pros and cons of medical treatments [42].A crucial aspect of successful user-assistant interaction is the accurate comprehension of the user's intentions.In the literature, intent prediction has been extensively studied, with researchers such as Qu et al. [36] and Ghosh et al. [20] focusing on predicting dialogue acts and speech acts to understand user intent from a linguistic standpoint.Additionally, there are studies that concentrate on task-specific intents and information requirements [19].Some approaches combine both linguistic and task-specific intents, as demonstrated by Shiga and colleagues in their modelling efforts [40].
The way people converse with an agent to complete a task has been shown to vary based on numerous variables, including the complexity [44] and difficulty [42] of the task.The characteristics of the agent are also important.Users tend to exhibit preference for agents whose conversational style matches their own [39].Thomas et al., who investigated the effects of an agent's style in detail, did not find any single "best" style but reported several effects on aspects such as perceived effort, engagement and the feeling of being understood by the agent [42].When agents make reference to previous utterances, this leads to greater user satisfaction and a lower cognitive load [13].Thus, the way an agent communicates can have a strong influence on the user experience.
Several scholars have examined the effect of the agent strategy or initiative (i.e., the interlocutor driving the conversation).Researchers refer to a mixed-initiative spectrum where an active or passive agent influences the characteristics of conversations and variables such as workload, user satisfaction, and learning.Active agents tend to result in more but shorter conversational turns, including more follow-up questions [16], whereas passive agents have fewer turns that are often longer [16].Active assistants can improve task performance [11] and be considered to be more engaging and truthful by users, especially in goal-oriented tasks [14], but also in social-bot contexts active strategies have been shown to foster engagement [22].In some settings, such as tutoring, active systems tend to facilitate the learning process [12].Passive agents, however, have been suggested to be better suited to simpler tasks, where users do not need assistance in describing what they want [37,44].
Not only does it seem that different tasks are suited to different interaction modes, but the evidence seems to suggest that, depending on the initiative strategy employed, users will need divergent support [1,3].Moreover, these opposing interaction modes come with their own challenges.Whereas passive agents need to decipher both needs and context from a user utterance [19], active agents are required to say the right thing at the right time-even a few seconds of silence after submitting a query to the agent can be perceived as an indicator for errors [35] and people may avoid using an active agent when an intervention is mistimed [2].

Assistance in the Kitchen
Assistance in a kitchen can take many forms.It is well documented, for example, that conversational assistants have functionalities that can be useful for cooking, such as setting timers or adding items to shopping lists [18,21].Assistance can be provided via recommendations for meals that a user may like to cook [17].Accomplishing this goal can involve leveraging the preferences of users who have similar profiles to the target user [24], as well as considering the inherent properties of the food itself [15].Furthermore, such personalisation can be adapted to address specific dietary needs, such as weight loss objectives [43].The generation of recommendations can be carried out through conversational interactions, with minimal differences in the interaction patterns regardless of whether the user interacts with the system by typing or speaking [4].Other systems aim to provide assistance during the cooking process.For example, the AskChef system [30] provides a recommendation for every step in a recipe utilising either a smart-speaker or a laptop screen, depending on the context and support needed.
A notable body of work has focused on how users interact to gain assistance, for example, when interacting with a human mimicking the "perfect conversational assistant" [19] or via WoZ studies [4,47].Frummet provides a taxonomy of information needs that people might ask an assistant while cooking.This is an important step in understanding user needs, however, the empirical setup restricted the sample to a small and relatively homogeneous group.Inspired by other conversational search research, Vtyurina and Fourney explored initiative in the cooking domain, discovering that unrestricted environments of communication exhibit many signals that could be processed by future assistants [47].For example, implicit cues, such as "okay, " express the intent to move to the next step, something that is not captured well by current systems [48].This finding has since been confirmed in further studies [19,33] where users make extensive use of such cues when they communicate with an assistant.Studies in this space have employed varying initiative strategies.The human agent in Reference [19] acted as a passive collaborator, whereas the wizard in Reference [4] played a pro-active role by explicitly prompting for information and asking clarifying questions.
Two primary points can be extracted from the related work: (1) a broad variety of assistance is possible, but we do not yet know what people need or expect from a digital kitchen assistant.To date, only small-scale studies with homogeneous samples exist [19].Therefore, we aim to learn about the information users expect to attain from digital assistants in a cooking context (RQ1).
(2) While we know that agent interaction strategy influences user behaviour, we do not yet know what impact this has on the kinds of questions asked in the cooking context and the assistance received (or knowledge transferred) as a result (RQ2)."silver surfers" [8].We believe this is appropriate, given our research aims, since we would expect that only technically proficient or indeed technically curious older adults would use such a system in practice.

Results
The distribution of responses to each question is shown in Figure 1.The main takeaway from the graphic is that the participants can envisage all of the suggested functionality provided by conversational kitchen assistants.The median response for all of the questions is higher than the mid-point of the Likert scale.A second observation is that the median responses are higher for some features than others.For example, features revolving around the cooking process (e.g., ingredients and their quantities, cooking temperature, cooking time) were scored consistently high, whereas the knowledge-grounded features, which did not appear in Frummet's study, were less so.The knowledge-grounded questions also exhibited the largest variance in responses.
To dig deeper into this, we examined the data to determine whether demographic information influenced how participants responded.The findings suggest that different demographic groups have varying expectations regarding the types of assistance that should be provided.Visualising the data revealed that older participants (>45, n = 60) rated knowledge-grounded features lower than younger participants (≤45, n = 140).More precisely, younger people rated recipe history higher (M = 4. .This is confirmed statistically. 2 Younger participants are significantly more interested in recipe history (U = 5679.0;p < .001)and science questions (U = 5061.5;p = .020)compared to those in the older participant group.Slight correlations were observed between reporting enjoying cooking and wanting to learn more about recipe history (r = .13;p = .062)and science (r = .17;p = .014).
Cooking with Conversation: Enhancing User Engagement and Learning 122:7

METHODOLOGY: THE WIZARD-OF-OZ STUDY
The expectation for an assistant to not just handle procedural steps but also handle knowledgerelated questions is the motivation for focusing on this previously unexplored aspect in the second study.We perform a user study simulating a cooking scenario where participants work through the steps of a recipe and are encouraged to converse with an agent and ask knowledge-grounded questions that come up along the way.It is important to note that, for practical reasons, participants did not physically cook the recipes during the study.The logistics of controlling this process would have been highly intricate, potentially prolonging the experiments and complicating an already challenging recruitment process.As detailed later, we made efforts to simulate the cooking process to closely resemble a naturalistic experience.We address the implications of this in the Limitations section.
Inspired by the literature, participants were randomly assigned to one of two conditions: passive, where the wizard simply responded to messages and questions from the participant; and active, where the wizard proactively interacted with participants to indicate that it possessed knowledge about the cooking steps and asking participants if they were interested to learn.Participants in both conditions were free to ask whatever questions they wanted, but there was no explicit need for them to ask questions.
The experimental condition (active vs. passive) was balanced such that each of the six recipes used featured in each condition with one of two wizards.That is, the two wizards had the same coverage in terms of recipes and conditions totalling 24 conversations for each wizard.Participants were unaware they were interacting with a human until the post-experiment debrief.In the following subsections, we explain the methodology in detail, outlining the procedure (Section 4.1), participants (Section 4.2), and the justifications for studying the recipes we did (Section 4.3).We then continue to detail the Wizard-of-Oz setup (Section 4.4), including an extensive piloting process used to establish guidelines for wizard behaviour (Section 6.5).Finally, in Section 4.6, we outline the measure we used to quantify the characteristics of conversations, as well as the knowledge transferred from wizard to participant.

Procedure
Each experiment comprised the following steps: (1) Participants read the informed consent form.
(2) If they agreed to the study conditions, then we provided them with specific task instructions and a short tutorial on how to use the chat interface.(3) After the tutorial, the participants opened the chat interface to receive a welcome message from the cooking assistant (wizard) and start the experiment.3(4) After completing the task (i.e., working through the recipe), participants were thanked and debriefed before their participation was confirmed on the crowd-sourcing platform.
Participants could take as much or as little time as they wished for the study.The time taken ranged from 30 to 62 minutes.

Participants
As with the survey, participants were recruited via Prolific based on our experience that the portal provides access to heterogeneous and motivated participants that deliver high-quality data.We applied the same restrictions in our sampling process as the survey, accepting only native English  speakers from the US or UK.This makes it possible to compare the findings and was necessary, since the recipes contain complex cooking procedures in English.Participants received between 9 and 11 GBP as compensation for their time.This ensured that participants received at least the minimum remuneration as defined by Prolific, and the majority received considerably more.
Forty-eight participants were recruited, 65% of which identified as female, 33% male, and 2% as non-binary, similar to the distribution in the survey.Most participants were aged between 45 and 54 (n = 16(33.3%)),followed by 10 (20.8%) between 25 and 34, 9 participants (18.9 %) were aged between 35 and 44, 5 (10.4%) between 18 and 24, and 5 (10.4%) between 55, and 64. 3 participants (6.3%) were aged between 65 and 74 years old.Unsurprisingly, given the recruitment procedure and the subject matter and type of study, the sample was biased towards individuals who cook or have an interest in cooking.

Recipe Selection
The sessions were based around the steps involved in cooking six recipes from the SeriousEats4 website.These are ideal for our experiment, since they feature interesting ingredients and cooking procedures and are complemented by associated how-to resources including the history behind the recipe and the science behind its methods.The recipes were chosen such that they contained a minimum of five steps and had at least one picture illustrating the steps.We assumed pictures would help with the simulation of the cooking process by allowing participants to imagine what the outcome of the steps would have looked like.The recipes were diverse in that they included both main meals and desserts, as well as omnivorous and plant-based dishes.We wanted recipes to be appealing to a broad set of participants.

Wizard-of-Oz setup
We utilised TaskMAD [41] as a platform for our experiments.The interface, used by the participants, provided a simple chat interface as well as contextual recipe information presented as sequential steps (see Figure 2).The wizard interface (Figure 3) presented wizards with information about the recipe and how-to in a structured manner.The interface also provided the ability to perform federated searches over custom external data sources.The resources included the SeriousEats website (how-tos), StackExchange Cooking5 (questions and answers), and Wikipedia (KILT) [34].Pilot sessions demonstrated that these datasets would be sufficient to answer most questions.
To share the workload and minimise personality and learning effects, the experiments were conducted by two wizards.Extensive piloting (28 experiments and over 21 hours of conversation) familiarised the wizards with the interface and allowed behavioural guidelines to be established.After every test run, wizards discussed their experiences and how they reacted, leading to a consistent response framework for both conditions.Numerous pilot sessions were necessary to ensure consistent interaction strategies and establish saturation regarding the types of questions participants might ask and how to respond to them consistently.In the following section, we outline how the pilot sessions informed the final study.

Lessons Learned from Pilot Conversations
The pilot sessions led to formalised procedures for the active condition that reflected the experiences and lessons learned in terms of how and when wizard interventions should be formulated.The active agent condition, which has the aim of provoking curiosity during a task-based context, blends a task-oriented setting [23,25,31] with more social chatbot settings where the aim is to promote user initiative and engagement [5,22].There are differences between our setting and a socialbot.In socialbot contexts, the questions focus on eliciting a users' preference or topic interest.Whereas, here, the task and topic are fixed and questions aim to invoke users' interest.
The wizards took into account the factors that had sparked interest and interaction from the participants during the pilot conversations when formulating the guidelines.Specifically, they deliberated on how to create effective prompts and when it was suitable to present them.This resulted in two main strategies that encourage participants to reflect on the background knowledge  necessary for each recipe step.The first was to make statements emphasising the importance of an aspect of the process, e.g., Don't forget to add X as this is crucial to make the perfect Y or Notice that X is important here.The second strategy was to formulate a question that relates to required knowledge, e.g., Why do you think apples should be put in a gallon-sized zip-top bag?Both of these strategies are similar to those that have been shown to be effective in increasing user initiative in social chatbot settings [22].
In terms of when to actively intervene, the wizards agreed that an appropriate moment was when the participant transferred to the next recipe step.It was determined that an intervention would be especially beneficial when the recipe step description provided limited information in comparison to the corresponding how-to instructions for that particular action.This was determined based on a mapping illustrated in Figure 4. Wizards informed the participants about this additional knowledge in the following ways: "Having a hot dough has significant effects on the shape of the parisian gnocchi." (conv.2, active cond.)or "Before going next, be aware of the impact cheese has on the gnocchi." (conv.2, active cond.)

122:11
In the passive condition, the primary guideline was that wizards were only to react when explicitly asked a question, as was the case in Reference [19].The wizards behaved intentionally personable in both conditions.According to existing research, for a bot to engage in social communication effectively, it needs to possess qualities such as empathy, supportiveness, and a genuine interest in the thoughts and ideas of its human interlocutor [29,38].This included being friendly, positive, and polite.In the active condition, when participants correctly answered a question, this was intentionally praised to build confidence and encourage further questions to be asked.See, for example, "Oh wow, that's correct!!You really are an expert![...]" (conv.7, active cond.)or "Wow, seems that you're already an expert :-) You're right: [the wizard continued to provide more extensive details]" (conv.14, active cond.) The wizards were also encouraged to express their own personal interest in the recipe and relevant facts like in "The zip-top bag trick is great, right?" (conv.15, active cond.)or "[the wizard provides background knowledge] So this is why you need to add flour directly all at once to the saucepan.Interesting, right?:-)" (conv.14, active cond.) This approach pertains to personal disclosure and the concept of the "disclosure-reciprocity effect" [9], which has been observed in chatbots to result in users sharing more information than they would typically [28].We anticipated a comparable transfer of curiosity in our study.
By the time the full study started, both wizards were intimately familiar with the recipes and were prepared for potential questions about techniques, chemical processes, and related information associated with recipe steps.

Measuring Conversations
To understand the impact of wizard behaviour on the resulting conversations, it was imperative to establish robust quantitative metrics for the conversational characteristics.In this section, we outline the metrics and justify choices.Following the definition in Zamani et al. [49, p. 4], we define an utterance as a message that has been sent by either the wizard or the participant.An utterance can consist of multiple sentences.The resulting conversations comprise 1,396 utterances in total.To examine the kinds of questions asked by participants, we annotate utterances using an appropriate information needs taxonomy and derived a process to quantify the knowledge transferred by the agent as a result.The following subsections describe the annotation processes in detail: 4.6.1 Annotating Questions.To establish the kinds of questions users tended to ask, the utterances were annotated using the information needs taxonomy as defined by Frummet et al. [19].This taxonomy is useful, as it offers a means to describe the information needs that occur during real-life cooking sessions.Only utterances determined to be questions were classified.Each utterance (question) received one information need label from level 1 of the taxonomy, out of the 12 labels that distinguish between questions on ingredients, their quantities, the cooking process, and so on.The annotation was conducted by the lead author of this article who derived the taxonomy in prior work.He is an information scientist, experienced in annotating both spoken and written conversational data using qualitative methods.We did not feel it necessary to annotate with more than one coder, since the reliability of the coding scheme and the coder's consistency in its application with other annotators were assessed in prior research.In that study, 10% of a comparable, albeit naturalistic, dataset was annotated by the same coder and another annotator.Together, they attained a Cohen's κ score of 0.75 [19].To ensure consistency and reliable annotation in the current dataset, however, a subset of 50 randomly selected utterances was relabelled by the same annotator resulting in a Cohen's κ score of 0.87, which is considered an almost perfect agreement according to Landis & Koch [27].It became clear very early in the annotation process that the categories at level 1 of the taxonomy were appropriate for our purpose, since the questions in our dataset and the phraseology employed by participants were very similar to those reported in the previous work.For transparency, we disclose that the annotator served as one of the wizards.The annotation was conducted without knowledge of which experimental condition corresponded to the transcripts.The process was based purely on the user utterances alone and took place several weeks after data collection.Consequently, we do not believe this introduces any bias to the process or the findings.
The labels illustrated in Table 2 focus on two types of labels-Process and Knowledge.Process questions relate to the actions participants needed to take and, in our data, were typically phrased as what questions, for example: "Ok, what's the first step?"(part.0) or "What do I do after I've made the pesto?" (part.3) We did observe other kinds of formulations for process questions such as "Can I use a knife?" (part.20, passive) and "Does it [strainer] have to be a metal one?" (part.21, passive) Knowledge questions, in contrast, sought knowledge in the form of explanations.These were mostly phrased as why or how questions.For example: "Why [should I use] Dijon mustard?" (part.1) or "Agent: Egg yolks play an important role in this step.Part.: How?" (part.8).
To better understand what kind of Knowledge questions were asked, we further annotated these questions with an additional label not present in the original taxonomy, which describes the type of knowledge being sought.These labels were Science when participants wanted to know about the underlying mechanisms.For example: "What does blanching do?" (part.5) or "How does reducing the temperature affect the duck instead of keeping it at a normal temperature and cooking for a shorter time?" (part.13) These questions prompt the agent to provide knowledge that explains underlying scientific processes.History questions were different, for example: "What's the origin of a soufflé?" (part.8) or "Is [the dish] French?" (part.5) These questions triggered information about origins of recipes and meals.Questions about the Step Importance were phrased, for example, as follows: "Why should I put cream in my egg wash?" (part.15).When this kind of question was asked, participants expected suitable explanations from the agent, which, in some cases, also involved providing scientific knowledge.The distinguishing criteria Cooking with Conversation: Enhancing User Engagement and Learning 122:13

History Step Importance Science
Where is the dish from?Why do we use cream of tartar?What does blanching do?
Where is cayenne pepper from?
Why do I flip the bag?Will frying a second time change the nutritional value of the recipe?
What other types of soufflé are there?
Why did you add tapioca starch?What does macerate mean?
Can you tell me the history of this chicken recipe.
Why coat the chicken in flour mixture?
What does putting 3 tablespoons of marinade into the flour do?
When was bechamel sauce invented?
Why seasoning lightly with salt before it goes into the oven?
between Science and Step Importance questions are that in Step Importance questions, participants explicitly ask why there is a need to perform a certain action, for example: "Why should I put cream in my egg wash?"This is not the case for Science questions where participants do not ask for the reason a specific step or action needs to be performed.To enhance the reader's understanding of the types of questions that were asked and how these were phrased, further illustrative examples can be found in Table 1.

Quantity of Knowledge Communicated.
To establish the quantity of knowledge communicated by agents in the conversations, we counted what we refer to as information nuggets in wizard utterances.An information nugget is an atomic piece of information that the assessor considered useful or interesting and that had not previously featured in the same utterance [46].To avoid bias, the utterances were annotated independently (i.e., free from conversational context), in a random order, and without any indication of the experimental condition.
The following example illustrates that this can be relatively straightforward with all of the annotators (we employed three) agreeing that the utterance contains two nuggets of information: As a general rule, the annotators agreed to treat multiple adjectives with a separator, as in nugget 2, as a single nugget.Other utterances were less straightforward to define.In the next example, all three annotators counted differently:

That s a дreat question! Chillinд ensures the douдh is cold to start
Tapioca is a starch,  To establish the consistency of annotation, the annotators each labelled 638 agent utterances with the number of information nuggets they believed each utterance to contain.Despite the differences in annotating granularity, the mean pairwise (pairs of annotators) Pearson's correlation between the counts of the three annotators was very high (r = 91.89%,r max = 94.33%,r min = 89.31).To determine if it is possible to replicate the human annotations and automate the process, such that we can scale to the entire dataset, we provided a suitable prompt to GPT-3 6 This resulted in an average correlation of r = 78.83%with the human applied counts and a mean absolute error of 0.71, SD = 1.37 when the GPT-3 counts are compared to the average of the three human annotators.We conclude that this is a sufficient signal to use GPT-3 counts to test the amount of knowledge conveyed in utterances in the experimental conditions.Consequently, we used the counts provided for each utterance by GPT-3 on the full dataset as the basis for our analyses below.

EXPERIMENTAL RESULTS
This section addresses RQ2.Our objective is to investigate how participants interacted with the wizard and whether the wizard's mode of interaction influenced user behaviour and the information transferred.We present our findings in four sections: In Section 5.1, we outline the general conversation statistics.Section 5.2 explores the types of questions that were asked during the conversations.Subsequently, we evaluate the wizard's responses and the extent of knowledge shared with the participants in Section 5.3.Finally, we examine the knowledge sources utilised to answer participant questions in Section 5.4.

Conversation Characteristics
In the first step, we focus on evaluating the overall characteristics of the conversations.Specifically, we analyse whether there were differences between the two interaction modes concerning the quantity and length of utterances, as well as the interactions between the user and the wizard.
The active condition yielded more utterances compared to the passive one.Specifically, the agent and participant issued 1,005 and 923 utterances, respectively.In the passive condition, however, 989 utterances were gathered: 409 by the agent and 580 by the participant.In both conditions, agent utterances were significantly (active: U = 742,794.).Thus, the active condition led to longer conversations overall, however, the number of user utterances compared to agent utterances and the length of user utterances remained similar regardless of the condition.
The transition graphs in Figures 5 and 6 visually depict the disparities in user and wizard interactions across the various conditions.Modelling the dialogues in this way illustrates how the conversations moved from one type of user/agent utterance to another.The active condition is generally more connected with several transitions (shown in red) in this condition that were not present in the passive condition.
Cooking with Conversation: Enhancing User Engagement and Learning 122:15

Distribution of Information Need Types
Here, we examine the annotations based on Frummet et al. 's taxonomy [19].As Table 2 shows, participants asked questions in both the active and passive conditions.Overall, however, 1.5 times more questions were asked in the active condition [309 vs. 211, χ 2 = 13.95,d f = 1, p < .001],and information needs were 1.5 times more likely to be knowledge-related in the active compared to the passive condition [37.86% vs. 25.12%,χ 2 = 78.59, These findings suggest that when the agent hints that it possesses additional information, it effectively prompts users to ask questions.However, we wish to emphasise that agent interventions were far from guaranteed to result in questions.This is demonstrated by cases where the participants were clearly not interested in discovering what the agent knows and replied that they wish to move to the next recipe step: "Agent: There are three possible ways of making [the sauce].Part.: Next" (part.5) or "Agent: The way in which you cut apples here is a crucial step when making apple pie.Part.: Next" (part.12) Since Knowledge questions were the most commonly asked question type, we examined these more closely to establish the types of knowledge-related questions participants asked.The bottom part of Table 2 illustrates that, in both conditions, most of the questions asked were related to science, step explanations (= step importance), and history aspects.Whereas questions about the history of the recipe or ingredients were balanced across conditions, science questions were nearly five times and step explanations nearly two times more common in the active condition.
It is important to note that participants did not need to explicitly ask knowledge questions to receive knowledge from the wizard.Process-related questions also led to knowledge transfer.Examples of this in our data include: "Part.: How small should the additions of vinegar be? -Agent: Very small additions.It will boil and bubble violently, so take your time to avoid a boil-over.(amount, part.13)"

Knowledge Communicated in Conversations
In this section, we analyse the quantity of information imparted in conversations to assess if this varied across conditions.A total of 309 out of 1,415 agent utterances included information nuggets.On average, conversations in the active condition contained 15.88 information nuggets ( xinf o_nuддet s = 14.50;IQR = 10 − 21), which was significantly higher (U = 144.5;p = .002)than the passive condition, where a conversation contained an average of 9.13 ( xinf o_nuддet s = 9.5; IQR = 3.75 − 12.25).
From the analyses in Section 5.2, we know that knowledge was not only transferred in response to Knowledge questions, but also via answers to Process questions, including those classified as Ingredient, Equipment, Time, or Preparation.To assess the quantity of knowledge conveyed in the agent's responses, we calculated and compared the amount of knowledge transferred across various information needs.
In the context of Knowledge and Process questions, it was observed that the wizard displayed a higher tendency to provide more extensive knowledge-based responses compared to processoriented inquiries.On average, a response to a Knowledge question included 1.58 (SD = 1.78) pieces of information, more than double than for process questions, which was 0.79 (SD = 1.39) pieces of information.This difference is also significant (U = 19,396.5,p < .0001).
Our results indicate that, on average, a greater amount of knowledge is conveyed in a conversation when in the active condition.This suggests that conversational assistants that are actively engaged can assist users in acquiring more background information while they are cooking.

Sources of Knowledge Used
As described in Section 4.4, wizards had access to various sources of information, including recipes and accompanying how-tos from SeriousEats, StackExchange Cooking, and Wikipedia.They also had access to other pages from Serious Eats, not directly related to the recipe used in our study.Table 3 displays the frequency with which each knowledge source was utilised to respond to specific types of questions, grouped by the corresponding information need.
While Wikipedia, recipe directions, and associated recipe how-tos were the most frequently utilised sources overall, the findings emphasise that different sources were employed depending on the type of question asked.The information in a recipe was primarily utilised to address questions related to the process, although, for such needs, how-tos, Wikipedia, and other sources were more frequently employed.
How-tos of the recipe served as the primary source of information for answering knowledgebased queries, particularly for queries pertaining to the significance of steps and scientific aspects.
Cooking with Conversation: Enhancing User Engagement and Learning 122:17 In contrast, queries about historical context were largely resolved by referring to Wikipedia as a source of information.These results offer valuable insights for replicating similar systems in practice.Depending on the particular information need a user has, different knowledge sources should be utilised by the agents, and this could be used to weight silos in an information retrieval setup.

Understanding the Influence of Intervention
As described in Section 6.5, wizards applied two different kinds of intervention in the active condition: statements and questions.There was no guideline as to when and how these should be used, and this was left up to the wizards themselves to decide.Here, we analyse if there is any evidence post-experiment that the tactic employed influenced outcomes.
We counted and analysed questions and information nuggets, which occurred in the conversation between a wizard intervention and either the subsequent intervention or the user proceeding to the following stage in the recipe.This is illustrated in more detail in Figure 7. Conversation (a) shows which turns we examined for our analysis between two wizard interventions.Conversation (b) shows the case where a wizard intervention occurs prior to proceeding to the next step in the recipe.Here, we used the turns between the wizard intervention and the user's next step message for our analysis.
In the case where questions were used as an intervention (N = 49), this resulted in an average of 2.67 questions ( xquestions = 2.0, IQR = 2 − 4) being asked, which is more than for statement interventions (N = 71, xquestions = 2, xquestions = 2.0, IQR = 1 − 3, U = 1275.0,p < 0.006).We did not find any significant differences in the distribution of kinds of questions asked.The fact that more questions were asked did not, however, result in more knowledge being transferred.We found no significant difference between the counted nuggets ( xnuддets = 1.8 nuggets after statements (IQR = 0 − 3), xnuддets = 2.0 after questions (IQR = 0 − 3)).

DISCUSSION
This section integrates the main findings from the survey, the differences in outcomes observed between the active and passive conditions in the WoZ study, and the insights gained from the 28 pilot experiments that led to the strategy guidelines for the wizards.
The key outcomes of the work are: -The expansion of an existing taxonomy from Frummet and colleagues to include more precise and detailed descriptors for knowledge-based information needs.This not only has theoretical significance but also has practical implications, as it enabled us (and would others) to conduct a more detailed analysis of the types of questions that were asked.-Through our survey, we learned what users want from conversational assistants in a kitchen context.We discovered that individuals are interested in gaining knowledge about the food they cook, such as the science and history of the dish, the ingredients used, and how they are prepared.The survey results also revealed demographic patterns, with younger individuals who are more passionate about cooking showing a greater expectation for conversational assistants to offer such information.-Despite a sample with similar characteristics, we did not find these demographic trends in the WoZ study.There were no differences in the number or type of questions asked, nor in the amount of knowledge imparted across groups.-The WoZ study revealed that agents implementing an active policy resulted in increased interactivity, a higher number of questions asked, especially knowledge-related questions, and a greater amount of knowledge being conveyed.-The pilot studies enabled the wizards to test and establish guidelines for the active condition, which have implications for the design of future systems and introduce new research challenges.-The conversations are available for researchers to further analyse and experiment with.We anticipate that the datasets will be valuable for testing retrieval and question-answering Cooking with Conversation: Enhancing User Engagement and Learning 122:19 algorithms within the context of conversational interactions.What sets our dataset apart is not only its focus on knowledge-based questions and answers in the domain of cooking, but that the information needs associated with the questions are grounded within the context of completing a particular task.The answers to these questions are located across different information silos.Our dataset comprises user-generated questions, contextual information about the associated recipe and recipe step, responses formulated by the wizard, and comprehensive provenance information indicating the sources and combinations of the answers.This makes our data different from existing resources such as CookDial [26], TREC CAsT [10], and Wizard of Tasks [7].The CookDial [26] corpus consists of questions that are based on the recipe document without incorporating external knowledge resources.However, TREC CAsT [10] dataset offers comprehensive provenance information but lacks representation of task-oriented dialogues.In the Wizard of Tasks [7] dataset, conversations were generated through crowdsourcing and revolve around questions and answers tied to the recipe (or DIY) document.Wizards were allowed to use external knowledge sources and provide URL links as references.However, the specific passages that served as the grounding for these external resources are not easily traceable.
In the upcoming sections, we discuss the limitations of our studies (Section 6.1) before continuing to interpret the findings in relation to the existing literature (Section 6.2) and discuss their implications for the practical design of conversational cooking assistants (Section 6.3) and beyond (Sections 6.4 and 6.5).

Limitations
Before interpreting what the contributions mean for the conversational assistance literature and the design of these systems, it is important to acknowledge the limitations of our approach and how it was implemented.
In our simulated study, participants did not physically engage in cooking the meals; rather, they followed the outlined steps and posed questions that naturally arose as they envisioned carrying out each step.The choice to conduct a simulated, or "hypothetical, " cooking process raises important implications and limitations that warrant discussion.We did not ask participants to physically cook the recipes due to numerous reasons.First, recruitment for this study was already challenging and was compounded by the need for English-speaking participants, in various demographics, which was particularly challenging, since some of the authors are situated in non-English-speaking countries.Using Prolific enabled us to access participants.
Mandating participants to physically engage in cooking would have added significant complexity and expense to the experiments.Verifying whether participants were genuinely cooking would have posed a challenge.Additionally, this approach would likely introduce numerous additional variables related to the environment, equipment, and other factors, making it impractical to maintain control.Moreover, it would have made experiments much more time-consuming and involved covering the costs of ingredients, cleaning, and additional travel.Consequently, our experiments cannot be considered naturalistic.However, there is precedence for this approach in two of the primary datasets for digital assistants in a cooking scenario [7,26].To increase the simulation's validity, we provided participants with images depicting the completed steps as a means of helping them imagine the process.
We acknowledge that our study assumes an optimal cooking process, devoid of external factors such as something burning or the presence of children or other distractions that could potentially hinder the success of the cooking endeavor or hamper curiosity.While this approach allows for a more controlled examination of the conversational assistance, it also limits the generalisability of our findings to real-world, unpredictable cooking scenarios.That being said, the learning context that we envisage would be one where users have time and curiosity to ask questions as a deliberate means to further their knowledge rather than a situation whereby the aim is to cook a meal as quickly as possible or where children or friends are competing for the cook's attention.We argue that this makes our "distraction-free" scenario suitable for our research aims.
Similarly, we must acknowledge the limitations of our participant sample.Although we applied the same rules within Prolific to define the sampling strategy, the samples drawn had slightly different characteristics.We believe that the differences in the types of study, such as the interactive nature, the time taken to participate, and the scheduling of experiments, led to unavoidable sampling bias as, it was less appealing to some groups of participants.This may go some way toward explaining why the differences in expectations between younger and older participants in the survey were not reflected in the behaviours exhibited in the WoZ study.We discuss this further below.
Both samples, however, exhibited a bias towards younger individuals and older adults with a greater affinity for technology.We contend that this sample selection aligns with our research objectives.It is reasonable to anticipate that only older adults who are either technologically proficient or genuinely curious about technology would utilise such a system in practical scenarios.We can only speculate as to why more younger participants reported a higher expectation for systems to provide information on science and history.We tend to believe it has less to do with desire for learning and more to do with understanding and expectation of the technology, but this would require additional investigations to test.
Finally, we wish to acknowledge that the behaviour of the participants in our studies, both the survey and the WoZ study, will have been shaped by past experience of conversational systems, which currently do not support the answering of knowledge-related questions.This may have naturally prevented knowledge questions being asked.We argue that, regardless of past experience, this does not detract from the findings that in the active condition significantly more knowledge questions were asked and more knowledge was communicated by the agent, as we discuss in greater detail below.

Interpreting the Findings
Our goal in experimenting with initiative was to create a more human-like and engaging experience that would foster curiosity and make the participants feel comfortable asking questions.All of the metrics we studied (number of utterances, number and types of questions asked, and amount of knowledge conveyed) were increased in the active wizard condition.These findings suggest that when users are made aware that an agent has more information to share, they tend to proactively ask questions to obtain that knowledge instead of remaining passive.These findings are consistent with the survey results, which suggest that a significant number of users desire agents to offer this information.
We see parallels with the literature regarding social-bots, which have experimented with similar kinds of initiative strategies (making statements and asking questions) to make conversations more human-like and encourage users to take control [22] and share information [28].The findings are also consistent with previous research indicating that active tutoring systems can enhance the learning process for users [12].Furthermore, these findings help explain the limited number of knowledge-oriented inquiries (1.89% of all inquiries) in Frummet and colleagues' naturalistic investigation.Given our results, this is reasonable, because the human-agent employed a passive approach in their research.When considering the totality of the evidence-which includes Frummet and colleagues' observation of low knowledge needs, the subsequent increase in those needs under an active condition, and the survey results-it appears that users typically do not 122:21 spontaneously ask questions related to knowledge acquisition when not explicitly prompted to do so.However, they do demonstrate an interest in acquiring knowledge when they are prompted.
To put our results in context of the literature, Table 4 presents examples of knowledge information needs identified in Frummet et al. 's naturalistic study.These examples highlight the variations in knowledge information needs within a naturalistic setting.Except for a few inquiries about chemical elements, which have a scientific focus, the majority of questions centered around practical and general knowledge and "did not relate to the implementation of [the current] cooking step" [19, p. 12].The acquired knowledge, such as standardised ingredient quantities and their desired characteristics, can be applied and utilised in future cooking sessions.In contrast to Frummet et al. 's naturalistic study, the knowledge information needs uncovered in our study delve into more specialised and detailed historical and scientific aspects of cooking.Nonetheless, the information needs related to the importance of cooking steps maintain a strong practical relevance.

Design Implications for Conversational Cooking Assistants
The findings of our study yield several design implications for conversational assistants in the context of a kitchen.We outline these implications in three sections: The first section focuses on the initiative strategy, the second section discusses the implementation of an active strategy, and the final section highlights the technical challenges associated with implementing these strategies, specifically from a retrieval perspective.
Initiative Strategy: Assistants should adapt the degree of initiative to the user.
Those users wishing to acquire background knowledge may choose for the system to perform more actively.This background knowledge proves particularly beneficial for individuals seeking to enhance their cooking abilities and expand their culinary knowledge.The presence of an active tutoring assistant aids the learning process, as highlighted by Dubiel et al. [12].Specifically, questions relating to the significance of specific steps, such as the purpose of coating chicken in a flour mixture or the scientific principles underlying phenomena like soufflé deflation, contribute to users' comprehension of fundamental cooking processes.This understanding can then be applied in subsequent cooking sessions, facilitating further improvement in their culinary skills.
There are scenarios, however, where this kind of active strategy would be less appropriate.The survey results indicate that some users do not believe that assistants should provide such knowledge.Moreover, users who wish to cook quickly or minimise interventions (e.g., they are in a rush or are entertaining guests or children) may prefer to interact with an agent using a passive strategy, which our results show lead to more streamlined conversations with fewer utterances by both the user and the agent.
Providing a means for users to switch between initiative modes depending on their preference and context may be a desirable feature.There are various potential approaches for implementing this feature.One option is to allow users to initiate the learning mode by instructing the system with a command like "Enter learn mode." This may be communicated to the user as follows: "Research indicates that I share more cooking knowledge when I actively engage in suggesting information at different steps.Would you prefer me to adopt this approach, or should I stick to simply answering the questions you ask, which would lead to a faster cooking process?"Alternatively, the system could proactively inquire about the user's preference for the mode.Further research can explore and compare these approaches to determine their effectiveness in terms of user experience.

Implementing an Active Strategy:
The WoZ guidelines provide insight into the functionality that needs to be implemented to successfully recreate the wizard behaviour automatically.
First, on the basis of wizard experience, we recommend that interventions need to be personable, enthusiastic, and empathetic.As illustrated in Figure 7(a), participants appreciated empathetic statements such as "I'm happy to tell you this." or "Interesting, right?:-D" by saying "That's interesting.Thank you." This aligns with previous research finding that users appreciate engaging assistants [42].
Second, interventions can be in the form of questions derived from knowledge associated necessary to complete the step or in the form of statements.Our findings showed that users ask slightly (but significantly) more questions when the system prompts with a question, but there was no evidence of more knowledge being communicated as a result.Therefore, systems should be designed to use either approach in an active strategy.
Finally, in our extensive piloting where we tested various strategies, moving to the next step in a recipe was found to be an appropriate time to make interventions.This would be one way of getting around the difficulty of timing interventions as reported in the literature [2].The wizards determined timing to be especially appropriate when extensive knowledge is available for a step compared to that communicated in the recipe instructions.However, determining this would require systems to automatically derive the mappings (see Figure 4) used by wizards, which we curated by hand.This is a further open problem for future research.

Federated Search Problem:
To successfully answer questions, an agent must generate utterances using various sources of knowledge, regardless of the initiative strategy they employ.We imagine a testing framework for retrieving and formulating these utterances that would exhibit a similar setup to the one used CAsT [10].The data collected in this study could form the basis of experimentation of this sort, as it includes user questions, conversational context, and answers provided by the wizard, as well as from which silo they were sourced.For researchers interested in performing such experiments, we have made the data available. 7hrough our research, we have discovered that different types of questions are typically associated with distinct categories of information.This insight could potentially impact the weighting of different silos in retrieval experiments, which in turn relates to the classification of information needs, such as that proposed by Frummet et al. and other similar systems.
Cooking with Conversation: Enhancing User Engagement and Learning 122:23

Beyond the Cooking Domain
While we exercise caution in avoiding over-interpretation of our findings, which are derived from a study specifically focused on supporting knowledge acquisition in a cooking scenario, we find no compelling reason why many of the insights uncovered in our research would not extend to other task-based conversational assistance contexts.For instance, it seems plausible that employing an active strategy could facilitate the communication of more knowledge in analogous task-based scenarios, such as DIY projects or the repair of items such as coffee machines, bikes, or cars.
The evidence indicates that users engaged in these types of tasks are inclined to broaden their knowledge rather than solely focusing on accomplishing the primary goal of task completion.Choi et al. [6] investigated procedural tasks, including activities such as cooking, DIY projects, and learning new skills.Their findings revealed that, while the majority of individuals sought step-by-step instructions, approximately 10% expressed a desire for additional background knowledge that was not strictly necessary for completing the task.We find it plausible that if users want such supplementary information and agents make it apparent via an active strategy that they can provide it, then users are more likely to seek and engage with it, as was the case in our cooking investigation.Additional factors reinforcing our confidence in the generalisability of our findings include parallels with other learning contexts, such as tutoring, where active strategies have demonstrated support for the learning process as well as insights from the social bots literature indicating that personal disclosure, coupled with the "disclosure-reciprocity effect" [9], leads to users sharing more information than they normally would [28].These aspects taken with the evidence that active assistants can enhance task performance in search tasks like booking flights [11] seem to paint a consistent picture.Our interests are focused on the cooking domain, but we would encourage other scholars to verify our suspicions empirically.It is clear, however, that, regardless of the domain, there are situations where people are more open to learning and others where users want to focus on completing the task at hand.We, as a research community, know little about this and the decisive factors, and this represents a challenging but important research direction for the future.

Studying Conversational Interaction-Lessons Learned
Through our studies and extensive piloting, we gleaned valuable insights that have implications for the future of Conversational User Interface (CUI) testing.While the concept of Wizard of Oz (WoZ) testing is not new, our approach introduced novel aspects.We utilised Prolific to recruit participants for offline experiments, employed TaskMAD to facilitate and regulate wizard interaction, and incorporated images to enhance the simulation.Our takeaways from this process could prove useful to other scholars, extending beyond the realm of cooking studies.We summarise these below.
-Employing two wizards proved beneficial, since it shared the workload and facilitated discussion and shared understanding between the wizards.This necessitated a clear strategy not only for intervention methods but also for determining optimal timing.Conducting numerous pilot experiments was essential to derive robust wizard behaviour and shorten response times.-The use of TaskMAD as a platform offered advantages.Its button-based interface streamlined social interaction, reducing wizard response times and promoting smoother engagement.The search interface allowed quick access to background knowledge from multiple silos and, since frequently used information could be represented by buttons, this reduced response time significantly.
-It is our impression that the use of images and video is a great means to both improve interaction and realism in chat simulations.We are unaware of any other study that has done this.-This study represented our first experience of using Prolific to recruit for a study of this type, i.e., where interaction with the experimenter is necessary, and we are unaware of any other study that has done this.Prolific is certainly not designed for this purpose, but it offered us access to a heterogeneous pool of suitable participants.Participant recruitment and retention was far lower than for previous Prolific studies, which led to data collection over several weeks.
Despite these insights, running Wizard-of-Oz studies remains difficult, time-consuming, and requires expert task knowledge.It may be interesting to consider how LLMs with task knowledge might be used to augment or support wizards in future studies.

CONCLUSIONS AND FUTURE WORK
In this article, we have presented the results of two empirical studies aimed at shedding light on how users interact with digital assistants in a kitchen context.Our first study, a survey of 200 participants, revealed that users generally expect assistants to provide information on cooking steps and processes.However, we found that younger participants who enjoy cooking were more likely to expect assistants to provide information on the history of food or the science behind cooking processes.
Our second study was a follow-up Wizard-of-Oz experiment with 48 participants, in which we compared the effectiveness of an active wizard policy versus a passive wizard policy.We found that the active policy led to almost double the number of conversational utterances, 1.5 times more knowledge-related user questions, and 1.7 times more knowledge communicated than the passive policy.These findings suggest that providing users with proactive guidance and information can lead to a more engaging and productive interaction with digital assistants in the kitchen.
Overall, our results have important implications for the design and use of digital assistants in a kitchen context.Specifically, our findings suggest that assistants should be designed to offer proactive guidance and information to users, especially younger users who are more interested in the science and history of food.We believe the data collected in our study provide a solid basis for future work.In a first step, we plan to study how existing QA and passage retrieval approaches are able to recreate the answers given by the wizards.We hope to build on these baselines by developing new approaches based on the insights from our experience as wizards.Moreover, we hope to examine ways of automating the interventions of wizards in the active condition, which includes the creation of suitable questions and statements, as well as the automated creation of how-to mappings that were valuable to determine suitable timing of interventions.

A.1 Survey
What information should a smartcooking assistant provide?Imagine a digital assistant (e.g., Siri, Alexa, or other) that can provide you with information while you are cooking.We are interested in learning what information would be desirable to you in such a situation.-Help learn about the origin of the recipe and its development (e.g., "Where does Duck à l'Orange originate?")Strongly disagree 1 2 3 4 5 6 7 Strongly agree -Help learn about the science behind cooking processes (e.g., "What happens to sour cream when it is heated?")Strongly disagree 1 2 3 4 5 6 7 Strongly agree -Help to adapt a recipe to my (dietary) needs and preferences Strongly disagree 1 2 3 4 5 6 7 Strongly agree -Explain how and why a step in the recipe is important Strongly disagree 1 2 3 4 5 6 7 Strongly agree -Inform about ingredients and quantities needed for a recipe (e.g., "Which ingredients do I need?" "How many potatoes should I use?") Strongly disagree 1 2 3 4 5 6 7 Strongly agree -Inform about the equipment/cooking utensils to use (e.g., "Can I use a pot for this?") Strongly disagree 1 2 3 4 5 6 7 Strongly agree -Inform about the temperature at which ingredients/meals should be cooked (e.g., "At which temperature?" "Do I need to preheat the oven?") Strongly disagree 1 2 3 4 5 6 7 Strongly agree -Inform about the time required until the meal is prepared (e.g., "How long does it take?10 minutes or 20 minutes?") Strongly disagree 1 2 3 4 5 6 7 Strongly agree -Help learn the cooking techniques required by the recipe (e.g., "OK how do you prepare potatoes properly?")Strongly disagree 1 2 3 4 5 6 7 Strongly agree -Guide through the process of preparing the recipe (e.g., "What should I do next?")Strongly disagree 1 2 3 4 5 6 7 Strongly agree -Provide suggestions about complementary dishes (e.g., "Which desserts go with chili?")Strongly disagree 1 2 3 4 5 6 7 Strongly agree A few questions to help us understand who has answered our survey -How confident do you feel about being able to cook from raw or basic ingredients?Extremely Confident 1 2 3 4 5 6 7 Not confident at all -How confident do you feel about following a simple recipe?
Extremely Confident 1 2 3 4 5 6 7 Not confident at all -How confident do you feel about preparing and cooking new foods and recipes?
Extremely Confident 1 2 3 4 5 6 7 Not confident at all -How often do you prepare and cook a main meal using raw ingredients (for example, cooking soup using fresh vegetables, or cooking chili using raw meat and fresh vegetables)?

Fig. 4 .
Fig. 4. Example of a mapping used by the wizards in the experiments.

122: 14 A
. Frummet et al.As shown in the example, Annotator 1 counted three information nuggets, Annotator 2 counted two, and Annotator 3 counted only one piece of information.

Fig. 7 .
Fig. 7. Intervention analysis.Utterances on the left are from the wizard, utterances on the right are from the user.

Table 1 .
Example Questions from Knowledge Information Need Subtypes History, Step Importance, and Science

Table 2 .
Information Need Distribution in the Active and Passive Conditions

Table 3 .
Information Source Types Employed by the Agent Based on Information Need Type

Table 5 .
Daily 4-6 times a week 2-3 times a week Once a week Less than once a week Never How would you describe your current employment status?XXXX -What is the highest degree or level of education you have completed?Less than high school High school graduate (includes equivalency) Bachelor's degree Master's degree Ph.D. or higher Vocational Education -How often do you use a smart assistant such as Alexa, Siri, Google Home, or other?Daily 4-6 times a week 2-3 times a week Once a week Less than once a week Never List of Recipes Used in Our Experiments -