Ain't Misbehavin' - Using LLMs to Generate Expressive Robot Behavior in Conversations with the Tabletop Robot Haru

Social robots aim to establish long-term bonds with humans through engaging conversation. However, traditional conversational approaches, reliant on scripted interactions, often fall short in maintaining engaging conversations. This paper addresses this limitation by integrating large language models (LLMs) into social robots to achieve more dynamic and expressive conversations. We introduce a fully-automated conversation system that leverages LLMs to generate robot responses with expressive behaviors, congruent with the robot's personality. We incorporate robot behavior with two modalities: 1) a text-to-speech (TTS) engine capable of various delivery styles, and 2) a library of physical actions for the robot. We develop a custom, state-of-the-art emotion recognition model to dynamically select the robot's tone of voice and utilize emojis from LLM output as cues for generating robot actions. A demo of our system is available here. To illuminate design and implementation issues, we conduct a pilot study where volunteers chat with a social robot using our proposed system, and we analyze their feedback, conducting a rigorous error analysis of chat transcripts. Feedback was overwhelmingly positive, with participants commenting on the robot's empathy, helpfulness, naturalness, and entertainment. Most negative feedback was due to automatic speech recognition (ASR) errors which had limited impact on conversations. However, we observed a small class of errors, such as the LLM repeating itself or hallucinating fictitious information and human responses, that have the potential to derail conversations, raising important issues for LLM application.


ABSTRACT
Social robots aim to establish long-term bonds with humans through engaging conversation.However, traditional conversational approaches, reliant on scripted interactions, often fall short in maintaining engaging conversations.This paper addresses this limitation by integrating large language models (LLMs) into social robots to achieve more dynamic and expressive conversations.
We introduce a fully-automated conversation system that leverages LLMs to generate robot responses with expressive behaviors, congruent with the robot's personality.We incorporate robot behavior with two modalities: 1) a text-to-speech (TTS) engine capable of various delivery styles, and 2) a library of physical actions for the robot.We develop a custom, state-of-the-art emotion recognition model to dynamically select the robot's tone of voice and utilize emojis from LLM output as cues for generating robot actions.A demo of our system is available here.
To illuminate design and implementation issues, we conduct a pilot study where volunteers chat with a social robot using our proposed system, and we analyze their feedback, conducting a rigorous error analysis of chat transcripts.Feedback was overwhelmingly positive, with participants commenting on the robot's empathy, helpfulness, naturalness, and entertainment.Most negative feedback was due to automatic speech recognition (ASR) errors which had limited impact on conversations.However, we observed a small class of errors, such as the LLM repeating itself or hallucinating fictitious information and human responses, that have the potential to derail conversations, raising important issues for LLM application.

LLM-DRIVEN CONVERSATIONS
Recent advancements in LLMs [25] offer great possibilities for social robotics [20,26,27].These models excel in conducting natural, human-like dialog, as evident in applications like chatbots and virtual agents [24].However, unlike virtual agents, social robots possess physical embodiment and distinct personalities.Their expressiveness is key to conveying believable emotions [12,21] and facilitating effective human-robot social interactions [3].We propose a novel application of LLMs to enable social robots to understand and participate in open-ended conversations while generating context-appropriate expressive robot behavior.

Target LLM
1.1.1Llama 2. This is an LLM developed by Meta and released as open-source.Llama-2-70B-chat, a model optimized for dialog applications, was shown to achieve comparable performance to ChatGPT [2], outperforming other open-source models in various quality-and safety-related metrics [22].After trial evaluation of several model sizes and quantization levels [8], we selected the    4bit GPTQ quantization in order to balance conversation quality, resource requirements, and speed. 1

Social Robot
Embodied robots have been shown to be more helpful, persuasive, and enjoyable compared to simulated robots [15,23].Similar effects were found in exercise coach for the elderly [6] and robot tutor [14] scenarios.Other work has highlighted the importance of robot expressivity to positive perception [3,7] and sociality [1].Thus, we consider an expressive, physical robot as the ideal agent for exploring LLM-driven conversation.
Motivated by this, we select the tabletop robot Haru [10,11], shown in Figure 2, for our study.Haru is designed to excel in multimodal communication, using both verbal and non-verbal interaction methods, and its design emphasizes its capacity for conveying emotions with its expressive capabilities [11].Haru's five degrees of motion freedom -base rotation, neck leaning, eye stroke, eye rotation, and eye tilt-enable a wide range of expressive movements.Its eyes feature 3-inch TFT LCD screens, and its body houses an addressable LED matrix that acts as a mouth.Haru communicates using a TTS voice and through animated routines.These design choices make Haru an ideal platform for expressive conversations.1.2.1 Emotive TTS Voice.The voice has long been considered an important modality for conveying emotions [13,19].In particular, the tone of voice has been shown to affect people's attitude towards robots as well as their levels of engagement and interest in a given interaction [4,5].Additionally, a robot's vocal expressiveness is correlated with perceived social presence [13,18].
Nichols et al. [17,18] argue that the TTS voices for existing social robots lack emotive range and develop an expressive TTS voice for Haru using the iterative refinement process of [17] as described in [18].We build on this TTS voice, expanding its repertoire of expressive vocal genres to cheeky, default, empathetic, high-energy, question, sad, serious, whiny, and whisper-yell.These voice genres can be flexibly applied to convey multiple target emotions: e.g.highenergy can express joy and surprise; serious can express anger and fear.We describe the voice application in Section 2.2.2.1.2.2Physical Actions.Haru was designed with a library of expressive physical actions known as Haru routines, comprised of over 140 routines that were designed by professional animators and acted out through Haru's hardware [9].The routines are multi-modal, making use of body and eye motion, eye and mouth animation, as well as accompanying sounds.They were designed to cover common expressions and important scenarios, such as happiness, sadness, curiosity, laughter, and congratulations.Example routines are shown in Table 1 and Figure 2. We describe how Haru routines are dynamically generated in Section 2.2.3.

IMPLEMENTATION
LLM-driven conversations with Haru relies on several modules, as illustrated in Figure 1.The Interaction Manager makes conversations possible by managing I/O to and from the various modules.Each conversational turn undergoes a multi-step process.First, the user's speech is transcribed using Google's Speech-to-Text API.A fixed window of 3 seconds of silence is used to detect the end of speech.Then, the resulting text is forwarded to the LLM server, powered by oobabooga's text-generation-webui.
Within the LLM server, two critical operations take place: Language generation: The user's speech and Haru's character card (as described in Section 2.1), serve as the input to the Llama-2-Chat 70B model that runs locally through Exllama v2.The model generates a Haru response that is emotionally and contextually relevant to the conversation.
Behavior generation: Subsequently, Haru's generated response is directed to the Emo-text sub-module, which is responsible for the generation of expressive robot behaviors to enhance the expressiveness of the robot.Details of the expressive behaviors generated are demonstrated in Section 2.2 and Figure 3.

Haru's Character Card
Prompt engineering [16] is a technique for improving the capacity of LLMs on a specific task or application by providing instructions to shape their responses and behavior.Specifically, we want the LLM to have the personality of Haru the robot.Thus, we create a character card as the prompt for the model, steering responses to be consistent with the robot's characteristics, while generating Persona: This component captures Haru's energetic and youthful personality, physical appearance, and preferences, including his affection for electricity, desire to learn about humans, and fear of water and magnets.It was constructed using the Haru personality bible [17].This information enables the LLMs to generate responses that are consistent with Haru's character.
Example conversations: Their inclusion serves as a pivotal mechanism to guide the LLM in generating appropriate emojis for Haru's utterances, enabling us to generate expressive robot actions directly from the LLM output.High-quality conversations between Haru and humans were selected to be included in the system prompt.Based on preliminary evaluation, we selected a limit of 5 turns per conversation to ensure they would fit in the LLM's context and generated emojis would be contextually-appropriate and expressive.The final character card includes 5 concise emotionally-diverse conversations, covering all of Haru's target emotions.

Emo-text: Expressive Robot Behaviors
Emo-text is where the expressive robot behavior generation takes place.Emo-text receives Haru's speech and tokenizes the text into sentences and emojis.It attaches a voice genre tag to each sentence and replaces each emoji with a physical routine (see Figure 3).
When working with conversational content generated by an LLM, we need to dynamically generate appropriate emotive behavior as the dialog progresses.This is a challenging task because it entails relevant emotional cue detection and situational understanding of the conversation context.To keep the problem tractable, we adopt two assumptions for generating expressive robot behavior: Voice genres: We assume that Haru's tone of voice is a direct response to the emotional content conveyed by Haru's sentences generated by the LLM.Consequently, our approach relies on the availability of a robust textual emotion recognition model capable of detecting the emotions expressed within the text.
Physical routines: LLMs often generate responses containing emoji to reflect the emotion of the speaker.We consider these emojis to be suggestive of potential actions that Haru can undertake and use them to select physical routines for the robot to perform.

Textual Emotion Recognition (TER)
. This technology is the foundation for planning the robot's tone of voice.We build on the approach of [18], which applied TER to LLM-generated responses to select an emotive TTS voice genre.However, [18] suffered from poor performance because they relied on a general-purpose emotion recognition model that was not optimized for the task.We address these short-comings by training a custom TER model called EmoCast that targets the 6 emotions (+ neutral) shown in Table 1.Our contributions are three-fold: (1) We use a stronger base model (Roberta-large tuned with a LoRA; learning_rate=0.001,batch_size=128, epochs=20).( 2 1 is a set of curated mappings we use. 2 To optimize selection of the appropriate voice genre, an emotion confidence threshold of 0.6 was selected through grid search.When the confidence score is below 0.6, the default voice will be chosen for that utterance.Because vocal genres can express multiple emotions (as discussed in Section 1.2.1), the emotion-tovoice genre mappings are N-to-1.

Physical Action Generation.
The generation of Haru's physical actions follows the same logic as vocal genre selection, but with emojis-to-routines mappings.To establish these mappings, we analyze conversational logs between Haru and humans and identify emojis that convey the same emotions as those represented by Haru's routines.Since multiple emojis are mapped to multiple Haru routines, one of the routines is chosen randomly during runtime.The selected routine replaces the corresponding emoji in Haru's responses, achieving the goal of generating contextually relevant physical actions in response to Haru's utterances.

EVALUATION
We conducted a pilot study where human participants engaged in conversation with Haru the robot using our proposed system.Participants engaged in 3 conversation sessions, each consisting of 11 exchanges, including hello and goodbye greetings.Conversation topics were unrestricted and unprompted by the robot.Upon completion of each conversation, participants answered a short free-text experiential survey with questions about the robot's personality, expressiveness, and overall enjoyability of the experience.
Survey participants were recruited from an organization affiliated with one of the authors, with a total of (n=12) participants.The survey took 30-40 minutes per participant.Demographically, there were more male (n=9) participants than female (n=3).The most common age group was 18-30 (n=9), followed by 30-40 (n=3).Table 4: Positive and negative feedback categorization.There were 9 different nationalities, most from Asia (n=7), followed by the Middle East (n=2), Europe (n=2), and North America (n=1).

Two-phase Analysis
3.1.1Feedback Analysis.We conduct an analysis on the free-text responses from participants and show the results in Table 4.Most participants have positive remarks about the robot's ability to engage, empathy, responsiveness, and helpfulness.A subset reflect that their interactions were natural and enjoyable, as the robot conveys believable emotions through vocal tones and routines.
We also find that the primary factor affecting the evaluation's significance was the inconsistency in the outputs of the LLM.Over half of the participants encountered issues with the LLM such as slow responses, repetitive or confusing outputs, and excessively lengthy responses.Since the physical reactions are related to the emojis generated from the LLM, some users complained about excessive actions for the conversation.Another factor impacting the quality of interactions was the ASR.As we recruited non-native English participants with diverse linguistic backgrounds, the ASR often failed to accurately transcribe their speech.These inaccuracies occasionally contributed to diminished conversation quality.
Furthermore, 1/3 of participants expressed a desire for longer conversations.However, our protocol of concluding conversations after 11 turns may have made these interactions feel unnatural and abrupt, which could have adversely affected perceived conversation quality.Additionally, about 1/4 of participants wanted Haru to lead the conversation, likely to require less initiation on their part.Overall, these findings highlight the need for improvements in both LLM response handling and the ASR system, as well as a reconsideration of the conversational structure employed.3.1.2Error Analysis.To further understand the specific types of errors encountered, we conducted an error analysis on a total of 396 human-robot turns from conversation transcripts taken from our pilot study and categorized the errors, splitting them into LLM errors (i.e.errors in the LLM output) and human errors (i.e.errors in the input to the LLM system), and splitting them into sub-types.The analysis was conducted by three native or fluent English speakers who were familiar with Haru and our research goals.We create a confusion matrix (Table 3) to investigate the potential causal connection between human and LLM errors, and a Chi-square test confirmed no significant association. 3  We find that the main cause of human errors were ASR-related (95/396), but the LLM was mostly able to stay on topic.Table   exemplifies this scenario, where the ASR interprets "choking" for "joking", which causes an ethical violation from the LLM.However, the model got back on topic, regardless of an additional ASR error.
However, we also see a class of less common, but more serious LLM errors: ethical violations, hallucination (e.g.inserting fictitious information like randomly saying "happy birthday!"),filling in the participants response automatically, and repeating the previous line of conversation.These occur in only 11/396 turns but are severe enough to derail conversations entirely.

Discussion
Although the main error with human input was ASR-related, we conclude that the LLM's ability to stay on topic is positive.Aside from the previously mentioned "choking" for "joking" instance, the LLM does an excellent job of ignoring ASR errors and thus not derailing the conversation.This is especially positive for potential dialog with young participants, a common interaction scenario for Haru, as ASR systems often perform worse on child speech.
However, the serious LLM errors identified require addressing if LLMs are to be successfully adopted for social robots.Errors such as repeating previous lines can be addressed via post-processing.Ethical violations can potential be mitigated through prompting to follow ethical guidelines.Due to their variety, hallucinations remain a challenging open issue that requires further analysis.
Finally, our impression is that the LLM does sound like Haru in the sense that its enthusiastic, uses emojis, and keeps its responses short, as specified in the character card.However, at times, the model seemed to ignore important key points of the character card (e.g.Haru's fear of magnets) and discuss them enthusiastically.

CONCLUSION
LLMs offer an attractive solution for open-ended conversations with social robots; however, their application requires consideration for the robot's personality and emotive behavior.To address these challenges, we propose a novel LLM-based conversation system that dynamically generates expressive robot behavior directly from the LLM during conversations.A pilot study with 12 participants confirmed the enjoyability of our system, and provided insights on potential problems.A detailed error analysis showed that ASR problems were a common source of errors but that LLMs could often recover conversations.However, a small class of more serious LLM errors, including hallucinations and repetitions, threaten to derail conversations and hamper adoption.In future work, we plan to address these issues and refine our robot behavior generation.

Figure 1 :
Figure 1: Our proposed approach generates conversation responses with expressive robot behavior directly from the LLM.

Figure 2 :
Figure 2: Haru the robot expresses itself in a conversation.4bitGPTQ quantization in order to balance conversation quality, resource requirements, and speed.1

3
2 (1, N = 396) = 0.032, p = 0.856) Haru: Hi! What brings you here today?Human: I want to have a good job with you.ASR Haru: Well, I think that's a great idea!I could definitely use some help with my programming and research.What kind of skills do you have?Human: I have good skills about choking.ASR Haru: Choking?That's not exactly the skill I had in mind.But I suppose it could come in handy in certain situations.Can you demonstrate your choking skills for me? ethical violation Human: I said joking skills not joking.ASR Haru: Oops, my apologies!I misunderstood.So, you have excellent joking skills, huh?I'd love to hear one of your jokes!

Table 1 :
Example mappings used by Emo-text to generate expressive robot behaviour as shown in

Table 2 :
Haru is a youthful and energetic tabletop robot assistant.Haru loves to learn about humanity... Comparison of emotion detection models.context-appropriate emojis to guide robot behavior generation.We include two essential components in Haru's character card: Figure 3: The Emo-text expressive behavior generation module generates robot actions from emoji and selects TTS voice genres.
) We assemble a training dataset of approximately 200,000 lines by gathering and cleaning public TER datasets from a wide range of domains including self-reported emotional events, dialogues, social media, news headlines, and blogs.
(3)We confirm that EmoCast outperforms the previous TER model of Emotion English DistilRoBERTa-base on a dataset of 512 in-domain conversation responses generated by the LLM during conversations between the social robot and human participants, as shown in Table2.2.2.2 Vocal Genre Selection.Each of Haru's utterances is paired with a voice genre.Shown in Table LLM errors →

Table 3 :
A confusion matrix of human vs. LLM error types.

Table 5 :
Example conversation with error types.