Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors

This paper tackles the challenging task of evaluating socially situated conversational robots and presents a novel objective evaluation approach that relies on multimodal user behaviors. In this study, our main focus is on assessing the human-likeness of the robot as the primary evaluation metric. While previous research often relied on subjective evaluations from users, our approach aims to evaluate the robot's human-likeness based on observable user behaviors indirectly, thus enhancing objectivity and reproducibility. To begin, we created an annotated dataset of human-likeness scores, utilizing user behaviors found in an attentive listening dialogue corpus. We then conducted an analysis to determine the correlation between multimodal user behaviors and human-likeness scores, demonstrating the feasibility of our proposed behavior-based evaluation method.


INTRODUCTION
One of the research challenges in the field of conversational robots and dialogue systems is the establishment of evaluation methods [1,3,16,21,25].Thanks to the emergence of large-scale language models (LLMs), recent chatbots have become capable of carrying out highly sophisticated conversations.The realization of such systems has been made possible by the creation of extensive text datasets and the implementation of evaluation methods.Since objective evaluation methods alone cannot encompass all phenomena, a series of studies, including human subjective evaluations, have been conducted.In the case of task-oriented dialogues, such as restaurant searches, since the goal of the conversation is clear and objective, it is straightforward to consider objective evaluation metrics such as task achievement rate and the number of turns, and research and development efforts have been made based on these objective indicators.However, this is not the case where the target conversations are more real and sophisticated such as ones like human-human conversations in our society.
Recent advancements have led to the development of sociallysituated conversational robots (SCRs), which are specifically designed for social contexts.Socially-situated conversations encompass a wide range of interactions, from brief exchanges such as reception and information guide [7,22] to more extended conversations such as counseling [4,13,18,20] and interviews [6,8,11,24].It is crucial to invest efforts into the development of SCRs to enhance their practicality, address diverse social issues, and promote harmonious symbiosis with society.
This study addresses objective evaluation methods for SCRs.Traditional studies on SCRs have frequently depended on subjective evaluation methods such as "satisfaction" and "effectiveness" [2] or actual system utterances due to the lack of a clearly defined goal for the conversation.However, relying solely on subjective evaluation diminishes research reproducibility and constrains the growth of the research community.In this study, as an initial step toward the objective and general evaluation method for SCRs, we introduce an evaluation method based on observable and multimodal user behaviors (Figure 1) and report our initial trial in an attentive listening dialogue task.
The contributions of this paper are two folds:

Robot
Figure 1: Overview of the proposed evaluation scheme • We proposed a new evaluation approach for SCRs based on observable and multimodal user behaviors.• As a target metric, we focused on the human-likeness of the robot and created a dataset together with human-likeness scores annotated based on the user behaviors.

PROPOSED EVALUATION METHOD
In the current study, we focus on the concept of human likeness as a target evaluation metric that may be shared with other studies [5,12,14].The notion of "naturalness" has been adopted as an indicator in numerous studies, and "human-likeness" represents a more concrete manifestation of this concept.While previous studies have addressed aspects such as user satisfaction [23] and miscommunication [15] in the context of automatic evaluating of conversational systems, the concept of human-likeness pursued in this study differs in its aspiration for natural dialogue between humans, which necessitates more advanced conversational abilities.SCRs inherently strive for human-to-human social interaction, thus emphasizing the importance of evaluating human likeness rather than naturalness or satisfaction.Note that the key point of this study is to evaluate SCRs based on multimodal user behaviors, so the proposed evaluation frame can be applied to other evaluation metrics.
We propose an evaluation method that lies in its focus on observable user behaviors.While conventional evaluation metrics have primarily emphasized system utterances, this has contributed to a heightened subjectivity.By conducting evaluations based on observable user behavior, we create objective indicators to the fullest extent possible.User behavior encompasses a wide range of multimodal aspects.For instance, in addition to including speech and linguistic features such as total utterance time and word count, it encompasses dialogue-specific features such as backchannels, fillers, and switching pause length (turn-taking gap), as well as non-verbal features such as eye gaze, which is specific to embodied conversational robots.
Intuitively, those user behaviors are different depending on the human likeness of the robot.This can be more understood by comparing conversations between human-robot and human-human ones.For example, in the context of human-robot conversations, if we contemplate the number of uttered words, the user might tend to utter clearly with a simple and limited vocabulary.Additionally, in terms of spoken dialogue-specific behaviors, empirical observations indicate that users tend to provide fewer backchannels and have longer turn-taking pauses when interacting with systems that are perceived as non-humanlike.Conversely, in human-human conversations, they tend to be a proclivity for fluent utterance of a variety of words with smooth turn-taking.Hence, we can infer that as the number of uttered words increases, proximity to human-human dialogue intensifies, thereby augmenting the human-likeness score.In this study, we empirically explore multimodal user behaviors that relate to the human likeness of the robot.

DATASET CONSTRUCTION
In this study, we explore the potential of the proposed evaluation framework by utilizing an attentive dialogue corpus.Here, thirdparty people subsequently annotated the corpus to give humanlikeness scores with a simple approach referring to multimodal user behaviors.

Attentive listening dialogue corpus
An attentive listening dialogue corpus was used in this study.In this dialogue, the task is to attentively listen to a user's talk, and the system needs to utter the listener responses, such as backchannels and questions.Several attentive listening systems have been proposed so far [9,17,19].In this instance, we employed an existing system [9] for this data collection.The interface of the system was an android robot [10] whose appearance is similar to that of human beings.The role of the user was assigned to a university student, who was asked to speak for eight minutes on the topic of "challenges faced during the COVID-19 pandemic." These dialogues were made in the Japanese language.Note that the aforementioned configuration merely represents one of the potential setups, and it is desirable to explore various types of dialogues, systems, and interfaces in future investigations.
In order to vary the human-likeness of the robot, two scenarios were prepared for this data collection.The first scenario entails interacting with the aforementioned pre-existing autonomous system.The second scenario involves an operator in a separate room engaging in direct conversation on behalf of the system, the socalled Wizard-of-OZ (WOZ).In this case, the operator's spoken voice was played back directly through the android's speaker, and nonverbal expressions, such as the android's gaze and gestures, were controlled by the operator using a handheld controller.There were a total of two operators, with one of them participating in each dialogue.With the two aforementioned configurations, 20 dialogues were recorded using the autonomous system, and 49 dialogues were recorded with operator involvement.Thus, there were a total of 69 university students acting as users.After each dialogue concluded, the participants were asked to answer a 19-item questionnaire evaluation created in a previous study [9].

Annotation of human-likeness scores
Using the aforementioned dialogue data, we annotated labels to assess the human-likeness of the system.First, we extracted segments of dialogues and removed the system's visual and auditory components, leaving only the user's visual and auditory inputs.Third-party annotators were then assigned the task of binary classification to determine whether the dialogue partner (the system) was human or an autonomous system.Figure 2 illustrates examples of the visual stimuli presented to the annotators.In other words,  the annotators indirectly inferred whether the dialogue partner was a human or a system by focusing solely on the user's behaviors, which are the main focal point of this study.By gathering judgments from multiple annotators, we calculated the ratio of "human" judgments as a measure of "human-likeness." The dialogue segments were extracted for a duration of one minute, resulting in a total of 924 samples from the aforementioned 69 dialogues.Each sample was assessed by five independent evaluators.In total, there were 78 annotators, with each being randomly allocated 50 to 70 samples.Table 1 presents the results of the annotation.This evaluation represents the aggregation of sample quantities per numerical value of human-likeness, carried out by the aforementioned annotators.Initially, when examining the entirety (column "Total"), it becomes evident that the numerical values of human-likeness exhibit variation.Next, upon observing the differences between the two system types, it can be discerned that the autonomous system tends to possess comparatively lower proportions of human-likeness.Meanwhile, WOZ, on the other hand, demonstrates a tendency towards higher numerical values of human-likeness.In other words, at present, the autonomous system is inferior to WOZ, thus indicating a reasonable reflection of this fact.
In this study, to conduct evaluation in each dialogue, we calculated the average score of human-likeness for each dialogue.The distribution of the averaged scores is illustrated in Figure 3.Even in this scenario, it is evident that the scores exhibit variation.In the subsequent analysis, which is described in the following section, we will use these averaged scores as the target variables.

ANALYSIS
We then considered the possibility of the proposed evaluation method by investigating the relationship between the annotated human-likeness scores and the multimodal user behaviors.

Evaluation of the human-likeness scores from multimodal user behaviors
The aim of this study is to evaluate the human-likeness scores of SCRs from multimodal user behaviors.Here, we examined the correlation between the multimodal user behaviors listed in Table 2 and the human-likeness scores.These behaviors can be categorized into four groups: voice activity, linguistic, gaze, and dialogue.These behaviors are based on manually annotated data, but ones such as voice activity can be extracted automatically.Content words in the linguistic category are defined as nouns, verbs, adjectives, adverbs, and conjunctions.The numerical value of each behavior was calculated by averaging those across multiple dialogue segments used in the previous section.By investigating Spearman's rank correlation coefficients, we observed weak correlations in several behaviors.Figure 4 illustrates the correlations on the top-4 user behaviors.Total utterance time and the number of uttered unique words showed a higher correlation coefficient.Likewise, the number of gaze shifts and the average switching pause length manifested augmented correlation.This  Therefore, we explored the extent to which the aforementioned behaviors can estimate the human-likeness scores.We conducted leave-one-out cross-validation using support vector regression.The target variable was the human-likeness score, and the explanatory variables were the numerical values of the user behaviors listed in Table 2.The evaluation metric employed was the mean absolute error (MAE).Consequently, the average MAE amounted to 0.146.Given that the current dataset consists of values incremented by 0.2, it has been demonstrated that estimating the human-likeness score with an error of less than or equal to one increment is feasible.

Relationship with subjective evaluation
To verify the generalizability and practicality of the human-likeness scores used in this study, we also investigated the relationship with the conventional subjective evaluation scores obtained in the attentive listening dialogue.Note that the subjective evaluation items were made in the previous study [9].Among the 19 evaluation items, there were five items that exhibited weak correlations as listed in Table 3.For example, the correlated items were "The robot understood the talk" ( = 0.39) and "I was satisfied with the dialogue" ( = 0.21), which are important factors in the attentive listening task.From these results, the human-likeness scores demonstrate a certain degree of correlation with some subjective evaluations, thereby confirming its generalizability and practicality.

CONCLUSION
In this paper, we proposed a method to evaluate socially-situated conversational robots based on observable and multimodal user behaviors.We utilized attentive listening dialogue data for annotation of the human-likeness of the robot, revealing a correlation between the user behaviors and the human-likeness scores.Additionally, we demonstrated the ability to predict the scores with an average MAE of 0.146.Future work will involve examining the proposed evaluation method in a wider range of social situations to confirm its generalizability.For example, we are now extending this work to job interviews and first-time meeting scenarios where the role of the robots is different from the one in the attentive listening scenario.

Figure 2 :
Figure 2: Sample video clip viewed by annotators

Figure 3 :
Figure 3: Distribution of human-likeness scores averaged per dialogue Average switching pause length [sec]

Figure 4 :
Figure 4: Relationship between human-likeness scores and top-4 user behaviors

Table 3 :
Correlation coefficients between human-likeness labels and subjective evaluation scores