A user study on the relationship between empathy and facial-based emotion simulation in Virtual Reality

In the contemporary metaverse landscape, comprehending the intricacies of human interaction is imperative for enhancing communication within Virtual Reality (VR) experiences. At the core of meaningful social relationships lie empathy and trust, pivotal elements nurtured by the capacity to comprehend both self and others’ thoughts and intentions. Conventional face-to-face interactions heavily rely on non-verbal cues, such as body language and facial expressions, to convey messages and display empathy. To investigate the relationship between emotional simulation in VR and empathetic skills, in this paper we conducted a user study involving 37 participants, which were requested of simulating facial expressions in a virtual environment. In order to capture their facial behavior, we employed the Meta Quest Pro, a virtual reality headset featured with accurate built-in sensors for capturing 63 microexpressions, accordingly to the Facial Action Coding System (FACS)[11]. Furthermore, the Interpersonal reactivity index (IRI)[9] questionnaire was used to assess the participants’ empathetic abilities. The results of this study underscore a statistically significant correlation between participants’ empathetic skills and their capacity to simulate emotions through facial expressions within the context of VR scenarios. Additionally, this research offers valuable insights into the prevalence of human micro-expressions during the simulation of seven distinct emotions. These findings lay the foundation for potential applications in the field of mental health and emotional well-being within the context of metaverse.


INTRODUCTION
The Metaverse refers to a world beyond reality where users can engage in a continuous and shared virtual experience that combines elements of the physical world with digital virtuality.The concept is rooted in the integration of technologies that facilitate interactions involving several senses with virtual environments, digital items, and individuals, such as Virtual Reality (VR) and Augmented Reality (AR).Therefore, the Metaverse can be described as a networked system of immersive worlds that are interconnected and exist on persistent platforms where multiple users can interact.It allows for smooth and integrated user communication in real-time and dynamic interactions with digital objects.The initial version consisted of a network of virtual environment, allowing avatars to seamlessly travel between them [21].The current version of the Metaverse includes virtual reality platforms that are both social and immersive.These platforms are designed to be compatible with large-scale multiplayer online video games, open game worlds, and augmented reality collaboration spaces.Users have the ability to generate customized avatars, so augmenting their social visibility and individuality inside the metaverse.This integration of social media enables instantaneous communication over diverse channels such as text, voice, and video, allowing for smooth interactions between users [1].In this interconnected environment, understanding human interaction dynamics becomes crucial to fully exploit the communicative potential offered by the metaverse.Human social interactions are deeply rooted in mentalization, the capacity to comprehend both self and others' thoughts and intentions, fostering empathy and trust [6,13].In traditional face-to-face interactions, non-verbal cues such as body language and facial expressions play a substantial role in conveying information and developing empathy [32].There are numerous definitions of the term empathy in the literature that focus on different aspects of the term.One approach that attempts to reconcile multiple components of empathy is the definition proposed by Cuff et al. [8].which characterizes empathy as an affective response shaped by both intrinsic qualities and contextual elements.Although this reaction occurs reflexively, it is also capable of being consciously regulated.The emotion that results from empathic reaction to a stimulus reflects what is perceived or understood through direct experience or imagination.This definition integrates cognitive empathy, which involves the ability to understand the emotions of others, with the awareness that these emotions are triggered by an external factor.Although there is no universally accepted definition of empathy, several studies suggest that imitation may play a role in its development.Indeed, the act of replicating the facial expressions of others has a dual effect: it affect not only our affective responses, but also our ability to develop empathy toward the people we are observing [27].Mimicry has an impact on emotional empathy, as it allows us to discern another individual's emotions through facial feedback.In fact, facial muscles provide feedback to the brain, resulting in the corresponding emotional experience [14,29].Stel et al. [28] used experimental materials, such as movies, to investigate the phenomenon of imitation in real-life situations.Their study demonstrated a causal relationship between imitation and empathy.The results show that imitating emotional expressions facilitates the assimilation of an emotional experience that aligns with the imitated person and increases the ability to empathize with his or her perspective.Therefore, imitation facilitates the development of both cognitive (perspective taking) and affective (emotional contagion) empathic reactions.Given the identified correlation in the existing literature between the proficiency in replicating an individual's emotional expressions and the capacity to adopt their emotional perspective, our inquiry seeks to extend this paradigm to the domain of virtual interactions.Specifically, we are investigating whether the skill of mimicking an avatar's emotional expressions, within the context of user-avatar interactions, aligns with an individual's empathic abilities.This exploration is framed in the context of established correlations observed in human-to-human interactions, aiming to discern if similar dynamics manifest within the virtual interaction.For these reasons, the main objectives of our investigation are twofold: (1) RQ1: Verify the potential correlation between proficiency in mimicking the emotional expressions of an avatar and an individual's empathic abilities.(2) RQ2: Analyze the most recurrent facial micro-expressions while simulating emotional expressions, focusing on the basic emotions identified by Ekman, et al. 's model [12].
The first objective related to RQ1 will allow us to determine whether the previously observed relationship in human interactions between imitation and empathy can be extended to virtual interactions where the user engages with an avatar.The second objective related to RQ2 is to collect data on the most frequent microexpressions while imitating basic emotional expressions.These data will contribute to creating a dataset on the micro-expressions that simulate basic emotions, providing a preliminary and complementary groundwork for future studies.Also, such data offer the possibility of enhancing research works related to non-spontaneous and spontaneous facial expression recognition with a novel kind of dataset with respect to the existing ones.
To address our research objectives, in the present study, we developed a virtual environment designed to imitate emotional expressions.Additionally, we utilized an Oculus Meta Quest Pro headset, enabling us to capture participants' micro-expressions, along with the widely used Interpersonal Reactivity Index (IRI) [9] questionnaire for assessing empathy.Subsequent sections of this paper will provide an overview of the current literature and delve into the detailed procedures and results adopted in the present study.
The paper is organized as follows: Section 2 reports the relevant related literature to our work; in Section 3, the user study is defined in detail; in Sections 4 and 5, respectively, the results are presented and discussed; finally, in Section 6 we conclude with an overview of the work and an indication for future research directions.

RELATED WORK
The findings from [7] offer first validation for the efficacy of VR in fostering empathy in individuals and diminishing their biases towards marginalized communities.Besides, the work by Louie et al. [18] claims that VR could represent a powerful tool for enhancing empathy in various scenarios, as through role-play in virtual worlds, humans may gain a comprehensive grasp of another individual's position and actively engage in perspective-taking.
Martingano et al. [19] claim that VR has the potential to greatly enhance affective empathy, which involves experiencing and understanding another person's feelings.However, their findings highlight that it is more difficult to elicit cognitive empathy in VR contexts.
Shifting focus to non-spontaneos micro-expressions, an early dataset that specifically examined non-spontaneous micro-expressions was compiled by Polikovsky et al. [23].Ten university students participated in the research while their facial expressions were captured at a rate of 200 frames per second (fps) and a resolution of 640×480 in a controlled laboratory setting.Although the demographic was moderately diverse, it was also relatively small, with only one Indian student participant, five Asians and four Caucasians.
In order to reduce the amount of shadow cast by the illumination, light was deliberately placed above, to the left, and to the right of the participants.Besides, to enhance facial acquisition performance, the camera underwent a 90-degree rotation, resulting in an augmentation of the available pixels.
The participants in this dataset intentionally exhibited microexpressions while being instructed to execute the seven fundamental emotions using minimal muscle tension.This dataset is inaccessible to the public.USF-HD [25], which is comparable to the preceding dataset, consists of one hundred posed micro-expressions captured at 29.7 fps.The participants were given the freedom to reproduce a range of micro-expressions in any sequence they desired.When recording at nearly 30 frames per second, vital information regarding motion may be lost.In addition, micro-expressions in this dataset are defined as lasting no more than 660 milliseconds, which contradicts previously accepted definitions.This dataset is not accessible for public research purposes.In the course of conducting the York Deception Detection Test (YorkDDT) psychological investigation, Warren et al. [31] captured twenty video segments with a resolution of 320×240 and a frame rate of 25 frames per second.The participants described emotionally and non-emotionally classified film segments in a truthful or deceptive manner, while micro-expressions were observed.However, once more, the data is not available for public access.
For their investigation into micro-expression recognition, Pfister et al. [22] acquired authentic scenario recordings in which nine subjects (six females and three males) exhibited micro-expressions.For analysis, a grand total of eighteen micro-expressions were obtained: eleven from the non-emotional version and seven from the emotional scenario.In addition to the limited number of microexpressions present in this dataset, the data is derived from a secondary source that does not provide comprehensive information regarding Action Units (AU) or participant demographics.
The work in [2] introduces VT-KFER, the first publicly available dataset for Kinect-based facial expression recognition.It contains sequences of both scripted (non-spontaneous) and unscripted (spontaneous) expressions of the six fundamental emotions (along with the neutral expression) for each of the 32 total participants joining the study.VT-KFER is a dataset comprising 1,956 sequences that contain depth maps and RGB images representing six distinct facial expressions in three poses (frontal, right, and left).With the neutral expression included, the average number of frames per sequence is six, with a maximum of 61 frames and a minimum of 2 frames.The non-spontaneous expressions are recorded at three distinct intensities, corresponding to the degree of muscle contraction in comparison to a neutral or relaxed facial expression.In total, 12,317 frames are scripted to represent the six expressions (plus neutral), which encompass all three poses.

USER STUDY
After investigating empathic capabilities and microexpression simulation in VR considering related work, we conducted a user study as shown in Figure1 with the aim of answering our two RQs.

Participants
The study comprised 37 participants, consisting of 22 males, 14 females, and 1 participant who chose not to specify their gender.Regarding age distribution, 28 participants fell within the 18-24 age range, while the remaining 9 belonged to the 24-30 age range.All participants were of Italian nationality, except for one participant who held Romanian nationality.In terms of prior experience with

Category
Item Answer Fantasy (FS) 1 I daydream and fantasize, with some regularity, about things that might happen to me. 5 I really get involved with the feelings of the characters in a novel.I believe that there are two sides to every question and try to look at them both.25 When I'm upset at someone, I usually try to "put myself in his shoes" for a while.28 Before criticizing somebody, I try to imagine how I would feel if I were in their place.
Personal Distress (PD) In emergency situations, I feel apprehensive and ill-at-ease 10 I sometimes feel helpless when I am in the middle of a very emotional situation.13 When I see someone get hurt, I tend to remain calm.17 Being in a tense emotional situation scares me.19 I am usually pretty effective in dealing with emergencies.24 I tend to lose control during emergencies.27 When I see someone who badly needs help in an emergency, I go to pieces.Table 1: the IRI scale virtual reality, only 13 participants had previously used a virtual reality headset, while the majority (24) had no prior exposure to the metaverse.It is noteworthy that all participants possessed normal or corrected-to-normal vision.
Participants were informed about the objectives of the study and provided informed consent.Moreover, it was explicitly communicated to each participant that they had the option to discontinue their participation in the experiment at any given moment.

3.2.1
The IRI test Questionnaire.
The Interpersonal Reactivity Index (IRI) [9] is a questionnaire that measures affective and cognitive components simultaneously to determine an individual's level of empathy responsiveness.Two of these elements deal with the person's emotional response, which might be focused on comprehending one's own feelings of worry and anxiety in interpersonal settings (personal distress) or on sharing the experiences of others (empathic concern).The other two are perspective taking, or the capacity to take on another person's point of view, and fantasy, or the propensity to see oneself in madeup scenarios.The IRI specifically has 28 items, as shown in Table 1, that are arranged into four subscales: • Fantasy (FS): which examines the inclination to identify with fictional characters from literature, film, or theater; • Perspective Taking (PT): which measures the capacity to take on the perspective of others; • Empathic Concern (EC): which measures the propensity of people to feel warmth, compassion, and empathy toward others going through difficult circumstances; • Personal Distress (PD): which describes situations in which people feel uncomfortable and anxious when they witness unpleasant things happen to others.
The questions are assertions that the respondent must rate on a 5-point Likert scale (1 being "never true" to 5 being "always true").
Consequently, the maximum score for each subscale is 35 points.

Headset Acquistion.
We made a custom-designed virtual environment, developed with Unity, where the participants interacted with a button panel displaying a range of emotions as shown in fig 2 .In this virtual environment participants were asked to imitate facial expressions corresponding to seven distinct emotions within the virtual environment as shown in detail in Table 2 These emotions were selected based on the model of Ekman et al [12].We paid particular attention to the acquisition of microexpressions through the sensors present on the Meta Quest Pro.The extracted values vary between 0 and 1 and reflect the intensity or amplitude of the variations in facial expressions detected by the Meta Quest Pro sensors.In order to obtain reliable and meaningful data, it was necessary to establish an appropriate threshold for the acquisition of the various Action Units (AU) representing the activation of the muscles involved in microexpressions.
We conducted several attempts by adjusting the AU detection threshold, initially setting it to T=0.7.However, during the initial stages of the experiment, we found that this threshold was too high and caused many meaningful microexpressions to be missed.Subsequently, we reduced the threshold to T=0.5 to improve the sensitivity of the acquisition system.Despite this, we continued to observe a loss of relevant data, as some microexpressions were not optimally captured.
After careful evaluation and numerous trials, we determined that the optimal threshold for acquiring microexpression AUs should be set at T=0.30.This value has proven to be the ideal compromise between detection sensitivity and minimization of false positives or noise in the dataset.The choice of the threshold at T=0.30 was supported by an in-depth evaluation of the collected data, as shown in section 4.1 .Furthermore, considering the typical duration of a human expression of emotion, between 0.5 and 4 seconds [10] [15] we decided to record the result 0.2 milliseconds after exceeding this threshold.This choice was motivated by the desire to record not the beginning but the maximum intensity of the expression itself.

Data pre processing
3.3.1 Questionnaire.
Once completed the IRI questionnaire, which assessed their empathic abilities, we collected the various dimensions of empathy, including cognitive and affective components.Before moving on to the analyses, the scores of items 7,12,4,14,18,3,15,13,19 have been reversed because they are presented negatively in relation to the overall meaning of the subscale for which they were intended.Specifically, to ensure consistency in the interpretation of the data, these scores were inverted so as to be in line with the other items and therefore correctly reflect the measurement of empathic abilities in the specific subscale considered.

Headset Aquistion.
The data collected during the experiment is made up of 37 observations, to which a further 7 cases were added.The latter participants performed the experiment exclusively with the headset, without completing the associated questionnaire.The entire dataunderwent pre-processing in order to ensure to facilitate subsequent analysis.This process involved 3 main steps; the first was the conversion of the data from JSON to CSV format, which allowed for a more structured and accessible representation.Next, a cleaning operation was performed to remove duplicate and inconsistent data.Finally, a sorting operation was performed to organize the data by emotion.

Data Analysis
The collected data, including facial microexpressions and questionnaire responses, were subjected to statistical analysis.Correlations between participants' empathic abilities and their ability to imitate emotions were examined.
To answer the RQ1 we use the Pearson product-moment correlation coefficient ( ), a statistical measure assessing the strength and direction of a linear relationship between two continuous variables.Widely used in various fields, it helps analyze associations between variables, often applied in hypothesis testing.The null hypothesis ( 0 ) suggests no significant correlation ( = 0), while the alternative hypothesis (  ) implies a significant correlation.
A positive  value signifies a positive correlation, while a negative  value indicates a negative correlation.The test was conducted to assess the correlation between participants' ability to mimic emotions and their IRI scores.
Before correlating with the IRI scale results, the scores of each sub-scale were normalized within the [0, 1] continuous interval.Both the data processing and correlation tests were conducted using RStudio.Results, including correlation coefficients and p-values, are showed in the next section.
While in line with the objectives of the RQ2 for the evaluation of microexpressions, descriptive analyses were carried out.The results of these analyses are shown in the next section.

RESULTS
In this section, we present the outcomes of the data acquisition process and elucidate the conducted analysis.The initial subsection focuses on the data derived from the headset, offering a descriptive analysis of the most pertinent micro-expressions observed during the emulation of the seven investigated emotions.The subsequent subsection furnishes descriptive statistics pertaining to the outcomes of the IRI survey, concurrently exploring the correlation between participants' proficiency in mimicking facial expressions and their corresponding IRI scores.

Micro-expressions prevalence
In Table 4 the mean values concerning facial micro-expression intensity (computed utilizing the entire dataset) are presented for the seven investigated emotions.Notably, the bold values in the table represent intensities surpassing the predefined threshold T, indicating a significant expression intensity during the imitation of facial expressions.Specifically, when mimicking disgust, the intensity of chin raising and upper lip raising was observed to be prominent.Regarding contempt, we observed a noteworthy increase in the intensity of the left lid tightener.The significance of this result may be attributed to the role of the lid tightener in conveying subtle cues of disdain, contributing to the overall perception of contemptuous facial expressions.
Moving on to happiness, our investigation revealed heightened intensity in the dimpler muscles on both the left and right sides of the face during the imitation of this emotion.This aligns with the association between the activation of dimpler muscles and the expression of genuine happiness.Concerning anger, our results indicate increased intensity in the cheek raiser muscles on both the left and right sides of the face, with a slightly higher value detected for the left cheeck.Finally, for sadness, our analysis revealed elevated intensity in various micro-expressions.Chin raising at both the top and bottom, along with the closure of the eyes and increased intensity in the right lid tightener, were particularly notable.These findings align with the multifaceted nature of sadness expressions, involving changes in both upper and lower facial regions.
Due to lack of space, the information regarding all the 63 AUs could not be inserted in this paper.However, the complete dataset can be viewed at [3].

Correlation analysis
The main descriptive statistics of the IRI questionnaire are presented in Table 3.The overall average total score attained by all participants was 89.68 ± 9.67 points out of a maximum of 140.Notably, the EC subscale exhibited the highest mean score (24.86 ± 3.84), while the lowest mean score was observed for the PD subscale (17.43 ± 4.19).The Pearson Product Moment Correlation test was conducted to examine the relationship between participants' ability to mimic emotional states and their IRI scores.The number of total failures in mimicking the seven chosen emotions is illustrated in Figure 3, revealing that contempt posed the greatest challenge, resulting in 24 failures.Conversely, all participants successfully mimicked happiness, indicating no failures for this emotional state.The correlation test aimed to evaluate the presence of a statistically significant correlation between the proficiency in imitating a specific emotion and the empathetic skills of the participants.
The resulting correlation coefficients and associated p-values are presented in Tables 6 and 5, respectively, providing insights into the strength and significance of the observed relationships.The  parameter was set to 0.05.The findings presented in Table 5 highlight significant correlations, as indicated by the bold values.Specifically, significant correlations were observed for the FS sub-scale in relation to the ability to mimic sadness (correlation coefficient of 0.356) and surprise (correlation coefficient of 0.378).Additionally, significant correlations were identified for the PD sub-scale in connection with the ability to imitate surprise (correlation coefficient of -0.361) and for the total score of the IRI scale in association with the ability to mimic sadness (correlation coefficient of 0.368).It is noteworthy that the p-values associated with the correlations of the EC sub-scale with sadness, as well as those of the PD sub-scale with disgust and anger, slightly exceeded the threshold of .

DISCUSSION
In this section, we discuss the outcomes of the performed analysis.The first subsection answers RQ1, verifying the potential correlation between proficiency in mimicking the emotional expressions of an avatar and an individual's empathic abilities.Subsequently, the second subsection delves into the topology of the acquired microexpression dataset.6: Pearson's product moment correlation coefficients among the ability of mimicking emotions and the empathy scores of the participants.For each emotional state, the higher correlation coefficient is highlighted in bold.

RQ1.
To answer RQ1, we correlated the number of errors made by participants in mimicking the emotional expressions of an avatar with their scores on the IRI.The results show that the emotional expression simulated correctly by all participants was happiness, while the most difficult to simulate turned out to be disgust.This result can be explained by the fact that a smile is most easily imitated in social interactions, as it has a positive effect on others' perception of our personality [20].In contrast, the emotion of disgust turns out to be the least successfully imitated emotional expression, probably because it influences a wide range of social judgments, including moral condemnation and prejudice [16].Data in the literature support that high imaginative abilities, related to fantasy, are related to better Theory of Mind (TOM) [30] skills and thus a greater ability to take on the perspective of others.In fact, empathy and TOM are two separate constructs but both related to social understanding [5].Presumably, participants with higher scores on the FS are able to imitate less common and more complex emotions of happiness precisely because of a better ability to take on the perspective of another, even when interacting with an avatar and not a human subject.Regarding the PD, results showed that lower levels of personal discomfort were associated with a better ability to mimic the emotional expression of surprise.Personal distress is defined as the tendency to feel pain when exposed to the suffering of others [9].Already Batson in 1991 argued that this construct, unlike the other spheres concerning empathy, contains an aversive and self-focused element [4].Indeed, subsequent studies have shown that personal distress represents the negative side of emotional empathy and could block empathic interaction instead of enhancing it [17].So, just as is the case in human-to-human interactions, lower levels of personal discomfort in interaction with an avatar may also promote perspective-taking of the other by enhancing users' abilities to imitate more complex emotional expressions of happiness such as surprise.Regarding the overall score obtained by IRI participants, results showed that higher levels of empathy correlated with a better ability to mimic the emotional expression of sadness.Data in the literature show a correlation affective empathy and increased vulnerability to depression [33].This means that people with high levels of empathy recognize and experience the emotion of sadness more frequently than people with lower levels of empathy.Greater experience and sensitivity to stimuli that evoke sadness may explain the ability of empathic people to more readily mimic the emotional expression of sadness than other emotions even in interaction with an avatar.
5.0.2RQ2.We depicted the most relevant micro-expressions in mimicking the seven chosen emotions, as shown in Table 4.The bilateral equal intensity of dimpler muscles movement in happiness imitation and of cheek raisers in anger imitation further underscores the symmetric nature of acted facial expressions compared to spontaneous ones, according with the findings in [26].However, a slightly greater intensity was detected for left cheek raising in anger imitation and for the upper lip raising in happiness imitation compared with the corresponding right-sided AUs.This result may be due to the fact that, in general, emotions are expressed more intensely on the left side of the face, as stated by Sackeim et al. [24].It is noteworthy that the usage of the MetaQuest Pro headset in data collection introduced a novel approach and enhanced the analysis of facial expressions, providing a more accurate depiction of AUs' intensity, while traditional approaches often rely on manual annotations or the application of computer vision techniques to static images and videos.

CONCLUSIONS
Our analysis results lead us to hypothesize that empathy extends its influence beyond human-human interactions, impacting humanavatar interactions as well.More precisely, the empathic abilities of users play a role in their capacity to adopt an avatar's perspective and replicate its emotional expressions, mirroring the dynamics observed in human-human interactions.These encouraging findings pave the path for forthcoming studies, analyzing various facets of empathy within user-avatar interactions can be investigated.
The adoption of the MetaQuest Pro headset enabled a high level of data granularity, which is particularly beneficial for researchers aiming to investigate the intricacies of non-spontaneous emotions, as it enhances the accuracy and reliability of the data.Besides, it offers a standardized and controlled experimental environment, reducing variability in data collection and ensuring consistency across participants and sessions.
The dataset, therefore, serves as a valuable resource for developing and refining emotion recognition systems, contributing to the improvement of their performance in real-world scenarios.
Future work is directed towards investigating potential correlations between user's empathy and trust in the context of interactions with VR avatars.Additionally, we plan to conduct a supplementary study involving the collection of data pertaining to emotion elicitation in VR.This approach aims to foster a more comprehensive understanding of distinctions between spontaneously occurring and deliberately acted facial expressions.

Figure 1 :
Figure 1: Workflow of user study

Figure 2 :
Figure 2: The virtual environment with the button panel displaying the range of emotions

Figure 3 :
Figure 3: Hystogram representing the total number of failures in mimicking emotions during the experiment.

Table 2 :
Avatar shown to participants representing one of the 7 basic emotions with the description of the related Facial Actions.

Table 3 :
Descriptive statistics for the IRI results (all of the participants).

Table 4 :
AUs mean values across the collected data: the values beyond the treshold T are highlighted in bold; *micro-expressions that were selected in advance to assess wheter the emotions had been correctly mimicked or not.

Table 5 :
P-values of the correlation among the ability of mimicking emotions and the empathy scores of the participants.The significant p-values ( < 0.05) are depicted in bold.