Investigating Effect of Altered Auditory Feedback on Self-Representation, Subjective Operator Experience, and Task Performance in Teleoperation of a Social Robot

Teleoperating social robots requires operators to “speak as the robot,” as local users would favor robots whose appearance and voice match. This study focuses on real-time altered auditory feedback (AAF), a method to transform the acoustic traits of one’s speech and provide feedback to the speaker, to transform the operator’s self-representation toward “becoming the robot.” To explore whether AAF with voice transformation (VT) matched to the robot’s appearance can influence the operator’s self-representation and ease the task, we experimented with three conditions: no VT (No-VT), only VT (VT-only), and VT with AAF (VT-AAF), where participants teleoperated a robot to verbally serve real passersby at a bakery. The questionnaire results demonstrate that VT-AAF changed the participants’ self-representation to match the robot’s character and improved participants’ subjective teleoperating experience, while task performance and implicit measures of self-representation were not significantly affected. Notably, 87% of the participants preferred VT-AAF the most.


INTRODUCTION
As robots gain acceptance and integration into public spaces to provide various services (e.g., [26,33,72]), the demand for their social interaction capabilities, particularly verbal communication, increases.Such robots, characterized by their ability to engage in social interactions with humans, are called social robots [27,60].However, despite technological advancements, fully autonomous social robots that can ofer natural and compelling communication as humans do, such as exhibiting fexible use of domain knowledge, are yet to be realized practically [33,35].Given these challenges, teleoperation, where a human operator remotely controls robots, is considered a promising solution.Although by defnition teleoperation can encompass various types of information transfer (e.g., manipulation [68]), in the context of service-oriented social robots, the operator is primarily expected to interact with customers through voice-based communication (e.g., [6,11,16,18,33,35,67]), which is the type of teleoperation referred to in this paper.Previous empirical studies have shown that teleoperated social robots can successfully ofer a variety of customer service in real felds, such as in guidance [6,18,33,35], advertising [67], and waitering in a cafe [11,71].
However, despite the importance of teleoperation interface designs aimed at easing operator tasks and improving operator experience for social robot teleoperation to become truly widespread, existing studies have often focused on the interaction between the robot and local users (e.g., customers) [5,33,35,65,67].By contrast, from the operators' perspective, one unique aspect of serving customers via robot teleoperation that difers from face-to-face situations is that the operator is expected to speak as the robot.In fact, it is agreed that local users who interact with the robot may feel uncomfortable and not accept the robot favorably when impressions of the robot's appearance and voice (e.g., acoustic features, linguistic content, and style of speech) do not match [62].Thus, technological support for the "speak as the robot" aspect could enhance the quality of the service while making it subjectively easier for operators to serve.Nonetheless, fully automatic, natural, real-time conversion technology to realize "a robot-like voice and speaking style" remains limited, even with recent advances in voice conversion research [63].
Therefore, we considered an approach that would assist operators in eliciting from them the ability to play a character, as people with special skills (e.g., voice actors) do.The idea can be interpreted in the context of human augmentation research, where the use of avatars and robots has been shown to enable users not only to obtain physical assistance, but also to augment their perceptual and cognitive abilities [46].Indeed, previous studies have shown that the change in one's self-representation, typically elicited by avatar embodiment in immersive virtual reality (VR), can afect behavior, abilities, and even thinking [7-9, 53, 77].For example, the virtual embodiment of a child avatar infuences the user's perceptual judgements, emotional states, vocal production, and implicit attitudes towards becoming childlike [7,10,69].
Even outside of VR, which exploits visual feedback, real-time altered auditory feedback (AAF), a method to alter acoustic properties (e.g., pitch) of a voice input and feedback the altered voice to the speaker, has been shown to afect the speaker's psychological state (e.g., emotional state [2,15,69] and implicit attitudes [1]); the altered speech characteristics are misattributed by the speakers as being caused by their own psychological changes.Particularly, Arakawa et al. [1] have recently showed that the AAF that converts the voice of a young participant to that of an elderly can transform the participants' self-representation, as evidenced by reducing their implicit bias toward the elderly.In addition, they proposed the concept of "digital speech makeup (DSM)" as the use of real-time AAF to promote users to voluntarily modify their self-representation and suggested the concept "to be used when one wants to act as a diferent persona in special circumstances, such as a performer being able to act as if they became their ideal actor and a live streamer immersing themselves in the role of their original character by hearing their speech converted to that of the character." Among these established methods for transforming selfrepresentations, that is, visual feedback (e.g., VR) and auditory feedback (e.g., AAF), we focused on AAF.This is because voice transformation (VT) itself, which does not necessarily involve feedback to the operator, is a commonly used method in teleoperation (e.g., [5,67]).Furthermore, unlike VR-based methods, AAF can be easily incorporated without interfering with existing teleoperation interfaces.However, design guidelines for using VTs in teleoperation systems and their impact on the operating experience have not yet been explored.
Thus, this study aims to apply the DSM concept to teleoperation of a social robot in service contexts, which essentially requires operators to "speak as the robot." We considered that AAF of the voice that is transformed to match the visual impression of the robot (e.g., higher-pitch for a child robot) can facilitate the transformation of the operator's self-representation to align the robot's representation, and consequently, infuence the operator's perception and behavior in performing conversational services as the robot.That is, the idea is to use the AAF to help operators easily engage in social robot teleoperation, specifcally reducing mental workload and difculty while improving positive feelings, which could eventually result in an improvement in service quality.Therefore, we investigated the following three research questions.(RQ1) Does AAF transform the self-representation of the operator toward "becoming the robot?" (RQ2) Does AAF make it subjectively easier for the operator to perform the service?(RQ3) Does AAF objectively improve service performance?
Here, we assume the teleoperation of a social robot in a service context and the AAF of a transformed voice to match the robot's representation.
To this end, we conducted an experiment in which participants (N = 30) teleoperated a robot (i.e., spoke as a robot) located on a bakery storefront with three diferent confguration patterns of VT (i.e., No-VT, VT-only, and VT-AAF) (Figure 1).Our main contributions are: (1) to apply the DSM concept and AAF to a real-world practical application, i.e., the teleoperation of a social robot, (2) to demonstrate in what aspects the use of AAF benefts the operator, and (3) to discuss design implications for the teleoperation interface.

RELATED WORK 2.1 Teleoperation of Social Robots in Service Contexts
Teleoperated social robots have been shown to ofer a variety of customer service successfully in real felds by exploiting the advantage of natural communication compared to autonomous robots [5,16,33,35,65,67].In addition, teleoperation can realize more fexible and efcient services, for example, an operator is aided by semi-autonomous control [33,35] and a single operator operates multiple robots [11,16] or vice versa [45].Furthermore, it also has advantages for the operator; people who have given up physical work, such as the elderly and disabled, can engage in physical and social work by teleoperating a robot [11,18,71].In fact, Takeuchi et al. [71] empirically showed that people with disabilities can work in a cafe through robot teleoperation and that this work experience facilitated their mental fulfllment.Teleoperation systems have two types of users: local users and operators.Therefore, a research approach from both perspectives is required to achieve improved service quality and motivate people working as operators.Previous studies on teleoperating social robots, conducted mainly in the HRI feld, have explored the perception of the customer (e.g., acceptance [33,35] and satisfaction [5,16]), attitudes [65], and behavior (e.g., decision making [65] and conversation [67]).Nonetheless, only a few studies have focused on the operator's experience when teleoperating a social robot, which also contrasts with abundant research that examines the relationship between the operator's mental workload and performance when teleoperating robot's movements such as carrying or grasping (e.g., [39,57,58]).
Among the few studies focusing on the operator's experience when teleoperating a social robot, Baba et al. [6] showed that the perceived workload of workers, particularly in terms of physical and time demands, decreases in teleoperation than in working onsite.Moreover, subjective experience (e.g., mental fulfllment) of teleoperation targeting specifc user groups, such as people with disabilities, has been explored [71].By contrast, to explore teleoperation interface design that can improve task performance while minimizing workload for general users, Glas et al. [17] focused on the operator's temporal awareness.They found that assistance systems, rather than a display of a clock on the interface, could improve the operator's temporal awareness and task performance without increasing mental workload.However, in their study, although the operators teleoperated a social robot, they performed button and text input, not a verbal interaction.Thus, there still remains a research gap in the exploration of the interface design that can help operators verbally interact with local users through a robot, despite its potential socio-economic impact.
Therefore, this study explores the subjective experience of the operator, as well as task performance, focusing on the unique and essential aspect of social robots, namely voice interaction.Although voice is an important aspect of teleoperated social robots, few studies have focused on this aspect, in contrast to extensive studies on voice in autonomous robots, as presented in the next subsection.The only study on the voice of a teleoperated robot, to the best of our knowledge, showed that the user's perception of the robot's voice, which was the operator's raw voice or a transformed high pitch voice, varied according to the user's gender [65].Specifcally, a higher pitched voice negatively infuenced male users' perception in terms of the robot's persuasiveness and dependability, while this was not the case with female users.Nonetheless, although such VT is commonly used in teleoperating social robots (e.g., [5,6,67]), no studies have examined the efect of VT on the operators themselves, which this study addresses.

Voice in Human-Robot Interaction
Although the importance of voice in robot design and research had been relatively underestimated compared to that of visual appearance, it has been recently highlighted in the broad feld of agent/avatar research, such as in social agents including robots [62], smart devices [14], autonomous robots [42], teleoperated robots [65], and virtual avatars [34].In fact, the acoustic characteristics of a robot's voice infuence the user's impression of the robot and subsequent social interaction between the user and the robot.It is plausible, given that speech conveys considerable information about the speaker, such as their age [64], gender [64], size [64], personality [41], and emotion [61], beyond the mere linguistic content.In humans, manipulating the acoustic characteristics of speech changes the perceived characteristics of the voice and speaker (e.g., pitch and formant on gender [29]), but evidence also suggests that even for a robot, the acoustic characteristics of the voice afect the user's perception of the robot (e.g., high pitch and perceived extraversion [47,48], [62] for a review).
Among the various acoustic features of voice, pitch and formants are major features, which relate to the frequency of the signal; pitch depends on the fundamental frequency (F0), and formants are defned as distinctive frequency components of the acoustic signal.In terms of human voice, they usually change depending on the speaker's glottal-pulse rate (GPR) and vocal-tract length (VTL), which are the anatomical features of the speaker that relate to the speaker's size, gender, and age [64].Therefore, listeners can judge the age and gender of the speaker simply by hearing their voice; conversely, manipulating the pitch and formant of the voice is known to afect the perceptual judgments of these aspects [29].Notably, although the efect of pitch and formants on perceived gender and age has been robustly confrmed, the relationship between the perceived personality of a speaker and the acoustic features of the voice is considered more complex.As for robots, although some studies have suggested that a social robot with a high-pitched voice tends to be perceived as extraverted [47,48], conclusive evidence has not been obtained.Generally, higher-order cognitive judgments are infuenced by multiple factors in a complex manner.Indeed, although a social robot with higher pitch tends to be preferred, the conclusion is afected by other factors such as robot size [78], its chronological background [62], and personality [48], and the gender [36,65] of robots and users.
In this manner, voice commonly afects the speaker's impression, whether the speaker is a human or a robot.Nonetheless, one of the unique diferences between humans and robots is the importance of matching the impression of a voice with that of the appearance [42,62].This is because while human speakers and voice characteristics tend to match naturally, those of robots do not, unless intentionally designed to match.Particularly, the mismatch between appearance and voice in terms of anthropomorphism (i.e., human-or robot-like) has been shown to induce a sense of eeriness [43].It has also been shown that assigning voices and appearances becomes faster when their anthropomorphism (either a robot or a human) is matched.It has also been shown that judgments of voice anthropomorphism become faster when the voice and accompanying image anthropomorphism are congruent.[59].In addition to anthropomorphism, high-pitched voices are more likely to be associated with a small child-like humanoid robot, rather than a large elephant-like robot [78].Furthermore, a study showed that robot voices that had been used in existing HRI studies rarely matched the appearance of the robot in terms of naturalness, gender, and accent of the voice [42].
Considering these studies on voice with humans and autonomous robots, transforming the operator's voice to match the appearance of the teleoperated robot, such as in terms of anthropomorphism, gender, and age, or even character, is a reasonable approach to improve the user's impression of the robot and lead to successful social interactions between the robot and the user.
However, the features contained in the voice are so vast and the relationship between them and the perceived impression is still unknown [29,62].Thus, although voice conversion has been rapidly advancing as a technology to convert as many of these characteristics as possible [63], it is still technically difcult to convert speech in real-time without compromising natural social interaction.Therefore, we focused on another approach; that is, to encourage operators' ability and engagement to speak as the robot by transforming minimal acoustic features (i.e., pitch and formants) of their voice and providing them with AAF in real time to afect their self-representation.

Multisensory Feedback on Changes in
Self-Representation In the teleoperation of a robot, the robot becomes the avatar of the operator; in fact, users feel the sense of embodiment over the robot [3,4].In the feld of VR, which visually transforms one's body representation, embodying an avatar has been shown to transform the user's self-representation, resulting in perceptual, cognitive, and behavioral changes.Specifcally, the characteristics of the avatar that the user embodies, such as gender [53], skin color [8], attractiveness [77], and age [7,69], or the avatar of a specifc person such as Einstein [9], lead to such changes in the user through stereotype or memory, which is automatically associated with the avatar.For example, the embodiment of a virtual avatar that resembles Albert Einstein, which is associated with a high cognitive ability, has been shown to increase the participants' cognitive task performance [9].Furthermore, the embodiment of a virtual child avatar has been shown to infuence the user's perceptual judgements (i.e., overestimation of object sizes), vocal production, and implicit attitudes toward becoming childlike [7,10,69].Beyond VR, for teleoperated android robots, when the operator can efectively manipulate the robot (i.e., feeling the sense of agency), changes in the robot's facial expressions can signifcantly change the emotional state of the participants [49].

Auditory Transformation.
Compared to visual transformations, few studies have aimed to transform self-representations using auditory feedback.Nevertheless, phenomena similar to those shown in studies on visual feedback have also been confrmed for auditory feedback.The focus can be categorized into two types: sounds generated by bodily movements and vocalizations.Regarding the former, Tajadura-Jiménez et al. [70] proposed a shoe-based system that senses the footsteps of a user and alters in real-time the frequency spectra of the walking sound so that it sounds like that of a lighter or heavier body.The system could change users' own perceived body weight, leading to a related gait pattern, and improve their self-esteem and motivation for physical activity.In addition, a method has also been proposed to create "robot-like" bodily sensations for entertainment applications [38].By combining visual, auditory (i.e., creaking sound), and vibrotactile feedback to simulate how a robot bends its arm, users could feel as if they had a robotic arm.On the contrary, for vocalization, in the same manner that an avatar that visually moves in sync with user movements is perceived as the user's body, the voice of a stranger that is heard as the auditory concomitant of user vocalizations can be perceived as their own voice [79].Furthermore, by altering the acoustic characteristics of self-produced speech sounds and immediately feeding them back to the speaker (i.e., AAF), the speakers misattribute the alterations as caused by their own emotional changes (e.g., happiness, anxiety, and anger), without awareness of the manipulation [2].Costa et al. [15] applied this fnding to reduce stress during dyadic conversation.They showed that when one of the dyads received AAF with a calmer tone, both the speaker and the other participant experienced less anxiety.
Although these studies have used AAF to manipulate emotions, some studies have used AAF to alter self-representations. Tajadura-Jiménez et al. [69] explored whether the AAF of a childlike voice in addition to the use of a virtual child avatar can enhance previously confrmed efects with visual transformations, that is, object size perception, self-representation, and subsequent real speaking.Although in their study the AAF did not strengthen the efects, that is, the contribution of auditory cues compared with visual cues in immersive VR was not found, the sense of embodiment was weakened when visual and auditory cues were incongruent in terms of the age conveyed.This result highlights that auditory cues are also important and the importance of maintaining semantic congruency between multisensory cues.Furthermore, vocal production is also impacted; the F0 of the participants' actual speech (i.e., pitch) shifted to align with the altered voice when they feel ownership over the altered voice, which is in line with another research [79].Nonetheless, contrary to their results showing less impact of auditory cues, Arakawa et al. [1] showed that AAF alone can afect self-representation without visual feedback.By contrast, our study aims to apply this concept to a real-world application scenario; the teleoperation of a social robot in service contexts.Therefore, instead of a complete laboratory experiment, we placed the robot in the feld (i.e., a bakery) to simulate a real customer service scenario.
Regardless of the purpose of using AAF, a unique but important aspect for the real-time feedback of one's speech is the inhibition of speech caused by delayed auditory feedback (DAF) [37,75].Although the minimal delay that triggers this speech inhibition is unclear, a delay of 200 ms was determined to be efective for adults in inhibiting speech [40], and the efective range is estimated to be between 30 ms and 300 ms [1].Therefore, minimizing the latency of the system is important to exploit the efect of AAF.Therefore, previous research on emotional manipulation via AAF [2,15] used software that had been specially developed to maintain latency within 15 ms [56]; or adopted another approach to mitigate the infuence of DAF, such as using bone-conduction headphones to allow users to hear their original voice and AAF simultaneously [1].

EXPERIMENT 3.1 Overview
In the experiment, participants repeated 15 min of customer service tasks by teleoperating a robot installed outside the entrance of a bakery, including speaking to passersby, introducing items, and talking to people who stopped in front of the robot.There were three within-subject voice conditions; the participants' voices were either not transformed (No-VT), transformed to match the appearance of the robot in terms of gender and age but not fed back to them (VT-only), or transformed and fed back in real-time (VT-AAF).We analyzed the data as a two-factor mixed design that considers a between-subject factor of the participants' gender and within-subject factor of the voice.This is because there should be an average gender diference in the magnitude of VT (i.e., pitch and formants) due to the inherent biological gender diferences in the original voice features when the feature of the target voice to imitate the character of a robot is the same.Thus, gender is taken into account to explore the scope of the efectiveness of the AAF.

Hypotheses
We hypothesized that in the VT-AAF condition, compared to the No-VT and VT-only conditions, the following results would be shown corresponding to each RQ.(RQ1) Does AAF transform the self-representation of the operator toward "becoming the robot?" • The implicit association test (IAT) score would increase, indicating stronger associations for the self and a robot child.• The robot embodiment and the change in self-representation scores in the questionnaire would increase.• The F0 of the participants' speech would increase to shift toward the F0 of the AAF.
(RQ2) Does AAF make it subjectively easier for the operator to perform the service?
• The NASA Task Load Index (NASA-TLX) score would decrease, indicating ease in the overall mental workload.• The task evaluation scores in the questionnaire would increase, indicating an improvement in subjective operator experience.• The selected percentage in terms of general preference would be the highest.
• The amount of conversation and speech would increase, indicating an improvement in task performance.
For RQ1, we hypothesized that self-representation would change in the direction closer to the robot used in the experiment, according to the measurement used in previous studies: IAT ([1, 7, 12, 69, 69]), questionnaire ( [69] in particular, but [10,19,28,50,52,79] as well), and change in vocal production ( [69,79]).IAT has been used to quantitatively measure the cognitive association of two conceptual dimensions [21,22].It has been widely used to measure implicit stereotypes and biases, typically by estimating the strength of association of a target category (e.g., racial groups) and attribute category (e.g., positive/negative words) through reaction time.Using IAT as a measurement, extensive research has been conducted on the virtual embodiment of avatars of negatively biased groups (e.g., those biased in terms of age [76] and race [8,23,54]) to reduce implicit biases.The IAT has also been used to evaluate whether the user's self-representation has changed (e.g., [7,69]), based on the notion that when a user embodies an avatar in an out-group category (e.g., an adult using a child's avatar), the association between that category and self-attribution becomes stronger [12].In fact, Arakawa et al. [1] verifed the concept of DSM by the change in the IAT results and subjective speech ownership before and after the participants used the AAF of the elderly voice.Thus, our study used IAT as an objective measurement of the change in self-representation.In addition to IAT, a questionnaire was used to measure subjective aspects, and we also expected that the F0 (i.e., pitch) of the actual speech of the participants would increase, indicating a change toward the transformed voice (i.e., higher-pitched), similar to that in [69,79].
For RQ2, we hypothesized that AAF would improve the experience of the participants as operators, measured in NASA-TLX, widely used measurement for workload (for teleoperation, [6,17]), custom multidimensional task evaluation using questionnaire, and general preference evaluation.As "to speak as the robot" is the aspect that is essentially required for the operator in teleoperation of a social robot, we hypothesized that AAF would help the operator accomplish the aspect and result in making the task subjectively easier for the operators to serve.
For RQ3, as measurements of task performance that can be quantitatively evaluated, we used the total duration of the conversation between the operator and local users and the amount of speech that the operator made.The duration of conversations corresponds to how much the operator's speech could attract local users' attention and the amount of speech corresponds to how much the operator tried to speak.We hypothesized that AAF would help the operator not only in subjective aspects (i.e., RQ2) but also in objective aspects, considering that the change in self-representation could infuence the way they speak.That is, the use of AAF (VT-AAF) could improve task performance by changing the manner and content of the operator's own speech rather than the simple use of VT (VTonly).This holds despite the fact that the applied VT is the same for VT-only and VT-AAF.Additionally, we expected that at least the use of VT, regardless of AAF, would improve task performance, especially in terms of amount of conversation, based on existing research that matching the appearance of an autonomous robot with the impression of its voice can help the robot to be perceived favorably [42,43,59,62].

Participants and Ethics
We conducted the experiment with 30 participants (15 males and 15 females, 38.00±13.19(SD), minimum 21 to maximum 58 yearsold) consisting of native Japanese speakers recruited through a temporary employment agency.These participants were assigned to one of six (= 3!) groups that difered in the assignment of the order of the three voice conditions to ensure that each group had the maximum similar distribution in terms of gender and age.The experimenters and participants conducted the experiment in a room at Osaka University.The participants signed an approved consent statement and were compensated with approximately $75 for 5 hours of participation.
To record the video footage of the site, we notifed passersby and customers through a bulletin board that the experiment was being conducted and that the footage was being recorded based on an opt-out method.We explicitly stated that users who wished to be removed from the recordings had the legitimate right to request removal.The experiment was approved by the local facility authorities and the Research Ethics Committee of Osaka University.

Apparatus
The experiment apparatus included a computer (2017 MacBook Pro with 2.3 GHz dual core Intel Core i5, operated by macOS Ventura 13.2.1) and a set of audio equipment.The computer was used to run the teleoperation interface, input the voice of the participants, record their speech, and conduct the browser-based questionnaire (using Google Forms and the NASA-TLX test form) and IAT (using psychexp 1 ).The audio equipment included a condenser microphone (AKG P420), microphone guard (AudioTechnica AT-PF2), tabletop microphone stand (AudioTechnica AT8652), audio output switcher (LiNKFOR), wired headphone (Sony MDR-H600A), and audio efect processor (ROLAND VT-4).
In addition, we developed and used a teleoperated robot system that allows operators to remotely control the robot using a browserbased video telecommunication application.The system has two main components: a robot controller system and an operator interface.These components communicated through a web server.As shown in Figure 2, the robot controller system comprised a small humanoid robot (Sota; Vstone Co., Ltd.), computer (Skynew K6, with Core i7-8565U operated by Windows11), unidirectional microphone (Sanwa Supply MM-MCU04BK), speaker (Sanwa Supply MMSPL-19UBK), and 180 • fsh-eye web camera (ELP USBFHD06H-L180-J).The robot had eight motors capable of body rotation, head movement, hand gestures, and fashing LED eyes.
The web camera was mounted behind the robot to transform local footage to the operator for real-time video communication; it captured people talking to the robot and passersby, as well as the robot itself from behind.Through an operator interface that ran on a Web browser, the operator can monitor the video from the robot's camera and listen to audio from the robot's microphone in real time.The operator's speech was output from the robot through the interface.In the practice session and the No-VT condition, the input to the interface was output as is; in the VT-only and VT-AAF 1 https://psychexp.com/conditions, the acoustic features (i.e., pitch and formants) of the input were transformed and then output from the robot.The robot performed the gesture when an operator said certain preregistered words.Participants were not explicitly informed that the robot would gesture in this manner.However, they could see how the robot moved in response to the words they spoke in the real-time video on the interface.

Voice Transformation (VT)
For a comprehensive overview of the procedures and rationale with respect to this subsection, see Supplementary Material (SM).

System Configuration.
Although the original DSM concept [1] involves a voice conversion technology that impersonates a pretrained target voice, we decided to use relatively simple VT (i.e., pitch and formant shifts) that work nearly in real-time.This is because voice conversion or transformation systems, in principle, have a trade-of between quality and latency.Although Arakawa et al. [1] used a high-quality any-to-one voice conversion technique, they compromised latency.However, as described in subsubsection 2.3.2,humans are sensitive to DAF; even a delay of 50 ms can make it difcult to speak normally, causing mental stress.In contrast to [1], our aim was to evaluate the usefulness of AAF in real feld application scenarios.Furthermore, delay not only inhibits speech, but also diminishes the sense of agency [73]; temporal synchrony is important so that the feedback voice is felt as one's own.
Therefore, we used a hardware audio-efect processor (ROLAND VT-4).It allowed the end-to-end latency of the participant (i.e., the time between speaking and hearing the AAF) to be approximately 5 ms, which was much less than the minimal delay we could obtain when using a software voice changer (i.e., 50 ms at least) and the latency of the AAF system used by Arakawa et al. [1] (200 ms).In addition, we used a closed-back wired headphone that blocks outside noise, including the speaker's original voice, so that the participants could hear only the AAF.With this confguration, we preliminarily confrmed that a speaker barely perceived the latency of our system when the AAF was administered and that speech impairment did not occur in any of the 20 speakers who tested the system in the pilot test.

Conditions. Three voice conditions were considered.
No-VT.The raw voice input of a participant was directly outputted from the robot.VT-only.The voice input of the participant was outputted from the robot after its pitch and formant were shifted.VT-AAF.In addition to the process applied under the VT-only condition, the participant could hear the transformed voice in realtime, which was identical to the output from the robot.
The amount of pitch and formant shifts was set for each participant according to the procedure shown in 3.5.3and constant throughout the experiment.Specifcally, the transformations were exactly the same for the VT-only and VT-AAF conditions for each participant.The auditory feedback of the participants' voice was provided to them only for the VT-AAF condition; that is, for the No-VT and VT-only conditions, the participants could hear only local audio during the task, not their original (i.e., non-transformed) voice.Although the No-VT condition could have been designed to involve feeding the participants their original voice, we wanted to design the No-VT and VT-only conditions to refect the more common situations.Nevertheless, before performing the task in each voice condition, the participants were briefed on whether the output from the robot would be transformed under the current condition and listened to a sample of their own converted speech if the condition was VT-only or VT-AAF (see 3.8 for further details).Thus, the No-VT and VT-only conditions were clearly diferent in terms of how the participants identifed the situation.

Manipulation.
Based on a literature review in 2.2, creating a voice impression to be as close as possible to the appearance of the Sota robot was considered to be important for maximizing the efectiveness of the AAF.Hence, we defned a specifc character setting in line with the impression of the appearance (i.e., neutral-gendered child, based on the ROBO-GAP dataset [55]).In addition, we set the character setting of Sota as extraverted because extraversion has been shown to be a suitable personality for customer service [30].As mentioned in 2.2, although some studies have suggested the link of a high-pitched voice with extraversion [47,48], the relationship between extraversion and pitch is not obvious.However, we considered that even a trait that was not directly linked with transformed voice features can also be elicited as a result of a change in self-representation through the implicit association process.That is, considering the previous study that the virtual embodiment of an Einstein avatar, which is associated with a high cognitive ability, increases participants' cognitive task performances [9], the AAF may facilitate the participants to roleplay Sota's character, including extraverted personality, without directly manipulating the perceived extraversion of a voice impression.
The acoustic parameters we manipulated to transform the participants' voice to match the appearance of the robot were the pitch and formants.In advance of the experimental period, we had asked a professional voice actor to speak as Sota according to the setting for approximately one minute.The acoustic features (i.e., pitch and formants) of the recorded speech were used as target values for the VT.To be in line with the character, the target pitch of the transformation was set to 260 Hz (see SM for rationales).The pitch parameter for each participant was selected so that the transformed pitch was closest to 260 Hz, among +5%, +25%, +50%, +75%, and +100% pitch shifts from the original voice.Note that we chose to increase the pitch for all participants, even when their original average pitch was estimated to be higher than 260 Hz, to avoid a confounding interaction of shift directions.Similarly for the formants, we determined the transformation parameter (+1 to +5) for each participant such that the formants of their transformed voice were as close as possible to those of the voice of Sota performed by the professional voice actor, using Resemblyzer2 , which can calculate the similarity of the voices.

Task
3.6.1 Customer Service Task (Main Task).The main task involved promoting and advertising items as a robot on the remote bakery storefront according to the Sota character.Participants were briefed on the three basic methods to serve, namely, calling people passing by and attracting their interest (e.g., hello), promoting bread (e.g., The specialty is xxx.), and conversing with customers (e.g., 'it's hot today.')They received a list of the 20 diferent breads sold in the bakery with their names, prices, pictures, and descriptions in advance, which they could refer to during the task.
To make the task easier for the participants and clarify the evaluation criteria, we instructed the participants that the following two aspects were important throughout the customer service task.The frst was to try to increase the variety and amount of speaking as much as possible.Specifcally, they were instructed to deliver as many variations of the lines while avoiding repeating the same lines, to continue speaking, and not to be silent, as possible.The second aspect involved speaking as Sota according to Sota's character settings, namely an extraverted neutral-gendered child.The task was carried out for 15 min for each condition and was repeated four times in total, including practice sessions.
3.6.2Script Reading Task (Sub Task).In addition to the main customer service task, a script reading task was performed.This task was performed before the main task for each voice condition to familiarize the participants with the voice conditions.Furthermore, it was conducted in order to record the speech of the exact same script between the participants and the conditions to perform a vocal production analysis as in previous studies (e.g., [69]).The script was that of Sota serving customers at an airport instead of a bakery, because we did not want the participants to repeat the same speech in the following main task by memorizing the script.The script was prepared beforehand by transcribing a speech by a professional voice actor improvising for approximately 30 s. Participants wore headphones during the script reading task, regardless of whether AAF was provided; in No-VT and VT-only conditions, no audio was received from the headphones.

Measurements
We measured the participants' mental workload using NASA Task Load Index (NASA-TLX), implicit attitudes using IAT, and other subjective scores using a questionnaire after each customer service task.In addition, we recorded the participants' speech and recorded video of local footage and audio in the bakery.

Implicit Association Test (IAT).
To investigate RQ1 (i.e., the change in self-representation), as in previous studies (e.g., [1,7,69]), we used the IAT to investigate how the participants' implicit association between a child robot (vs.an adult human) and self (vs.others) changed after performing tasks in each voice condition.
Following the IAT standard protocol [21,22], we selected 12 images in total, each representing a human adult or child robot such that the human represented the category of the participant population and the robot represented the category into which the self-representation changed.Participants were instructed to categorize an image of either an adult human or child robot and a pronoun that represented either the self (e.g., me) or others (e.g., s/he) by pressing a corresponding button as quickly and accurately as possible.
Reaction times and correct/incorrect data were used to measure the IAT score, which indicates the strength and direction of the association of categories.That is, categorization has been shown to be faster when images and words with strong associations are paired, compared with the opposite combination.Therefore, the IAT score is considered to refect the strength of the participant's implicit association between each pair of the image and word categories.In our study, we expected that the participants' implicit association between the self and a child robot would become stronger after experiencing AAF.The IAT was conducted immediately after each customer service task.Additionally, participants completed the IAT before experiencing any voice condition or task to obtain a baseline score and practice the test.
3.7.2Qestionnaire and Interviews.To investigate RQ1 (i.e., the change in self-representation) and RQ2 (i.e., the subjective operator experience), the questionnaire was used.At the beginning of the experiment, participants completed a pre-experiment questionnaire asking for demographic information (i.e., age and gender) and their personality according to the Japanese translation of Ten Item Personality Inventory (TIPI-J) [20,51].
After conducting the IAT following the tasks in each condition, the participants frst completed the NASA-TLX [24,25], which consists of mental demand, physical demand, temporal demand, performance, efort, and frustration items to measure the perceived workload of the main task.
Following NASA-TLX, participants completed a mid-experiment questionnaire with 15 items (18 items for the VT-AAF condition) that was used to assess subjective experience during the task (see Table 1 for all items).Each response was scored on a seven-point Likert scale (−3 = strongly disagree and +3 = strongly agree).The questionnaire comprised items for assessing robot embodiment (Ownership and Agency), change in self-representation (FeltChild, FeltRobot, FeltExtraverted, RobotExtraversion), task evaluation (Enjoyment, Motivation, Difculty, Confdence, Satisfaction), and voice ownership and agency (OwnVoice, VoiceFeatures and VoiceAgency).Except for the task evaluation items, the items were selected from a previous study [69] and modifed to correspond to our study context, based on [19,52] for robot embodiment items and [28] for change in self-representation items.Items related to task evaluation were originally designed based on user comments obtained from our preliminary user tests.Particularly, we asked about the overall enjoyment and motivation that the participants experienced during the task and ease, satisfaction, and confdence in each of the two requirements of the teleoperation task (Roleplay: "to speak as if you were Sota" and Service: "to speak a lot in variety and amount").Note that the items for voice ownership and agency, used in [10,50,69,79], were asked only in the VT-AAF condition, because no auditory feedback was given to the participants in the other conditions.Therefore, these items were used to see if the AAF functioned as intended, rather than comparing the ratings between the voice conditions.
After completion of the remainder of the experiment, the participants completed a post-experiment questionnaire.The questionnaire asked about general preferences with the following question: "If you were to perform robot customer service tasks on a daily basis in the future, which of the three voice conditions you experienced would you prefer to use?"In addition, we asked about their previous customer service and acting experience in an open-ended question.
Finally, the experimenter conducted a semi-structured interview in which participants were asked about their reasons on each of six subscales of NASA-TLX and general preference evaluations.All interviews were recorded with the consent of the participants, transcribed after the experiment.The interview data were used to supplement and expand the discussion of the results of the quantitative analysis.

Audio and Video
Recordings.To investigate RQ1 (i.e., the change in self-representation) in terms of the change in vocal production and RQ3 (i.e., task performance), we analyzed the speech and video data.In both tasks, including practice sessions, the participant's original speech and transformed speech, which were identical to the output from the robot, were recorded in parallel using GarageBand3 at a sampling rate of 24 bit/44.1 kHz in stereo settings.Note that for the No-VT condition, the original and transformed recordings were fundamentally the same.Simultaneously, in the main task, a video footage of the site, which was captured by a webcam of the teleoperated robot, was recorded.It was identical to the video that the participants could see on the operator interface during the task.
For the analysis of vocal production corresponding to RQ1, similar to [69], we analyzed the average F0 for each original voice recording (i.e., not transformed) to investigate whether an acoustic trait (i.e., pitch) of the participant's voice changed depending on the voice conditions.By contrast, as task performance measurements, we considered the volume of speech, which indicates how much the participant followed the instruction to speak as much as possible, and the duration of conversation, which indicates how much their speech could attract local users' attention.Furthermore, based on the observations made in the preliminary experiment, we found that while the occurrence of a conversation was largely infuenced by several situational factors, the amount of individual speech before a conversation occurred largely depended on the operator's efort.Specifcally, we assumed that the difculty in the task (and in actual customer service scenario) involved continuing speech alone during non-conversation periods.Hence, in addition to the total amount of speech during the customer service task, we also analyzed how much the participant spoke outside of the conversation.Therefore, in addition to a total amount of speech for the service task, the amount of speech per time was also analyzed, excluding speech made during conversation periods.

Procedure
The procedure was divided into fve sessions: (a) pre-experiment, (b) instruction, (c) practice, (d) main, and (e) post-experiment sessions.Figure 3 overviews the procedure.The entire process took approximately 3.5 to 5.5 h to complete.(a) Pre-experiment session.Upon entering the room, the participants read and signed an experiment consent form.Then, they completed the pre-experiment questionnaire, followed by the baseline IAT.(b) Instruction session.The pitch and formant transformation parameters were determined in the range of +1 to +5 according to the procedure described in 3.5.3.These transformation values were fxed throughout the experiment.The participants were then provided with printed instruction sheets regarding customer service and the robot teleoperation system and a printed list of breads.They were also instructed to see a robot in front of them, which was identical to the one installed at the bakery to understand its appearance and size.In addition, they watched a 15-minute instructional video, including a 1-minute example of Sota performing the task, operated by the voice actor.The printed materials were then made available to the participants to view at any time thereafter and to write notes as needed.
(c) Practice session.Participants performed a script-reading task followed by a customer service task.The primary purpose of the practice session was to familiarize the participants with each task.Hence, VT was not applied and no instruction was provided to participants on VT at this point.Technically, this was the same as the No-VT condition in the main session.However, while participants were explicitly instructed to use their back voice under the later No-VT condition if they felt it was necessary to do so, they were instructed to prioritize general task mastery in the practice session.This was to prevent participants from habituating the No-VT condition better than the other two.
(d) Main session.The main session comprised (1) instruction on voice condition, (2) script reading task, (3) customer service task and (4) test and questionnaire, for each voice condition.The order of the voice conditions was counterbalanced among the participants.
In (1) instruction on voice condition, participants were provided with a verbal description of the assigned voice condition and emphasized the importance of speaking in such a manner that the voice output from the robot sounds like that robot.In the VT-only condition, they listened to the VT-applied version of a recording of themselves reading the manuscript, which had been recorded in the preceding practice session, to better understand how VT would be applied, although they could not hear their transformed voice in the actual tasks.In the VT-AAF condition, to experience AAF before the actual task, participants spoke words of their choice until they understood how their speech was transformed and how AAF sounded.In (2) script reading task, they were asked to read the printed manuscript load so that the voice that would be output from the robot would match Sota's character, even though their voice was not output from the robot in this task and they were aware of that.It took approximately 30 seconds to read the manuscript to the end.Because we wanted this task to be as uniform as possible, except for the infuence of voice conditions for later analysis, if a mistake was made, we stopped the recording and restarted from the beginning.In (3) customer service task, the participants actually teleoperated the robot and served the real passerby.The task duration of 15 min was strictly maintained across conditions, so the participants were semi-forcibly disconnected from the robot even if they were in the middle of a conversation with a customer.In (4) test and questionnaire, the IAT was conducted immediately after the task to ensure that the test refects the implicit change of attitude during the task to the extent possible.Then, when answering the NASA-TLX and questionnaire, participants were instructed to recall and rate the experience during the preceding customer service task.(e) Post-experiment questionnaire and interview.Finally, participants completed the post-experiment questionnaire and were interviewed by the experimenter for 10-40 min.

RESULTS
During the main task for one participant with one condition, a network issue occurred, resulting in a task duration of 10 min instead of 15.Data from this participant were excluded from the analysis for which the diference in duration was considered critical (i.e., 4.3.2Task Performance).
A two-way mixed analysis of variance (ANOVA), was performed for each measurement with a between-subject factor of the participants' gender (gender) and a within-subject factor of voice conditions (voice).For nonparametric data, we frst applied an aligned rank transform (ART) to the data to perform ANOVA [74].We treated the data as nonparametric if they were theoretically considered to be nonparametric (e.g., Likert scale); otherwise, the median and mean were plotted frst and considered nonparametric if the discrepancy between them could be visually determined to be large.All statistics are listed in the SM.

IAT
We used SPSS to calculate the IAT score using the D-scores, based on the improved algorithm suggested in [22], such that more positive scores refect stronger associations for "self and a robot child" relative to "self and an adult human, " which in turn corresponds to stronger associations for "others and an adult human" than for "others and a robot child."As shown in Figure 4, the two-way ANOVA with the ART did not show any signifcant efects (gender: (1, 28) = 0.40, = .53, 2 < 0.01, voice: (2, 56) = 0.18, = .84, 2 = 0.01, interaction: (2, 56) = 0.24, = .79, 2 = 0.01).

Questionnaire
4.2.1 NASA-TLX (Figure 5).The overall workload scores were calculated by weighting the scores of each scale from the NASA-TLX.(2, 56) = 0.03, = .97, 2 = 0.00).Post-hoc pairwise comparisons of the voice efect (Holm-corrected) showed that the score in the VT-AAF condition was signifcantly higher than those in the No-VT (p<.01) and VT-only conditions (p<.05), meaning that the perceived ownership for the robot was strongest in the VT-AAF condition.In addition, female participants perceived more ownership than males.For Agency, during the experiment with two participants, the gesture function of the robot that normally works in response to the operator's speech did not work well in at least one condition; therefore, the data from those were excluded from the analysis.The two-way ANOVA with the ART did not show any signifcant efects (gender: (1, 26) = 2.95, = .10, 2 = 0.10, voice: (2, 52) = 1.49, = .23, 2 = 0.05, interaction: (2, 52) = 0.18, = .83, Change in Self-Representation (Figure 7).Two items with respect to RobotExtraversion were averaged to represent a single score according to the original TIPI protocol [20].For FeltChild, showed that the score in the VT-AAF condition was signifcantly higher than those in the No-VT and VT-only conditions (both p<.001).For FeltRobot, a two-way ANOVA with ART showed a signifcant main efect for gender and voice, respectively (gender: (1, 28) = 5.56, < .05, 2 = 0.17 showed that the score in the VT-AAF condition was signifcantly higher than those in the No-VT and VT-only conditions (both p<.01).For RobotExtraversion, a two-way ANOVA with ART showed a signifcant main efect for voice (gender: (1, 28) = 1.49, = .23, 2 = 0.05, voice: (2, 56) = 6.05, < .01, 2 = 0.18, interaction: (2, 56) = 1.21, = .31, 2 = 0.04).Post-hoc pairwise comparisons (Holm-corrected) showed that the score in the VT-AAF condition was signifcantly higher than those in the No-VT (p<.01) and VT-only conditions (p<.05).In summary, the participants mostly felt as if they were a child, robot, and extraverted and that they could roleplay the robot as extraverted in the VT-AAF condition.Additionally, female participants felt as if they were a robot more than males.
To summarize, in the VT-AAF condition, the participants enjoyed the task and were motivated the most, found it the easiest, felt most confdent, and were most satisfed in speaking like Sota.Additionally, female participants enjoyed the task and were motivated more than males, and found it easier, felt more confdent, and were more satisfed in speaking like Sota than males.By contrast, the participants felt more confdent and were more satisfed in terms of speaking a lot of variety and quantity in the VT-AAF condition only compared to the No-VT condition.The perceived ease of this aspect was not signifcantly afected by the voice conditions.
Voice Ownership and Agency (Figure 9).As these items were asked only in the VT-AAF condition, we performed the Wilcoxon rank sum test for each scale with respect to the gender factor.For all items, the scores were not signifcantly diferent between males and female participants (OwnVoice: = 127, = .54,= 0.065, VoiceFeatures: = 101, = .62,= 0.052, VoiceAgency: = 92, = .39,= 0.091.)4.2.3Summary.In summary, all items related to task evaluation (Figure 5: NASA-TLX and Figure 8: Task evaluation) were signifcantly better rated in the VT-AAF condition than that in the No-VT condition, except for ServiceEase, where a not signifcant, yet similar trend was observed.However, the comparative results of the VT-AAF condition with the VT-only condition depended on the indices.Regarding the items that asked for general experience (enjoyment and motivation) and the roleplay aspect, the VT-AAF condition was signifcantly better rated than the VT-only condition.Moreover, no signifcant diferences were observed between the VT-only and VT-AAF conditions for mental workload (NASA-TLX) and the items that asked for the aspect of service (i.e., to speak a lot of variety and quantity).By contrast, items related to self-representation (Figure 6: Robot Embodiment and 7: Change in self-representation) showed a consistent tendency that the VT-AAF condition induced the strongest feeling compared to the others, except for the sense of agency evaluation.Additionally, gender diferences were observed in several aspects.Enjoyment, Motivation, RoleplayEase, Roleplay-Confdence, RoleplaySatisfaction, Ownership, and FeltRobot were more likely to be felt by female participants.To investigate whether an acoustic trait (i.e., pitch) of the participant's actual speech changed depending on the voice conditions, the average F0s of the raw (i.e., nontransformed) speech data for each voice condition and task were computed using Praat, by excluding the sounds exceeding 500 Hz or below 75 Hz (considered noise).As we were interested in the Ogawa, et al.

Speech and Task Performance Analysis
changes in F0 according to the voice condition, we calculated the increase rate of F0 based on the average F0 of the script reading task in the practice session for each participant.

Task
Performance.Using the voice and video recording data for the customer service task, we analyzed the amount of speech and the duration of the conversation to measure the task performance.We used a transcription service (notta.ai7 ) to convert the voice recordings into text, and four annotators checked and corrected them.Additionally, using the video recordings, two annotators noted the timing of when local users spoke to the robot.Particularly, we defned a conversation between the robot and the same person or group as a series of conversations, and the calculated duration of the conversation as the time between the beginning of the frst utterance and the end of the last utterance of the user(s) during the series of conversations.All annotation work was checked at least three times per fle, and each fle was checked by two or more annotators.Then, to calculate the amount of speech, we computed the characters included in the transcription text of the participant's speech during the customer service task because the linguistic nature of our native language (i.e., Japanese) makes each letter approximately correspond to each mora, a phonological unit of a language.As described in 3.7.3,as the amount of speech, we used two types of indices: a total amount (characters) of speech for each 15-minute service task and the amount of speech per unit time (i.e., seconds), during the period excluding the conversation.
For the duration of conversation, a two-way ANOVA with ART did not show signifcant efects (Figure 11      to that in the VT-only condition.On the contrary, the duration of conversation and the amount of speech (Figure 11), which are objective measurements of task performance, did not difer signifcantly between the conditions.

General Preference
In the post-experiment questionnaire, 26 participants selected the VT-AAF, 3 selected the VT-only, and one selected the No-VT condition as their preferred and would use it again the most.
In the interview, many participants (14/30) mentioned the advantages of hearing the same voice that is being heard on the other side (i.e., from the robot).As advantages, some (5/14) specifcally mentioned that they felt relieved in the VT-AAF condition or, conversely, that they felt anxious in the VT-only condition.Similarly, some (5/14) stated that it was easy for them to adjust their tone of voice and the content and manner of their speech in the VT-AAF condition as they could hear the voice to be heard.Additionally, some (9/30) made comments related to changes in self-representation.Specifcally, a feeling of not being themselves or being (or speaking/communicating as) a robot and the easiness to roleplay were pointed out.Additionally, few (2/30) mentioned that their motivation and enjoyment increased as they felt like a robot, consistent with the results of the questionnaire.As the advantage of VT (i.e., VT-only and VT-AAF) over No-VT, several (3/30) mentioned that they preferred their voice to be transformed as they do not like their own voice and few others (2/30) mentioned that they felt that the customers responded better when their voices were transformed.
Of those who selected the VT-only condition as their preference, a majority (2/3) mentioned that they chose the VT-only condition over the VT-AAF condition because AAF was unfamiliar and somewhat uncomfortable to hear at frst, but both answered that they thought the VT-only condition would be the best if they could get used to it by using it longer.The other mentioned that '(in the AAF  The only participant who chose the No-VT condition as their preference explained the reasons as because she felt most comfortable.She also mentioned that she felt a little strange and uncomfortable in the frst 5 minutes of VT-AAF.It is also noteworthy that the F0 of her original voice was measured as 277.1 Hz, which was higher than the target pitch of 260 Hz, and that she had 10 years of promotional work in the customer service industry.

DISCUSSION
The results are summarized as follows: 1) Change in Self-Representation.According to the results of the IAT and mid-experiment questionnaire (Figure 4, 6, and 7), although the IAT scores and subjective agency over the robot were not signifcantly changed among the voice conditions, subjective feelings of change in self-representation were strongest in the VT-AAF condition for all four items of the questionnaire, as well as subjective ownership of the robot.In addition, subjective ownership and the feeling of "being like a robot" were more likely perceived by female participants.Furthermore, a vocal production analysis (Figure 10) revealed that the F0 of the participants were shifted in the direction of the feedback voice when AAF was applied.
2) Subjective Task Evaluation.According to the results of the mid-experiment questionnaire and NASA-TLX (Figure 5 and 8), although all indicators were rated better in the VT-AAF condition than in the No-VT condition, the comparison between the VT-only and VT-AAF conditions showed that whether or not VT-AAF was more efective than VT-only was mixed according to perspective.In particular, the use of AAF in addition to VT (i.e., VT-AAF vs VT-only) may improve enjoyment, motivation, and decrease the psychological efort of "speaking as the robot."In contrast, the overall mental workload and psychological efort of "speaking as much as possible" aspect were not signifcantly infuenced by the use of AAF.
3) Objective Task Performance.The speech analysis (Figure 11) showed that the voice conditions did not signifcantly afect task performance measurements.

General Discussion
5.1.1RQ1: Change in Self-representation.The results corresponding to RQ1 demonstrate that when AAF is applied, the participants' self-representation changed signifcantly to align with the character of the teleoperated robot, as evidenced by explicit measurements (i.e., questionnaire), but not implicit measurements (i.e., IAT).Among explicit changes, the result that the participants felt most extraverted in the VT-AAF is worth highlighting, considering that the transformed acoustic parameters and the extraverted personality are not directly linked; rather, extraversion is linguistically defned characteristics of Sota's setting.In fact, a participant mentioned in the interview "with AAF, it was easy to grasp the image of Sota and immerse myself in the character." The results of the vocal production analysis further support that the participants felt more like a child robot when AAF was applied.Generally, AAF has been demonstrated to shift the F0 of the speaker's voice in two opposite manners: following the feedback (e.g., [10,13,69,79]) and compensating for the discrepancy (e.g., [13,32]).However, it should be noted that in [69], the AAF of the child-like voice increased the participant's F0 (i.e., followed the feedback) when they embodied a child avatar, but decreased (i.e., compensated for the discrepancy) when they embodied an adult avatar.That is, the auditory traits of AAF alone do not determine the shift direction; rather, a top-down modulation process may be involved, as also evidenced by another study [10].Importantly, in our study, while participants had to intentionally increase their F0 to speak like a child robot without the help of VT in the No-VT condition, they did not have to do so in the VT-AAF condition.Our results of F0 still increasing when AAF was applied suggest the existence of an unconscious adjustment process of participants' vocal production through a sense of ownership over the feedback voice and the teleoperated robot.
Nonetheless, although we expected to observe diferences in implicit measures of the change in self-representation as well, the IAT results did not show a signifcant efect of AAF.In fact, the efect of AAF on IAT is not conclusive in previous studies; of the two previous studies that examined the impact of AAF on IAT (i.e., [1,69]), one that examined the efect of AAF in addition to visual feedback (i.e., the use of a virtual avatar) in immersive VR showed that child-like voice feedback, compared to adult (i.e., nontransformed) voice feedback, did not afect the IAT score, although the use of a child-like avatar signifcantly increased the IAT score [69].By contrast, the other study [1] showed that the AAF of the elderly alone can afect the IAT score of young participants in the direction of reducing the implicit bias toward the elderly.However, the notable diferences between their and our studies are that they measured the implicit stereotype (i.e., negative/positive category) toward the elderly, rather than the association with the self (i.e., self/others category).In addition, no direct comparison was made with/without AAF (i.e., all participants experienced AAF, and the comparison was made only between before and after the use of AAF).Therefore, the possibility that the negative bias toward the elderly was simply reduced by exposure to the concept of the elderly during the experiment cannot be ruled out.Given these considerations, our results are not considered substantially inconsistent with previous studies.Nevertheless, in the baseline IAT score of our study, which was measured at the beginning of the experiment, some participants already exhibited stronger associations between "self" and a robot child, rather than with adult humans, although the means and medians were negative, indicating that on average, they showed the stronger association between "self" and adult humans, as expected.This may indicate the possibility that our experimental stimuli could be insufcient to be perceived as representative of each category.We used a pronoun word referring to self/others categories, whereas previous studies [7,69] used words that were personalized for each participant (e.g., names, ages).Another explanation could be that although we assumed that adult humans were perceived as an in-group category as opposed to child robots, the images of a racial group that difers from those in the participant population could have been perceived as an out-group that was associated with the "other" rather than the "self."Taken together, the subjective measurement and the change in F0 in vocal production together indicate that the use of AAF fosters a change in self-representation of the operator toward the robot character.Yet, further validation through implicit measurements will be needed in future studies.5.1.2RQ2: Subjective Operator Experience and RQ3: Task Performance.The results corresponding to RQ2 demonstrate that the efect of AAF on the subjective experience of the teleoperation task depends on the evaluation perspective.As for the types of tasks for which AAFs were efective, for the Roleplay (i.e., to speak as the teleoperated robot), a unique aspect of social robot teleoperation, the use of the AAF was able to assist the operator better than a simple VT.However, for aspects of the Service (i.e., to speak a lot of variety and quantity), which are neither robot-specifc nor teleoperation-specifc, the AAF is useful to the operator only when compared to the case where VT is not applied (i.e., when their voice output from the robot as it is).On the other hand, as for the types of psychological aspects for which AAFs were efective, mental workload (NASA-TLX) was lower in both VT-AAF and VT-only than in the No-VT, but enjoyment and motivation were highest in VT-AAF compared to the other two.In addition, VT-AAF was preferred the most by 87 % of the participants.In the interview item that asked about positive experiences throughout the experiment, several (5/30) mentioned that they enjoyed the roleplay experience itself, for example, "I could say things that I normally wouldn't have said." Some (6/30) stated that they enjoyed the fact that they were able to interact with people whom they would not normally interact with, especially children, as the robot.
By contrast, the results corresponding to RQ3 demonstrate that the voice conditions did not signifcantly afect task performance as far as measured indices, namely "speaking as much as possible" and "attracting the users' attention as much as possible." CHI '24, May 11-16, 2024, Honolulu, HI, USA Among the few previous studies that have focused on the teleoperator's experience of social robots, Baba et al. [6] compared the perceived workload of workers and the performance in face-to-face customer service tasks with those in teleoperation, using simple VT corresponding to our VT-only condition.They showed that teleoperation can reduce workload compared to working onsite, while overall performances were not signifcantly diferent.Our study, moreover, not only suggests the design implication that can further ease the workload, but also sheds light on the diferent perspective that the experience of interacting with people as someone else may itself be benefcial to the operator.This aspect may be related to the previous study showing that the teleoperation of a social robot by physically disabled people can promote social participation and increase their mental fulfllment [71].Our study, in contrast, shows that even with a general population, that is, not only a specifc group of users, the same would apply and that the use of AAF, or the increased change in self-representation, would further improve the operator's subjective experience in terms of motivation and enjoyment.Lastly, Glas et al. [17] have found that autonomous assistance systems that simplify an operator's task where temporal awareness is important could improve task performance without increasing mental workload.Although the task was not verbal in their study, assistance systems, such as complementing the content of speech, can also improve task performance in the case of verbal interaction.Considering that our approach mainly benefts in the operator's subjective experiences, combining these approaches may contribute to interface design to improve task performance while minimizing workload.
Nonetheless, there may be more room for discussion on our results that task performance, especially in terms of conversation duration, did not improve when VT is applied (i.e., VT-only and VT-AAF) compared to No-VT, although the match between appearance and voice has been shown to increase user acceptance for autonomous robots [42,43,59,62,78].Since this is the frst study to examine the impact of manipulation of operators' voice on task performance in practical and commercial scenarios for teleoperated robots, there exist various diferent aspects from previous studies, which may explain the results that are diferent from the expectations.One possible explanation is the gap between whether users perceive the robot favorably and whether they actually talk to the robot or have many conversations with it, especially given that this study was practically oriented and conducted in a real feld setting.In fact, although not many studies have measured customer behavior in practical service situations, in existing research, perception and behavior results do not always match [67], and even multiple indicators of behavior are not always consistent within studies [66,67].Another gap between our study and existing research is the diference between autonomous and teleoperated robots.Song et al. [65] discussed that the efect of voice manipulation (i.e., increase in pitch) on perceived impression of the robot could difer depending on whether the robot is autonomous or teleoperated, contrasting their results, where no main efects of voice manipulation were found, with previous studies that demonstrated that an autonomous robot speaking with higher pitch could be considered more attractive and achieve higher quality of interaction [47,48].This is considered because the user's awareness of teleoperation would infuence the perception of the robot [5,65].

Efect of Gender.
Other fndings include the efect of the participants' gender.We expected that there could be gender diferences in the efect of AAF, because the average amount of applied VT should be diferent when the target is the same.Nonetheless, somewhat surprisingly, no interaction efect with the gender factor was observed in any of the analyses.In addition, according to the results of the questionnaire on items related to voice (Figure 9), participants did not perceive the AAF as being or resembling their voice that much (i.e., the medians are all 2 to 3), regardless of gender.In addition, although there were no statistical diferences in the representative values between male and female participants for any of the indicators, it is worth noting that for OwnVoice, the distribution appears to be diferent between them.We can assume that this is presumably due to gender diferences in the magnitude of the VT mentioned above.Nevertheless, the agency for AAF was moderately perceived by both genders.It is interesting whether or not one feels "own voice" and whether or not one feels "agency" in AAF seem to be independent of each other.The result that subjective agency of the robot was not signifcantly decreased even when AAF was applied regardless of gender further supports the fact that participants could feel the sense of agency even when they felt the transformed voice as not their own.Thus, whether the feedback voice is perceived as one's own or not does not seem to have much efect on the efectiveness of AAF.

Design Implications for Teleperating Interfaces
Although VT has often been used empirically when operators speak through teleoperated social robots (e.g., [5,6,67]), this study is the frst to demonstrate its quantitative efectiveness on operators' subjective experiences.One of the advantages of using VT/AAF for teleoperation is that any voice communication-based system can easily apply the fndings regardless of the specifc interface implementation, in contrast to the fact that the implementation of the intelligent assistant system may depend on the specifc interface and task.
Based on the discussions in 5.1.1 and 5.1.2,our general recommendation is to apply VT at the very least to reduce the mental workload of the operator of a social robot in service contexts.Additionally, to make it easier for the operator to speak as a robot and induce positive emotions such as motivation and enjoyment, we suggest applying AAF in addition to the simple use of VT.Regarding individual diferences, based on the discussions in 5.1.3,it can be assumed that the efects of VT and AAF do not depend on whether the magnitude of VT is trivial or dramatic.Therefore, we consider that no special consideration is required depending on the acoustic characteristics of the original voice.Yet, the interview comments in 4.4 indicate that few participants (2/30) needed more time to become familiar with AAF.Further considering that one preferred the No-VT, we recommend allowing users to adjust the volume of the AAF themselves, so that they can moderate the volume until they get used to it, or turn it of for those who do not prefer it.
Finally, we emphasize the importance of low-latency systems in the application of AAF.Although we did not rigorously measure the infuence of delay, in our preliminary user tests, a delay of 50 ms signifcantly induced speech inhibition and mental stress, although there appears to be considerable individual diferences in susceptibility.However, although recent technological advances in voice conversion techniques achieved near-real-time conversion, a minimal delay of 50 ms is considered almost inevitable as a result of the recurrent processing of the algorithm [1].Therefore, to leverage AAF, it is recommended that voice transformers be selected based on the trade-of between quality and delay, depending on the application.For instance, in service contexts where speech inhibition can cause serious problems, hardware-based voice transformers, as used in our study, are recommended to be used.In fact, no participant mentioned speech inhibition or latency in the interview, nor was it confrmed insofar as the authors heard and judged speech data in our study.In contrast, some participants in [1], which used a voice conversion technique with 200 ms latency, mentioned latency.Otherwise, if the teleoperated robot is a certain famous character whose voice is well known, it would be better to use voice conversion techniques to mimic the voice while mitigating the infuence of DAF by using bone-conduction headphones as suggested by [1].

Limitation and Future Work
Despite the wealth of insights gained, we believe that there are aspects that were outside the scope of this study and therefore would require further research.First, since we conducted this study as a frst step in exploring the efect of AAF, we designed the experiment to eliminate potential confounding factors as much as possible.For this reason, we fxed the robot and its specifc character settings, including voice features.Thus, further studies are required to determine whether the efects confrmed in this study can be applied to robots with diferent appearances, characters, and voice features.Future work also includes exploration of how to optimize the voice and how to encourage optimal adjustment, for example, to determine whether the efect of AAF can be further enhanced if users themselves choose or design the target voice.It would also be necessary to consider the efects of using the system in the long term and in diferent service contexts.Indeed, the impact of the robot's voice is known to vary depending on the context of the interaction [42].

CONCLUSION
We explored whether the use of AAF in the teleoperation of a social robot in service contexts fosters the operator to transform the self-representation to "become the robot, " eases the operator's task subjectively, and improves the task performance.The paper contributes novel knowledge on the infuence of real-time AAF on subjective experience and self-representation of the operator, while efects on overall performance were not found.
Finally, we believe that the fndings can be further applied to a wide range of avatar-mediated communication, especially in commercial scenarios, from fully VR shopping environments to a situation integrated into everyday space, close to our experimental scenario, such as displaying the 3DCG avatar of a salesperson on a monitor in real space.Although such a service-oriented use of virtual avatars has long been studied [31,44], the auditory impact of avatars has received attention only recently [34,69].Nonetheless, considering the strong efect of visual feedback compared to auditory feedback in VR shown in [69] and the fact that it is easier to change the appearance of a virtual avatar rather than the operator trying to speak to match the appearance, which is not the case with a robot, we believe that the fndings of this study are best applied to robots.

Figure 1 :
Figure 1: Overview of system confguration comparing three voice conditions.The participants' voices were either not transformed (No-VT), transformed to match the appearance of the robot (VT-only), or transformed and fed back in realtime (VT-AAF).

Figure 2 :
Figure 2: Teleoperated robot system installed in a bakery storefront.

Figure 4 :Figure 5 :
Figure 4: Box plot5 of IAT scores according to participants' gender and voice conditions.Baseline scores that were measured at the beginning of the experiment are also shown as reference values.More positive scores refect stronger associations for "self and a robot child" compared with "self and an adult human."

Figure 6 :
Figure 6: Box plot 5 of scores for the Robot Embodiment category of the mid-experiment questionnaire according to participants' gender and voice conditions.

Figure 7 :
Figure 7: Box plot 5 of scores for the Change in Self-Representation category of the mid-experiment questionnaire according to participants' gender and voice conditions.

4. 3 . 3
Summary.In summary, the vocal production analysis (Figure10), which supplements the evidence to validate RQ1, showed that the VT-AAF condition, as well as the No-VT condition, increased the average pitch of the participants' speech compared

Figure 8 :
Figure 8: Box plot 5 of scores for the Task Evaluation category of the mid-experiment questionnaire according to participants' gender and voice conditions.

Figure 9 :
Figure 9: Box plot 5 of scores for three questionnaire items for Voice Ownership and Agency according to participants' gender.

Figure 10 :
Figure 10: Bar plot 6 of increase rate of F0 for each task according to participants' gender and voice conditions.

Figure 11 :
Figure 11: (Left) Box plot 5 of conversation duration.(Center and Right) Bar plots 6 of the total characters included in speech during the main task and the amount of speech (characters) per second, excluding conversation.All are plotted according to participants' gender and voice conditions.
condition) I couldn't speak naturally because I was too conscious of my voice tone.'

Table 1 :
Mid-experiment questionnaire items used in the experiment.Each response was scored on a seven-point Likert scale (−3 = not at all, +3 = very much).* : RobotExtraversion2 was the control question.* * : Items corresponding to Voice Ownership and Agency were asked only in the VT-AAF condition.
I could speak as Sota with confdence.I could speak as Sota satisfactorily.To speak a lot of variety and quantity was easy.I could speak a lot of variety and quantity with confdence.I could speak a lot of variety and quantity satisfactorily.Voice Ownership and Agency * * OwnVoice VoiceFeatures VoiceAgency I felt as if the voice I heard when I spoke was mine.I felt as if the voice I heard when I spoke resembled my (real) voice in terms of tone, pitch, or other acoustical features.I felt as if I caused the voice I heard.