Effortless Polite Telepresence using Intention Recognition

Telepresence technology creates the opportunity for people that were traditionally left out of the workforce to work remotely. In the service industry, a pool of novice remote workers could teleoperate robots to perform short work stints to fill in the gaps left by the dwindling workforce. A hurdle is that consistently talking appropriately and politely imposes a severe mental burden on such novice operators and the quality of the service may suffer. In this study, we propose a teleoperation support system that lets novice remote workers talk freely without considering appropriateness and politeness while maintaining the quality of the service. The proposed system exploits intent recognition to transform casual utterances into predefined appropriate and polite utterances. We conducted a within-subject user study where 23 participants played the role of novice remote operators controlling a guardsman robot in charge of monitoring customers’ behaviors. We measured the workload with and without using the proposed support system using NASA task load index questionnaires. The workload was significantly lower (p< .001) when using the proposed support system (M = 46.07, SD = 14.36) than when not using it (M = 62.74, SD = 12.70). The effect size was large (Cohen’s d = 1.23).


INTRODUCTION
In Japan, there is a growing shortage of workers who could physically fulfill some of the jobs in the service industry.However, there is a part of the unemployed population that could find time to work remotely and fill in for short periods using telepresence.Working using telepresence [29,34,54,55,63] has the potential to enhance the quality of life and operational efficiency by reducing transportation needs and addressing issues such as fatigue and exhaustion associated with traditional approaches [12,47].To address this labor shortage, some companies have begun experimenting with using teleoperated robots in the service industry.This has led to the emergence of a gig economy where remote workers frequently switch between teleoperating jobs.The service industry requires that workers present a considerable level of hospitality [18,31].For example, workers in the tourism industry [14] or receptionists in contexts such as museums [10] need to use phrases that are appropriately polite for the context, which necessitates replacing casual expressions with more formal manners of language, such as "Sir, could you please ... ", "May I kindly remind you that ... ", "We would greatly appreciate it if ... ", or "Would you mind please considering ... " Maintaining politeness and appropriate speech while teleoperating can be mentally taxing for remote workers [2,5], particularly when faced with long and frustrating work sessions.Because of tiredness, repetition, excitement, impatience, anger, or rage, remote workers may be inclined to use informal language rather than maintain politeness, for example, using phrases such as "Hey, you ... ", "How many times do I have to tell you ... ", "Damn it, don't you know ... " or "For goodness' sake ... " Considering this gig economy scenario, we created a teleoperation support system that "corrects" the operator.For example, in the guard robot task we selected to illustrate that concept (Figure 1), it is difficult for a novice to come up with the appropriate way of expressing their intents (e.g., how to admonish politely).Then, the remote worker has a high workload and the service may not be well provided.The proposed system lets less-experienced workers perform the admonishing task with few efforts and still provide a good quality of service.
The proposed system is based on intent recognition [3,19,28,35,42,45,46,48,62,67] as we believe this is an inclusive technology that could scale reasonably well.Novice operators talk freely without worrying about appropriateness or politeness.Then, intent recognition modules estimate the expressed intent and an automatic procedure replaces the operator's utterance with a predefined utterance that has an appropriate wording and is polite.Finally, the teleoperated robot delivers that utterance to the user facing the robot.

Robotics and Hospitality in the Service Industry
Client satisfaction has been widely referred to as a fundamental indicator of efficiency in the service industry and is dependent on the level of hospitality perceived [1,4,36,37].The human-robot interaction model utilized in the foregoing context may affect the relationship between human service providers and their customers and make a significant impact on the level of hospitality.Making human-robot interaction smoother and more hospitable has become feasible with the emergence of the fourth industrial revolution, which paves the way to deploying this technology into applications such as tourism and has even given rise to the concept of hospitality robotics [57].It bears considerable pertinence to the scenario under study in this article, which involves a guardsman robot.

6:3
Although the importance of hospitality in service robotics has been well appreciated, the existing scientific literature has not paid particular attention to ensuring or improving hospitality while considering the requirements arising from telepresence.In other words, most of the studies have focused on what service robots could offer for better hospitality, but not on how the operator's inputs could be modified for the same purpose.More clearly, the trending perspective concerns enhancing service hospitality in a future society where artificial intelligence and automation technologies will take over traditionally assumed human responsibilities [8,9,24,25,33,39,41,44,52,53,60,61,64].However the impact of the operator's behavior on the level of hospitality perceived, as well as possible ways of revising it targeting more hospitable conduct, seems to have been overlooked.One of the main motivations behind the model proposed in this article is alleviating the foregoing drawback.We focus on the appropriateness and politeness of the operator's utterance as it is a prerequisite for an hospitable service.

Telepresence Robots
Owing to the ability of human partners to anthropomorphize robots [66], using them for telepresence has found applications not only in shopping malls taken into account in this article but also in a variety of other contexts.Relevant examples include a receptionist android [11] taking advantage of inverse reinforcement learning [7,15,22] and behavioral cloning [43] for inviting visitors to use hand sanitizers, educational telepresence robots [51,65] facilitating joint work and fostering the feeling of social presence, and a childcare robot [50] utilized for personality estimation.
Despite the abundance of studies on telepresence robots, including those employed in the service industry, in most of them, the pivotal point has been to explore ways to engage the robot as widely and effectively as possible, while hospitality has been neglected, not considering it as an essential criterion at the design and experimental evaluation stages.Remedying the latter deficiency is among the crucial incentives of the present study, where the telepresence model is set up and practically assessed, while taking hospitality into account as the main indicator of efficiency.

Teleoperation Support Systems
In the literature, teleoperation support systems have been developed while adopting diverse types of modalities.For example, body postures have been assessed, optimized, and positioned for ensuring comfort and safety [13].The system proposed in [11] required the operator to determine the appropriate timing for the robot's actions.In [65], vision recognition was the main responsibility of the system, where a facial avatar was used to produce movements.Similarly, in [50], the teleoperation support system was designed to display images.
In some of the relevant studies, the vocal modality has been incorporated as well.For example, in [51], the communication between the two partners was based on video recordings of the robot's 360-degree camera and LEDs, where audio streams were also provided using its microphone and speaker.The telepresence system introduced in [26] used live speech alongside several visual modalities to create conversational motions while synchronizing speech and motion among users.In [6], audio processing in noisy and dynamic environments was studied based on wearable arrays to achieve dynamic localization and direction-of-arrival estimation.The study reported in [49] concentrated on evaluating user speech to find out how remote users are being heard in noisy classrooms, thereby providing feedback.
Despite the crucial significance of vocal communication in human-robot interaction, manipulating words to improve the quality of services provided by teleoperation support systems has not been well studied in the literature.Most of the existing approaches are aimed at visual modality.Even in those studying the vocal modality, the user's voice is transmitted directly to the other side.At best, they have analyzed or processed the speech, targeting, e.g., clarity or localization.To the Fig. 2. A schematic illustration of the human-robot interaction scenario considered as a case study for evaluating the performance of the proposed system.authors' knowledge, none of the existing studies has offered a solution to automatically modulate the operator's words to provide a more hospitable service.Overcoming the foregoing shortcoming constitutes the central motivation of the present study.In particular, we want to guarantee that novice operators are able to talk appropriately and politely in order to consistently provide an hospitable service.

DESIGN
In this article, we investigate and try to improve the performance of teleoperated robots in terms of politeness (a prerequisite for providing an hospitable service).The goal is to ensure a consistently friendly and pleasant interaction despite the typical adverse factors, including operator frustration, lack of suitable words, anger, fatigue, naivety, or inexperience.
To this end, we propose to exempt the operator from the task of deciding what wording to use to deliver their intention.More clearly, upon tracking the scene, the operator determines whether it is required to intervene and transmit a message, trying to do so by speaking into a microphone, but is not expected to figure out in what exact form the intended notion will be framed.Instead, the foregoing task is performed by an intent recognition pipeline, which is aimed to classify statements, followed by picking up the corresponding sentence from a carefully crafted dataset of prescribed messages targeted at implying any of the intentions supported by the proposed system.
The design considerations that we propose and implement within this study can be summarized as follows: • The operator and the telepresence system complement and cooperate with each other; • The operator observes the behaviors of the visitors, and talks into a microphone to thank, assist, instruct, or interdict them; • The operator speaks freely, without being obliged to choose their phrases carefully; • The intent recognition framework finds out what purpose is most probably meant to be communicated by the operator; • For appropriate and polite language, the relevant intent label is used to pick up the corresponding sentence from a dataset of utterances.

THE SCENARIO
The human-robot interaction scenario considered for evaluating and demonstrating the effectiveness of the proposed model consists of a guardsman robot communicating various intentions from the remote operator to the visitors of a shopping mall.The scenario is schematically illustrated in Figure 2. The guardsman robot invigilates the visitors and interacts with them depending on the operator's perception of the circumstances.For example, when a visitor arrives, the robot greets them, when they seem to be confused or lost, the robot offers help, and when they misbehave, the robot tries to instruct them to reconsider their actions.The scenario was selected based on our past research on guardsman robots.We started by defining the tasks of the guard robot according to the knowledge we gathered during past research.Then, the task choice was refined by rounds of preliminary experiments during which we also identified what were the difficulties for the novice operators.Finally, the scenario was designed to present the remote operators with different types of behaviors that are likely to occur in a real environment.

THE PROPOSED SUPPORT SYSTEM 5.1 Overview of the Support System
The proposed teleoperation support system is schematically summarized in Figure 3.The operator uses a control interface, including a joystick and a microphone, to move the robot and speak, respectively.Once a voice recording is ready, the transcript is prepared using speech processing, and then passed into the intent recognition pipeline to recognize the underlying intent category.Afterward, the corresponding appropriate and polite utterance is picked up from the prescribed dataset and pronounced by the robot.Using this strategy, the user can react to the situation and speak freely, while the remaining components of the proposed telepresence model handle the task of communicating the intended message using a ready-made appropriate and polite sentence.

The Control Interface
The remote computer used by the operator, as shown in Figure 4, is equipped with a joystick and a headset.The joystick may be used for navigating the robot as desired.Vocal settings, e.g., microphone volume, may be easily adjusted using a dedicated sound panel, preventing distractions to the operator's actions and commands.
In terms of software, ROS is employed with various RViz plugins specifically developed for this study, which let the operator conveniently observe and manipulate the relevant functionalities.The visual and vocal data recorded by the robot, as well as the estimated location and orientation, are provided to the operator in real-time.

Automatic Speech Recognition
A transcript of the operator's voice is produced using the Google ASR toolkit and shown within the interface described in Section 5.2.To isolate the operator's voice, i.e., to prevent undesired sound and noise from being sent to and processed by Google ASR, a certain button on the joystick needs to be pressed and held for the voice to go through.Therefore, another plugin made for the interface is responsible for displaying the voice-recording status to the operator and instructing them to press the button to start recording.

Intent Recognition
5.4.1 The Architecture.Intent recognition is performed by a Bidirectional Encoder Representations from Transformers (BERT) architecture [32].It performs bidirectional training on  we use a pre-trained BERT model that has a linear layer on top of the pooled output to perform sequence classification.We fine-tuned the model trained by the Tohoku University Natural Language Processing Group on the Japanese version of Wikipedia [56].This BERT model has 12 layers, with 768-dimensional hidden layers, and 12 attention heads.
For the case study taken into account in the context of this article, we conducted preliminary experiments to select a set of intentions that generally cover the needs of the operator.Through running several rounds of preparatory experiments, it was revealed that a set of nine intent categories would suffice for handling a diverse range of situations typically encountered in the course of teleoperating the guardsman robot.These categories are listed in Table 1.
When provided with an utterance, our intent recognition model outputs a prediction vector with nine elements, i.e., one for each intent category, where the element associated with the largest value indicates the category predicted by the classifier.If the model is well trained, the magnitudes of the values in the prediction vector differ greatly between utterances that represent one of the intent categories and those that are not aimed to imply any of the intentions covered by the system.Therefore, we use an empirically adjusted threshold to discard predictions determined with unacceptably low probabilities.

Data Collection.
Gathering a large dataset suitable for intent recognition is challenging and costly.However, Huggins et al. showed that fine-tuning a BERT language model to perform intent recognition requires a small number of training utterances per intent category [23].We use at least 30 training utterances per intent category, which is decided based on extensive rounds of trial-and-error, i.e., collecting data, training, and then practically examining the performance.Deterioration of performance observed while applying the recognition model to a real-world scenario, as compared to what is achieved during training of the model, is a natural and wellknown phenomenon.Therefore, we collected the data keeping in mind the importance of the data representing the actual prospective inputs as closely as possible.
The utterances were collected from nine participants who were staff and students of our laboratory.Some of them did not contribute sentences for all intents since not all intent categories were considered in the first phase of data collection.The participants were told to imagine themselves as a guardsman working in a mall.Then for every intent category, each participant was asked to propose five appropriate utterances to express the intention and five additional inappropriate ones.An appropriate utterance was defined as an utterance with polite wording that guardsmen are expected to use in Japan.We collected 40 utterances for each of the nine intentions.After training the first model, we conducted a preliminary experiment with internal participants who were not aware of our research hypotheses.During the experiment, we collected an additional 125 utterances.As the participants proposed their utterances independently, duplicated sentences existed, after removing which, the dataset was reduced to 452 unique utterances.The distribution of these utterances is provided in Table 1.

Training and Performance.
The data collected from the participants was used to measure the performance.As previously mentiopned, not all participants provided utterances for all intentions, and consequently, we were unable to use a leave-one-out approach.However, we created 11 pairs from 270 training and 90 test utterances, where all the utterances of a given participant are either in the training set or in the test set.More clearly, the 90 test utterances are from participants whose data is not used for training the model.For each pair, we have 30 training and 10 test utterances per intent.We used 20% of the training set for validation, and made 100 training and test runs for each of the 11 sets.For each set, we fine-tune the Japanese BERT model by minimizing the cross-entropy loss on the training set while monitoring the loss on the validation set to perform an early stopping.The maximum number of epochs was set at 30, as the early stop using validation data usually occurred around the 15 th epoch.We use the ADAM optimizer [30] with a learning rate of 2e −5 and an epsilon of 1e −8 .The average accuracy was 0.886 with a standard deviation of 0.028.Thus, we can expect the trained model to generalize well to new users.
The final architecture used as part of the proposed model was configured using all the 452 utterances that we collected for training, and exhibited an accuracy of approximately 0.90 during the preliminary test phase.In addition, for this model, the threshold to filter utterances that do not match any of the intent categories was adjusted by trial-and-error throughout the preliminary tests.The system could classify utterances as matching an intent category or not with an accuracy of 0.95.

Polite Speech Generation
After intent recognition is performed, the predicted intent category is considered to select an alternative to the operator's utterance from a dataset of appropriate and polite sentences.The latter contains three choices for each of the intent categories.These were selected from among utterances submitted during the data collection procedure and modified according to the suggestions of the participants in the preliminary experiment.With the proposed model, as soon as recognized, the intent category, along with the corresponding appropriate and polite expression, is shown to the operator, through the interface introduced in Section 5.2.The appropriate and polite utterance is then passed to the speech synthesis module and played by the robot's speaker.
For utterances that do not match any of the intent categories, no utterance is pronounced by the robot, and the operator receives a warning saying that his utterance did not match any category.

The Robot
The guardsman robot is a Robovie2 humanoid robot [27].Owing to its robust design, the robot can produce several types of motions and gestures, which helps to leave a human-like impression on the visitors, as well as to communicate with them using visual and vocal channels.Moreover, the robot is equipped with sensors and software enabling it to report and update a map representing the environment surrounding it, along with its location and orientation, which are displayed to the operator in real-time, thereby facilitating the task of moving the robot toward the desired pose quickly and relatively accurately.

Other Modules
In addition to the modules in charge of processing the operator's voice, there are modules processing the operator's motion commands (see the bottom track in Figure 3).The operator inputs motion commands using a joystick.These commands are processed by the controller, which outputs the desired linear and angular velocities.Before sending the velocities to the mobile base, they are modified by the collision avoidance module so that the robot cannot get too close to any obstacles.The collision avoidance module uses data from two onboard laser range finders, i.e., front and back, to detect obstacles around the robot.Consequently, even a novice operator can navigate the robot through the environment safely.

EXPERIMENTS 6.1 The Experimental Scenario
The experiments were carried out within our institution.They took place simultaneously in two locations, namely, the "control room" and the "corridor." The former is the room where the control interface was set up and the operator was left alone to control the robot that was in the latter.
The scenario presented the robot as a guardsman robot in charge of invigilating an area, i.e., the corridor, in a shopping center.The role of the operator was to control that robot and behave as the guardsmen do in shopping malls in Japan.This includes potentially greeting and helping visitors but also policing the place to stop visitors from engaging in undesirable behaviors.
We recruited external persons, i.e., "actors", to play the role of visitors.They were not told about our research hypotheses and worked in pairs.They acted on the basis of prescribed scripts, which listed the actions with specific timestamps and instructed them as to how to respond to the guardsman robot.Their reactions included cooperating, ignoring, or responding aggressively.For example, one action was "Enter the corridor, pretend to drink from the cup and throw the cup on the floor." The follow-up instructions were "If the guardsman robot talks to you, ignore it; If it talks again, respond aggressively and then leave the corridor." Each script had four sets of actions and instructions testing different combinations of reactions.Other actions were "Enter the corridor and pick up your phone as if it rang, and keep talking loud", "Enter the corridor and keep looking around, as if you are searching for something", "Enter the corridor and pretend to smoke", "Enter the corridor and keep checking the robot", "Enter the corridor and keep talking on your phone while walking" , and, "Enter the corridor and look at the map while drinking from the cup and throw the cup on the floor." Using these scripts, the actors could consistently reproduce the same scenario.

Hypotheses and Predictions
When using a conventional teleoperation system, the voice of the operator is recorded by a microphone and played through the robot's speaker.Consequently, the operator of a guardsman robot has to talk in the way that is fit for that role.In particular, the operator is expected to use the correct wording and be polite in all circumstances.The proposed teleoperation support system is designed such that the robot speaks what the operator intends politely and with appropriate wordings.Consequently, the operator no longer must choose the appropriate language.We expect this modification to impact the workload of the operator.Therefore, we make the following prediction: Prediction 1 ("Workload").An operator using the proposed teleoperation support system will have a lower workload than an operator using a conventional teleoperation system.
The speech produced by the operator of a guardsman robot may have various shortcomings, e.g., a lack of hospitality or tolerance for non-cooperative actions, reactions, or words, incomplete, hesitant, or unclear instructions, and unfriendly or aggressive tone.With a conventional teleoperation system, any of these may be tangibly felt by a person who interacts with the robot.However, the proposed teleoperation support system filters out such deficiencies by design.Therefore, with the proposed model and teleoperation support system, we expect the robot to talk invariably, which enables us to make the following prediction as well: a A robot controlled with the proposed teleoperation support system will speak more politely than a robot controlled with a conventional teleoperation system.

Participants.
We recruited 23 adult participants, i.e., 15 women and eight men.The average age was 21.9 and the standard deviation was 7.3 years.They were paid for their participation.

Conditions. We compared the following two conditions:
• Baseline: A robot controlled with a conventional teleoperation system; • Proposed: A robot controlled with the proposed teleoperation system.
The study had a within-participant design, and the order was counterbalanced.
In both conditions, the operator observes the environment and controls the robot using the same interface and must use the same button for talking.Schematic overviews of the proposed and baseline systems are shown in Figures 3 and 5, respectively.

6.3.3
The Procedure.For each participant, the experiment took around 50 minutes.First, the context and task were presented in a short briefing.Then the participant experienced a session with each condition for 10 minutes, and a 5-minute intermission.For each session, the participant sat in the control room (see Figure 4), where the teleoperated guardsman robot and two actors playing the visitors were in the corridor (see Figure 6).Before the start, a reminder about the task was given, the teleoperation system used in that session was presented and the participant was asked to move the robot and try pronouncing a few utterances to familiarize themselves with the system.The participant was then left alone, and the session was started.The participant controlled the robot using the teleoperation interface and interacted with the actors who played visitors.For each session, the participant had the opportunity to interact with eight visitors, who were played by the same two actors.The participant had the freedom to decide whether the guardsman robot should interact with a visitor or not.Once the actors went through their interaction scripts, the session ended.
After the teleoperation sessions, a questionnaire was administered, and a semi-structured interview was conducted subsequently to filling out the second questionnaire, i.e., after the second session.During the semi-structured interviews, participants first talked freely and then answered three questions (if they did not already talk spontaneously about it): A question about the delay introduced by the system, a question about how they felt affected when a visitor refused to comply and answered back and a question about their opinion on using the robot's voice versus their own voice.A debriefing was done at the end of each round of experiment.The experimental protocol was approved by our institutional review board.

Workload
The workload was measured with a Japanese translation of the NASA TLX questionnaire [20,21].We are using the weighted NASA TLX workload.A score ranging from 0 to 100, i.e., the lowest to the highest workload, respectively, was obtained by weighting according to the self-reported pairwise importance at the following seven scales: Mental demand, physical demand, temporal demand, performance, effort, and frustration.

Politeness
We asked three external coders who were unaware of our study hypotheses to rate the politeness of the utterances spoken by the guardsman robot.To eliminate the influence of the difference in the robot voice between the conditions using the baseline and the proposed model, we first produced transcripts of the interactions that occurred during each session, and extracted the robot's utterances.The three coders rated the politeness on a five-point Likert scale, where the order of presentation of the sets was randomized.The resulting Pearson correlation coefficients were r 12 = .46between coders 1 and 2, r 13 = .75between coders 1 and 3 and r 23 = .46between coders 2 and 3, and the politeness rating of each session was obtained as the average of the three coders' ratings.

Observation.
The default starting position led to a desirable view of the corridor and therefore, the operators focused on recognizing the behaviors and handling the verbal aspect of teleoperation.Consequently, in both conditions, most of the operators simply rotated the robot to face the target visitors but did not move closer.
Moreover, in both conditions, the operators used polite utterances.This was expected when using the baseline model but surprising when using the proposed model, as all operators tried simple impolite utterances in the familiarization phase.Two operators started by using a few impolite utterances when using the proposed system but soon switched to talking politely even though the intent recognition framework performed well.
Most operators adapted their way of talking to the proposed model if the intent recognition pipeline made a mistake.Only a few operators repeatedly used the same non-recognized utterances, resulting in the robot's silence for a certain time period.
The operators were more hesitant to admonish "walking while using a phone" compared to "smoking" or "littering."

Analysis of Performance.
A confusion matrix showing the classification results for the utterances used by all the participants during the experiment is depicted in Figure 7.Note that this matrix has 10 rows and 10 columns, as we consider an additional class for the utterances that are out of the set of intentions that our system covers.During the experiment, the participants expressed a total of 381 utterances while using the proposed model, i.e., each pronounced an average of 18.14 utterances (SD = 3.41).The accuracy for each participant was between 0.83 and 1.0, with an average of 0.96 (SD = 0.045).
The participants spoke 44 utterances that were out of the set recognized by the proposed model, but they were successfully classified as unknown at an accuracy rate of 0.9, and in most of the cases, did not make the robot talk.For example, the operators often asked "not to talk on the phone", which bears an unknown intention, rather than "to be quiet", which resembles a known intent category.However, four instances were wrongly attributed to known intentions, and the robot undesirably pronounced an utterance.Nine valid requests were classified as unknown and ignored by the system, but no valid request was miss classified as another valid request.

Verification of Prediction 1.
A paired t-test was conducted to compare the workload when using the baseline system and when using the proposed system.A significant difference was observed (p < .001).The participants felt that the workload was lower when the proposed model was used (M = 46.07,SD = 14.36) than when the baseline system was run (M = 62.74,SD = 12.70), see Figure 8.The effect size was large (Cohen's d = 1.23).
This result supports our prediction that an operator using the proposed teleoperation support system will have a reduced workload compared to an operator using the baseline system.

Verification of Prediction 2.
A Shapiro-Wilk test for normality showed that the distribution of the politeness rating departed significantly from normality (w = 0.907, p = .036),so we conducted a Wilcoxon signed-rank test and found there was no significant difference (Z = −0.579,p = .57)when using the baseline system (M = 3.75, SD = 0.79) and when using the proposed system (M = 3.83, SD = 0.70), see Figure 9.

Interview
Results.We will first report the opinions about the proposed system that were spontaneously expressed by several participants.Most participants, i.e., 17 out of 23, said that the proposed system facilitated their task and made it mentally less demanding.Seven participants described the benefit of automatically generating polite utterances as not having to pay attention to the wording.However, nine participants who expressed intentions outside the scope of the system complained that the robot did not talk and found this to introduce lengthy delays in the interaction.
Additionally, there were interesting comments from several participants.Two felt that it was hard to react on the spot when something happened because of the delay.A participant commented that the delay introduced using ASR and intent recognition may not be greater than the time it takes to produce an appropriate answer with the baseline system.Two participants would have liked to have a list of simple utterances to control the robot.Another participant reported trying to use utterances that should be well recognized by the system after recognition failed one time.
Concerning the baseline system, some participants spontaneously expressed a few opinions.Out of 23 individuals, 13 persons mentioned that it was hard to find the appropriate words, and 6:13 six expressed the feeling of being under pressure to be polite while using the baseline system.However, eight participants reported that the baseline model was better than the proposed system as they could express more nuances while talking to the visitors.
Several interesting opinions about the baseline system were expressed.One participant felt that her motivation was higher when she was responsible for generating the utterances on her own.One participant felt that using her voice had more impact on the visitor when admonishing.
We intentionally created a scenario where operators were confronted with non-cooperative visitors, thereby investigating its effect on the two conditions.Then if the participants did not mention it spontaneously, we asked them what they thought about the aggressive responses given by some of the visitors.Around half of the participants, i.e., 11 out of 23, found that the proposed system shielded them from the aggressiveness of the visitors to a certain extent.Ten participants found that aggression from the visitors was strongly felt while using the baseline system.However, four participants mentioned that, for both conditions, just using a teleoperated robot created a feeling of distance that made such tricky situations easier to handle compared to an in-person interaction.

Workload and Politeness
The proposed system significantly reduced the workload perceived by participants.The results of the interviews suggest two probable causes.First, the proposed system eliminates the pressure to talk politely and to think of appropriate utterances on the spot for each situation.Second, the proposed model better protects the operator from the negative behaviors directed toward the robot by the visitors.It may be attributed to the fact that the system makes it possible to isolate the operator and the robot from each other, thereby enabling the feeling that the visitor's negative reaction is targeted at the robot, but not necessarily at the operator.
Many Japanese, especially the younger generation, find that using very polite Japanese requires a conscious effort and they express low confidence in their ability to use it [40].Interestingly, the participants talked politely and still felt that the workload was reduced.With the proposed model, none of them talked too casually for an entire session.Only two participants briefly "played" with the system to test a few less appropriate utterances.Considering the cultural aspect, it is worth acknowledging the importance of politeness and polite service in Japan [16,59] and how it could potentially influence the participants' behavior in maintaining politeness, regardless of the system being used.During the interview several participants expressed the idea that the proposed system was like a "backup" that would correct their sentences and it made them relieved to know that if they made a mistake it would be corrected.Then, this type of "backup" system may help them with their lack of confidence in using very polite Japanese language.We could observe that participants seemed to talk more spontaneously when using the proposed system and to spent more time finding their words when using the baseline system.This could be part of the explanation for the change in perceived workload.However, to get a definitive answer, this hypothesis has to be tested in a follow-up study.
The scenario was designed to incite the participants to talk less appropriately.We expected that, being the target of aggressive visitors, the participants would relax their control.However, the participants did not lower the level of their speech.It is likely that after getting used to the proposed system and understanding how it works, operators would start expressing their intentions more casually.However, further research is needed for a solid conclusion.
Similarly, we expect that operators will not be keeping an appropriate level of politeness when working for an extended period.Then the quality of the utterances is likely to decrease when using the baseline model, and visitors should perceive the degradation of the service.However, more research is needed to support this hypothesis.

Limitations
The generalizability of the findings in this study may be limited by several factors.
First, the study was conducted in a specific cultural context (Japan) where emphasis on hospitality and politeness may differ from other cultures [16,59].In particular, in Japan, even workers who are not directly in the service industry take great care to provide a good experience to the end customers and have pride in their work.This may seem to undermine the value our study brings to the service industry.Nevertheless, when it comes to luxurious services, shops, or hotels, or considering other motivations, such as increasing the chances of receiving tips even in casual settings, the importance of hospitality may still be appreciated in other cultures.Therefore, the effectiveness of the proposed system in other cultural contexts needs to be further investigated.
Concerning the level of politeness of the participants that operated the robot, we can think of two possible carry over effects.First, for participants who experienced the baseline condition first, there may be a carry over effect to remain polite when using the proposed system.Second, the research staff interacted politely with the participant (operator) and there may be a carryover effect.
As another limitation, it can be mentioned that the practicability of the proposed system is restricted by the performance of the intent recognition module.It is costly to increase the number of intent categories while keeping the recognition performance high.In our experiment, we saw that participants were confronted with the system's limited number of recognizable intents, as 11% of the utterances expressed by the operators were thought to represent unknown intent classes by the system.However, even with a small number of intent categories, the proposed system automates significant components of the process of maintaining appropriateness and politeness, which is beneficial in various scenarios and could be seen as a major step toward reducing the mental burden imposed on remote telepresence workers.Furthermore, the performance of the intent recognition module may vary depending on the language used, and it is important to evaluate the system's effectiveness in languages other than the one used in this study.While the results of this study provide valuable insights into the use of intention recognition, further research is necessary to generalize these findings to a broader range of contexts and languages.
In addition, the proposed model has been tested only in a specific scenario, for a single task, and within a short-term period.This is not sufficient to claim that it would also be useful under other arbitrary conditions.There are other limitations too.For instance, we conducted our experiment only with a specific robot, with a specific voice.Further studies are needed to confirm the generalizability of our findings.
We did not directly involve professional guards in the design or testing of the system.In this study, our purpose is to support operators that are not professional guards so that they can act as guards.We designed our system with some prior knowledge we have about guards from a previous study we conducted [38].In addition, guards are prevalent in Japan and most people interacted with guards in several contexts and have a good image of what is expected from a guard in term of politeness.We believe this shared knowledge about what is expected from guards in Japan is enough for designing the system in this study.However, it would be interesting to study how the proposed support system is used or perceived by professional guards in a follow-up study.

Ethics and Broader Impact
In terms of ethical considerations, the proposed teleoperation support system has the potential to improve the well-being and mental health of remote operators by reducing their workload and cognitive burden.However, as with any technology, there are possible negative implications as well.
The proposed teleoperation support system, which transforms casual utterances into predefined polite and appropriate utterances, raises some concerns regarding employee autonomy and employer control.While the system may alleviate the mental burden on novice operators, it also takes away their responsibility to choose their own words and express themselves freely.Interview results show that some participants preferred the baseline system as they could express more nuance while talking with the visitors.The proposed system limits the operators' range of interactions and their ability to determine how they present themselves to customers.The predefined utterances restrict operators to only performing a few predefined actions.This limits the operators' ability to engage in a personalized way.This lack of agency can have broader implications, potentially enabling employers to exert more control over employee behaviors and restrict their language in public spaces.Further development of the system should consider addressing these concerns and exploring ways to strike a balance between providing support and maintaining employee autonomy.Open dialogue and collaboration with stakeholders, including users, robot designers, and ethicists, can help identify and mitigate potential negative consequences while ensuring the technology is used ethically and responsibly.
By automating the speech and language of the robot with the proposed method, there is a risk that the interaction becomes more scripted, homogeneous, and less personalized, which could ultimately reduce engagement and the overall quality of the user experience.A potential solution is to use a generative model to add variations to the utterances of the robot [17].This impact on the users around the physical robots should be acknowledged and further studied to ensure a balanced and satisfactory user experience.

CONCLUSION
We introduced and validated the efficiency of a telepresence model using which the remote worker is exempted from the responsibility of maintaining appropriateness and politeness throughout verbal communication with visitors or clients in various contexts of human-robot interaction, such as hotels.Considering the positive contributions of intent recognition pipelines to improving fluency in human-robot interaction, the principal notion supporting the proposed model was to reduce the workload and psychological anxiety imposed on the remote worker through engaging a machine learning architecture in recognizing the main theme or intention underlying the utterances and subsequently having the robot pronounce prescribed sentences corresponding to the relevant intent category instead.The case study designed to verify the effectiveness of the proposed model was based on a guard robot monitoring a certain area within a shopping mall and providing instructions or expressing disapproval in reaction to the visitor's behavior.The questionnaire and interview results revealed a considerable reduction in the workload and mental burden felt by the remote workers in comparison to the level associated with the baseline model.Potential future research directions include developing a hierarchical intent recognition framework, which could improve the flexibility of the model by enabling further tailoring of the utterances to the details, as well as facilitating and accelerating it, i.e., supplementing the assistance provided to the remote worker with a mechanism guiding them through the process of expressing their words by postulating and displaying candidate intent categories in stages.The effectiveness of the proposed system was evaluated in a Japanese context, where politeness is highly valued in the service industry.Further research is required to check the applicability and feasibility of this system in other cultural contexts.This study focused on the operator's workload and the politeness of the service.An interesting follow-up study would be to evaluate the effectiveness of the robot's admonishments "in the wild" and investigate how the robot is perceived by passersby.In this study, some participants felt that using the proposed system, they were "shielded" from non-cooperative visitors.It would be interesting to further investigate this effect and better characterise it in a follow-up study.

Fig. 1 .
Fig. 1.Angry operator admonishing a visitor without (left) and with (right) the proposed teleoperation support system.

Fig. 3 .
Fig. 3.An overview of the proposed model.

Fig. 4 .
Fig. 4. The operator using the control interface to perform the admonishment.

Fig. 5 .
Fig. 5.A schematic overview of the baseline model.

Fig. 6 .
Fig.6.The robot is about to admonish a visitor that was littering in the corridor where the role-played scenario took place.

Table 1 .
Distribution of the Utterances Representing the Nine Intent Classes within the Dataset Collected for This Study [23]sformers, i.e., a stack of self-attention-based layers[58], using a masked objective.It produces contextual representations of the text based on vector embeddings.Similarly to Huggins et al.[23],