More Than Meets the Eye? An Experimental Design to Test Robot Visual Perspective-Taking Facilitators Beyond Mere-Appearance

Visual Perspective Taking (VPT) underpins human social interaction, from joint action to predicting others' future actions and mentalizing about their goals and affective/mental states. Substantial progress has been made in developing artificial VPT capabilities in robots. However, as conventional VPT tasks rely on the (non-situated, disembodied) presentation of robots on computer screens, it is unclear how a robot's socially reactive and goal-directed behaviours prompt people to take its perspective. We provide a novel experimental paradigm that robustly measures the extent to which human interaction partners take a robot's visual perspective during face-to-face human-robot-interactions, by measuring how much a robot's visual perspective is spontaneously integrated with one's own. The experimental task design of our upcoming user study allows us to investigate the role of robot features beyond its human-like appearance, which have driven research so far, targeting instead its socially reactive behaviour and task engagement with the human interaction partner.


INTRODUCTION
Successful human-robot collaboration relies on the complex interplay of multiple mechanisms [4,14,22].Imagine you are building a bookshelf with your robot colleague and they ask you to pass them the screwdriver on the left.For this simple request to be fuid, both agents need to not only be capable of knowing what the other can see, but how they see it.One needs to take the robot's viewpoint to discern which screwdriver is being requested, and that what is to the left or the robot is to the right of yourself.This ability for visual perspective taking (VPT) helps us navigate our social environment [15], participate in joint action [10] and predict the next actions of others [2,8].Accumulating evidence suggests that VPT could also underpin more complex abilities to reason about others' goals [11] and afective/mental states [3,19].While impressive progress has been made developing VPT abilities in artifcial agents, such that robots can take their human partners' perspective [1,7,9,16,17,33], not much is known about how people take a robot's perspective, and which robot features would help them to do so.Humans do not have specialised social-cognitive mechanisms for interacting with robots.Instead the existing mechanisms for interaction with other humans are most likely repurposed [40], [25].Understanding the feature space that triggers these repurposed mechanisms [12,13] is critical for designing understandable robots that humans can interact with fuidly.
A problem for research on VPT is that most available tasks rely on measuring whether people take the perspective of robots on the computer screen -either as cartoons, photos, or short movie clips -instead of real-life interactions.These studies have suggested that whether people spontaneously take a robot's perspective primarily depends on its human-like appearance.Prior research has shown, for example, that people often describe numbers on table how they appear to another person sitting opposite to them, rather than to themselves (e.g., they denote the number 6 as a 9, if it looks like this to the other person, [31]).For robots, this tendency to perspective-take increased the more the robot's appearance was human-like, but was surprisingly unrelated to the extent of that it was attributed a "mind" or an ability to represent visual input [45].Other studies [35,36] came to similar conclusions.They relied on the phenomenon that people fnd it more difcult to report a number of dots in a room, if another agent within the room can, from their perspective, not see all of the dots [27].For robot agents, this interference from another's perspective only occurred when the robot had a human-like body, but not a cat-like body.However, it was independent of whether the robot's head was human-like or not (a camera head), and independent of whether the robot was on or of (indicated by a red light).
It is striking that the "mere appearance" [35,36,45] of a robot as human-like sufces to cause people to spontaneously take its perspective, and that higher-order factors such as the ability to represent the environment or its "mind" contribute little, if at all.We hypothesize that this remarkable state of research might refect, to a large extent, the artifcial and dis-embodied mode of presentation on a computer screen, in which robot and human do neither interact nor share a common task space.Indeed, there is evidence that objective measures of anthropomorphism increase for physically embodied robots [26], and that the tendency to attribute mental states to robots is modulated by socially interactive behaviour [32].Moreover, there is suggestive evidence that action matters for perspective taking.Using the same 6/9 task as in their study above, Zhao [44] reported that people more likely take the robot's perspective when they saw the robot reach out towards the numbers on the table (instead of merely gazing at it), but only if this action was shown as video (instead of a photo).Thus, while this still falls short of full-scale interaction, perspective taking seems to increase the more people perceive the robot as a realistic agent that acts within its environment.
To the best of our knowledge, only one study investigated perspective taking in a real human-robot interaction [42], but was not able to demonstrate any perspective-taking.Two robots (telenoids [23]) were seated around a table, either side of the participant.One telenoid was inanimate; the other looked and waved to the participant at the beginning of the experiment but remained otherwise passive.Participants had to indicate whether letters that were oriented either towards the animate or inanimate telenoid were mirrored or not.However, instead of the predicted overall advantage when letters were facing the animate robot, they only found that, for the previously animate agent, the response time diference between mirrored and non-mirrored letters increased -a diference that was not pre-registered, inconsistent with the usual perspective-taking results in such paradigms [37][38][39], and therefore hard to interpret.In essence, it may only refect that participants are biased to call a letter non-mirrored if it faces the animate agent, irrespective of whether they take its perspective.
Together, these studies make clear that the study of human-torobot perspective taking requires real-life tasks that are not decoupled from the intricate behavioural dynamics when interacting with a robot embodied in physical space.Moreover, it requires tasks that are not afected by confounds and alternative explanations.For example, in the 6/9 task, just knowing that a 6 looks like a 9 to others could be enough to induce seemingly altercentric responses, without the participant visualizing how the robot perceives the environment [28].Similarly, as the authors acknowledge [36], the results in the dot perspective task could simply refect participants' attention being cued to the dots the robot is "looking" at, instead of any representation of what it can see [5].
Here, we present a task that overcomes the limitations of prior work on VPT towards robots [35,36,44,45] and can reliably measure perspective taking in face-to-face interactions.It relies on the well-established phenomenon that response times to identify a letter increase linearly the more the letter is oriented away from the participant (i.e., from 0 degrees angular disparity to 180 degrees), as participants frst have to "mentally rotate" turned away letters into their canonical orientation [29].Our previous work [37][38][39] shows that this pattern is disrupted by another agent in the scene.People recognize rotated-away letters more quickly if the letters are oriented towards the other agent and can therefore more easily be identifed from their perspective.In contrast, they recognize rotatedaway letters more slowly if they are rotated even further away from the other agent's perspective.These fndings show both, that people spontaneously "borrow" another's perspective if it provides better visual access to a stimulus than ones own, and that this perspective also involuntarily intrudes into their own and disrupts visual processing.
Our task is well-validated for human-to-human interaction when presented as short photo animations on a computer screen [37][38][39]43], or in virtual reality [41].After extensive piloting for humanhuman VPT, here we adapt it to investigate face-to-face interactions with a humanoid robot to test how factors beyond mere-appearance, such as goal-driven and autonomous socially reactive behaviours, facilitate perspective taking.Importantly, our task measures both, the facilitation and disruption of perceptual processing through VPT; unlike other tasks [5,28], its fndings can therefore not be explained through simple heuristics or attentional cuing.

METHODS AND MEASURES 2.1 Task and Experimental Conditions
Participants sit at a table, with a robot to their left or right.In each trial, an alphanumeric character is projected onto the table in-front of the participant in either its normal, or mirror inverted form.The character is shown at difering orientations from upright to upside down relative to the participants viewpoint.Participants simply judge with a button press whether the stimulus is in its canonical or mirrorinverted form ("R" vs. " R").We vary between experimental blocks whether the participants' partner (a robot with diferent behaviours -see Figure 1) is positioned 90 • to the right, or 90 • to the left of them.In our prior research, individuals spontaneously incorporate another person's viewpoint when making these judgements.When letters are facing another person, people fnd it easier (measured in terms of recognition times, RTs) to judge whether the character is in its regular or mirrored presentation.Conversely when the letter is facing away from another person, individuals fnd the task more difcult [37][38][39].This suggests that people are spontaneously assimilated their own egocentric viewpoint with a generative model simulating a likely estimation of the other person's altercentric frame of reference.
Here, we test whether people take the perspective of a physically present humanoid robot, and whether this depends on the robot's socially reactive behaviours in unconstrained human-robot interactions and/or its goal-directed engagement in a task alongside the participant, two factors that have been shown to be important for human-to-human perspective taking [10].In a between-subjects design, participants are randomly assigned to one of four experimental conditions that arise from the factorial combination of the Task Reactivity and Social Reactivity factors (see Figure 1).Thus, across participants, the robot will be in one of four states: (i) Being completely inanimate in and outside the task, (ii) actively completing the task alongside the participant but being passive outside of it, (iii) being passive during the task but actively engaging with participant and experimenter outside the task, and (iv) engaging in the task with the participant and interacting with them outside of it.Our predictions for the upcoming user-study are as follows: (i) The robot's mere-appearance will be sufcient to elicit overall VPT; (ii) if VPT refects the ad-hoc generation of a model of the robot's mind for predicting its functional behaviour within the task, VPT should increase when participants are paired with the task-reactive robot; (iii) if VPT requires the attribution of agentitive mental states to the robot, such that it is believed that the robot is capable of representing and interacting with a more unpredictable social environment, VPT should increase for the socially reactive robot modes of behaviour; (iv) if VPT requires evidence of both functional and agentitive mental states, then both measures should interact and reveal VPT only when the robot is both socially and task-active.

Robot and Behaviour
From the robots available to us, we chose to use the Pepper robot [24] due to its humanoid appearance, and social interaction capabilities.Pepper's behaviour is governed via a simple Finite State Machine (FSM) programmed in Python 2.7 [34].Transitions in Peppers behavioural state (described below) are instigated by the Inquisit experimental script.
VPT towards humans has been shown to be driven by joint task behaviour [10] with agentive partners [30].We therefore vary, across participants, Pepper's behaviour along two independent dimensions that allow us to independently evaluate whether perspective taking is driven by (i) Pepper's engagement in the same task as the participant, (ii) by Peppers social behaviour outside of the task, and (iii) by an interaction of both factors.

Task Reactivity.
The Task Reactivity factor codes Pepper's behaviour during the task.In its task-on state, Pepper exhibits goaloriented behaviour by appearing to complete the task alongside the participant.Its posture is focused and its gaze is directed towards the task stimuli.It squeezes each hand as if to respond to the stimuli, with a variable delay imitating human-like performance.Like the participant's, its hands are concealed under the table, such that participants are unable to see its responses (though its mechanical hand actuation is audible).In its task-of state, Pepper is completely stationary during the task, its head is lowered towards its chest and it does not respond to the task stimuli.

Social Reactivity. The Social Reactivity factor codes whether
Pepper interacts with the participant outside the task.In its social-on state, Pepper responds to environmental stimuli, turning its head to look at sources of sound (the participant speaking or the experimenter).It establishes eye contact with the participant and experimenter.The robot's posture is upright, and more relaxed than in the task state.There are slight periodic movements of Pepper's upper body.In its social-of state Pepper is completely stationary, its head is lowered towards its chest and it does not respond to any behaviour of experimenter or participant.

Apparatus,
experimental conditions and procedure Once the practice trials are completed, participants are taken to the testing environment, where they frst encounter the robot.Participants are sat around the table, with their robot partner 90 • to the left or right of them.For participants paired with a social-on robot, Pepper reacts to the participants presence, meet their gaze, and directs attention to any noises the participant or experimenter will make.The social-of robot will not respond to the participant at this time.Participants then complete 8 blocks of 32 trials.At the start of each trial, a fxation cross is presented in the centre of the table for 1000 ms.Afterwards, after a pseudo-random delay of 1100 ms to 1800 ms that prevents participants from anticipating exactly when the letter will appear [37] an alphanumeric character is presented on the table.The character is presented in its normal ("R", "4", "G", "F") or mirror F G 4 R inverted (" ", " ", " ", " ") form, in one of eight possible orientations relative to the participant (0 Characters are presented for 3000 ms.During this time, participants judge whether the presented character is normal or mirror inverted with a handheld miniature keyboard, on labelled keys.Recognition times are measured from character onset.
For participants paired with a task-on robot, Pepper will complete the task alongside them.The task-of robot will not respond to the stimuli and remains passive.
After each block, participants and Pepper are moved around the table such that both sit at each the four locations of the circular table over the course of the experiment, and Pepper has been on the participant's left and right an equal number of times.This ensures that all potential sources of visual or spatial diferences in the testing area cancel each other out in subsequent analysis.Between experimental blocks the participant is able to rest.When Pepper is in its social-on state it will engage with participant as described above.

Experimental Measures
Analysis follows our prior work on this task [37][38][39].Dependent measures are derived from average recognition times (RTs) measured from character onset, for each of the eight character orientations (RTs, coded relative to upright to the participant), depending on the agent's location relative to the participant (left, right).Participants' RTs for each orientation and condition are parameterised onto two orthogonal and statistically independent summary measures for each participant, such that comparison across conditions is possible without alpha infation due to multiple comparisons.These summary measures result from representing each participant's RTs for each character orientation and partner location (left/right of the participant) as a vector in a polar coordinate system where RT serves as the magnitude of the vector, and character orientation the polar angle from the origin (upright to the participant).
The Own-Perspective summary measure (Equation 1) quantifes how much each participant is anchored in their own perspective when judging letter forms.It quantifes the extent to which response times across character orientations follow the classic mental rotation pattern [29], becoming faster the more characters are orientated towards the participant (0 • , 45 The Other-Perspective summary measure (Equation 2) is computed analogously, but quantifes how much the participant takes the other agent's perspective.It quantifes the extent to which a participant's recognition times are faster the more characters are oriented towards the robot (e.g., for a robot on the left, 270 • , 225 • , 315 • ) than when oriented away from them (for an agent on the left, 45 • , 90 • , 135 • ).It is calculated as the average of the recognition times for each character orientation, with the contribution of each weighted by the negative cosine of the angular disparity between letter orientation and the other agent.Positive values show that a participant's recognition times are faster the more the characters are oriented towards the robot and slower the more they are oriented away from them, that is, the extent to which they are spontaneously judging the letter forms from the robot's perspective.
For each participant and experimental condition the mean is calculated independently on both summary measures.Note that because the other agent is always sitting at a 90 • angle to the participant, the Other-Perspective and Self-Perspective measures are orthogonal to each other and statistically independent.

CONCLUSION
Human-robot collaboration is forecast to play a crucial role in numerous industries [20], notably manufacturing [18] and construction [6].Visual perspective taking is an essential mechanism for human collaboration, and is therefore of considerable importance for human-robot collaboration.Although progress has been made allowing a robot to take their human interaction partner's perspective, the design-space of features that support human-to-robot perspective taking is severely under-researched.Prior work has found predominately that a robot's human-like appearance promotes perspective taking, though we argue that the typical non-situated and disembodied mode of presentation on computer-based experimental designs can test little else.
Here, we present a novel task that resolves previous methodological problems and can reliably reveal whether human partner's take a robot's perspective in a face-to-face interaction, and which robot behaviour features (i.e., social and task reactivity) promote such perspective taking.Importantly, this task provides a direct measure of whether participants assimilate the robot's spatial reference with their own egocentric perceptual processing, so that it is available for action planning and prediction in shared task spaces.
Our contribution is an experimental design and accompanying analysis that can robustly explore the VPT feature-space with realworld interactions between individuals and a physically embodied robot.This grants investigation into attributes deeper than mereappearance, such as dynamic goal-driven and autonomously driven socially reactive behaviours.We are optimistic that our contribution will lay the groundwork for investigating behavioural that improve human-robot collaboration, and pave the way towards the design of more understandable robots.We look forward to updating the community on the fndings of our participant study.

Figure 1 :
Figure 1: Examples of robot behaviours across the task and social reactivity dimensions.

Figure 2 :
Figure 2: Expected pattern of results: (a) Polar plot showing RT increase the more characters are rotated away from the participant, as well as RT decrease and increase when characters are rotated towards or away from the robot; (b) Bar plot showing example data for Other-perspective summary measure when VPT is driven by independent contributions of task and social behaviours, so that VPT is strongest when both are combined.
• , 315 • ) and slower when oriented away (180• , 135• , 225 • ).It is calculated as the average of the recognition times for each character orientation, with the contribution of each weighted by the negative cosine of the angular disparity between letter orientation and participant.Positive values represent faster recognition times the more letters are oriented towards the participant.