“Who Said That?” Applying the Situation Awareness Global Assessment Technique to Social Telepresence

As with all remotely controlled robots, successful teleoperation of social and telepresence robots relies greatly on operator situation awareness; however, existing situation awareness measurements, most being originally created for military purposes, are not adapted to the context of social interaction. We propose an objective technique for telepresence evaluation based on the widely accepted Situation Awareness Global Assessment Technique, adjusted to suit social contexts. This was trialled in a between-subjects participant study (n = 56) comparing the effect of mono and spatial (binaural) audio feedback on operator situation awareness during robot teleoperation in a simulated social telepresence scenario. Subjective data were also recorded, including questions adapted from Witmer and Singer’s Presence Questionnaire, as well as qualitative feedback from participants. No significant differences in situation awareness measurements were detected; however, correlations observed between measures call for further research. This study and its findings are a potential starting point for the development of social situation awareness assessment techniques, which can inform future social and telepresence robot design decisions.

Teleoperated social robots are suitable platforms to address this in practice, notably telepresence robots or social humanoid robots. The former are already in use commercially [40], providing the operator the experience of being socially present at a remote location. The latter, while still mostly restricted to research applications, could also act as a natural user interface for remote interactants on behalf of the operator, providing more human social cues (body language etc) through humanoid embodiment. As with any teleoperated robot, operator situation awareness (SA) is important for adequate performance and low cognitive load. Situation awareness is defined by Endsley as "the perception of environmental elements and events with respect to time or space, the comprehension of their meaning, and the projection of their future status" [11]; with the scope and subtleties of non-verbal information during social interaction [35,52], SA may be all the more important for telepresence robots.
Situation awareness as a whole has not been studied in the context of social robot teleoperation such as telepresence.
Existing methods for its measurement are anchored to the task-oriented military origins of the concept, and do not translate easily to the social domain; SA has been shown to improve for non-social robot teleoperation through representative instrumentation (e.g. minimaps, haptic feedback, etc), but social cues are less feasibly abstracted (possibly requiring the human information feedback of a natural user interface to be easily perceived). We propose that, if properly placed in the context of social interaction, situation awareness could be viewed as analogous to a low-level social awareness, which may be assisted or hindered by the teleoperation interface. By bridging conventional and social robot teleoperation with situation awareness methods, telepresence robot designers could measure and evaluate social situation awareness, potentially leading to design choices that increase operator performance and decrease mental and social fatigue.
In this paper, we aim to develop a repeatable methodology for measuring social situation awareness (SSA) by adapting existing SA tools. An initial trial of this methodology will be performed on a simulated social telepresence interface, examining the potential benefits of spatial audio feedback for social situation awareness alongside several auxiliary metrics and qualitative analysis.
After a review of related work in the next section, section 3 details the development of the SSA measurement technique, by adapting the widely-used Situation Awareness Global Assessment Technique (SAGAT). Following this, section 4 presents the methodology of the trial study employing this technique. A summary of experimental results are then presented in section 5, followed by their discussion in section 6 and a conclusion on the findings.

Situation awareness measurement techniques
While seldom if ever applied to social teleoperation, techniques for situation awareness (SA) evaluation are wellestablished for non-social tasks and contexts. A number of such techniques would be considered standard in human-robot interaction research [53]: SAGAT. Endsley's Situation Awareness Global Assessment Technique (SAGAT) [10] is possibly the most commonlyused SA evaluation technique. It requires a simulated remote environment for the teleoperated robot. Periodically, the simulation is paused and questions are asked to the operator about the situation. SAGAT questions are divided into three categories, each measuring a particular degree of awareness (immediate, recall and future estimation). This reproducible, empirically validated technique results in an objective score measuring situation awareness, with the caveat of being incompatible with real-world robot teleoperation. While intended for the piloting of unmanned aerial vehicles, it has also been applied to medical simulations [15]. SAGAT has also been adapted to use real-time querying of the operator instead of pauses [25] in tools such as the Situation Present Assessment Method (SPAM) [33]. These tools can notably be administered during real-world robot operation (outside of simulation), but recent studies have called their validity into question [12].
SART. Unlike the objective measurement techniques above, Taylor's Situation Awareness Rating Technique (SART) [44] is a commonly-used subjective measurement. A form with three sets of experiential Likert-scale questions is completed by the operator after the teleoperation session, resulting in three scores describing different facets of awareness, which combine into a composite overall score. It is vulnerable to the same disadvantages as other forms of subjective assessment, such as bias from participants' feelings and emotional state [54], and was shown to be less accurate than SAGAT in particular [13], but it has the additional advantage of easy, universal administration.
Measuring other qualities. The connection between situation awareness and the distinct but similar concept of presence [36] has been pointed out in the literature [17,41], linking conventional robot teleoperation and social robotics. Goodrich et al. [17] explain that while the two do not share the same definition, a high sense of presence implies the operator has sufficient awareness of the remote situation to feel present. While no explicit techniques have been developed for evaluating situation awareness in social contexts, subjective questionnaires have been proposed for the measurement of qualities such as presence, one of the most commonly used being Witmer and Singer's Presence Questionnaire [49].

Spatial audio in teleoperation
With its adoption in interactive media such as video games [8], spatial audio has been employed in the past for a number of computer-mediated social interactions. The potential benefits of spatial audio during conference calls was investigated by Inkpen et al. [24] and Ahrens et al. [2], and this approach to conference calls was extended to that of a shared virtual room in recent years by Wong et al [51].
Existing proposals of spatial audio feedback in robotics have predominantly been to enhance telepresence. This includes the technical work of Keyrouz and Diepold [27] and Saraiji et al. [42]; both concerned the teleoperation of a humanoid robot, using spatial audio in order to enhance the operator's sense of presence. A lack of spatial audio [29,30] or otherwise unnatural audio behaviour [32,37] has been expressed as an issue during teleoperation for some robot telepresence platforms.
Combining audio directionality with telepresence has seen applications outside of social robotics. While not involving an embodied teleoperated robot, a telepresence simulation concept by Torrejon et al. was enabled in part by spatial audio [45]. Telepresence was also employed for industrial machinery operation by Ahn et al. [1], using spatial audio feedback on the predicate that it would enhance immersion and situation awareness.

DEVELOPING AN ASSESSMENT TECHNIQUE OF SOCIAL SITUATION AWARENESS
With caution taken to ensure reliability, this section details the development of a measurement of situation awareness (SA) when applied to social contexts, which we will name social situation awareness (SSA).

Situation awareness and social interaction
A defining feature of social robotics, and by extension social telepresence, is the use of a social interface for human-robot interaction [22,34]. The social interface can be seen as the communication medium used for everyday human-human social interaction, which includes channels such as speech, gesture and affect expression.
Manuscript submitted to ACM According to Endsley's model of situation awareness [11], SA is defined strictly relative to performance in a task, towards one or multiple goals and objectives. Social interaction on the other hand is seldom goal-or task-driven, however the design exercise of social telepresence has goals, one of which is to simulate the social interface as closely as possible to the experience of in-person interaction. SA in social telepresence can be therefore be informed by this goal of successful social interaction.

Awareness traits of interest
Social awareness it is not a concretely-defined concept, but its mention in psychosocial literature [16,46] as well as technology-related publications [5] often concerns high-level awareness during social interactions, such as awareness of emotional states (as provided by empathy) or of socioeconomic factors. While it could be argued that these high-level concepts are more important for social interaction, it is difficult to conceptualise a consistent measurement of awareness thereof, as it possibly depends much more on the operator than the teleoperation interface. 1 Furthermore, if an operator fails to pick up on more basic cues (e.g. who is saying a phrase), then their ability to pick up on more complex ones (e.g. who is surprised to hear the phrase) is likely impacted nonetheless.
Because of this, we will target a low-level form of social situation awareness: knowledge of basic non-emotional information about a social situation that can be obtained only through the social interface. This can include for example the names or explicit social roles of people interacted with, but not information about their appearance (which can be acquired non-socially). Targeting this low level of awareness will minimise the complexity of the problem at hand, potentially increasing the reliability of the technique, while also allowing for shorter, more practical procedures for participant testing.

Existing technique selection
To maximise reliability, an existing situation awareness assessment technique will be adapted to measure awareness of the above traits. Despite being one of the most commonly used subjective techniques, it is judged that the abstract experiential questions of Taylor's Situation Awareness Rating Technique (SART) may be confusing to non-expert participants 2 and are overall unsuited for measuring situation awareness for social interaction. Although the questions of Witmer and Singer's Presence Questionnaire may be of value for its relevance to telepresence studies, an objective technique is preferred for consistency. While real-time variant techniques of SAGAT may provide additional information of interest (workload measurements in the case of SPAM [33]), it is unknown whether the potentially intrusive nature of the questions might introduce a confounding factor during social interaction. This risk, combined with the widespread acceptance and stronger empirical validation of SAGAT, makes the latter the most reliable choice.

Adapting SAGAT for social contexts
The body of literature on SAGAT details requirements and guidelines for its administration. However, as its original intended use concerned the piloting of unmanned aerial vehicles [10] there are complications adapting it to social interaction. As many of SAGAT recommendations as possible will be replicated, with some changes for it to pertain to social situation awareness.
For one, it is instructed that SAGAT queries be created through a goal-directed task analysis [9] i.e. based on the operator's goal, but as mentioned in 3.1 the average real-world social situation is seldom intentionally goal-driven, and the telepresence goal of engaging in social interaction comparable to face-to-face interaction is too broad to analyze in this manner. We propose instead that analysis of the context of the social interaction determine the selection of important information to query.
SAGAT divides awareness into three levels: (1) Perception of Data Comprehension of meaning, while slightly more complex to translate, can still be adapted to social interaction. One potential consideration is the awareness of emotional states based on secondary social cues such as speech prosody [48] but expressing, perceiving and even defining emotional states in this manner is potentially difficult [43]. We will instead consider Level 2 awareness to concern implicit contextual objective data, requiring understanding of the social situation to identify. For instance, while awareness of the information "Brian wishes to leave" is a deeper comprehension of implicit social meaning, it is subjective and may be difficult to pick up on, so is considered unreliable for participant testing. Contextual information such as "Brian is chairing the meeting", however, is more suitable; it is objective while remaining implicitly expressed and socially relevant, and it requires awareness and synthesis of lower-level social signals to comprehend.
The third level, Prediction of future events, is less applicable to social contexts. Social interaction can be inherently unpredictable, and while predictions during a conversation might occur intuitively, it is assumed that conscious attempts to anticipate the outcome of an interaction to be unneccesary for the majority of cases (akin to considering social interaction goal-driven). This level shall therefore not be considered during generation of SAGAT queries.
Beyond the selection of queries, a substantial issue that may exist when applying SAGAT socially may be that of demand characteristics [38]. While awareness of socially-relevant information is maintained passively, participants that are explicitly aware that they are being queried on that awareness may attempt to overcompensate, deliberately seeking pieces of information they would not otherwise. This could drastically bias results in a positive direction, particularly in the case of a repeated-measures trial, where participants may overcompensate in this manner during the second test condition. This foregrounds the need for measures mitigating demand characteristics and order effects; it is recommended that the study be conducted in between-subjects design, and that the intent to measure situation awareness is obscured or distracted from.

Practical considerations
SAGAT requires the situation be simulated, so that it can be paused to administer queries. The simulation used for this must be a sufficiently valid representation of reality, in this case a valid representation of social interaction. The ubiquitous nature of remote video communication in modern life (and the acceptance of video chats as a valid form of social interaction) makes this easier to accomplish; the simulator does not need to have a similar level of immersion to in-person interaction, as long as it is at least as immersive as a video chat.
Communication through social interaction can nonetheless include dozens of channels [19], which cannot feasibly all be accounted for, therefore to construct a simulation of social interaction and faithfully convey as many natural social signals as possible, live video recordings of actors must be used rather than computer-generated images. Consequentially, conditions cannot be easily randomized, limiting experimental design. While the conditions of a flight simulator can be changed in software (aircraft positions, velocities etc), with potentially no two identical situations across all participants, for a social simulator every video used must be captured in advance. Although variations can be generated by editing videos together programmatically, two issues prevent them from being viable for use. For one, to maintain immersion and realism, individual phrases should not be interrupted by a cut in the video, which means that the information provided in each utterance cannot be altered, drastically decreasing the random sample space, and with it the utility of programmatic editing. But even without splitting sentences, cuts in the video may interrupt the natural flow of the overall interaction portrayed by actors, possibly destroying subtler social signals. 3 This limits the available test conditions to the amount of videos the researcher can feasibly record. While requiring more resources, having a larger pool of different videos to draw from for each participant will mitigate confounding factors, however it may also decrease the sensitivity of the trial. Regardless of the quantity of videos used, they should all concern the same type of social encounter and the same SAGAT queries should apply across all, but with different ground-truths determining the correct answer.

Summary
The final technique developed to assess social situation awareness (SSA) is as follows: • A simple, common social context for the study is chosen, depending on the research focus • The information to query participants about is chosen through analysis of this context, along the first two of the three SAGAT levels of awareness: -Perception of data, information explicitly available via the social interface -Comprehension of meaning, implicit information deduced using first-level information and through context • This information must be: socially relevant objective in nature non-emotional • A simulation is created using recorded footage of actors, simulating social interaction in the chosen context • SAGAT is administered as appropriate during testing, using queries on the selected information Section 4 below details our trial use of this technique to assess the effect of spatial audio feedback on social situation awareness.

TRIAL STUDY: SPATIAL AUDIO FEEDBACK
A design choice that may enhance social situation awareness through the paradigm of a natural user interface is via spatial audio feedback. The human brain can instinctively localize where sounds are coming from by combining several different techniques, one of the most significant being difference in sound arrival times between the ears [27]. By mathematically modelling the human head and using a pair of headphones, sounds can be made to be perceived from different locations -this is often used in interactive media such as video games [8] to increase operator immersion in a virtual environment. The binaural nature of hearing is also a major contributor to the "cocktail party effect", [21] our ability to distinguish and focus on an auditory source in noisy environments, including speech when multiple people are talking simultaneously, however this can occupy substantial mental resources (the "cocktail party problem") [14]. As spatial audio feedback increases the quantity of information a user receives through the natural user interface paradigm, it would seem logical that its inclusion over mono audio feedback would increase situation awareness of the remote location.
As an initial trial of the technique proposed in the previous section to measure social situation awareness (SSA), we applied it to a telepresence experiment wherein spatial audio feedback was compared to mono audio feedback during social robot teleoperation. A telepresence simulation was created with the robot at the centre of a social scene; participants were asked to follow along as though teleoperating the robot remotely. One group of participants received mono audio feedback as a baseline, the other received spatial audio, and were therefore capable of localising sounds in the virtual environment.

Stimuli creation
To simulate a social interaction as closely as possible while relying on pre-recorded video, omnidirectional footage was recorded of the portrayed scene.
Social context. The form of social interaction chosen for this study is a classroom quiz scenario. A scene like this is a structured, systemic way to represent a social interaction, as other typical social interactions can be more complex.
This situation is also a potential use-case for social robot teleoperation, in the form of robot-mediated remote learning or teaching.
Design. Actors were recruited (two men and four women), to portray one quiz master and five contestants. So that the simulation would adequately represent robot teleoperation by contrast to a conventional video call, actors were seated at desks in a circle around the camera (as shown in Figure 1)-this was to encourage the operator to look around the virtual environment rather than simply view it like a static camera feed, so that the workload would incorporate the control input component. 4 Quiz questions that the quiz master asked of the contestants were explicitly chosen to be extremely difficult or very vaguely worded, so although the questions and answers seemed conventional in a quiz, it was unlikely that contestants would be able to answer using prior knowledge. Two videos were recorded using two distinct scripts following the same format. Actor positions were shuffled, and a different set of quiz questions, character roles and names were used between the two. Quiz questions and answers for both videos are detailed in Appendix A. Duration. One recommendation for the administration of SAGAT [9] is that a minimum of 3 minutes must have elapsed from the start of the simulation before the first pause, and that further pauses must be spaced apart by a minimum of 1 minute. In order to keep simulation time relatively short, both for ease of testing with participants and to facilitate video recording, it was decided that SAGAT pauses be administered twice per video; first at a random time between 2 and 4 minutes, and then at a random time at least one minute after the first. Each video would therefore last 5 minutes total.
Script. The quiz master began by introducing the experiment to the operator, initiating a round of introductions where every contestant spoke their name. This provided the operator a chance to hear every contestant name, and also encouraged them to visually explore the room. Then the main loop of the scene began: every 15 seconds, the quiz master would ask a question to a specific contestant, addressing them by name (e.g. "Which European city hosted the 1936 Summer Olympics? Annie?"). The chosen contestant would begin by saying "I think the answer is-" to prime the operator for the answer. Then both the contestant and another "distractor" contestant would speak different answers simultaneously, of which one at random was correct, talking over one another (Annie: "I think the answer is...London"; Maria: "Berlin"). The quiz master would then congratulate the contestant who answered correctly ("That's right, Maria"), pause for the remainder of the 15 seconds, and continue with the next question. This main loop of the script would continue until the video time reached 5 minutes. A complete list of quiz questions and answers is provided in Appendix A.
Recording. Video was recorded using a Ricoh Theta Z1 omnidirectional camera. The camera captures videos in a spherical format using its dual fish-eye lens, which are converted to a conventional 360-degree format (MPEG-4) using Ricoh Theta proprietary software. Audio was captured using a Sennheiser AMBEO VR Mic, a ambisonic microphone array consisting of four high-fidelity microphones in tetrahedral arrangement, via the Zoom H6 Audio Recorder.
Sennheiser's AMBEO A-B proprietary software was used to convert the recording to standard Ambisonics-B format (WAV).

Implementation
Simulator. For its support in the literature for use in human-robot interaction simulators [3,31] as well as its ease in handling multimedia content, Unity was chosen as a framework to create the simulator. The 360-degree MPEG-4 videos were projected on the interior of a Sphere object in Unity, with a Camera object at its centre to represent the teleoperated robot. The open-source plugin Resonance Audio [18] was used to generate spatial audio in real time based on the ambisonic recordings, conveying both interaural level and time differences using head-related transfer functions.
Teleoperation interface. A conventional desktop PC setup was used as a teleoperation interface, for its widespread familiarity and ease of access. Audio feedback was provided through a pair of high-fidelity over-ear headphones.
Pressing the left and right arrow keys of the keyboard would rotate the camera object, as if rotating the camera feed of a robot-the spatial audio feed would rotate accordingly.

Task
In order to mitigate SAGAT query demand characteristics as explained in 3.4, as well as to engage and maintain focus on the scene, a simple task was created; participants were asked to follow along with the video shown to them, and to identify for each question the correct answer. After a question was answered, both the correct and incorrect answers that were uttered were displayed at the bottom of the screen, and the participant chose which one they believed was correct by pressing a key on the keyboard. In this manner, the task pertained to the situation and awareness thereof, without any direct overlap with the content of the SAGAT queries.

Manipulations
Two videos were recorded, and two levels of the independent variable (audio feedback) were to be evaluated, resulting in four test condition permutations. Each participant would view one of the four permutations in between-subjects experimental design, resulting in two groups; one having received Mono audio feedback, the other Spatial audio feedback.
The increased sensitivity of conducting a within-subjects trial a.k.a. repeated measures would have been advantageous (presenting one of each video in random order with one for each audio feedback condition). However it was judged based on informal testing of the simulator that ordering effects could be significant between the two audio feedback conditions, potentially introducing confounding factors, so the decision was made to use independent measures.

Measures
While the SAGAT score of each participant is the primary metric of interest to this study, measuring situation awareness, secondary metrics were also recorded and analyzed to gain further understanding of test results.
SAGAT score. This score of social situation awareness (SSA) is measured using the technique devised in Section 3. Queries were created that pertained socially to the quiz scenario, such as "Which contestant last answered correctly?".
No queries required the participant to identify the correct answer to a quiz question, as this task was already asked of participants (see 4.3). The full pool of questions, randomized between both SAGAT pauses, is shown in Table 1.
Each query allowed the participant to select from multiple response options-six options available, with a single correct answer, as well as an additional "I don't know" option. Two SAGAT pauses were administered per participant of five queries each, resulting in a final SAGAT score out of 10 (a score of 10 showing that every question was answered Level of awareness Queries

Perception of Data
"Identify one person who was in your field of view just before the simulation paused. " "What was the most recent question?" "Who is sitting left/right of the quiz master?" "Who is sitting two seats to the left/right of the quiz master?" "What is the colour of the quiz master's shirt?"

Comprehension of Meaning
"Who was asked to answer a question most recently?" "Who is the quiz master?" "Who last answered correctly?" "Who last answered incorrectly?" "What kind of social event is taking place here?" Table 1. Pool of all potential queries delivered during SAGAT pauses. correctly, indicating high situation awareness). A sample SAGAT query screen displayed during the simulation is shown in Figure 2 Presence Questionnaire. With the value of presence in relation to situation awareness, particularly where telepresence is concerned, of a subset of questions from Witmer and Singer's Presence Questionnaire [49] was administered to participants. Each response was provided through a 7-point Likert scale. Question labels were slightly adjusted and extended for clarity, as shown in Table 2. The sum of scores for all component questions constitutes an overall presence score-for 6 questions, this will be a score out of 42.
Positional metrics. Spatial audio can enable an operator to localise sounds without needing to see the source. It is therefore predicted that mono audio feedback will incite participants to visually pan around the simulation more by comparison. To investigate this, the following metrics were derived using positional data and user input logs from the teleoperation interface: • Mean answer time -the mean time in seconds for each participant to select which of the two answers to a quiz question they deemed correct.
• Ratio of time in motion -the ratio of experiment time spent turning the camera to total experiment time.
• Mean viewing angle -the mean angle in degrees (where an angle of 0 is facing the quiz master).
Label Original number [49] Original question [49] Adjusted for clarity P1 5 "How much did the visual aspects of the environment involve you?" "How much did the visual aspects of the environment involve you? In other words, how much did the visual component of the experience contribute to the awareness of the situation?" P2 6 "How much did the auditory aspects of the environment involve you?" "How much did the auditory aspects of the environment involve you? In other words, how much did the audio component of the experience contribute to the awareness of the situation?" P3 * 15 "How well could you identify sounds?" "How well could you identify sounds?" P4 16 "How well could you localize sounds?" "How well could you tell where sounds were coming from? In other words, how well could you localize the direction of sound?" P5 * 12 "How much did your experiences in the virtual environment seem consistent with your real-world experiences?" "How much did your experiences in the virtual environment seem consistent with your real-world experiences?" P6 * 23 "How involved were you in the virtual environment experience?" "How involved were you in the virtual environment experience?" * Question unchanged from original questionnaire Table 2. List of questions used from Witmer and Signer's Presence Questionnaire [49], adjusted for clarity and administered postexperiment.
• Heading angle variance -we wish to evaluate the angular range of motion employed for each participant, but it is assumed participants will pan to view the entire scene (360 degrees) at least once during the experiment.
The variance of the heading angle over time for each particpant can represent the variability of angles in the given timespan; a low variance indicates an generally smaller angular range was viewed, while a high variance indicates the participant was more willing to cover wider ranges over the course of the experiment.
Task performance. A lesser metric is that of task performance; the total amount of correct quiz answers identified. Every time a question is asked by the quiz master to a contestant, the participant is prompted on-screen to identify which answer was correct, as shown in Figure 3. The task was not created with the intent to measure performance but rather to mitigate demand characteristics (drawing attention away from the SAGAT queries), and with no particular precendent in the literature for the use of such a metric, it was considered auxiliary to the others.
Qualitative feedback. The post-experiment questionnaire includes the following optional fields for positive and negative open feedback from participants: • "Is there anything in particular that you liked about the experience?" • "Is there anything in particular that you disliked about the experience?"

Participants
The goal was to recruit a representative sample of the lay population, controlled for English language fluency, hearing or spatial awareness issues, and colourblindness. This was done through random recruitment of the footfall in semi-public locations. Fig. 3. A view of the simulator screen shortly after a quiz question was asked, illustrating the task asked of the participant to identify the correct answer.
Screenshot captured after testing was completed, with inferior video resolution than what was used during data collection.

Procedure
Participants were recruited one by one. After providing informed consent, the participant would begin a 2-minute trial of the experiment during which no data was recorded, which included one SAGAT pause. The participant was encouraged to ask the researcher any necessary questions during the trial. Once complete, the screen would fade to black, and the participant informed that they could begin the experiment proper when ready, which began once they accepted through the simulator interface. After the simulation, participants completed a questionnaire based on their experience, and were thanked for their participation.

Analysis
Data was pre-processed using Python scripts. Statistical analysis was performed using R. [39] We wish to evaluate whether the difference in means for SAGAT scores between the two groups is significant. To first determine parametricity, a Shapiro-Wilk normality test was conducted. A two-tailed independent-measures -test would be conducted for normally distributed data, a Wilcoxon rank-sum test if not normally distributed. The same process was followed to determine a difference in means between the answers given in the presence questionnaire. The chosen significance level in all cases was = 0.05.
A Pearson correlation matrix was also calculated across all quantitative variables, to investigate and evaluate their research value. Finally, qualitative analysis was conducted to identify any overarching themes in participant feedback.

RESULTS
A total of 56 participants were recruited from two separate locations, a college building (28) and an office block (28).
Participant age ranged from 18 to 60 ( = 28.9, = 11.5), and the male-to-female ratio was 31:25. All participants reported to be fluent in the English language, and none reported any form of colourblindness, hearing issues or spatial awareness issues.

Quantitative results
SAGAT score data across all participants was not normally distributed ( = 0.926, = 0.002). A Wilcoxon rank-sum test showed that the group that received spatial audio feedback did not yield significantly higher situation awareness scores compared to the group with only mono audio ( = 338.5, = 0.373). Indeed, the median score was the same for  both groups (8). No instance of a participant selecting the "I don't know" answer was recorded for any of the queries.
SAGAT score data is summarised graphically in Figure 4 and numerically in Table 4, with detailed information on responses to each query shown in 3.
Although composite overall responses to the Questionnaire formed a normal distribution ( = 0.976, = 0.331), individual responses to Presence Questionnaire were not normally distributed ( ≤ 0.004). Cronbach's alpha showed that the six items of the Presence Questionnaire were poorly internally consistent ( = 0.577). Wilcoxon rank-sum tests for each individually did not show any significantly different means between the spatial audio and mono audio groups ( ≥ 0.104). These results are summarised in Table 5.
The other quantitative measures (task performance, time to answer etc) are summarised in Table 4. Pearson correlations between all measures are shown in Table 6, identifying a number of significant correlations.

Mono audio
Spatial audio Overall SAGAT Query C r C r C r Δr "What was the most recent quiz question?" 27 Table 3. SAGAT scores detailed by each individual SAGAT query. As query selection was random, not every query was equally represented. With the number of times the query appeared, C the number of times the query was answered correctly, r the ratio of correct answers, Δr the difference between the correct answer ratios of spatial and mono audio conditions.   Table 5. Statistical summary of results of Presence Questionnaire questions, with results of Wilcoxon rank-sum tests between both Mono ( 1 = 28) and Spatial ( 2 = 28) audio feedback groups. Shown here with reminder phrases for question content (full questions as seen by participants shown in Table 2)

Qualitative results
32 items of positive feedback and 20 items of negative feedback were provided through the post-experiment questionnaire.
Thematic analysis identified several themes of interest, shown in Table 7. Individual answers to positive and negative open feedback questions are detailed in Appendix B along with their thematic relevance.
(1)   "Immersion" and "Entertainment" in overall positive feedback. The majority of feedback given across both groups (14 items total) was of appreciation for the immersion of the interface, in particular the ability to look around. A smaller number of items (8) across both groups reported appreciation for the entertainment of the experience.
"Sound localisation" in Spatial group positive feedback. Participants of the Spatial audio group reported appreciation for the spatial nature of audio feedback and the ability to localise sounds (6 items).
"Difficulty, Frustration" in Mono group negative feedback. Negative feedback provided by the group having received mono audio feedback describes experiencing difficulty and overall frustration with the task and interface (8 items, 6 more than in the Spatial audio group).
"Boredom, Impatience" in Spatial group negative feedback. Negative feedback provided from the spatial audio feedback group reports impatience with wait times between questions or a desire for a faster, more dynamic interface (6 items, 5 more than in the Mono group).

Quantitative findings
While researchers predicted quantitative differences in favour of spatial audio, no statistically significant differences were found between both groups (for a global significance threshold of = 0.05) for situation awareness measurements, presence questionnaire results, or any secondary metrics. Although it is possible that spatial audio feedback has no effect on any of these factors by comparison to mono audio, the sensitivity of the study may have been poor for a number of reasons. For one, as shown in Figure 5, greater SAGAT scores were observed with higher densities, with 75% of SAGAT scores between 7 and 10. With data such as this grouped around the maximum, the ceiling effect may have occurred, whereby the upper limit placed on the measure reduces the meaningfulness of the data, potentially obscuring an effect that would otherwise be observable. Another reason is the low statistical power of the study (estimated at 0.45). Table 6 shows several correlations of note, a number of which may be of interest for improving the SSA assessment technique, or the experimental design of subsequent studies. Several were identified among component questions of the Presence Questionnaire-with how the Questionnaire was designed, along with the similarity of the component questions and the method they are administered, any significant correlations between them are to be expected. A strongly significant negative correlation is that between Task performance and Mean answer time, which is also to no surprise-participants that are more confident of their answers (or simply more focused on the simulator) would be more likely to input the answers quickly. The positive correlation between Task performance and the Presence Questionnaire component on Sound identification can be similarly explained; the task itself required identifying sounds.
A weaker positive correlation is that of SAGAT score and Task performance. This reflects the similar, albeit nonoverlapping nature of both SAGAT queries and the quiz task; both require information obtained through social signals, and the success rate of identifying the correct answer to a question based on events in the scene could be considered a very focused measurement of situation awareness. Finally, the positive correlation of Variance of viewing angle with Sound localisation shows how participants would look around the room more to compensate for difficulty localising sounds. This also shows initial promise for the use of the viewing angle variance as an objective measure, either of ease of sound localisation directly, or of workload related to difficulty localising sounds.

Qualitative findings
Examining the Table 7, negative feedback provided by the Mono audio group predominantly pertained to the difficulty and frustration theme identified, while negative feedback from the Spatial audio group pertained to the theme of boredom and impatience. This illustrates a substantial difference in how the trial was experienced between both groups. What participants disliked about the Mono audio experience was the workload, particularly citing the audio feedback as a cause. 5 By comparison, what participants expressed they disliked about the Spatial audio experience was tedium, a desire for it to be more dynamic, and to engage more in the interface. In tandem with the fact that a substantial amount of positive feedback for the Spatial group was on the spatial nature of the audio, it can be induced that introducing spatial audio over mono likely decreased the workload for the task at hand. This is in accordance with the neurological basis of the "cocktail party problem" [14]; attempting to distinguish one thread of speech spoken simultaneously with others can heavily engage the brain, to the point of decreasing performance at concurrent tasks. The binaural hearing of spatial audio is known to be a major contributor to our ability to distinguish speech in this manner [21], so it can be intuited this would decrease the associated workload.
For the above reasons, along with the overall feedback distribution (the spatial audio group having provided 29% more positive feedback items and 10% less negative feedback items than the mono audio group), it can be concluded that spatial audio was a qualitatively superior experience to mono audio, likely due in most part to the potential workload alleviation. This is consistent with observations made in the literature around spatial audio; participants often report appreciation for being able to localize sounds in prior studies. Finally, the praise given by both groups to the ability to look around using the interface underlines the potential value of telepresence over conventional videoconferencing.

Limitations and recommendations
The most substantial limitation of this evaluation was its statistical power, estimated at 0.45 for an assumed effect size of 0.5. 6 An increase in sample size could overcome this, as well as a redesign of the experiment to allow for repeated measures over both test conditions for each participant, although as explained in Section 3, it may be challenging to adapt to within-subjects design because of ordering effects and demand characteristics.
While the robot being placed at the centre of the group of actors was relevant to some teleoperation situations, it may be less representative of the majority of real-life interactions. Drawing on the domain of proxemics can be used to improve on this in future studies, such as through the use of Kendon's F-formations [23,26].
Although care was taken to preserve sound directionality in the audio pipeline, a complementary study to validate the spatial audio of the setup could ensure with full confidence that spatial audio was properly conveyed, such as that performed by Kiselev et al. [28]. The setup of this experiment used first-order ambisonics-improvements can be made to the fidelity of spatial audio by increasing the ambisonic order of the microphone (using a larger microphone array).
The potential observed ceiling effect on SAGAT scores is another limitation of note. Adjusting for this in future studies can be done based on the data in Table 3, which breaks down the individual scores for each SAGAT query type.
This can be used to assess the relative difficulty of future queries, and if a similar "quiz" scenario is repeated in a future study, can be prioritized for re-use.

CONCLUSION
This study set out to develop a methodology for measuring social situation awareness through a novel application of the Situation Awareness Global Assessment Technique (SAGAT) to the domain of social interaction, evaluated by investigating whether spatial audio feedback during robot teleoperation would increase operator awareness of socially relevant details.
While qualitative analysis of the trial study showed tangible usability benefits to spatial audio feedback, the quantitative data of the trial study was mostly inconclusive, although correlations between measurements indicate potential value in the approach; more work is required to improve and validate the methodololgy. It is hoped that these findings can create a starting point for further social situation awareness studies, as well highlight the value of qualitative analysis as a complement to quantitative statistics.
Much work remains to be done towards developing and perfecting social situation awareness measurement. The reliability of the technique detailed in this paper depends on that of SAGAT-future methods papers may consider moving away from a reliance on SAGAT and/or conducting assessments of consistency and sensitivity. It would also be of value to develop a method for it to be used in within-subjects trials, perhaps introducing alternative solutions to mitigate ordering effects and demand characteristics. Finally, its queries could be generalised to more easily apply the technique to any social context.
Concerning spatial audio in telepresence, future work might begin by investigating more aspects of spatial audio during social robot teleoperation, notably a formal workload evaluation using tools such as NASA-TLX [20]. A subsequent study could more closely focus on the cocktail party effect during robot teleoperation, studying the effect of spatial audio feedback in contrast to mono while also varying the level of interfering noise from other speakers. This could be also applied in a contrasting social context, such as a longer, more drawn-out explanation or storytelling session, to examine different forms of social awareness.

ACKNOWLEDGMENTS
We thank Tangent as well as The Digital Hub for enabling data collection. We also thank Maya Vizel-Schwartz (University College Dublin) for discussions on qualitative methods, and Ann Bell (The Digital Hub) for general helpfulness.  Table 9. Quiz questions asked during Video 2. Correct contestant and answer for each question shown in bold. Quiz master: "Megan", Contestants: "Michael", "Emma", "Amy", "Carol", "Paul"

B OPEN PARTICIPANT FEEDBACK
The feedback in the following tables was provided via post-experiment questionnaire; positive feedback in Table 10, negative in Table 11. This feedback was used for quantitative analysis in 5.2.

Feedback item Category
Mono group Not having a wider field of view -not being able to see more than one person at a time Field of view too small Not having a 360 degree view and taking the time to shift between people.
Field of view too small Was a bit strange being in the centre of a group, almost like I was being looked at from every direction Other difficult to differentiate the girl voices, I would need to see who speaks Audio was poor (increasing difficulty) sound quality! Audio was poor (increasing difficulty) sound had a bit of an echo to it Audio was poor (increasing difficulty) Sometimes I tried to lipread the participant that was asked to answer and the video lagged slightly making that difficult.
Audio was poor (increasing difficulty) using the arrows to move left and right. in a real environment I would just position my body, move my head or just my eyes Other when answered together, I could only hear one answer rather than both Audio was poor (increasing difficulty) frustrating when cant make out answers as someones talking over Audio was poor (increasing difficulty) Found it rather annoying cause I like to see who is answering while they answer, but it takes a while to shift focus to a person and figure out who answered till the teachers says it out loud.
Slow camera panning speed Spatial group slow panning speed of the visuals Slow camera panning speed The wait time between the questions was quite long. If the video paused until the questions was answered, and then resumed straight away. I think it would've been slightly better.
It was repetitive / boring

No vertical control
No vertical camera control you cant see up and down No vertical camera control the overlay of two answers required a lot of concentration and often the choices were not clear until the graphic with the answer selection appeared on screen.
Audio was poor (increasing difficulty) the talking over each other so it was hard to concentrate on correct answers Audio was poor (increasing difficulty) wish the arrows would move quicker Slow camera panning speed It was somewhat repetitive It was repetitive / boring How I experienced volume differed to how I experience volume normally. I expected the people behind me to be quieter.
Other Table 11. Negative feedback provided in answer to the question "Is there anything in particular that you disliked about the experience?"