Delay Threshold for Social Interaction in Volumetric eXtended Reality Communication

Immersive technologies like eXtended Reality (XR) are the next step in videoconferencing. In this context, understanding the effect of delay on communication is crucial. This article presents the first study on the impact of delay on collaborative tasks using a realistic Social XR system. Specifically, we design an experiment and evaluate the impact of end-to-end delays of 300, 600, 900, 1,200, and 1,500 ms on the execution of a standardized task involving the collaboration of two remote users that meet in a virtual space and construct block-based shapes. To measure the impact of the delay in this communication scenario, objective and subjective data were collected. As objective data, we measured the time required to execute the tasks and computed conversational characteristics by analyzing the recorded audio signals. As subjective data, a questionnaire was prepared and completed by every user to evaluate different factors such as overall quality, perception of delay, annoyance using the system, level of presence, cybersickness, and other subjective factors associated with social interaction. The results show a clear influence of the delay on the perceived quality and a significant negative effect as the delay increases. Specifically, the results indicate that the acceptable threshold for end-to-end delay should not exceed 900 ms. This article additionally provides guidelines for developing standardized XR tasks for assessing interaction in Social XR environments.


INTRODUCTION
The use of immersive technologies has aroused interest in several telecommunications-based applications, such as industrial training [42,54], telecare [10], and telemeetings [48].However, 2D videoconferencing is still the most widely used technology for teleconferences, although it presents certain drawbacks that affect the user experience.According to Skowronek et al. [48], prolonged videoconferencing can strain human interaction factors in telemeetings, causing fatigue and increased cognitive load due to the unnatural communication, reduced mobility, and the added effort of non-verbal communication (known as videoconferencing fatigue).Therefore, 2D videoconferencing presents inherent limitations due to its 2D visual representation and the lack of user free movement.
To overcome the limitations of 2D videoconferencing, Social eXtended Reality (XR) has emerged as a promising solution by offering a more natural and immersive communication.This is because of the inherent 3D nature of XR technology, which allows users to freely move around and interact with each other in a way that is more realistic and engaging than ever before [20,24,31].In addition, under the XR paradigm, local and distant physical realities can be blended with virtual assets to offer realistic interactions in 6 Degrees of Freedom (DoF) that enhance the user experience.Within the possibilities offered by this paradigm, Social XR communications are called to be the next step in immersive communications [24,31,48].
However, despite the increasing popularity of XR communications, the effects of system factors on user experience and performance have not been widely studied yet, with delay being among the most important.On the contrary, the influence of delay in 2D videoconference is a well-studied field [2,6,11,45].Previous studies show that delay has different ways of affecting users.On the one hand, desynchronization and echo cause severe damage to the perceived quality of users with respect to the system.On the other hand, by mitigating these effects and making the delay synchronous, users are able to withstand higher delays [45].This is the most common and studied aspect of delays in videoconferencing.
In earlier studies, the influence of delay on the adoption of videoconferencing technology has been examined through subjective experiments [6-8, 43, 46, 50].Together with objective metrics, these experiments have identified acceptable delay thresholds for videoconferencing [35,37,41].The recommended delay threshold for avoiding user annoyance is below 600 ms [37], but recent studies have suggested higher values, exceeding 900 ms [6,43].While these values apply to 2D videoconferencing, they may not be applicable to richer Social XR communication scenarios.However, to the best of our knowledge, there are still no similar studies to establish the limits of delay for videoconferencing in Social XR.Moreover, there is still no established methodology for the evaluation of interactive videoconferencing in Social XR.
This article addresses the challenge of determining new appropriate delay limits to guarantee the user's acceptance in collaborative Social XR.For this purpose, a subjective experiment was conducted with remote users communicating verbally and visually using photorealistic 3D representations [25] within a shared virtual environment, under different delay conditions.Moreover, we present a new methodology for evaluation of interactive videoconferences in XR adapted from the standard for evaluation in 2D videoconferences.Our results show an impact of the delay on the user experience and conversation flow above 900 ms.These values are related to previous studies on video-based conferences that pointed to delay acceptance values above 600 ms [6,43].Therefore, this study contributes to the following: -Set an acceptance limit at 900-ms end-to-end delay for Social XR.
-Provide a new evaluation protocol for interactive teleconferencing in Social XR.

RELATED WORK
The objective of this study is to evaluate the impact of interaction delay in immersive teleconferencing environments using a photorealistic Social XR system.Delay can be defined as the elapsed time between the transmission of a signal and its reception at the destination.In the context of videoconferencing, end-to-end delay refers to the delay between the movement of a user and the moment when the remote user sees that movement.Delay can have a detrimental effect on the communication process, leading to a decrease in the quality of interaction [41].
Audiovisual communication systems, including videoconferencing and streaming services, are highly reliant on user experience in terms of system acceptance [48,49].Besides, one of the key factors that can impact user acceptance is delay [1,2,11].In particular, the delay is crucial for real-time applications such as videoconferencing.
This section analyses the delays in other non-immersive environments to provide an overview of the recommended values for more classical communications.In addition, the current state of immersive communications in Social XR is described along with examples of systems, and the current methodologies to assess the influence of technical factors on user acceptance.

Delay on 2D Communications
The conventional approach to videoconferencing involves the use of at least, a display, a camera, a microphone, and an audio playback device for each participant.The transmission of audiovisual data may cause delays that affect the videoconference.Research on the acceptance of videoconferencing systems establishes that the threshold for synchronization between video and audio signals can vary between +90 ms and -185 ms on average, respectively [32].Although synchronization issues may be resolved through the use of synchronizers, reducing the overall end-to-end delay within communication systems is not a straightforward task.
The perception of delay, as well as its effects on interaction, is a field of study within the areas of user experience, conversation, and interaction [7,14,37,47].With respect to the factors used to evaluate user experience, we found analyses of both subjective factors through questionnaires and objective data [37].The relevant data gathered by the standards include the perceived Quality of Experience (QoE), the annoyance using the system, the perception of the interruptions, and whether users notice the delay [37,41].According to the evaluation of these factors, the overall delay tolerance for maintaining an acceptable experience is said to be under 600 ms [37].Furthermore, recent studies set user's acceptable delay in higher values (above 900 ms) [44,46].From another point of view, some studies have analyzed the impact of delay in video-mediated interactions by assessing the impact on how conversations flow [15,[45][46][47].In this sense, video delay in video-mediated interaction has significant implications for communication and user experience.High delay can result in a disjointed and unnatural conversation, with participants experiencing delays between their actions and corresponding responses.This delay can hinder the flow of conversation, disrupt the natural turn-taking process, and negatively influence non-verbal cues and gestures [47].Nevertheless, the results of these studies point out that the users somehow adapt to them and attribute these technical difficulties to the poor fluency of the other users [45].
However, these results are intended for 2D videoconferencing.Due to the higher DoF and different imaging modalities that are being used in XR communications, the thresholds in delay may be different; therefore, these delay thresholds need to be revised.For this purpose, in this article, we conducted a comprehensive study of latencies in a Social XR environment.

Tasks for Evaluating XR Systems
The various protocols established to assess the impact of the system on interaction in videoconferencing environments are varied and are reflected in different international recommendations [36][37][38]41].These protocols involve the performance of a task with more than one user.While such protocols are well established and standardized, there is a lack of protocols for the evaluation of Social XR systems.In the literature, several evaluation studies of Social XR systems can be found that include tasks such as watching a movie [25], collaborating to achieve a common pattern [22], and playing a game [17].
In this work, we have replicated a collaborative block-building task described in a standard recommendation on interactive test methods for 2D audiovisual communications [41].To adapt the task as faithfully as possible, we present a Social XR environment that mimics the building block task using photorealistic representations of the users and the figures.Additionally, for the experimental design we have followed the recommendation for immersive video evaluation ITU-T P.919 [19].All of these contributions together form a new protocol designed according to different international standards for interaction evaluation in Social XR.

XR Communications System
Social XR refers to a paradigm where individuals can interact with each other and their surroundings through the use of XR technologies.Therefore, Social XR systems enable remote and synchronous communication, providing an immersive experience that goes beyond 2D videoconferencing [31].
The main difference between Social XR and 2D videoconferencing is the DoF for user exploration and interaction [48].DoF signifies how freely a user can view different angles of media content.The level of DoF in Social XR systems ranges from 3 DoF, which involves head movements (pitch, yaw, and roll) to 6 DoF, including translational coordinates (x, y, z).Therefore, Social XR should allow video viewing from different points of view.
In the literature, we can find different Social XR systems with different DoF capabilities.For example, Kachach et al. [21] present a virtual environment where users can interact with a distant environment in 3 DoF using a 360-degree camera.Another example is the work of Becher et al. [4], which presents an environment with purely virtual avatars where users interact with 6 DoF using their voice and controllers.However, this 6-DoF environment does not use video for user representation.Finally, Viola et al. [52] present a 6-DoF Social XR system using volumetric video through a set of color and depth coordinated cameras.Therefore, volumetric video is a promising approach for Social XR because it enables users to see each other in photorealistic detail from multiple perspectives.
Volumetric video is an emerging technology that further enhances the user experience in XR environments.Unlike 2D video formats, which offer fixed viewpoints, volumetric video enables users to see each other from various perspectives within the virtual space.This means that users can explore and interact with one another from different angles, providing a more natural and engaging way to communicate in virtual environments.This capability adds an extra layer of realism and interactivity to XR experiences, making them feel even more like face-to-face interactions [31,48].With respect to volumetric video, we can find two representation techniques.One approach is mesh-based techniques.These techniques generate a set of dependent triangles that are positioned and colored according to the information received by the depth and color cameras.Some examples of mesh-based volumetric videoconferencing systems can be found in other works [5,27,55].Although these techniques have been shown to provide good performance under loose grid conditions, the triangle generation process requires complex processing that can affect system delay [51].
Another approach to represent volumetric video is point clouds.Point clouds are generated by giving an independent volume in space to each color and depth pixel set provided by the cameras.The fact that they are independent and derive directly from the camera streams makes their implementation for real-time systems more suitable [51].In addition to the real-time requirement, the use case for videoconferencing in Social XR requires systems that are adapted to immersive technologies.Some state-of-the-art systems that use volumetric video in Social XR are Free Viewpoint Video Live [9], Holoportation in Microsoft Mesh [29], and VR2Gather [52].
In this work, the VR2Gather Social XR system [52] has been selected because it is a point cloud based volumetric videoconferencing system prepared for immersive environments.Moreover, it allows symmetric communication in terms of visualization between users.In other words, users see themselves and others in a reciprocal manner (Figure 1).Another decisive factor was that it is open source [52], allowing modifications to be made to introduce artificial latencies.In addition, it allows the replicability of the experiment allowing the protocol described in this article to be included as part of the tasks of a forthcoming recommendation for the evaluation of volumetric Social XR systems.

SOCIAL XR VIDEOCONFERENCE ENVIRONMENT
The objective of the system is to enable interactive videoconferencing using immersive technology.To achieve this, different modules are linked together, allowing users to see themselves in an XR environment where they can manipulate objects from their physical reality.Additionally, the system needs to be able to represent and display the remote user in the shared environment.Therefore, the system must capture aspects of two physical realities, namely where the two remote users are located, and position all of that information in a Social XR environment.As an illustration, Figure 1 shows two users placed in two different physical rooms (bottom), with each user wearing a Head-Mounted Display (HMD), and corresponding snapshots of the views generated from their HMDs (top).In this figure, it can be seen that both users are immersed in a virtual world with a virtual table that mimics the physical one while hands and physical blocks are visible.Additionally, the volumetric representation of the remote user is visible at the end of the virtual table.

Social XR System
The different elements that make up the Social XR system are defined here.The two roles related to the collaborative task, namely the instructor and the builder, are presented in Figure 2. Furthermore, each color (blue and orange) represents the flow of information from each role.The black border boxes represent the elements contained in each physical reality.In other words, the physical room where each user is located.In this study, we use a room with a table (see Figure 1).In each black frame of Figure 2, it can be seen a user wearing an HMD being captured by surrounding cameras.The cameras surrounding the users capture color and depth information from the physical reality to generate a point cloud representation.Besides, the HMD generates two types of information.It captures the user's voice with the built-in microphone and, through the integrated camera, captures the physical reality from an egocentric perspective (self-view).The audio and the point cloud are combined with information about the world and then encoded and transmitted to the remote user via TCP transmission protocol.It is at this point that the remote user integrates this information into their virtual world to generate the view of the Social XR environment that will be reproduced by their HMD.
According to the diagram described previously, there are two information loops in the system: one for the generation of the self-view and another for the generation of the volumetric avatar (point cloud, audio or voice, and world position).
For the generation of self-view, the XR environment should represent the physical environment that usually includes the user's body and real objects.In our case, we capture the physical environment using egocentric cameras that are attached to the HMD, and by using image segmentation algorithms to crop the image, only the body of the user and some real objects are included within the Social XR environment (Figure 3)  For the generation of the user volumetric avatar, the system includes an acquisition setup that uses multiple cameras with depth sensors to capture volumetric data of the user from different angles [25].In addition, the voice is captured by the HMD's built-in microphone.The captured data is then processed, transmitted, and integrated into the shared environment (Figure 4).An analysis of the different processes that contribute to the end-to-end delay is presented in the next subsection.

System Delay
The system has numerous sequential processes, each of which can add an intermediate delay that will affect the total end-to-end delay.Table 1 summarizes the different components that consist of delays related to capturing, processing, display, transmission, and synchronization.
In the XR communication system, there are two different information loops that are sensitive to delay.The first one is the self-view.The Social XR system uses the egocentric camera for capturing the physical environment; then, this image is processed to include only the user's hands and some objects of the physical environment (see Figure 3).After that, the result is rendered in the virtual world and displayed in the HMD.In Figure 4, this loop is illustrated in the self-view element that traverses through the world synchronizer to add the hands and some real objects into the generated view.Therefore, the elements that contribute to the composition of the self-view delay are self-view delay = τ cap + τ pr oc + τ disp . (1) In Equation ( 1), the τ cap stands for the time the HMD camera frames are available in the processor memory.The τ pr oc includes the transformation of the camera to adapt to virtual reality and the segmentation process.The τ disp stands for the time that the XR engine takes to show the result of the processing in the HMD.
To generate the user representation, the process is more elaborated.First, a set of color and depth cameras should be placed around the user to cover its volume.Then, the captured information of each camera is processed with a common reference in real space (calibration).With this information, the system generates a point cloud representation of the user.Then, the point cloud is coded and transmitted to the remote user together with the microphone audio and the world information through a TCP connection.Then, the remote user server should receive, synchronize the audio and video, and render the point cloud into the remote user XR environment according to the world information.Therefore, the elements that contribute to the composition of the Social XR delay are In Equation ( 2), the τ cap stands for the time the HMD camera frames are available in the processor memory.The τ pr o includes the transformation of the point cloud generation.The τ tx stands for the transmission time of the volumetric avatar.The τ sync stands for the time of world synchronization-that is, audio and video synchronization plus world positioning.Finally, the τ disp stands for the time XR engine takes to show the result of the processing in the HMD.
Although the local user client and remote user server capturing and display delays can be determined and stabilized, the transmission and processing delays are subject to network variables and computer capabilities.As a result, these delays can have an unexpected impact on the user experience.In the experiment, the delay under consideration represents the duration between the local camera capture and their rendering on the remote display.

EXPERIMENTAL DESIGN
The aim of this study is to assess the impact of interaction delay on immersive teleconferencing environments for Social XR, by utilizing photorealistic user representations.To accurately evaluate the effects of delay, a task was selected from the standard for interaction assessment in videoconferencing: the ITU-T P.920 [41].This task involves collaborating to construct block-based figures, with one participant designated as the instructor and the other as the builder.The objective is for the instructor to guide the builder to reproduce the complete figure.Communication and interaction take place through both audio and visual channels, as the teleconferencing environment is audiovisual in nature.However, the task was originally intended for 2D videoconference using a basic camera and a 2D monitor, and thus modifications were necessary to adapt it to the immersive environment.Specifically, egocentric capture with chroma-based physical environment segmentation was employed to represent the local environment, whereas multi-camera-based volumetric capture was used to represent distant users.These adaptations are illustrated in Figure 1.The Social XR system under consideration encompasses two distinct delays: the self-view delay and the XR delay.An assessment of the impact of the self-view delay on the block-building task's performance was conducted on a previous study [12], using an identical system configuration.To eliminate the effect of additional parameters, in this experiment, there was no remote user involved (typically responsible for providing instructions on the building process), but we incorporated a pre-reconstructed 3D figure into the setup that was serving as a reference.The study determined the minimum latency of the system self-view to be 190 ms.Moreover, we tested the user's experience under different self-view delays of up to 587 ms that were artificially introduced.Our results showed that for delays lower than 338 ms, the user experience was unaffected.As a result, it is concluded that the self-view delay introduced by the system (190 ms) yields very good results in terms of user experience and does not influence the Social XR study presented in the current article.
This section introduces the methodology employed in the current study.The research involved the adaptation of the standardized ITU-T P.920 task, which entailed the collaborative construction of block-based figures within the Social XR environment.A description of the software utilized for synchronizing the virtual environments of two users and artificially manipulating delays is provided.Furthermore, the hardware configuration for each room, signifying distinct task roles, is expounded upon.Moreover, the process of experimental design, encompassing task adaptation, administration of subjective quality questionnaires, and collection of objective data during experimental sessions are outlined.Finally, it should be mentioned that the experimental process was refined based on pilot studies that were conducted with a limited participant pool, which are briefly reported.

Hardware
The experimental hardware utilized in this study encompassed a range of functionalities, namely local reality capture, point cloud capture and transmission, synchronization, and Social XR environment display, allocated per user.Local reality capture and environment display were achieved through the use of the HMD HTC Vive Pro, whereas point cloud capture and generation were facilitated by using the CWIPC system [25], utilizing the Kinect Azure color and depth cameras.The synchronization of social worlds was managed by VR2Gather software [25], installed on Windows 10 PCs with an Intel Core i7-4790 with a clock speed of 3.6 GHz, boasting eight cores, alongside an NVIDIA TITAN Xp GPU.

Software
The predominant software used was VR2Gather, a socially immersive software platform designed by the Centrum Wiskunde & Informatica (CWI) using the Unity engine, which enables audiovisual communication in XR settings.To assess diverse delay circumstances, a software component was adapted that was tasked with synchronizing the audio and video components of an avatarthat is, the synchronizer.The synchronizer is responsible for matching the audio and volumetric video received by each user.In addition, it has the option of storing this information so that the total delay is controlled (taking into account the time it took to receive the audio and video from its capture).Therefore, the synchronizer makes the experiment possible, allowing the delay to be artificially varied.Additionally, we use OBS [28] software to capture the audio of the conversations.This software was configured to capture the microphone and headphones integrated into the HMD.Each of these sources was stored in a channel of an audio file to facilitate further analysis.The MIRO360 [13] application was used to conduct the questionnaires within the virtual environment.

Objective Data
During the experiment, objective data was captured to analyze the impact of delay on user performance.The time required by each pair of users to complete the task was recorded using a data log from Unity.Furthermore, the audio of the conversations was captured to identify the number of interventions and the activity time of each user.

Questionnaire
To evaluate the influence of interaction delay, a combination of objective and subjective measures was employed.Subjective quality questionnaires were selected based on their previous use in assessing interaction quality.Table 2 presents the subjective factors evaluated in conjunction with their respective questions.Subjective factors analysis included global quality, system annoyance, delay perception, and interruption perception, derived from international standards and specifically aimed at assessing the impact of delay on system acceptance [34,37,41].Additionally, to evaluate the effect of delay on the perception of interaction with the local environment, a validated questionnaire for this type of environment was used [30].This questionnaire was also used for the self-view delay experiment [12].To further examine the impact on subjective social quality, questions from the work of Gupta et al. [18] used in an experiment with a similar task [53] were included to assess subjective social factors.

Experimental Conditions
The experimental conditions comprised the pairing of delay values and block-based figures.A pilot test was conducted to select the different delay conditions, by which a proposal of figures and delays was presented.The delay intervals were anchored at 300 ms, which was deemed to be the base.To evaluate the effectiveness of the proposed experimental conditions, a pilot test was conducted with 10 participants who evaluated the system using four figures with four different delays.The pilot test established that quality degradation ranged from 600 to 1,000 ms and that the degradation was more significant for the builder role.Additionally, the feedback from the participants suggested that the figures were relatively complex.Consequently, for the actual experiment, the number of latencies surrounding 600 and 1,000 was increased by reducing the number of blocks for each figure.The following delay values were selected: 300 ms (minimum), 600 ms, 900 ms, 1,200 ms, and 1,500 ms.In addition, the selected block-based figures are shown in Figure 5.Each figure is composed of seven blocks.An essential consideration when establishing experimental conditions is randomization and balancing [33].To ensure that conditions were balanced, the Graeco-Latin distribution was used to organize the delay and figure conditions [23].In this way, we ensured that the same number of pairs of conditions existed for each possible combination.In addition, the order of the conditions were randomized.

Experiment Workflow
The experimental procedure involves several sequential steps.First, the participants are informed about the collaborative task and instructed to disregard any visual effects arising from egocentric capture and volumetric avatars.Subsequently, the roles of instructor and builder are assigned to the participants and they are located in separate rooms.Participants are informed of a training session during which they can familiarize themselves with the system.In the training session, users must complete two buildings under the best (300 ms) and worst (1,500 ms) delay conditions.This methodology is in line with the conventional practices in subjective experiments [33,40].A 10-minute break follows the training session before the start of the actual experiment.The experiment consists of a repetition of five tasks with different delay conditions and figures.Figure 6 shows a flow diagram of the experiment.Each "task" involves the collaborative process between an instructor and a builder, utilizing an immersive videoconferencing system to construct a figure.At the start of each task, the instructor begins with a perfectly constructed figure, whereas the builder starts with a set of loose parts.The users then collaborate to enable the builder to replicate the figure held by the instructor.Once the users determine they have completed the task, the experimenter initiates a virtual environment where the users can respond to the questionnaire outlined in Table 2.After both users complete their questionnaires, they wait in an empty environment for the experimenter to disassemble the builder's constructed piece and replace the instructor's reference figure, preparing for the next iteration.

Participants
We conducted an experiment with 60 subjects (29 female and 31 male; ages between 20 and 33 years, mean: 22.8, standard deviation: 2.1).None of them were experts in the use of virtual reality.All users reported no vision problems in terms of color perception, and the HMD was adjusted in the training phase to assure the best visual experience.

RESULTS
This section presents the results of the various factors assessed in the experiment.Each subsection comprises a normality test to assess the distribution of scores; an ANOVA to examine the impact of delay, figure, and role on voting outcomes; and a bar graph of the average score for each role and delay value.In addition, Tukey's HSD (honestly significant difference ) post hoc analysis was performed to evaluate the differences between the delay values.

Subjective Performance Factors
Initially, normality was confirmed for each of the factors either by a Kolmogorov-Smirnov normality test or by checking that both skew and kurtosis were in the range (-2, 2) as established by George [16].Table 3 shows the statistical results for each factor of the subjective performance of the system.This table shows for each factor an analysis of the statistical significance (by means of an ANOVA analysis) of the different variables of the experiment (role, delay, and figure).If it is established that the role had an influence on the scores, an analysis by role is performed for this ?Fig. 7. Subjective performance results.
factor.In addition, for variables showing significance (p < 0.05), Tukey's HSD post hoc analysis was performed to identify statistically different delay pairs.According to the results, the role was significant for the influence factor of delay perception and interruptions, which is why for these factors the analysis is done individually by role.Furthermore, the study examined the impact of different figures on the voting results and found that while certain figures significantly influenced Global QoE and the instructor's perception of delay influence, the effect was relatively small (η 2 < 0.06).Tukey's HSD analysis revealed significant differences between only two figures (Mazinger and Bird).On the contrary, the delay was found to have a significant impact on voting for all factors (p < 0.05), with a large effect size (η 2 > 0.14) in general.
Figure 7 shows the average scores for each factor and delay with their 95% confidence intervals.It can be observed that for the factors of perceived delay and interruptions, we can find differences between roles, with the builders being more sensitive to delay (i.e., they notice it earlier).Moreover, we can find significant differences from 600 ms of delay for the two conditions and for the two roles.At the level of averages, we also find for the perception of delay and interruptions that the quality values drop significantly from 900-ms delay onward.For overall quality and system annoyance, no differences were found between the roles, but differences were also found for the two factors from 900 ms, with the two worst delays (1,200 ms and 1,500 ms) reaching levels on average of 3.5.At the level of QoE in the system, we could establish 900 ms as a threshold that guarantees an acceptable delay.This result is higher than that established in the recommendation [37]; however, it is in line with later studies [6,44].

Presence
The study examined the presence of the adaptation factor.First, we verified the normality of the skew and kurtosis ratings, which were found to have absolute values less than 2. The results of the analysis of variance are presented in Table 4, includes the role, delay, and figure variables for the presence factors under consideration, namely involvement, adaptation, and task.Additionally, Tukey's HSD post hoc analysis was performed to identify significant differences between pairs.After examining the influence of the role variable, it was determined that it only impacted the adaptation factor.Therefore, a separate analysis of the variables by roles was conducted for this factor.Results indicate that the delay and task factors had a significant impact with a medium effect (η 2 > 0.06) observed.The significant differences column reveals that differences between delays (1,200 and 1,500 ms) and delays of 600 ms or longer were observed.For the feeling of having completed the task correctly, we can observe that the delay did not have a significant effect.According to the average results in Figure 8, we only found differences between the roles in adaptation factor.Here, we can observe that the builders suffered more from the delay than the instructors.This is in line with the idea that builders notice the delay earlier and that it is more difficult for them to adapt to the task since they need to interrupt the other user.For instructors, this effect is smaller, although it also affects them.The last factor of presence refers to whether users feel that they have completed the task.This result is good for all delays.It was probably influenced by the fact that they needed to agree on the completion of the task to move on to the next figure.

Social Factors
The present study examined some social factors.First, we verified the normality of the skew and kurtosis ratings, which were found to have absolute values less than 2. Utilizing an ANOVA (Table 5), it was determined that, for most of the social factors, only the delay factor had a significant impact on the ratings (p < 0.05), whereas the role and figure factors were deemed insignificant (p > 0.05).With respect to role, only the social annoyance factor shows statistically different results between instructors and constructors (p = 0.01).For the social presence factor, we can see an effect of the figure on the results, but it is at the limit of statistical significance (p = 0.048) and the effect size is small (η 2 < 0.06).
Tukey's HSD post hoc analysis was subsequently conducted between delay pairs, revealing statistically significant differences between 600 ms with 1,200 ms and 1,500 ms.
According to the average results from Figure 9, social collaboration and adaptation have similar behavior to the task completion factor for presence.Users have the feeling that they finished the task correctly, both from the self and the whole point of view.Social presence, however, suffered a clear impact of delay, degrading similarly on average to those obtained for the Global QoE values.Finally, for the social annoyance factor, instructors were able to understand the users' message better than builders for higher delay values.The average results of the builder were significantly influenced by the delay (on average) from 900 ms, whereas the instructors kept their averages relatively stable.

Duration
This section presents an analysis of the impact of completion time for each experimental condition, namely delay and figure.First, a normality test was conducted to determine the distribution of the data, which indicated a non-normal distribution with kurtosis that exceeded an absolute value of 2. Subsequently, a more detailed examination of the results was performed, revealing a significant  To investigate the influence of figures and delay on task completion time, an ANOVA was performed.The results revealed that the figure had a significant effect on task completion time, but the delay value did not.Subsequently, Tukey's HSD post hoc analysis was performed that revealed significant differences between two pairs of figures, namely the Dog with Rocket and TRex figures.The mean times for each delay value are presented in Figure 10, and it was observed that the confidence intervals were wide and no significant differences were found between the delay values.In particular, the average completion time was found to be 160 seconds for delays ranging from 300 to 1,200 ms, whereas for the worst condition, an average of 190 was obtained, representing ∼ 19% increase.

Audio
During the experimental sessions, the conversations of the participants for each condition (delay and figure) were captured using OBS [28] software, which enabled the recording of both the microphone channel (representing the voice of the local subject) and the headphone channel (representing the voice of the remote user).These audio channels were recorded in an audio file, where the left and right channels represented local and remote audio, respectively.
To ensure uniformity and standardization of the audio signals, the audio files were normalized to -26 dBov according to ITU P.56 [39].The activity time of each user was then determined by calculating the squared mean amplitude of each 200-ms audio segment and comparing it against a threshold value of -16 dBFS.Any audio segment with a dBFS that exceeded the threshold value was classified as active.In Figure 11(a), an example of the audio signal (in blue) can be observed, with a running average of 200 ms (in orange) and a threshold of -16 dBFS (in red).
Once the threshold has been applied, we can see in Figure 11(b) the average time taken to finish the different figures for each role and delay.According to this graph, we can see that the average  values increase by 1,500 ms for the instructors and from 1,200 ms for the builder.To check if this increase in activity is due to longer interventions or if there are more interventions, we calculate the percentage of time occupied by each of the roles in the conversation.In Figure 11(d), it can be seen the average of the activity times of each construction divided by the total time of that construction.In addition, we calculated the average number of interventions of each role by counting each intervention as the time between two silences of more than 200 ms following ITU-T P. 1305 [37].The results of the number of interventions show similar results to those of the activity time per role.Together with the results shown in Figure 11(c), everything seems to indicate that for delays above 900 ms, the builder had to intervene more times than for shorter delays.Similarly, this effect can be seen for instructors at 1,200 ms and higher.However, the distribution of activity time was not altered.This indicates that users had to intervene more times to perform the same task from 900 ms onward.

DISCUSSION
We have analyzed subjective and objective factors varying the end-to-end delay of a photorealistic Social XR communication system.To do so, we have conducted an experiment on a system validated in terms of user experience, to which we have artificially introduced audiovisual delay in a collaborative Social XR task.Additionally, we have carried out an exhaustive analysis of the results for each subjective factor evaluated as well as of the possible elements that may introduce noise to the measures of the impact of delay on user experience.A discussion of the results follows.
The results of the experiment can be examined from a dual perspective: subjective and objective.Subjective results can be categorized into three distinct dimensions: overall perceived quality, presence, and social factors.
Although we could observe a reduction in the overall perceived quality as the delay increases, it is not too pronounced.The existing literature on conversations with delay [45,46] suggests that users partially attribute the delay to the inoperability of their peers, thus absolving the system of blame.This attribute allows for greater delays in synchronous environments, as observed in the presented experiment.In absolute terms, and taking into account the data obtained for the subjective assessment, we can recommend not to exceed 900 ms of end-to-end delay for collaborative videoconference Social XR systems.This value is higher than the threshold established by the recommendations for 2D videoconferences (600 ms), but is in line with more recent 2D videoconference studies [6,44].
From an objective standpoint, the impact of delay on task completion time was analyzed.According to the results, an increase in the mean time required to construct the figures is evident.However, this increase is not statistically significant or as apparent as in the case of subjective results.This is attributed to the users' ability to adapt to the degraded environment, with their subjective perceptions of task performance remaining relatively unaffected by the deleterious effects of delay [12,15].In the experiment, we conducted further analysis on the influence of delay on users' recorded conversations.Our observations indicate that the instructor's role accounted for most of the conversation time (∼45%), whereas the builder spoke for ∼25% of the time (see Figure 11(b)).The remaining 30% of the time corresponds to silence.This silence is attributed to the time required to assemble the figures.Importantly, this distribution of conversation time was not altered with increasing delay.Although, as mentioned previously, the interactions were prolonged with higher delays, an examination of the number of interventions made by each role in relation to delay reveals that there were more interventions with longer delays while still maintaining the distribution consistent with the respective roles.In other words, there was an increased frequency of interventions, but the pace of the conversation remained unchanged.This fact supports the user adaption hypothesis.
Nevertheless, according to the factors that compose the perception of delay [2] (prior experience, task complexity, and expectations), we can find a great influence of the type of task [3].In particular, the block-building task represents the most common form of interactive collaboration in videoconferencing-in other words, a conversation between two users who collaborate to perform a task [26].However, other tasks could have a component that encourages users to interact as fast as possible.In this sense, the maximum acceptable delay value could vary.Therefore, further studies on the influence of delay are needed to set thresholds with respect to the specific use case.
Another aspect that has been addressed during this work is the adaptation of 2D videoconferencing protocols to the Social XR paradigm.In the same way that the first recommendations proposed tasks for telephone calls, there was a posteriori work to adapt these tasks and to propose different ones to evaluate the user experience in the field of videoconferencing.In this work, we have gone a step further and adapted a task for interactive videoconferencing to the Social XR paradigm.In this case, the differentiating element with respect to usual videoconferencing standards is that we consider 3D environments.At system level, Social XR still faces a number of challenges associated with the 3D environment in which users are immersed.While in 2D videoconferencing environments the remote user occupies the entire screen, in Social XR environments the other user's avatar must be located in a shared space.This adds an extra dimension in that the shared virtual elements must be synchronized.Moreover, the Social XR system should guarantee that the two users can interact between them and have a twin behavior in the shared space.For the building block task, it was crucial to configure the immersive environment in such a way that users can visually perceive the form of the figures that the remote user had in their hands without the ability to replicate them without asking the partner, while still maintaining sufficient proximity to prevent the task from becoming solely reliant on audio communication.Another important aspect regarding the social task is that the role of the builder was more sensitive to the delay even though he was the one who spoke the least.It is reasonable to believe that in the future we can centralize the analysis only on the builder part and use some kind of confederate user that always repeats the instructor role.In this way, we can increase the number of conditions at the same time even if we lose the information related to the role (but it has already been analyzed in this study).

Fig. 1 .
Fig. 1.Two users sitting in two different physical rooms and meeting in the same Social XR environment during the experience.

Fig. 4 .
Fig. 4. Physical environment of the instructor and the generated viewport of the builder in the Social XR environment.

Fig. 10 .
Fig. 10.Mean score values of the task duration in seconds with 95% confidence intervals.

Table 2 .
Questionnaire Used in the Experiment

Table 3 .
Subjective Performance Analysis

Table 4 .
Presence Analysis

Table 5 .
Social Factors Analysis Following the identification of outliers with |zscore | > 3, two outliers of the conditions were identified and removed.Upon the elimination of these outliers, a normality test was conducted once again, which confirmed the normal distribution of the data with kurtosis and skew being less than 2 in absolute value.