Social Simon Effect in Virtual Reality: Investigating the Impact of Co-actor Avatar's Visual Representation

When engaging in joint actions in the real world, each individual achieves unconscious coordination with the other by automatically activating representations of the other’s behavior in the motor system. It has been investigated by examining the Social Simon Effect (SSE), which is a spatial stimulus–response interference induced by the presence and active engagement of a co-actor in a joint single response go/nogo task (the joint Simon task). On the other hand, collaborative co-actors are not always perceived in the same way between the real world and virtual reality (VR) as the visual representation of the avatar is thought to affect the perception of others’ presence and actions. In order to obtain design guidelines for a virtual environment (VE) that facilitates collaboration between users, this study looked into (1) whether SSE can occur during joint actions of avatars in a VE; and (2) how the visual representation of the co-actor’s avatar affects SSE; through the joint Simon task performed by two adjacent avatars. The results showed that SSE was induced when the co-actor’s avatar was displayed in full-body or was entirely transparent, but the SSE was weak when only the two hands were visible. These results suggested that participants perceived the full-body avatar as the socially engaged co-actor, representing its motions into their own motor planning. The same phenomenon occurred when the whole body of the co-actor was not visible, which could be attributed to the visibility of response actions and their consequential effects facilitated the occurrence of SSE. In contrast, the hand avatar, with its non-human-like appearance and seeming function independently of the entire body, inhibited the action co-representation process because it might be perceived as an unintentional artificial agent.


ABSTRACT
When engaging in joint actions in the real world, each individual achieves unconscious coordination with the other by automatically activating representations of the other's behavior in the motor system.It has been investigated by examining the Social Simon Effect (SSE), which is a spatial stimulus-response interference induced by the presence and active engagement of a co-actor in a joint single response go/nogo task (the joint Simon task).On the other hand, collaborative co-actors are not always perceived in the same way between the real world and virtual reality (VR) as the visual representation of the avatar is thought to affect the perception of others' presence and actions.In order to obtain design guidelines for a virtual environment (VE) that facilitates collaboration between users, this study looked into (1) whether SSE can occur during joint actions of avatars in a VE; and (2) how the visual representation of the co-actor's avatar affects SSE; through the joint Simon task performed by two adjacent avatars.The results showed that SSE was induced when the co-actor's avatar was displayed in full-body or was entirely transparent, but the SSE was weak when only the two hands were visible.These results suggested that participants perceived the full-body avatar as the socially engaged co-actor, representing its motions into their own motor planning.The same phenomenon occurred when the whole body of the co-actor was not visible, which could be attributed to the visibility of response actions and their consequential effects facilitated the occurrence of SSE.In contrast, the hand avatar, with its non-human-like appearance and seeming function independently of the entire body, inhibited the action co-representation process because it might be perceived as an unintentional artificial agent.

INTRODUCTION
Co-acting with others is a fundamental aspect of human life, which enables us to seamlessly collaborate with others in achieving common goals.Joint actions, such as lifting a heavy object, playing basketball, or operating surgery, are very common and important in our daily lives.The joint action refers to coordinated efforts where two or more individuals synchronize their actions in both time and space to bring about a change in the environment [60,72].During these actions, individuals don't merely focus on individual tasks; they share sensorimotor information, include each other's tasks in their error monitoring [8], coordinate their behavior and form interpersonal synergies [55], form a perceptual common ground [40], and shift from an individual to a collective sense of generating the joint action [44].These highlight the importance of understanding the cognitive and neural mechanisms underlying the joint action.
Previous studies have indicated that individuals in joint action tend to automatically represent their co-actor's task and integrate this representation into their own action planning, which is called "co-representation". Automatic task co-representation has been classically demonstrated by the Social Simon Effect (SSE, or joint Simon effect), which is a spatial stimulus-response interference induced by the presence and actively engagement of a co-actor in a joint single response go/nogo task (the joint Simon task) (for details of this task, see subsection 2.1) [61].The co-representation serves various functions in joint actions, such as establishing control frameworks, facilitating action prediction, and directing action monitoring during online coordination episodes [30,32,45].Coupled with perspective-taking (spontaneously mentalizing the spatial perspective of another and it into account on the interactive scene) [3,21], the co-representation forms an irreducibly collective mode of cognition called "we-mode" [15].The we-mode illustrates that individuals perceive the joint action as being aimed toward what they are going to pursue collectively (as a "we"), and such 'we-ness' is a prominent feature of the psychology of collective behavior [59,71].Despite the importance of co-representation (SSE) in mirroring another's motor plans and intentions to execute cooperative behavior [35,63], which is crucial for interpersonal coordination, it is not fully understood whether SSE can be observed in virtual reality (VR).
Joint actions and social interactions are increasingly taking place in VR.In particular, social virtual reality (social VR) and metaverse are expanding as platforms for multiple people to collaborate in virtual environments (VEs) using avatars.The elicitation of SSE has been found to rely on social context; it is not elicited in situations where the co-actor is absent or not actively involved in the interaction [61,70], which is supported by findings from both neurophysiological and electrophysiological studies [34,70].One of the main differences between VR and the real world is that people's bodies are substituted by their virtual avatar.Thereby, the information obtained from others and the way in which their presence is perceived may differ, which may affect SSE.Because the virtual avatars in VEs have limited anthropomorphic properties due to the lack of social cues such as facial expressions and gaze, the sense of social presence (a mental condition where people perceive virtual social actors as real active social entities either through sensory or non-sensory ways, and experiencing a sense of being together with them [2,38]) towards avatars is diminished [1,17,29].Although several solutions now offer the ability to reproduce facial expressions and gazes [36,46], it is not common in consumer applications.The decline in social presence towards the avatars may negatively impact the manifestation of SSE, which makes the manifestation of SSE in avatars not straightforward.Moreover, according to previous studies, the SSE depended on confirmation of the co-actor's presence through visual cues [61] (or auditory cues [29], proprioceptive cues [9]).In VEs, the users' avatar is not always represented in full-body; avatars with limited visual information such as hand only or fully transparent avatars are frequently used in social VR and VR games.They also serve as effective tools to investigate the influence of avatars' visual representation on perception [49], feelings [19] and behavior [33].While manipulation of an avatar's appearance has been demonstrated to affect social presence [20], its effect on SSE remains undetermined.Taken together, we answered the following questions: can SSE be observed during interactions between two avatars in a VE? whether the visual representation of the co-actor's avatar affects SSE?To address these questions, in this study, we investigated whether SSE can be observed between two adjacent virtual avatars (participant and co-actor) in VE performing the joint Simon task, and further explored the impact of the avatar's visual representation of the co-actor on SSE.The findings revealed here are expected to contribute to the development of design guidelines for VEs that promote collaboration among users.
The purpose of this study are as follows: • This study introduced the joint Simon task into a VE and investigated the occurrence of SSE in two adjacent virtual avatars.• This study investigated the impact of the visual representation of the co-actor's avatar on SSE.It explored the occurrence of SSE when the co-actor is represented as a full-body avatar, a hand avatar, or a fully transparent avatar.

RELATED WORK 2.1 Joint Simon Task, Social Simon Effect and Task Co-representation
The Simon effect is the difference in reaction time between trials in which stimulus and response are on the same side and trials in which they are on opposite sides, with responses being generally slower when the stimulus and response are on opposite sides.The effect is a kind of stimulus-response compatibility effects [4,72].Stimulus-response (S-R) compatibility is the degree to which a person's perception of the world is compatible with the required action.S-R compatibility has been described as the "naturalness" of the association between a stimulus and its response, such as a left-oriented stimulus requiring a response from the left side of the body or a right-oriented stimulus requiring a response from the right side of the body.In the standard Simon task [54], a task to test the Simon effect, two types of stimulus (e.g., red/green dot, or two types of auditory pitch; these stimuli themselves are without any spatial information) are presented at the left or right side of some reference point (e.g., a central fixation cross displayed on a screen).
A single participant is instructed to execute a left-hand action (e.g., a key press) for one stimulus type and a right-hand action for the other, irrespective of the stimulus's presented location (left or right).
Although the location of the stimulus is task-irrelevant, participants respond more swiftly when the target aligns with the response side compared to when it's on the opposing side.
It has been suggested that the perception and action are processed in the same representational medium and using the same kinds of codes [22,23,52].Therefore, the responses in the standard Simon task are represented by a composite of feature codes (codes that represent the features of all response-related perceivable effects) coding for the motoric patterns generating it (stimuli) as well as its consequences (actions).The task-irrelevant stimulus location is not entirely neglected; it instead automatically included in the feature codes stimulus representations along with the task-relevant stimulus feature (color or audio type) [12,24].Likewise, the feature codes of the actions' location also exist in action representations (tactile feeling of the left/right-hand key pressing).Accordingly, actions that feature overlap with stimuli get primed, leading to faster response speed.Conversely, a mismatch between features tends to slow down response speed [25].
In the go-nogo version of the standard Simon task, where participants are required to respond only to a specific stimulus type (e.g., respond to the red stimuli with the right hand and dismiss the green stimuli), the Simon effect disappears.However, whenever participants perform the go-nogo standard Simon task alongside another co-actor (the joint Simon task), who is sitting next to the individuals and responsible for responding to the other type of stimulus (e.g., a co-actor who seats on the left side is required to respond to the green stimuli and dismiss the red stimuli), the Simon effect reappears.This phenomenon is called the social Simon effect (SSE) [61].An example of the visual version of the joint Simon task is shown in Figure 2.
To account for SSE, Sebanz et al.presented the task corepresentation account [61], suggesting the effect arises when participants represent their co-actors' task and integrate this representation into their action planning.The Simon effect (SSE) reappears in the joint Simon task because it is as though the participants are performing the whole task (like what they do in the standard Simon task), thereby reactivating the response codes [53,76].The mismatch in spatial coding between stimuli and actions results in interference when the individual's response is required, but the position of the stimulus primes the response of their co-actor.[14].

Social Features of Social Simon Effect
A number of studies have indicated that the occurrence of SSE was affected by the co-actor's social features.Kuhbandner et al., revealed that individuals showed no SSE when they were in a bad mood that was manipulated by film clips [5].Hommel et al. showed that individuals preferentially represent actions of those they perceive in a socially positive way through their friendly and cooperative actions, but this is not the case for people they perceive in a socially negative way through their intimidating and competitive actions [26].The red participant is responsible for responding to the red dot stimuli by pressing the button in front of them as quickly as possible, regardless of the location of stimuli.If there is a co-actor who is sitting on the left side of the red participant (the Joint condition), the reaction time for the red participant is significantly faster when S-R is compatible (both the red participant and the red dot stimuli are on the right side), as opposed to when it's incompatible.Conversely, if the green co-actor is absent and no one is designated to respond to the green dot stimuli (the Solo condition), there is no significant difference in the reaction time.In the auditory version of the joint Simon task, the red/green dots are substituted with two different types of auditory stimuli.
Research that investigated the SSE between either two in-ground members with the same skin color or two out-group members with a different racial group revealed that the out-group co-actor eliminates the size of SSE [43].The occurrence of SSE between two in-ground members and the absent of SSE between two out-group members was observed even though a group of strangers was artificially formed based on arbitrary categories [41].Additionally, the SSE has shown sensitivity to aspects like the degree of social self-construal [6], religious belief [7], and context competitiveness [56].Collectively, these findings imply that the SSE can not emerge when the co-actor is perceived as socially irrelevant.
Research focusing on SSE in collaboration with unintentional artificial co-actors, such as robots or computers, reinforces the notion that an individual's ability to co-represent actions during a joint task is critically influenced by the co-actor's social identity [67,75].Sahaï et al. reported the absence of SSE when a computer program substituted a human partner in the joint Simon task [57].Conversely, when the co-actor was a humanoid robot, SSE emerged when the robot was described as functioning in a biologically inspired way, but it disappeared when the robot was introduced in a purely deterministic way [66].The effect of the co-actor's human likeness and animacy was also found in a study by Liepelt et al., in which SSE was more pronounced when participants believed they were interacting with a human hand as opposed to a wooden one [39].Other studies extended these findings by showing the presence of SSE when individuals formed a vivid image of a non-biological co-actor with a wooden hand by watching a video fragment of Pinocchio prior to the task [42].The absence of SSE in scenarios where the co-actor is non-biological or machinelike might arise due to several reasons: a lack of attentional shift towards the machine-generated unintentional actions lead to an inactive in individuals' sensorimotor network [58,66]; a failure in activating the same mechanism that codes human motor behavior for the actions of machine as these actions are not within human's experiential motor repertoire [42,68]; or an inability to perceive the physical causality between the machine's actions and its effects [67].
Dolk et al. found that SSE occurred when the co-actor was replaced with a Japanese waving cat or a ticking metronome [10].These observations gave rise to an alternate explanation for SSE , termed the referential coding account.The waving cat served as a salient event that provides a reference point for referential coding of the response.Thus, the response codes are likely to refer to the response's horizontal location relative to the reference point, leading to the compatibility or incompatibility in the spatial codes of S-R [10].Furthermore, introducing the salient events introduces a discrimination problem.The more salient the event is, the more necessary for individuals to distinguish the cognitive representation of one's own action from the representation of the salient events, thus the size of SSE is more significant.Any event that attracts the individual's attention and located within the individual's peripersonal space (reaching distance), can be sufficient to induce the tendency to code one's action spatially in reference to this attention-attracting event [9,18].Nevertheless, the referential coding account can not fully explain the absence of SSE when individuals perform the joint Simon task with an unsocial or unintentional artificial co-actor [58].It's undeniable that SSE is sensitive to the socialness of a situation and reflects the degree of interpersonal integration [10].

EXPERIMENT
We aimed to test whether the participants perceived the avatar in a VE as an active social agent, enabling the observation of SSE.We further investigated the influence of the visual representation of the co-actor's avatar on SSE.In the experiment, two avatars sat side by side (left: co-actor, right: participant) in VE to perform the joint Simon task together; the perceived social presence of the co-actor's avatar was measured through a questionnaire.SSE was measured through the difference in reaction times between compatible and incompatible S-R.

Participants
Twenty-four right-handed healthy participants were recruited (12 males and 12 females, age: 25.708 ± 5.752) through social media, and they were paid based on the minimum wage in the authors' country as compensation for their 1h 15mins participation.The number of participants we recruited referred to related studies [ 70,74].This sample size provides a power of 0.80 to detect a withinfactors comparisons effect with effect size  = 0.25 in a repeated measures design, according to G*Power 3.1 [13].None of them had previous knowledge about the purpose and the hypothesis of this experiment, and they were informed of an alternative aim of the experiment in recruitment (the aim of this experiment is to test their reaction speed in a VE).For their familiarity with VR, one participant experienced VR 2-3 times a month, eight participants 1 time a month, and 16 participants had never experienced VR before.This experiment was approved by the local Ethical Committee for Human-Subject Research.

Joint Simon Task in a Virtual Environment
During the joint Simon task in a VE, the participant's avatar on the right side sat adjacent with the co-actor's avatar on the left side.On the table in front of each avatar, 35 in front of and 10 to the right of the avatar, was a button that was utilized for stimulus reaction.A tablet was placed in the center of the table, 60 in front of the avatars.On the left and right sides of the tablet, there were two speakers placed with a 100 distance between them (see Figure 3) No difference in participants' feature of task performance and task co-representation was indicated by previous studies between the visual and audio versions of the joint Simon task [10], so we adopted auditory stimuli to leave more attentional capacity for participants to confirm the appearance of their co-actor's avatar.Each trial of the joint Simon task began with a white fixation cross that appeared at the center of the tablet's screen and a warning sound (white noise, 300 in duration, without any sptial information).After 700, one of two types of auditory stimulus, Sound A (300 pure tone) or Sound B (600 pure tone) with a duration of 300, was presented from the speaker on either the left or the right side.Participants and their co-actor had to respond exclusively to the assigned auditory stimulus by pressing the button in front of them with the index finger of their right hand, no matter which side the auditory stimulus presented.If anyone responded to the stimulus in 1700 (the maximum time for response) or 1700 had passed, an empty square was displayed around the fixation cross, which marked the end of one trial.After that, the subsequent trial started after 1000.The timeline in each trial was shown in Figure 4.
The experimental system consisted of the VR system and the Stimulus-Reaction system; both were developed using the game engine Unity (version 2021.3.11f1) and ran on a Windows laptop (GALLERIA ZL7C-R38H).The VR system provided participants with an immersive VE, displaying avatars of the participant and their co-actor.Participants' head and hands were tracked by Meta Quest2 and its two hand controllers, and their position and rotation reflected to their avatar.The lower body of the participants' avatar was in sitting posture, and gestures of elbows and shoulders in the avatar's upper body were calculated by inverse kinematics.The participants embodied their avatar from the first-person perspective.The male avatar (Male_Adult_08) and the female avatar (Female_Adult_05) from the Microsoft Rocketbox Avatar Library were used [16].These avatars have a white European appearance and are dressed in plain clothing 1 .
On the other hand, the Stimulus-Reaction system presented the participants 3D Unity AudioSource (Sound A and Sound B, with spatial information) and 2D Unity AudioSource (warning sound, without spatial information).The system also received inputs from button pressing, and recorded reaction times.Frames per second in the Stimulus-Reaction system was around 650 ∼ 750, thus the margin of error in reaction times was approximately 1.

Questionnaire
Social presence, described as perceiving others as social entities and developing the sense of being together with others, was measured by a questionnaire from [2].We considered two facets of the social presence: co-presence (the degree to which the individuals feel as if they are in the same space with the other) and perceived attentional engagement (the degree to which the individuals allocates focal attention to the other).From the original questionnaire, we adopted only the "Perception of self" part (the extent to which the participant feels they are with their partner), and the "Perception of the other" part (the degree to which the participant believes their partner feels in company with them) was excluded.These aspects were chosen based on their potential influence on SSE, as suggested by previous studies [9,10,18,58].Questions in the original questionnaire were arranged to conform to the current experimental situation.Questions are listed in Table 1.All of the questions were presented to the participants in random order and were rated on a 7-point Likert scale (1: strongly disagree; 7: strongly agree).The average score of the Q1, Q2, Q3 (reversed), Q4 (reversed) indicated the level of co-presence, and the average score of the Q5, Q6 (reversed), Q7 (reversed) indicated the level of perceived attentional engagement.
Table 1: All questions to measure the co-presence (Q1-Q4) and the perceived attentional engagement (Q5-Q7).Scores of questions marked with * were reversed.
Q1 I often felt as if the people sitting next to me and I were in the same virtual environment.Q2 I was often aware of the people sitting next to me in the virtual environment.Q3* I hardly noticed the people sitting next to me in the virtual environment.Q4* I often felt as if we were in different places rather than together in the same virtual environment.Q5 I paid close attention to the people sitting next to me.Q6* I was easily distracted from the people sitting next to me when other things were going on.Q7* I tended to ignore the people sitting next to me.

Experimental Conditions
In accordance with whether the participants performed the joint Simon task alone or with their co-actor as well as the representation of their co-actor's avatar, there were four conditions: Solo, Trans_Avatar, Hand_Avatar, and Full_Avatar.As the Simon effect is trial-by-trial, the visual representation of the co-actor's avatar in conditions previously experienced by the participant has little impact on it in the current condition [3,25].Therefore, we used a within-subject design, that is, the participants experienced all four conditions.The order of these four conditions was counterbalanced.In all conditions, the participants used a full-body avatar with all of the body parts displayed and visible.The gender of the participants' avatar matched their biological gender.While group membership has been shown to influence SSE [41,43], there is no conclusive evidence regarding the impact of the co-actor's gender on SSE.Therefore, to better simulate real collaborative scenarios, the gender of the co-actor's avatar was counterbalanced across participants, instead of matching the co-actor's gender with that of the participants.This means there was an equal probability (50%) that participants would be paired with a co-actor of the same or different gender.Details of each condition are as below: • Solo: the co-actor did not exist and the participant performed the joint Simon task alone • Full_Avatar: the co-actor used a full-body avatar of which all body parts were displayed and visible • Hand_Avatar: the co-actor used a hand avatar of which only the left and right hand were displayed and visible • Trans_Avatar: the co-actor used a transparent avatar of which none of the body parts was displayed and visible We introduced the Hand_Avatar condition, where only hands of the co-actor's avatar were visible, as an intermediary between the Full_Avatar and the Trans_Avatar conditions because the hand avatar was commonly used in many VR games.Except for the Solo condition, the participants were informed that their co-actor immersed into the same VE remotely through the network.Throughout the experiment, they neither saw this co-actor in the physical room nor were they aware of the co-actor's actual physical presence.For reproducibility, motions of the co-actor's avatar were pre-recorded, with the button press in the joint Simon task being software-controlled and the reaction times ranged from 300 to 450 based on previous studies [69].The pre-recorded actions of the co-actor's avatar were designed to mimic real reactions as closely as possible.Initially, the co-actor's avatar displayed curiosity towards the participant's avatar, observing it, and then shifted its focus to the task at hand.None of participants noticed this manipulation during the experiment.The hand avatars were generated by making all body parts transparent except for the hands of the male/female full-body avatars.The transparent avatar, being fully transparent, contained no demographic information and was identical for all participants.For the transparent avatar, neither the body parts nor movements of the co-actor's avatar could be visually confirmed.However, the participants could discern the effects of these movements, which include that the virtual cube was moved during the cube-carrying task (see subsection 3.6), and the button was pressed (button cap went down and up) and resulted in the display of the empty square on the tablet during the joint Simon task.

Hypotheses
The elicitation of SSE is dependent on social context, it is absent when the co-actor is non-existent or inactive [61,70].Thus, we hypothesized that SSE in a VE may relate to the social presence perceived towards the co-actor's virtual avatar, which describes the degree to which individuals perceive the virtual avatar as a social entity and feel a sense of being together with it.Despite the lack of social cues of virtual avatars leading to a reduction in their perceived social presence [1,17,29], we hypothesized that SSE can still be observed between two adjacent avatars because the presence and active engagement of the co-actor can be visually confirmed.However, when all of the body parts or all except the two hands of the co-actor's avatar become transparent, thereby limiting the available visual information of the avatar, its social presence may further decrease [77], which may inhibit the elicitation of SSE.
The hypotheses are as follows: • H1: SSE can be observed when the participants perform the joint Simon task with the co-actor's full-body avatar sitting adjacent to them in the Full_Avatar condition.• H2: SSE can't be observed when only hands of the co-actor's avatar are visible in the Hand_Avatar condition or when the co-actor's avatar is invisible in the Trans_Avatar condition.• H3: The absence of SSE is due to the decline in social presence towards the co-actor's avatar.

Procedure
Participants firstly signed a consent form and filled in a demographic questionnaire.The participants were then briefed on the experimental flow and the usage of controllers.They wore the headmounted display Meta Quest2 to immerse themselves in the VE, and then looked at a virtual mirror in the VE while freely moving their bodies for 60 to induce the sense of embodiment (a sense of being inside, having, and controlling a body in VE) towards their avatar [64,65].Then, a manipulation check of auditory stimulus was performed, during which the warning sound, Sound A, and Sound B were played in random order and the participants had to rate the direction of the sound they headed on a 7-point Likert scale (1: the sound was from left; 7: the sound was from right).The manipulation check was to ensure that the spatial information of auditory stimulus was presented and perceived properly.Afterwards, as a tutorial, they performed the joint Simon task with 16 trials alone (the experimental setup in the tutorial was the same as the Solo condition).The tutorial was only conducted once at the beginning of the experiment.Participants in the joint Simon task might concentrate more on completing the task rather than on the visual representation of their co-actor's avatar, a factor we hypothesized could impact the occurrence of SSE.Therefore, in each condition, a cube-carrying task was conducted prior to the start of the joint Simon task.Under the conditions where the co-actor exists (Full_Avatar, Hand_Avatar, Trans_Avatar), the purpose of this task was to allow participants to familiarize themselves with their co-actor's avatar visual representation.It also reinforced the perception that their co-actor was actively participating and engaging.During the cube-carrying task, the co-actor carried a virtual cube generated from the left side of the table to the middle.Then, the participants had to carry the virtual cube handed over by their co-actor to the right side as fast as possible.This cube-carrying process was repeated for 60.Conversely, under the Solo condition, participants, to embed the belief that the co-actor was not present, performed the same task while they carried the virtual cube generated from the middle of the table to the right side on their own for 60.The function of carrying a cube was achieved through the right-hand controller; when the controller's button was pressed, the cube that the avatar's right hand was touching would attach to it, and when the button was pressed again, the cube would detach from it.
Following the cube-carrying task, the joint Simon task began.Participants placed the controllers on the table and positioned the fingertip of their right index finger on the button in front of them, and their avatar did the same posture (see Figure 3).Two types of auditory stimulus (Sound A/Sound B) are designated either as the participant's or the co-actor's target stimulus.For instance, participants might have Sound A while their co-actor had Sound B, or vice versa.Then, the participant performed the joint Simon task that was divided into two sections.Each session consisted of 128 trials (spatial compatibility of stimulus-response relationship (Compatible/incompatible) × sound type (Sound A/Sound B) × 32 repetitions); for details of each trial and configuration of the VE, see subsection 3.2) and there was a 2 short break between the two sections.After the participant finished the joint Simon task in each condition, they filled in the questionnaire (see subsection 3.3) and then there was a 5 long break.The target stimulus of the participant and the co-actor exchanged between conditions (e.g., from participant: Sound A, co-actor: Sound B to participant: Sound B, co-actor: Sound A), and the initial assignment of the target stimulus in the first condition was counterbalanced.As the normality of the data was confirmed according to Shapiro-Wilk test and QQ plot, we firstly performed a two-way repeated measures ANOVA on the reaction times.All trials in which response times were longer or shorter than 3 standard deviations of that participant's mean response time were excluded from further analysis.A significant main effects of S-R compatibility ( (1, 23 = 16.282), = 0.000516,   2 = 0.003) and a significant main effects of experimental condition ( (3, 69 = 3.261),  = 0.027   2 = 0.014) were found.There was not a statistically significant interaction between the effects of S-R compatibility and experimental condition ( (2.11, 48.62) = 0.462, ).

Social Simon Effect
A two-way ANOVA did not reveal a significant interaction, which may be attributed to the impact of the co-actor's avatar's visual representation on SSE, that is measured by compatibility effect (the difference in reaction times between compatible and incompatible S-R), among the Full_avatar, Hand_avatar, and Trans_avatar conditions was not substantial enough to yield a significant interaction.Alternatively, the lack of significant interaction could be due to the sample size required to detect interaction effects being larger than that needed for main effects [73].Considering a lot of previous research indicates SSE's presence when the co-actor exists and its absence when the co-actor does not [58,61,62,70], the hypothesis that SSE differs between the Solo condition and the Full_avatar condition is plausible and clear.Moreover, significant interaction effects can occur within a model even if the omnibus test is not significant, as shown in [37].Therefore, we performed a paired t-test to examine SSE in each condition.Results are shown in Figure 6.We found a significant difference in reaction times between compatible and incompatible S-R in the Therefore, H1 was supported and H2 was partly supported.
In order to examine whether the size of SSE, that measured by the size of compatibility effect (the prolonged reaction times in incompatible S-R compared to compatible S-R), differed in different conditions, we performed one-way repeated measures ANOVA for the size of the compatibility effect among the four conditions, no significant difference ( (2.11, 48.62) = 0.462,  = 0.643) was revealed.Results are shown in Figure 7    showed no significant difference in all condition pairs.Results of the attentional engagement are shown in Figure 7 (c).Results of linear regression revealed that either the score of co-presence or attentional engagement could predict the size of SSE, so H3 was not supported.

DISCUSSION
The absence of SSE in the Solo condition suggested that participants were unable to represent the co-actor's actions and activate the feature codes for the locations of their responses, which are essential for SSE, due to the non-existence of a co-actor [53,76].More importantly, the occurrence of SSE was observed in the Full_Avatar condition.These results revealed that the full-body avatar, which mirrors the physical movements of the co-actor and is responsible for the complementary portion of the joint Simon task, can be perceived as a socially relevant agent.Participants represented actions of the full-body avatar into their own motor plan, leading to the formation of the task co-representation.
SSE was also significant in the Trans_Avatar condition.The absence of visible avatar body parts, due to them turning transparent in the Trans_Avatar condition, did not impede SSE.Within this condition, the avatar's body movements (such as pressing and releasing the button in the joint Simon task) were invisible, while a portion of the response actions (like the button cap went up and down) and its consequent response effects (the fixation cross appeared on the tablet's screen) remained perceivable.The observation of another agent's actions can modify actions programmed by the observer.According to the ideomotor theory, the context provided by another acting agent activates response codes in the observer that are functionally equivalent to an actual planned action [31,61].The representation of observed actions can either facilitate (when the observed actions and internally-driven actions are compatible) or interfere (when these actions are incompatible) with the observer's internally-driven actions, providing a possible explanation for the emergence of SSE [28,74].In addition, the visibility of the response actions in the present experimental settings might enable the activeness of SSE due to its role in attributing the sense of agency (physical causality between an initiator of action effect and the effect).Previous studies indicated the absence of SSE when participants believed that their co-actor did responses using Brain Computer Interface (BCI) without physical response actions conducted and agency [67], and that both SSE and the sense of agency were absent during a human-machine joint Simon task [58].As the translation of high-level representation of the goal and causality underpinning an action relies on its low-level visual information [27,50], thus the visibility of response actions is essential for attributing agency and for facilitating the occurrence of SSE.Furthermore, the mere observation of response effects alone without any response actions being insufficient to trigger SSE.Previous studies revealed that no SSE was observed when participants only witnessed the response effects of the co-actor who was in another room [62], even if the room's location was disclosed to participants [74].Taken together, SSE was elicited in the Trans_Avatar condition might because both the response actions and its consequent response effects of the co-actor's avatar were observable.Furthermore, we believed that the fact that the participants were performing the joint Simon task in VE have impact on SSE.Tsai et al. observed SSE if participants believe in cooperation with a biological agent in another room even in the absence of any visual-auditory feedback of that agent [69], which is contradictory to the findings in [62,74].This could be attributed to the co-actor, in their experiment, being familiar to the participant and they were allowed to communicate through an intercom system prior to the experiment, which reinforced the belief in the other's presence.In VR, where the presence of a transparent avatar is technically achievable, the participants' belief in the co-actor's presence in VE may be promoted in a similar way that contributes to the occurrence of SSE.
SSE was not significant in the Hand_Avatar condition, but the p-value of 0.062 that close to the margin of statistical significance revealed a weak SSE in this condition.One possible explanation for this attenuated SSE can be that the participants perceived the avatar with only hands as a non-human artificial agent rather than an intentional humanoid agent.The non-human-like appearance of the co-actor's hand avatar made the participants perceive that they acted in a machine-like manner, thus impeding the action representation process [66].On the other hand, another possible explanation is that the participants in this condition were effortless in making self-other discrimination during the joint Simon task.Dolk et al. proposed that a successful self-other discrimination is required in the joint actions, and the more similar the to-be-discriminated event results to more difficulty in the discrimination.This increased difficulty enhances the importance of using relative spatial coding towards actions (participant responding by pressing the button on the "right" and the co-actor responding by pressing the button on the "left") for self-other discrimination, which is responsible for the occurrence of SSE [10].In contrast, when there is a distinct difference in appearance between the participant and their co-actor, spatial location doesn't serve as the primary distinguishing feature, thus impeding the spatial coding of the participant's action and, consequently, the observation of SSE [10].One can argue that, in the Trans_Avatar condition, the co-actor's avatar is more different in appearance than the participant's full-body avatar.Consequently, the avatar's appearance, rather than the spatial coding of actions, becomes more predominant for self-other discrimination, and thus SSE should also be absent.This could be explained that the participant's full avatar serves as a reference for the imagination of the co-actor's avatar appearance when the co-actor's avatar is fully invisible.This makes it harder to establish self-other discrimination according to appearance because the appearance between the participant's avatar and the co-actor's avatar (in imagination) is consistent, so that the spatial location of actions become more salient and thus SSE can elicit in the Trans_Avatar condition.In contrast, the observable hands leave litter for the participants to imagine and complement the co-actor's body in the Hand_Avatar condition.Notably, while the analysis result of the SSE's size (Figure 7 (a)) revealed a considerable effect size (14.659ms), the significance of SSE was undermined by a large standard deviation in reaction times.This reveals a large individual difference in the occurrence of SSE, which may need further research.
Analyze results showed that the co-presence significantly declined when all body parts, or all body parts except the hands, became transparent (Figure 7 (b)).However, the attentional engagement almost maintained at a similar level (Figure 7 (c)).These findings, along with the presence of SSE in both the Full_Avatar and the Trans_Avatar condition and a weak SSE in the Hand_Avatar condition, suggest that neither the co-presence nor the attentional engagement is the decisive factor of the elicitation of SSE.This is evident from the Trans_Avatar condition where a decrease in co-presence didn't hinder SSE, and from the Hand_Avatar condition where SSE was weak despite stable attentional engagement level.However, the roles of co-presence and the attentional engagement in influencing SSE can not be completely denied, they may just not be the primary factors affecting SSE in the current experimental settings, as shown by previous studies [18,25,58].We propose that further studies of SSE in VR may take other factors into account such as mimicry and intentionality.
Taken together, SSE was observed when two adjacent fullbody avatars performed the joint Simon task together.This study investigated one of the underlying mechanisms of the joint actions, demonstrating that individuals can represent their co-actor's task and establish task co-representation, even when themselves and the co-actor is represented as an avatar in VR.The results of our study encourage interpersonal coordination and social interaction in VR as these activities may not significantly differ with those in the real world.Furthermore, the findings of this study suggest that the visual representation of another's avatar is one of the factors that can influence multi-person cooperation in VR.The weak SSE of the hand avatars underscores a potential perception gap.The hand avatars are not perceived as sufficiently social or active to engage in coordination, hindering their actions from being incorporated into others' action planning.This study may offer reference for avatar design especially in collaboration scenarios such as social VR and cooperative VR games, suggesting that the usage of hand avatars may need to be reconsidered.Reaction times to compatible stimuli were observed to be faster when a co-actor was present, represented either by a full-body or a transparent avatar.Consequently, this study could have practical applications in enhancing human's ability in tasks that demand rapid reactions, such as emergency response, industrial control, or rhythm games.

LIMITATIONS & FUTURE WORK
In addition to the findings from our research, we believe that there are other aspects about SSE in avatars that need further investigation.First, the avatars utilized in our study featured standard appearances and plain clothing, and its visual representation changed in a short-term.However, social VR and VR applications today offer users extensive customization options for their avatars, including the ability to freely edit or create a self-avatar through 3D Human Face Scanning.Individuals tend to develop a stronger sense of acceptance towards a personalized avatar or an avatar that has been used for a long-term [11,47].Furthermore, the avatar's characteristics can influence users' perceptions and behaviors, an effect known as the "Proteus effect" [48,51].Therefore, beyond the visibility of body parts, examining how customization, characteristics, and the long-term used avatar impact SSE is required.Second, individual differences may influence the SSE.In particular, participants more accustomed to VR may perceive the hand avatar as human-controlled and acknowledge the co-actor's presence when a transparent avatar is used, drawing on their VR experience.Conversely, those less familiar with VR may not.Third, the current study followed the classical design of the button-pressing joint Simon task in the real world.Given the increasing prevalence of hand tracking and gesture control in VR, future research may consider using pointing or pinching as the input technique.

CONCLUSION
In this study, we explored the occurrence of SSE, which has been proposed to reflect the task co-representation that plays an important role in the joint actions, between two adjacent avatars through the joint Simon task.We further investigated the influence of the visual representation of the co-actor's avatar on SSE by varying the avatar appearance of the co-actor: from the full-body avatar to hand avatar and transparent avatar.We observed SSE when the co-actor was displayed as the full-body avatar.This suggested that the fullbody avatar was perceived as a socially interactive agent, enabling individuals to integrate its movements into their motion planning.SSE persisted when the co-actor was displayed as the transparent avatar, which might be related to the visibility of response actions and its consequential effects.Conversely, SSE was weak when the co-actor used the hand avatar, it might be because that the hand avatar was perceived as an unintentional artificial agent and the participants took less effort in making self-other discrimination.The weak SSE for the hand avatar may offer insights for avatar design in collaboration scenarios by suggesting that the usage of hand avatar may need to be reconsidered.

Figure 1 :
Figure 1: (A) Experimental settings in Virtual Reality (VR).Two adjacent virtual avatars (graphical representation of the user's body in VR, left: co-actor, right: participant) perform a joint Simon task in virtual environment, which is used to measure the occurrence of social Simon effect.(B) The participant's perspective in the joint Simon task.During this task, each person has to respond exclusively to their assigned auditory stimulus by pressing the button in front of them, while dismissing presented location of the stimulus (from left or right speaker).(C) Experimental settings in the real world.Participants are not aware of the co-actor's physical presence, they can only perceive the co-actor's presence through their avatar in virtual environment.

Figure 2 :
Figure 2: An example of the visual version of the joint Simon task from the perspective of the red participant.The red participant is responsible for responding to the red dot stimuli by pressing the button in front of them as quickly as possible, regardless of the location of stimuli.If there is a co-actor who is sitting on the left side of the red participant (the Joint condition), the reaction time for the red participant is significantly faster when S-R is compatible (both the red participant and the red dot stimuli are on the right side), as opposed to when it's incompatible.Conversely, if the green co-actor is absent and no one is designated to respond to the green dot stimuli (the Solo condition), there is no significant difference in the reaction time.In the auditory version of the joint Simon task, the red/green dots are substituted with two different types of auditory stimuli.

Figure 3 :
Figure 3: Experimental setup of the joint Simon task in the VE.

Figure 4 :
Figure 4: Timeline in each trial in the joint Simon task.

Figure 5 :
Figure 5: From left to right: (male) Avatars the co-actor used in the Full_Avatar condition, the Hand_Avatar condition, and the Trans_Avatar condition.

Figure 6 :
Figure 6: The mean reaction times as a function of the visual representation of the co-actor's avatar and spatial S-R compatibility.Significant differences between compatible S-R and incompatible S-R that analyzed by the paired t-test represent the occurance of SSE.SSE is observed in the Full_Avatar condition and the Trans_Avatar condition, but not in the Solo condition and the Hand_Avatar condition.Error bars represent standard errors of the mean differences.

Figure 7 :
Figure 7: (a) The size of SSE that measured by the prolonged reaction times in incompatible S-R compared to compatible S-R.A higher score indicates a larger size of SSE.(b) The co-presence score.A higher score indicates that the participants feel stronger that they their co-actor are in the same VE together.(c) The attentional engagement score.A higher score indicates that the participants allocate more attention to their co-actor.Error bars in these three plots represent standard errors of the mean differences.