NeuroDog: Quadruped Embodiment using Neural Networks

relies solely on consumer-grade hardware, thus making it appropriate for use in applications such as VR gaming or VR social platforms.

Fig. 1.Our virtual reality (VR) quadruped embodiment system in action.Left: Tracking the user's input motion using a HMD and three body trackers (located on the two feet and the pelvis).Middle: A third person view of the virtual dog.Right: The first person perspective experienced by the user in VR (user is looking into a mirror in the virtual environment).
Virtual reality (VR) allows us to immerse ourselves in alternative worlds in which we can embody avatars to take on new identities.Usually, these avatars are humanoid or possess very strong anthropomorphic qualities.Allowing users of VR to embody non-humanoid virtual characters or animals presents additional challenges.Extreme morphological differences and the complexities of different characters' motions can make the construction of a real-time mapping between input human motion and target character motion a difficult challenge.Previous animal embodiment work has focused on direct mapping of human motion to the target animal via inverse kinematics.This can lead to the target animal moving in a way which is inappropriate or unnatural for the animal type.We present a novel real-time method, incorporating two neural networks, for mapping human motion to realistic quadruped motion.Crucially, the output quadruped motions are realistic, while also being faithful to the input user motions.We incorporate our mapping into a VR embodiment system in which users can embody a virtual quadruped from a first person perspective.Further, we evaluate our system via a perceptual experiment in which we investigate the quality of the synthesised motion, the system's response to user input and the sense of embodiment experienced by users.The main findings of the study are that the system responds as well as traditional embodiment systems to user input, produces higher quality motion and users experience a higher sense of body ownership when compared to a baseline method in which the human to quadruped motion mapping relies solely on inverse kinematics.Finally, our embodiment system

INTRODUCTION
With the growing prevalence of concepts such as the Metaverse, taking on new identities by means of embodying avatars in virtual worlds is becoming increasingly common.Traditionally, users of virtual reality (VR) have been limited to embodying avatars that are humanoid or have very strong anthropomorphic qualities.When animal avatars are available, their animations are typically generated via Inverse Kinematics (IK) based on the users' motions.The resulting animations are likely to be unrealistic or inappropriate for the animal type.Potentially, this could lead to a weaker virtual embodiment experience for the user.
Towards this, we present NeuroDog -a VR quadruped embodiment system.NeuroDog is the first VR quadruped embodiment system to use deep learning to synthesise realistic motion for the virtual quadruped.The user, viewing themselves as a quadruped from a first-person perspective, is able to perform different actions in a natural, intuitive and relatively unconstrained manner.The virtual quadruped mimics their actions, switching gaits as the user moves faster or slower, sitting as the user sits, or raising its paws as the user raises their legs.The core novelty within NeuroDog is a novel architecture, built using two neural networks, for the real-time mapping of natural human motion to realistic quadruped motion.The architecture is designed to preserve the synchrony, as much as possible, between the user's legs and the virtual quadruped's front legs when performing tasks such as locomotion or paw-raises.We evaluate our system via a perceptual experiment in which we investigate the virtual embodiment experienced by users of NeuroDog, their perception of NeuroDog's motion, and their overall experience of using NeuroDog.Our experiment is the first to investigate the importance of animation fidelity on virtual animal embodiment.Comparing NeuroDog to a quadruped embodiment system relying solely on IK, the current standard for animal embodiment systems, we find that users do, in fact, rate the motion of NeuroDog as more natural, more pleasant and more dog-like, thus validating the quality of NeuroDog's output.Users also experience a higher sense of virtual body ownership when using NeuroDog.This is all achieved without reducing the levels of agency experienced by users.Finally, we find there is a strong positive correlation between ownership levels experienced by users of quadruped embodiment systems and users' naturalness ratings of the virtual quadruped's motion.This suggests animation quality should be an important factor when designing future virtual animal embodiment systems.
The reliance of NeuroDog solely on consumer-grade devices, a HMD and three body trackers, makes it appropriate for VR gaming and VR social platforms.The realism of the synthesised animations means it is also suitable for multi-user VR scenarios.While we envisage these being the main applications for NeuroDog, the potential applications are much wider.In the film industry, NeuroDog could allow human actors cast in the roles of animals to visualise their performances on their virtual characters from a first-person perspective live on-set.This could potentially lead to improved performances and increased satisfaction for the actors.Filmmakers, also immersed in this virtual reality, would too be able to visualise the virtual characters' performances earlier on in the film making process, thus improving collaboration between filmmakers and actors and helping with the creative process.NeuroDog also has potential applications in environmental applications.For example, allowing people to embody animals can endow them with empathy towards threatened species [Ahn et al. 2016].
To summarise the main contributions of this paper are: • A new neural network architecture for the real-time mapping of human motion to quadruped motion such that the synthesised quadruped motions are realistic while also being faithful to the input human motions.• A novel quadruped embodiment system suitable for use in a range of VR applications.
• An experiment in which we present new perceptual results regarding virtual animal embodiment.

RELATED WORK
In the following subsections we briefly review the existing research related to our work.et al. [2012] describe the sense of embodiment as consisting of three sub-components, namely the sense of body ownership, the sense of agency, and the sense of self-location.These terms refer to the user's feeling that the virtual character's body is the source of their experienced sensations, their sense that they are the cause of the virtual character's movements, and their feeling that they are located inside the virtual character's body, respectively.While most research explores the embodiment of humanoid and extended humanoid avatars, it has extended to the study of virtual animal embodiment.Krekhov et al. [2019b] find that virtual body ownership is also applicable to non-humanoids and researchers have investigated embodying animals as diverse as turtles [Pimentel and Kalyanaraman 2022], cows [Ahn et al. 2016], cats [Li et al. 2022], birds [Oyanagi and Ohmura 2019], beavers [Sierra Rativa et al. 2020], spiders, tigers and bats [Krekhov et al. 2019b].However, all previous methods have used simple inverse kinematics (IK) based solutions for mapping the input user human motions to the target virtual animal.This results in high levels of agency but low animation quality (when viewed in a mirror or from a third-person perspective).

Kilteni
No previous method has investigated the use of data-driven techniques, employing real animal motion-capture data, for animating the virtual animals in order to enhance the sense of human-toanimal embodiment.While our work is the first to explore this, there has been some work using data-driven techniques to enhance human-to-human embodiment in VR.For example, Ponton et al. [2022] combine a a neural network for body orientation prediction, motion matching [Clavet 2016] and IK for animating self-avatars in VR.Winkler et al. [2022] use a reinforcement learning framework to simulate plausible and physically valid full body motions from sparse input signals obtained from a HMD and two controllers.

Human to animal motion mapping
While not incorporated into VR animal embodiment systems, there are existing works which look at mapping human motion to animal motion.Many of the systems in these works are more akin to puppetry systems in which the user controls the target character, for example, using gestures or by performing a limited set of pre-determined motions.In contrast, our goal is for the user to be able to act in a more unconstrained fashion.For example, Komura and Lam [2006] use sensing gloves to real-time map a user's hand motion to locomotion on the target character (a quadruped).Seol et al. [2013] present a puppetry system for non-human characters (for example, dinosaur, mammoth, shark, and spider).The final output target pose is a blend between a pose resulting from learned linear mappings between input human features and target character features and a pose determined via a frame-by-frame correspondence constructed from reference human motions and target character animations.Rhodin et al. [2015] estimate wave properties (amplitude, frequency and phase) from input user wave gestures in real-time and use these to animate a target character (for example, a quadruped or a caterpillar) via a parametric motion graph ( [Heck and Gleicher 2007]) in which the example animations have been annotated with amplitude, frequency and phase parameters.Yamane et al. [2010] and Celikcan et al. [2015] both present methods for animating non-humanoid characters (for example, penguin, squirrel, or dinosaur) using human motion capture data.Mappings between input human poses and target character poses are constructed via pose correspondences generated by selecting key poses from the human motion capture data and creating corresponding poses for the target character.The requirement to define key pose pairs and the use of a pose-to-pose mapping limit the realism and variation of the resulting animations.Also, these systems do not run in real-time, a key requirement for us.Boehs and Vieira [2019] present a real-time system for controlling a quadrupedal character (a horse) using human motion as input.Similar to our method (and dissimilar to some of the puppetry methods discussed above), the user is able to move freely and naturally with their motion relatively unconstrained.Their system depends on two neural networks.However, the networks are used to choose the most appropriate frame to play back from pre-generated animations (with only one such animation used for each motion class) rather than to synthesise new animations on the fly.The playback of pre-generated animations is in contrast to our work in which we synthesize the motion frames using a large database of quadruped motion capture data.Also unlike us, they do not incorporate their method into VR.

Deep learning for motion synthesis
A key component in our work is the synthesis and control of realistic quadruped motion.Current state-of-the-art methods for motion synthesis are data-driven deep-learning models We also use such a model in our method.Mourot et al. [2021] provides a comprehensive survey of such methods for skeletal-based animation.
Researchers have used recurrent neural networks (RNNs) to learn the temporal dynamics of (human) motion [Fragkiadaki et al. 2015;Ghorbani et al. 2020;Pavllo et al. 2020].Their models are purely for prediction, i.e. they do not allow for control which we require.Holden et al. [2016Holden et al. [ , 2015] ] train a convolutional autoencoder to learn the manifold of human motion.However, the requirement of this time convolution architecture to "see the future" makes such an architecture unsuitable for online motion synthesis, an essential requirement for us.
Mixture of experts (MoE) architectures are a popular choice for motion synthesis and control [Holden et al. 2017;Mason et al. 2022;Starke et al. 2022Starke et al. , 2020;;Zhang et al. 2018].Here, the concept of motion phase(s) is used to force the progression in time of synthesised animations and to prevent regression to a mean pose.In these auto-regressive, but memory-less models, the weights of a feed-forward motion prediction network are dynamic -updating at each time-step according to the phase(s) of the motion.The theory being that by clustering poses according to their motion phase(s), any averaging of poses will be limited to poses of the same phase(s).We also use a MoE model, similar to that introduced in [Zhang et al. 2018].However, unlike the MoE models discussed here which are designed to take their control input from a gamepad, our model is required to take its input from a human user's motion.

Human versus dog motion
A key requirement for us is that the synthesised quadruped animations be realistic.Unlike humans, who mainly use two (symmetrical) gaits1 for locomotion (i.e. a run and a walk), quadrupeds use several, including a walk at slow speeds, a trot or pace at moderate speeds, and a canter or gallop at high speeds.These gaits have distinct movement patterns and the transitions between them are discontinuous as opposed to a smooth blend from one to the other.The gaits can be classified as symmetrical (walk, trot, pace) or asymmetrical (canter, gallop).A gait is symmetrical if the left and right legs of each pair move half a stride2 out of phase with each other.Otherwise it is asymmetrical.Figure 2 shows relative phases3 for some quadrupedal gaits.Alexander [2003] provides an in-depth discussion of animal locomotion.These considerations feed into our design choices as discussed in section 3.

NEURODOG: QUADRUPED EMBODIMENT SYSTEM
We propose a new method for the VR embodiment of quadrupeds.Our quadruped embodiment system is the first to use deep-learning for synthesising the virtual qaudruped's movements at run-time.This leads to more realistic animation of the virtual quadruped when compared to the current standard of animal embodiment systems which tend to rely on inverse kinematics.The user is able to behave in an intuitive and relatively unconstrained manner, performing actions such as locomotion at different speeds, sitting, or paw-raising, with the virtual quadruped mimicking their actions.As previous research [Kokkinara and Slater 2014] has found that enhanced embodiment is achieved when the real and the virtual bodies move synchronously, our architecture is designed to preserve the synchrony, as much as possible, between the user's legs and the virtual quadruped's front legs when performing tasks such as locomotion or paw-raises.This is intended to give the user a sense of agency over the quadruped.
3.0.1 Design choices.Our design choices for the system are determined by some key requirements.We want control of the virtual quadruped to be intuitive for the user.For example, if the user wants the quadruped to walk, sit or raise a paw, then they walk, sit or raise a limb, respectively.To increase the user's sense of agency over the virtual quadruped, we feel it necessary that they use their limbs to control some, or all, of the quadruped's limbs.For two reasons, we decided the user should take a bipedal stance (as opposed to a quadrupedal stance) in which their legs and mapped to the quadruped's front legs.First, requiring a quadrupedal stance has been found to be significantly more exhausting for users [Krekhov et al. 2019a].Second, a key requirement for us is that the quadruped's animations be realistic.On analysing some sample motion capture data of potential users trying to replicate various quadruped gaits, we found users were unable, or at least struggled, to accurately replicate the gaits without significant practice.Thus, mapping all four user limbs to all four quadruped limbs could make realism a more difficult goal to achieve.With a bipedal stance, the user's legs move half a stride out of phase with each other during locomotion, which is also the case for the quadruped's front legs when performing symmetrical gaits (section 2.4).Thus, we feel this design choice was appropriate.
Another key requirement for us is that the system be accessible, i.e. not rely on expensive equipment such as full motion capture setups.Thus, we limit ourselves to using consumer-grade devices.The system only requires a head mounted display (HMD) and three tracking sensors to be placed on the user's pelvis and feet.
3.0.2Overview.The core component of our method is a novel architecture, depending on two neural networks, for mapping the input user (human) motion to output animation for the virtual quadruped.An outline of our method is shown in Figure 3b.The purpose of the first network is to predict the target root velocity of the virtual quadruped based on the user's two feet velocities.This predicted target velocity, along with the user's facing direction, is then used to construct a target trajectory for the virtual quadruped to follow.The second network then uses this constructed trajectory, together with other information extracted from the user (target action to perform, target feet contact labels, target feet velocities), to predict the quadruped's next pose based on its previous one.
Once the virtual quadruped's body pose has been predicted, inverse kinematics is used to animate the neck and head pose based on the HMD orientation.Finally, the user views themselves as the virtual quadruped in the virtual environment.We describe our system in more detail in the following sections.

Motion mapping
In this section, we describe our method for mapping the input user motion to the output virtual quadruped animations.The mapping relies on two neural networks, both of which are trained using a dataset of quadrupedal motion capture data.We describe the dataset and the networks below.The training and run-time pipelines are illustrated in figures 3a and 3b, respectively.
3.1.1Dataset.We use the dataset introduced in Zhang et al. [2018] for training the two neural networks.The data consists of unstructured motion capture data for a single dog.It contains various modes of locomotion (walk, pace, trot, canter), as well as other motions such as sitting, paw-raising, idling, lying, and jumping.After removing unwanted data clips (for example, because of noisy data or unwanted actions) and mirroring the remaining data, we were left with 284502 frames of motion.An approximate breakdown of the data by motion type or action is illustrated in table 1.A skeleton fitted to the motion capture data contains 27 joints, including a root and 6 end-effectors (Figure 3).As mentioned previously, a goal of ours is for the virtual quadruped's front limbs to move in sync with the user's legs.On analysing sample human and quadruped motion capture data, we noted that quadruped strides tended to last significantly less time than human strides.Thus, to counter this, we slow the original quadruped motion capture down threefold.The data is processed differently for training each of the two neural networks -the different processing pipelines are explained in the following sections.3.1.3Target velocity prediction network.We want the virtual quadruped's steps to be synchronous with the user's steps.However, by taking a given number of steps, a quadruped and human may travel very different distances.Thus, we cannot directly use the user's root velocity as the target velocity for the quadruped as moving at this velocity may require the quadruped to take significantly more or less steps than the user.This motivates our use of a neural network to predict the target root velocity for the quadruped.The target velocity prediction network is used to predict the target 2D ground-plane root velocity v  ∈ R 2 for the virtual quadruped for the next time-step .The run-time input to the network is a vector are the target 3D left and right front feet velocities for the virtual quadruped for the next time-step , together with the corresponding values from the previous  time-steps.The vector v−:−1  ∈ R  ×2 contains the target 2D ground-plane root velocities from the previous  time-steps, i.e. the network's previous predictions are fed back in and used for the next prediction.All inputs and outputs to the network are in a root-centric coordinate system to make them independent of the global orientation (of the quadruped during training and the user at run-time).We use a value of 20 for , meaning the network input and output are 166 and 2 dimensional, respectively.
The network itself has a simple feed-forward architecture with three fully-connected layers.The two hidden layers have 512 and 256 units, respectively.The final output layer has 2 units.Each hidden layer is followed by an ELU (exponential linear unit) [Clevert et al. 2016] activation layer.
The network is trained using the quadruped motion capture dataset described in section 3.

𝑅
) , v   } are extracted from the data.The network is trained in a supervised manner by minimising the mean squared error between the network output and the ground truth data, i.e. by minimising L ( v As the network's run-time predictions depend on its previous predictions, we use scheduled sampling [Bengio et al. 2015] during training to increasingly expose the network to its previous predictions by replacing v  −: −1  with v−:−1  in the training input.Full details of the training process, including the scheduled sampling, are provided in section 3.3.1. Despite the network being exclusively trained on quadruped data, the run-time inputs are constructed from human (i.e. the user's) motion.This process is described in section 3.2.
3.1.4Pose prediction network.The purpose of the pose prediction network is to predict the virtual quadruped's pose, including global position and orientation, at the next time-step .The network inputs and outputs are similar in style to those used in [Holden et al. 2017;Starke et al. 2020;Zhang et al. 2018] but with some adjustments and additions.The input to this network for time-step  is a vector while the output is a vector  defines the virtual quadruped's actions along the trajectory.For a given time-step , T   is a 5D (idle, move, sit, stand, lie) one-hot vector encoding the quadruped's action.The next inputs v  −3::  and v  −3::  are target 3D left and right front feet velocities for the virtual quadruped for both the next time-step  and the past time-steps defined by the superscripts.For a given time step , the velocities are relative to the root's facing direction at that time-step.The c  −3:: are the desired contact labels for the quadruped's front feet for both the next time-step  and the past time-steps defined by the superscripts.For a given time-step j, we have c  ∈ {0, 1} 2 .Other works [Holden et al. 2017;Starke et al. 2022;Zhang et al. 2018] have used one second past and future windows for similarly styled network inputs; we use three second windows to compensate for the fact that we have slowed down the quadruped data threefold.Next, v   ∈ R 2 and   ∈ R are the 2D ground plane velocity and angular velocity around the vertical axis, respectively, which define the virtual quadruped's global position and facing direction update for frame .Both values are output relative to the root a time-step  − 1.Finally, p   ∈ R  ×3 , v   ∈ R  ×3 and r   ∈ R  ×6 are the positions, velocities, and rotations, respectively, of all the virtual quadruped's  joints at time-step  and all are relative to the root at that time-step.Joint rotations are represented using the 6D continuous representation presented in [Zhou et al. 2019].In our case,  = 27 and we set  = 10 and  = 20.This results in the input vector   and the output vector Y  having 523 and 384 dimensions, respectively.
The network itself is a mixture-of-experts model similar to that introduced by [Zhang et al. 2018].We used 8 expert networks.Each expert is a three layer fully-connected multilayer perceptron (MLP).The two hidden layers have 512 units, while the final output layer has 384 units.Each hidden layer is followed by an ELU activation layer.The gating network, which outputs the weights for blending the experts, is also a three layer fully-connected MLP.The two hidden layers have 512 units, while the final output layer has 8 units.Again, each hidden layer is followed by an ELU activation layer.The input to the gating network is the subset of the total input given by Similar to the target velocity prediction network (3.1.3),the pose prediction network is trained using the quadruped motion capture dataset described in section 3.1.1.Training pairs are extracted from the data and the network is trained in a supervised manner by minimising the mean squared error between the network output and the ground truth data, i.e. by minimising L ( Ŷ , Y  ) = || Ŷ − Y  || 2 .Full details of the training process are provided in section 3.3.1.
At run-time the inputs to the pose prediction network are a blend between the network's predictions at the previous time-step and values obtained from the user's motions.This process is described in section 3.2.

System design
The architecture described in section 3.1 forms the core component of our embodiment system.Here, we describe its incorporation into the system and how the system works in practice.Figure 3b illustrates the process.
Together with the HMD, the user wears a tracker on their pelvis and on each of their feet.When they initially enter the virtual environment a calibration step occurs in which the heights of the trackers off the ground are recorded.These values are then used to determine thresholds for classifying the action of the user (for example, when they are sitting) and determining when their feet are in contact with the ground.The user's 3D feet velocities are captured using the tracking sensors placed on their feet.These are rotated into a root-centric coordinate system by removing the user's rotation about the vertical axis.This is obtained from the forward-facing direction of the pelvis tracker.The velocities are then scaled to be in a plausible range of quadruped feet velocities to obtain v  −:  and v  −:  .Each input human foot velocity v  ℎ  can be scaled according to where  ∈ {,  } indexes the left or right feet and the minimum and maximum velocity values determine the ranges of human and quadruped feet velocities.By setting the minimum human and dog feet velocities both to 0, equation 4 reduces to v   = c • v  ℎ  for some scaling factor c. The scaling factor c can be a a 1D scalar (uniform scaling) or a 3D vector (non-uniform scaling) and can also take on different values depending on whether the different components of the human velocity are in a positive or negative direction.In practice, we use a single scalar value for c.Its value was determined manually, by observing some pilot users of our system and tweaking c until the most visually appealing results were obtained.For simplicity, we used the same c for all users in our experiment (section 4).An alternative approach would be to determine values for c using minimum and maximum velocity values obtained from the quadruped motion capture dataset and a human motion capture dataset.User specific scaling factors could be achieved by obtaining minimum and maximum human velocity values during the initial calibration phase.The resulting scaled velocities are fed into the target velocity prediction network, along with this network's previous outputs, to obtain the target root ground plane velocity v  for the quadruped.Next, the run-time inputs for the pose prediction network are calculated.The future values in T  −3:+3:  , T  −3:+3:  T  −3:+3:  are a blend between the network's outputs at previous time-steps and the target trajectory which can be calculated based on the user's smoothed facing direction and the target root velocity v  predicted by the target velocity prediction network.The user's action is determined by their velocity and the heights of the trackers.For example, if the user's pelvis falls below a threshold (determined at the calibration stage), their action is classified as sitting.These classification values and the network's previous action outputs are blended to obtain the future values in T  −3:+3:  .The target feet velocities v  −3::  , v  −3::  are obtained in the same fashion as for the target velocity prediction network.The front paw contact labels c  −3:: are obtained by checking if the user's feet are below the height thresholds obtained in the calibration step.The purpose of these inputs is to try and force the virtual quadruped's front legs to move in sync with the user's legs, both for locomotion and other actions such as paw-raises.
Once all the input information has been determined, the body pose (including global position and orientation) of the quadruped is predicted.Finally, inverse kinematics is used to animate the quadruped's neck and head so that its orientation matches that of the user's head which is determined via the HMD.
The position of the camera through which the user views the virtual environment follows the position of the quadruped's head.However, we smooth the camera's trajectory using a springdamper system4 to mitigate against any jerkiness in the quadruped's head movements which could cause the user discomfort.The orientation of the camera matches that of the HMD.
Figure 4 shows some sample poses produced by NeuroDog, while examples of NeuroDog in action can be seen in the accompanying video which is available in the supplementary material.Additionally, to mitigate against error accumulation at run-time, scheduled sampling [Bengio et al. 2015] was used for training the target velocity prediction network.That is, the training data was split into roll-out windows of length 20 frames.For the first 70 epochs, the network is exclusively fed ground truth previous root velocity values.For the following 50 epochs, the network is exposed to its previous root velocity predictions with linearly increasing probability  for the duration of the roll-out windows.120 epochs, the network is exclusively fed its previous root velocity predictions.Training of both networks continued for slightly less than 400 epochs.Training was done using a Nvidia GeForce RTX™ 3090 GPU.

Run-time implementation.
The run-time implementation of our quadruped embodiment system is done using the Unity 3D game engine.The user's motion is tracked using a VIVE Pro Eye HMD and three VIVE trackers (3.0) (no hand controllers are used).Neural network inference is done on the CPU via the Unity Barracuda library6 .The Final IK library for Unity7 is used for animating the quadruped neck and head.Currently, the system runs at between sixty and seventy frames per second (fps) on a PC with the same GPU as specified in section 3.3.1 and AMD Ryzen 9 5950X 16-Core CPU.

PERCEPTUAL EXPERIMENT
As the purpose of our system is to provide users with a strong sense of virtual animal embodiment, we felt a perceptual experiment in which we investigated users' opinions of our system was the best method of evaluation.
In our experiment, we investigated the three dimensions of virtual embodiment [Roth and Latoschik 2020] experienced by users of NeuroDog, users' perception of NeuroDog's motion, and users' overall experience of using NeuroDog.We were interested to find out if users agreed that NeuroDog's motion was, in fact, natural or dog-like and, if so, did this lead to an increased sense of virtual embodiment and an improved user experience in general when compared to more traditional virtual animal embodiment setups which rely solely on inverse kinematics (IK), for example, as in [Krekhov et al. 2019a].
Our main hypotheses for the experiment were as follows: • H1: The system that produces more dog-like natural motion will result in higher virtual body ownership of the virtual dog.• H2: Agency will be higher for IK than NeuroDog due to the direct mapping of limbs.
• H3: Motion quality will be rated higher for NeuroDog as the network is trained on real dog motion capture.

Inverse Kinematics method
In order to test our hypotheses, we implemented a baseline VR quadruped embodiment method against which to compare NeuroDog.The baseline method relies solely on inverse kinematics (i.e.no machine learning is used) for animating the virtual quadruped's body.The Final IK library for Unity is used for the inverse kinematics.
The same user setup is used for the baseline system, i.e. the user wears three VIVE trackers 3.0 located on the feet and pelvis and the HMD (see Figure 1 (left)).The feet trackers determine the positions of IK targets for the virtual quadruped's limbs.The left/right tracker determines the IK targets for the virtual quadruped's two left/right legs, i.e. each of the user's legs controls two of the virtual quadruped's legs.This is similar to one of the setups described in [Krekhov et al. 2019a].The neck and head are animated in the same fashion as with NeuroDog (i.e. also using IK).The virtual quadruped's global orientation and velocity match those of the tracker on the user's pelvis.
The quadruped motion synthesised by the baseline method exhibits some noticeable differences when contrasted with the motion synthesised by NeuroDog.Firstly, the front and back legs on a side of the quadruped's body always move perfectly in phase with one another.While this is similar to a dog's pacing gait, it is not accurate.An alternative design choice would be to have each human leg control a front leg and a back leg on opposite sides of the quadruped body.This would have have resulted in the virtual quadruped exhibiting motion similar to a trot.However, diagonally opposite legs being perfectly in phase with each other would still be unrealistic.(See section 2.4 and Figure 2 for more detail on quadruped motion.)The global orientation and position updates of the user and virtual quadruped are identical for the baseline system, something which is not necessarily the case with NeuroDog.This can cause unrealistic motion of the virtual quadruped, for example, if the user rotates their body on the spot -something quadrupeds do not do.Other noticeable differences include the baseline's lack of tail animation and its struggle to produce convincing animation for some actions, for example, sit or paw-raise.
Examples of NeuroDog and the IK baseline method in action can be seen in the accompanying video which is available in the supplementary material.

Experiment design
The experiment was a within-subjects study in which all participants experienced two conditions -IK and NeuroDog.The exact same virtual environment was used for both conditions, with the only difference being whether NeuroDog or the baseline IK method was being used to map the users' motion to the virtual quadruped.The environment, illustrated in Figures 1 and 4, contained three mirrors, in which participants could view their virtual quadrupedal avatar and its movements, as I coped with the control of the avatar.

E3
Controlling the avatar was exhausting.

M1
The motion of the virtual dog appeared natural.

M2
Please rate your impression of the virtual dog's movements.Scale: very machine-like (1) to very dog-like (7) M3 Please rate your impression of the virtual dog's movements.Scale: very unpleasant (1) to very pleasant (7) Note: Participants were asked to answer each question on a scale of 1 to 7. Except for the last two questions, the scale went from strongly disagree (1) to strongly agree (7).The scale endpoints for the final two questions are indicated in the table.
well as ecologically valid objects of known-size, such as dustbins and park benches, to help the participants with distance estimation in the scene.Viewing the virtual body in the mirror reflection is a standard way in the literature to induce the sense of embodiment (e.g., [Kokkinara and Slater 2014;Krekhov et al. 2019b;Oyanagi and Ohmura 2018]. For each condition, the user spent approximately five minutes in the virtual environment embodying the virtual quadruped.During this time, they were (voice) prompted by the experimenter to perform a series of tasks in order to test the embodiment methods.The same (or very similar) prompts were used for all users and for both conditions.Example prompts included statements such as "walk forward towards the mirror," "turn ninety degrees to your right and walk towards the bench, " "stand in front of the mirror and look around and down at your feet, " "look around and check if you can see your tail, " or "sit down and raise your paw." The questionnaire contained 18 questions (see table 2).The first 12 questions were the Virtual Embodiment Questionnaire (VEQ) constructed by Roth and Latoschik [2020], with the only adjustment being the word animal replacing the word human in question OW3.Questions E1, E2 and E3 were taken from Krekhov et al. [2019a] and were intended to capture fascination, ease of control, and fatigue, respectively.Questions M1, M2 and M3 were intended to measure the naturalness of the motion [Ferstl et al. 2021] and potential uncanny effects.They were taken from items of the Godspeed questionnaire [Ho and MacDorman 2010], but with the word dog replacing the word human.
Participants were asked to answer all questions on a scale from 1 to 7. Except for the last two questions, the scale went from strongly disagree (1) to strongly agree (7).In the last two questions, participants were asked to rate their impression of the virtual dog's movement on scales from very machine-like (1) to very dog-like (7) and very unpleasant (1) to very pleasant (7).At the end of a questionnaire, there was a space for general feedback where participants were free to write any thoughts they had about the overall experience.
For each participant, the order of exposure to the two conditions was alternated.After experiencing the first condition, participants removed the HMD and completed the questionnaire on a laptop.They then put back on the HMD in order to experience the second condition, after which they removed the HMD and completed the questionnaire for a second time.Following this, the experiment ended with an informal discussion in which participants were able to provide feedback or ask the experimenter questions.

Participants
We recruited 21 (10, 11 ) participants for the study, with ages ranging from 19 to 35 ( 24.7,  5.7).Participants were asked to self-report their experience with both video games and virtual reality, the results of which are shown in table 3.

Results
We performed a repeated measures ANalysis of VAriance (ANOVA) with within-subjects factor System (IK, NeuroDog) for each questionnaire item that met the normality assumption, tested using Shapiro-Wilk test.When normality was not verified, a non-parametric Friedman test was conducted.See Figure 5 for an overview of the individual item ratings.We also individually analysed each of the separate items and found a main effect of System for OW1 ( 2 (1) = 4,  < 0.045), OW2 ( 2 (1) = 5.4,  < 0.02), OW3 ( 2 (1) = 7.36,  < 0.007), with participants reporting higher ownership for NeuroDog for all items.No main effect was found for OW4, but a trend towards NeuroDog was observed ( (1, 20) = 3.902,  = 0.062,  2 p = 0.163).4.4.2Agency.Similarly, we conducted a combined test for Agency and found no main effect, indicating that participants experienced similar levels of agency for both conditions.
We also individually analysed each of the separate items and found no main effect of System for AG1, AG2 or AG4.However, a main effect for AG3 was found ( 2 (1) = 3.77,  < 0.035) indicating that participants felt that they were causing the movements of the virtual body more when using the NeuroDog system than with IK.

Change.
A combined test was also conducted for Change with no main effect of System found.Scores were generally quite low which is common in the literature (e.g.[Krekhov et al. 2019b]) and no main effects were found for the individual items indicating that the body change was the same for both IK and NeuroDog.4.4.4Experience.Each Experience item was analyzed separately as in [Krekhov et al. 2019b].We found a main effect for E1 ( 2 (1) = 4.5,  < 0.034) with users finding the overall experience more fascinating for NeuroDog.No other main effects were found, indicating that participants coped well with the control and also did not find the avatar exhausting to control for either system.4.4.5 Motion.Each of our motion items were analysed separately.A main effect was found for System on M1 ( 2 (1) = 5.00,  < 0.025), M2 ( 2 (1) = 7.12,  < 0.008), and M3 ( 2 (1) = 7.12,  < 0.008) showing that the motion in NeuroDog was significantly more natural, more dog-like and more pleasant than the IK.Mean scores were generally high for motion quality, which also validates that users were not disturbed by slowing down the quadruped motions in the dataset, which was necessary to enable synchrony with human motion.

DISCUSSION
Results from our perceptual experiment show a very positive effect of using a data-driven deeplearning approach for animal embodiment.The motion quality of NeuroDog was judged as more pleasant, more natural, and less machine-like, while the experience was judged more fascinating, than animating the virtual dog directly using human movements via IK -the current standard for animal embodiment systems (e.g.[Ahn et al. 2016;Krekhov et al. 2019b;Oyanagi and Ohmura 2019;Pimentel and Kalyanaraman 2022]).This validates H3.
Previous work has shown the importance of animation fidelity on the sense of embodiment for virtual human avatars [Fribourg et al. 2020] over other factors such as appearance.Additionally, other studies have shown that lower-quality tracking can decrease embodiment [Eubanks et al. 2020] or enfacement [Kokkinara and McDonnell 2015], but can perform equally well on agency [Yun et al. 2023].No previous work has investigated the importance of animation fidelity on the sense of embodiment of an animal, and our results show that higher fidelity motion synthesized through machine learning can improve embodiment over traditional IK-based motion control.Additionally, a Spearman's correlation test showed a strong positive correlation between Ownership and Motion Naturalness ratings (  = 0.687, < .001),which implies that higher ownership occurred due to the natural animal motion achieved by deep-learning, validating H1.
Interestingly, similar levels of agency and change in body schema were achieved between NeuroDog and IK, which is promising for use in VR or Metaverse applications, but which contradicts H2.We believe this validates our neural network architecture as having successfully produced a human-to-dog mapping with high levels of agency and low levels of fatigue for users.Our system also produced motions that the user felt were in-sync with their motions, which validates that our network successfully synchronised the dog animation to the user input, which is an important element for enhancing embodiment [Kokkinara and Slater 2014].
Our system should be appropriate for situations where multiple users occupy a scene embodied as animals, since the third-person view of the motion would be more faithful to the animal than an IKbased system driven by a human.Future work will investigate the feeling of Social Presence [Slater and Steed 2002] of users in such scenes, where we believe systems like NeuroDog would evoke higher levels of Social Presence.

LIMITATIONS AND FUTURE WORK
While NeuroDog works well in most cases, producing plausible and realistic quadruped animations that are faithful to the user's performance and also inducing a strong sense of embodiment for the user, it has its limitations.NeuroDog's motion does, at times, exhibit artifacts (as evidenced in the accompanying video).Foot sliding artifacts could be reduced using IK based cleanup.Additionally, while the global positional and rotational velocities of the user and virtual dog will, by design, not necessarily be the same (as discussed in 3.1.3),drift artifacts can still sometimes be observed in NeuroDog's motion.This could potentially be improved using rule-based methods, for example, by overwriting the target velocity prediction network's output if the user's velocity falls below a certain threshold.Also, if the user performs very "un-quadraped-like" motions such as spinning around quickly on the spot, the system is not guaranteed to output realistic quadruped motion and in the worst cases may just output noise (however, the architecture's memory-less nature means that it is likely to eventually recover once the user returns to performing in a more quadruped-like fashion).
For the paw-raise action, it may have been more intuitive to map the user's hands to the dog's front paws.However, this would either require the user to take a quadrupedal stance (contradicting our design choice, see section 3.0.1)or for the mapping to switch from the user's legs to their hands once they sit down.Currently, NeuroDog's motions are limited to idling, locomotion, sitting, and paw-raising.In theory, other actions could be incorporated, provided the quadruped motion capture dataset contains data for those actions and that the corresponding human action can be identified by the action classifier.For example, the dataset used here does contain jumping.However, we omitted this as we limited ourselves to actions which could be performed safely and comfortably in VR.
NeuroDog performs various locomotion gaits.As evidenced in the accompanying video, this does include faster asymmetrical running gaits (for example, a canter or gallop due to their presence in the training data) when the user speeds up.While the system is designed to preserve the synchrony as much as possible between the user's legs and the virtual dog's front legs, this synchrony is not guaranteed and is less likely to be present when NeuroDog is performing asymmetrical gaits unless users are capable of performing these gaits with the correct timings.Future work includes considering if alternative design choices can guarantee a constant sense of synchrony.
Our method uses a small set of trackers to capture the motion of the human.Future work will investigate if there are advantages to using more trackers or full-body motion capture.Currently, our system is not generic to all quadrupeds, only working for dog characters with a specific skeleton.Adapting it for other dog breeds or quadrupeds (for example, cats or horses) would either require capturing motion capture datasets for these animals or developing a solution for motion retargeting between quadrupedal animals (for example, by adapting the method of [Aberman et al. 2020] to animals).Addressing these limitations is a direction for future work.
Other possible future work could include retargeting the user's facial expressions to the virtual quadruped or looking at secondary motions.For example, the user's emotions could be used to control the virtual quadruped's tail or ears.All of this could potentially lead to an improved sense of embodiment being experienced by the user and also a higher quality of output quadruped animations.
In this work, our aim was to create a system for providing users with a strong VR quadruped embodiment experience.We do not fully solve the fundamental problem of mapping or retargeting motions between characters exhibiting extreme morphological differences.An obvious direction for future work is to try to learn (in an unsupervised fashion) a direct mapping between human and quadruped motion.For example, by using the the method of Starke et al. [2022] to learn separate phase spaces for both human and quadruped motion, it may be possible to then learn a mapping between these phase spaces.Alternatively, it may be possible to extend or adapt existing state-ofthe-art methods for retargeting between homeomorphic (and typically humanoid) skeletons (for example, [Aberman et al. 2020]) to also work in the case of non-homeomorphic skeletons.
In terms of perceptual experiments, our work has also opened a new avenue to explore other aspects of the unique concept of animal embodiment in VR.One dimension of embodiment that did not score highly for our system was change in body schema.Future work will investigate if tactile feedback could enhance the feeling of a highly different body shape or the ownership of fur and thus improve embodiment overall.

CONCLUSION
We present NeuroDog, a novel system for VR quadruped embodiment, designed for ease of use and exploiting consumer-grade hardware.Our main technical contribution is a novel neural network architecture for the real-time mapping of human motion to quadruped motion, leveraging a mixture of experts model [Zhang et al. 2018].This is validated via a perceptual experiment for motion responsiveness and naturalness, and evoking the illusion of embodiment.Our other contribution lies in the results of our experiment, which show for the first time that higher animal body ownership can be achieved through more natural animal motion than through anthropomorphic motion.We hope that our system and results will spark future work in this novel area, with the development of more fascinating embodiment experiences in the Metaverse and beyond.
3.1.2Notation.In the following, a superscript of the form  denotes a value of a quantity at a specific time-step .A superscript of the form  :  (where  < ) denotes the concatenation of Proc.ACM Comput.Graph.Interact.Tech., Vol. 6, No. 3, Article 38.Publication date: August 2023.
Training: The target velocity prediction (left) and motion prediction (right) networks are trained separately.Both are trained using the quadruped motion capture data, i.e. no human data is used at the training stage.They are trained in a supervised fashion by minimising a mean squared error loss.In addition, the target velocity network also learns from its own predictions.(b) Run-time: Three tracking sensors are used to capture the user's feet velocities and facing direction.These are scaled to ensure they are in an appropriate range for a quadruped's feet velocities.These scaled values, along with the past root target velocity predictions, are fed into the velocity prediction network to predict the desired root velocity for the dog.This, along with the user's facing direction, feet contacts, and action are used to construct the input to the pose prediction network.The quadruped's neck and head are animated via inverse kinematics based off the user's HMD orientation.

Fig. 3 .
Fig. 3.An outline of our training (top) and run-time (bottom) pipelines for mapping human motion to quadruped motion.
::  , T  ::  , and T  ::  define the trajectory for the virtual quadruped.Past values are those that were achieved.Future values are targets.For a given time-step , T   , T   , T   ∈ R 2 represent the 2D ground plane trajectory position, velocity and facing direction, Proc.ACM Comput.Graph.Interact.Tech., Vol. 6, No. 3, Article 38.Publication date: August 2023.NeuroDog 38:9 respectively.Past and future trajectory values are used for the input, while only future trajectory values are output.Input trajectory values at time-step  are relative to the root at time-step  − 1.Output trajectory values are relative to the updated root at time-step .T  :: Fig. 4. Corresponding human and virtual dog poses from NeuroDog.Note how the human feet correspond to the front paws of the dog.

3. 3
Implementation details 3.3.1 Neural network training.PyTorch 5 was used for the neural network training implementations.The two neural were trained using the Adam optimizer [Kingma and Ba 2017], with a learning rate beginning at 0.0001 and linearly decreasing to 0 over the course of 400 epochs.A mini-batch size of 64 was used.Dropout with a retention rate of 0.7 was used in training both networks.

Table 1 .
Breakdown by motion type of quadruped motion capture dataset values of a quantity from time-step  to time-step , inclusive.A superscript of the form  :  :  denotes the concatenation of values of a quantity from time-step  to time-step  sampled every  time-steps, i.e. the concatenation of values of a quantity at time-steps ,  + ,  + 2, . . .,  − , .We use  to to represent the number of time-steps corresponding to one second of motion.A ˆabove a value indicates that it is a network predicted value.

Table 2 .
Proc.ACM Comput.Graph.Interact.Tech., Vol. 6, No. 3, Article 38.Publication date: August 2023.Questionnaire used for the experiment ID Question OW1 It felt like the virtual body was my body.OW2 It felt like the virtual body parts were my body parts.OW3 The virtual body felt like an animal body.OW4 It felt like the virtual body belonged to me.AG1 The movements of the virtual body felt like they were my movements.AG2 I felt like I was controlling the movements of the virtual body.AG3 I felt like I was causing the movements of the virtual body.AG4 The movements of the virtual body were in sync with my own movements.CH1 I felt like the form or appearance of my own body had changed.CH2 I felt like the weight of my own body had changed.CH3 I felt like the size of my own body had changed.CH4 I felt like the width of my own body had changed.E1 The overall experience was fascinating.E2

Table 3 .
Participant experience