UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons

The automatic co-speech gesture generation draws much attention in computer animation. Previous works designed network structures on individual datasets, which resulted in a lack of data volume and generalizability across different motion capture standards. In addition, it is a challenging task due to the weak correlation between speech and gestures. To address these problems, we present UnifiedGesture, a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons. Specifically, we first present a retargeting network to learn latent homeomorphic graphs for different motion capture standards, unifying the representations of various gestures while extending the dataset. We then capture the correlation between speech and gestures based on a diffusion model architecture using cross-local attention and self-attention to generate better speech-matched and realistic gestures. To further align speech and gesture and increase diversity, we incorporate reinforcement learning on the discrete gesture units with a learned reward function. Extensive experiments show that UnifiedGesture outperforms recent approaches on speech-driven gesture generation in terms of CCA, FGD, and human-likeness. All code, pre-trained models, databases, and demos are available to the public at https://github.com/YoungSeng/UnifiedGesture.


INTRODUCTION
Nonverbal behaviors, including gestures, play key roles in conveying messages in human communication [37].The automatic co-speech gesture generation is considered an enabling technology to create realistic 3D avatars in films, games, virtual social spaces, and for interaction with social robots [56].In the era of deep learning, existing data-driven gesture generation methods usually rely on a large dataset.Studies have shown that a larger amount of data can improve the generalization of the model and enhance its performance [10,57].
Thanks to the development of human pose estimation [46], it's easy to extract 3D human poses from tremendous 2D gesture data on the web, e.g., TED [89] and PATS [2], some works [47,88,89] are based on 2D gesture datasets.While large in quantity, 3D poses extracted from 2D datasets are poor in quality and difficult to use, most works [4,17,36,43,49] opt for high quality 3D mocap datasets.
There are two main challenges when utilizing 3D datasets.First, due to the expensive cost of motion capture, the typical 3D gesture datasets [16,19,38,49] are relatively small, thus the generalization of the models trained on the individual dataset is limited, and the ability of the trained algorithms is also confined to the content of the individual dataset.For example, some datasets contain style information [19,49], while the others do not [16,38].Second, it is not straightforward to train algorithms on mixture datasets directly, since different datasets usually have different skeletons, they are captured with different mocap systems.Most of the current solutions use software such as Blender [18] or Maya [8] for automatic retargeting to a unified skeleton, which requires manual specification of the bone mapping and leads to unavoidable errors [1].The irregular connectivity and hierarchical structure of the skeleton joint motion cause difficulties in the large-scale application of multiple skeletons.
To tackle these challenges, we propose UnifiedGesture, a novel unified co-speech gesture synthesis model for multiple skeletons.The overview of our method is shown in Figure 2.Although the number and position of the different skeleton joints are different, they all correspond to homeomorphic (topologically equivalent) graphs [81].Unlike sign language or hand gestures, there is a weak correlation between speech and body gestures at a coarse-grained level [61,84].Specifically, we assume that the gesture details associated with speech are contained in the primal skeleton gesture.According to this assumption, we first use a data-driven deep skeleton-aware [1] framework to learn latent homeomorphic graphs We retarget the different skeletons to the primal skeleton.Given a speech segment (optional style, seed gesture), the output is the primal gesture after VQVAE encoding of the output of the diffusion model.We introduce reinforcement learning to refine the gesture generation network.Finally, a gesture of the skeleton is specified and generated, with physics guidance.
for different skeletons.The different skeletons are unified and retargeted to the primal skeleton while extending the dataset.Then we introduce a denoising-diffusion-based speech-driven co-speech gesture generation model, using WavLM features [12], based on cross-local attention [64] and self-attention [75] architecture to better capture the temporal information between audio and gestures.Third, unlike speaking with the face or lips, the weak correlation between speech and gesture lacks a suitable criterion for learning the model, to refine the gesture generation model, we employ inverse reinforcement learning (IRL) on discrete gesture units to train a reward model that evaluates the generated gestures and guides the diffusion model to generate high-quality and diverse gestures aligned with speech during the reinforcement learning (RL) process.
Our code, pre-trained models, and demos will be publicly available soon.The main contributions of our work are: • We employ a skeleton-aware retargeting network to unify the different skeletons to a common primal skeleton while extending the dataset.

RELATED WORK 2.1 Motion Retargeting
Our task is to take advantage of multiple gesture datasets.There are two main challenges.First, different datasets have the same motion capture standards (e.g., Trinity [16] and BEAT [49] both using Vicon's suits); second, different datasets have different motion capture standards (e.g., ZEGGS [19] and Talking With Hands [41]).For the first case, we can select the body joints common to both datasets and unify the different skeletons normalized [88] by height or arm span, etc.The latter case is more challenging.Ma et al. [54] try to map multiple datasets to a defined skeleton, but it still partially relies on handcrafting and the results are still limited to a specific motion skeleton.Some work [31,35] try to retarget the motion of different skeletons by VAE, using standard convolution and pooling.However, unlike images or videos, different skeletons exhibit irregular connectivity.Villegas et al. [77] propose a neural network for motion retargeting that adapts input motion to target characters, achieving state-of-the-art results and using cycle consistency for unsupervised learning.Lim et al. [48] propose a pose-movement network for motion retargeting using a normalizing process and novel loss function.Kim et al. [30] present an unsupervised motion retargeting model using temporal dilated convolutions that generates realistic and stable trajectories for humanoid characters.Villegas et al. [76] propose a motion retargeting method that preserves self-contacts and prevents interpenetration, using a recurrent network.Li et al. [44] propose an iterative motion retargeting method using an iterative motion retargeting network for unsupervised motion retargeting.Inspired by [1], we use a deep skeleton-aware framework for data-driven motion retargeting between skeletons.

Gesture Generation
2.2.1 End-to-end Co-speech Gesture Generation.Gesture generation is a complex task that requires understanding speech, gestures, and their relationships.The present data-driven studies mainly consider four modalities: text [6,80,89], audio [20,24,62], gesture motion [52,85,88], and speaker identity [3,4,50].There are some works to extend the scale of the dataset.Liu et al. [49] present a large motion capture dataset for studying the correlation of conversational gestures with facial expressions, emotions, and semantics.Kucherenko et al. [38] provide further annotation of the specific properties and details of the gestures in the dataset.Ghorbani et al. [19] propose a dataset containing motion styles compared to the previous dataset containing speech styles.However, these methods are currently only tried on a single dataset, which resulted in a lack of data volume and generalizability across different motion capture standards.Some works [23,92] use motion-matching methods to generate co-speech gestures.Yang et al. [86] try to transform the original gesture motion space to the deep latent phase space [66], but it is still based on the traditional convolutional network, ignoring the hierarchy and connectivity between the skeletons.Besides, this approach requires careful design of the database, which is directly related to the performance of the generated gestures.The length of matching needs to be balanced between quality and diversity.Furthermore, the approach also requires the complex and timeconsuming manual design of the matching rules.[25] excel at modeling complicated data distribution and generating vivid motion sequences.Many works [29,63,72] integrate diffusion-based generative models into the motion domain.There are some works [5,91,94] that introduce diffusion models in gesture generation to demonstrate the potential of diffusion models in solving cross-modal, time-series relations problems.In our work, we use a well-designed attention architecture in the diffusion model to make the generated gestures match better with the speech.

2.2.3
Quantization-based Pose Representation.Kipp has represented gestures as predefined unit gestures [33].Lucas et al. [53] propose to train a GPT-like model for next-index prediction.Li et al. [45] propose to pose VQ-VAE [74] to encode and summarize dancing units.In terms of gesture generation, there are several works [7,84,93] that apply VQVAE to encode meaningful gesture units.Existing studies [21,27,53] have shown that quantification helps to reduce motion freezing during motion generation and retains the details of motion well.Unlike them, we encode the gesture units in the deep primal motion latent space.

Reinforcement Learning
The goal of reinforcement learning (RL) is to learn a policy that maximizes rewards through iterative interactions with the environment [69].The agent takes actions based on the current state, receives rewards, and updates its policy.The trial-and-error learning nature of RL enables it to be a versatile method for making decisions in complex and dynamic environments [15,42,78].RL algorithms can be broadly categorized into two types: value-based [13,68,82] and policy-based [9,22,83].Value-based methods estimate expected rewards for actions in a given state.Policy-based algorithms directly learn a policy model that maps states to actions and updates using Policy Gradient [70].Policy-based RL are popular for handling high-dimensional state and action spaces, and non-differentiable reward functions.Since the rewards in RL do not need to be differentiable with respect to model parameters, RL algorithms can be applied to a wide range of reward maximization problems [45,58,60].Related to our work, Sun et al. [67] propose a contrastive pre-trained reward to evaluate the correspondence between gesture and speech sequences and employs conservative Q-Learning (CQL) [40] for model optimization.Although state transitions and reward functions are obtainable, the method relies on offline RL, which limits the capability of the model.Bailando [45] use a hand-designed reward function to fine-tune the dance generation model.However, hand-designed reward functions require significant expert knowledge and make it difficult to comprehensively evaluate actions.In our work, we fine-tune our diffusion model with online RL on the training set, using the learned reward model to refine the gesture generation model.

OUR APPROACH 3.1 Multiple Skeletons Retargeting Network
The structure of a skeleton is typically hierarchical [1], so we use graphs [79] to represent motions.Connectivity is determined by the kinematic chains (the paths from the root joint to the end-effectors).Nodes to represent corresponding joints and, in particular, leaf nodes to represent end-effectors.The adjacency lists are expressed as N  = N  1 , N  2 , . . ., N   , where  is the number of joints and

N 𝑑
denotes the edges whose distance in the tree is equal or less than  from the -th edge.
3.1.1Reference Pose Unification.The current full-body motion capture dataset for speech-driven gestures contains mainly: Trinity [16] (244 min of audio, a male actor), ZEGGS [19] (135 min of audio, a female actor, 19 different motion styles), BEAT [49] (76 hours, 30 speakers, 8 different emotions) and Talking with Hands [41] (50 hours, two-person face-to-face conversations).More details on the different skeletons can be found in the supplementary material.Different gesture datasets have different reference poses and motion representations (number and position of joints).We take two datasets A and B as examples.We first need to set the position and rotation of the unified reference poses, such as T-pose or A-pose.We first centralize the reference representation to the root joints (e.g. the Talking with Hands dataset uses the 'world' joint to maintain height).Two reference poses P  and P  can be aligned through global and local translation and rotation: where Q  denotes the reference poses transfer matrix.
The reference representation of a motion sequence of length  based on reference pose P can be represented by 3D position and 4D rotation of the root joint as R ∈ R  × (3+4) .The reference representations R of different skeletons can be retargeted after normalization according to the height of the reference pose.

Motion Unification.
The motion of different skeletons consists of a static component S ∈ R  ×3 (joint offsets) and a dynamic one D ∈ R  × ( ×4) (joint rotations).To unify the motion of the different skeletons, we utilize a retargeting network architecture similar to [95].The architecture is shown in Figure 3.Here we take static component S  and dynamic component D  of skeleton A with   joints as an example.First, we adopt skeletal convolution and pooling layers [1] in encoders to extract a deep latent representation L A of the motion S and D, which can be formulated as where operator 'repeat' denotes tiled and concatenated along the time dimension, − → S A ∈ R  ′ × (  × ′ ) and L A ∈ R  ′ × (7× ) ,  ′ =  /  ,   is the temporal down-sampling rate.Assuming that all skeletons contain 5 end-effectors (2 hands, 2 feet, and head) and 2 mid-nodes, we use the latent representation L A of the primal skeleton to represent all the different skeletons, which contains only 7 nodes. ′ and  are the numbers of deep static and dynamic latent channels.
A following skeletal de-convolutional decoder   projects − → S A and L A back to the motion space as ŜA , which can be formulated as where DA→A indicates that L A is fed into the decoder   , simplified as DA .
During training,   tries to reconstruct the input motion, so the decoders are trained by minimizing the reconstruction losses: (5) where operator 'FK' is the forward kinematic to get the joint positions, which prevents the accumulaiton of error along the kinematic chain [59].
The skeletal-aware encodes enables retargeting motions of different skeletons into a common deep primal skeleton latent space.A latent consistency loss is applied to this shared representation to ensure that the retargeted motion retains the same dynamic features as the original clip: where DA→B indicates that L A is fed into the decoder   .Since different skeletons can share the same set of end-effectors (typically head, left hand, right hand, left foot, and right foot), the end-effectors of the original skeleton and the retargeted skeleton should have the same normalized velocity to avoid the artifact of re-targeting, such as foot sliding.This can be formulated as where    and    are the velocities of the -th end-effector of skeletons A and B, respectively.E is the set of end-effectors.ℎ  and ℎ  are the height of skeletons A and B, respectively.And we use discriminator   to evaluate whether the retargeted motion is plausible.The adversarial loss can be formulated as The loss of the retargeting network can be computed as: For details on the network structure, please refer to our supplementary material.

Diffusion Model for Speech-driven Gesture Generation
Diffusion models [25] have made great progress in motion generation [72] due to their ability of to learn to gradually denoising starting from pure noise.We unified the gestures by retargeting the skeletons of different gesture datasets to a primal skeleton, and now obtained a multi-deep primal skeleton gesture set [L A , L B , ...] with the corresponding speech set [A A , A B , ...].To generate co-speech gestures with a diffusion model, we use DiffuseStyleGesture [85], which has recently achieved strong results on a single dataset, as our backbone model.As shown in Figure 4, the diffusion model consists of two parts: the forward process (diffusion process)  and the reverse process (denoising process)   .We denote the generated gesture as L in the diffusion process, which has the same dimension as an observation data In denoising process, the denoising process   is a process of learning parameter  via a neural network.The noise L   at time   is used to learn   , Σ  , then During training, noising step   is sampled from a uniform distribution of {1, 2, . . .,  }, with the same position encoding as [75].Noisy gesture L   has the same dimension as the real gesture L 0 obtained by sampling from the standard normal distribution N (0, I).In the latent representation of the gesture we also extract the difference between two frames as latent velocity and also extract the difference between two frames of latent velocity as latent acceleration, therefore L 0 ∈ R  × (7× ×3) .Audio features are generated from the pre-trained models of WavLM Large [12].Then we use linear interpolation to align WavLM features and gesture L 0 in the time dimension.The styles of gestures are represented as onehot vectors where only one element of a selected style is nonzero.Seed gesture helps to make smooth transitions between consecutive syntheses [88].The first   frames of the gestures clip are used as the seed gesture  and the remaining  frames are used as the real gesture L 0 to calculate loss.Self-attention [75] and cross-local attention [64] based on relative position encoding (RPE) [34] are used to generate better speech-matched and realistic gesture.Random masks (RM) are added to the pipeline of seed gesture  and style  feature processing for classifier-free learning [26].During the training process, we combine the predictions of the conditional model Denoise L   ,   ,  1 ,  1 = [, , ] and the unconditional model Denoise Then, as for style  in condition, we can generate style-controlled gestures when sampling by interpolating or even extrapolating the two variants using , as The Denoising module can be trained by optimizing the Huber loss [28] between the generated poses L0 and the ground truth human gestures L 0 on the training examples: 3.2.2Sample Module.The final co-speech gesture is given by splicing a number of clips of time duration   with frame length  .The initial noisy gesture L   is sampled from the standard normal distribution and the other L   (  <   ) is the result of the previous noising step.The seed gesture for the first clip can be generated by randomly sampling a gesture from the dataset or by setting it to the average gesture.Then the seed gesture for other clips is the last   frames of the gesture generated in the previous clip.For every clip, in every noising step , we predict the clean gesture L0 =Denoise(L   ,   , ), and add the noise to the noising step L   −1 using Equation (10) with the diffuse process.This process is repeated from   =   until L 0 is reached (Figure 4 bottom).Please refer to our supplementary material for training details such as network structure and implementation details.Here we train a VQVAE to summarize meaningful gesture units to reduce the exploration space for following reinforcement learning.Each code represents a unique gesture.Besides, discrete spaces are more conducive to reinforcement learning for exploration [14,71].The architecture of the primal gesture VQVAE is shown in Figure 5.Given the primal gesture

Gesture Generation Refinement
where u ∈ R  ′′ × ′′ and  ′′ =  /  ,   is the temporal downsampling rate in VQVAE and  ′′ is the channel dimension of features.Then we quantize u by mapping each temporal feature u  to its closest codebook [74] element   as q(.): where Z  is a set of   codes of dimension   .And u q is the elements of codebook Z  , u  ∈ Z  .A following de-convolutional decoder   projects u q back to the deep latent space as a primal gesture sequence Lupper 0 for the upper body, which can be formulated as Lupper The VQVAE can be trained by optimizing L  : where the first item is the reconstruction loss.The next two items are velocity loss and acceleration loss [45,86].sg[•] denotes the stopgradient operation, and the term u − sg u q is the "commitment loss [74]" with weighting factor   .

Reinforcement
Learning Finetuning.To further enhance the alignment between the speech and gesture and increase the diversity of the generated gestures, we employed reinforcement learning to fine-tune the gesture generation model.The reward signal is pivotal in balancing exploration and exploitation in reinforcement learning.Previous work [45] attempts to optimize partial performance metrics of the model through hand-designed reward functions.However, in our experience, designing heuristic reward functions that comprehensively evaluate the model's performance is challenging.Reinforcement learning training is less stable than supervised learning, and if the reward function only considers specific metrics while neglecting others, the model's overall performance may deteriorate.
In this paper, we adopted Inverse Reinforcement Learning (IRL) [55] to learn a neural network model from human demonstrations to fit the true reward function and explain human behavior.Specifically, our reward model training is shown in Figure 6, similar to [11].Firstly, we sample a speech-gesture pair from the VQVAEencoded dataset D, denoted as trajectory  0 .Then we randomly replace  codes in the trajectory  0 where We sample  tuples and thus get  ×  trajectories to form the dataset D  to train the reward model.We make a weak assumption that the more codes replaced with random codes, the worse the quality of the trajectories, including alignment with speech and diversity.Then, we let the reward model   classify these trajectories with different qualities (may come from different human demonstrations with different speech)  =   () to determine which trajectory is better: where {,  ∈ [1, • • • , ],  ≠  },  means the sigmoid function and sgn means the signum function: By learning the classification task, the reward model can learn to output a scalar reward signal  () =   () that makes reasonable evaluations on the quality of the trajectory .
Given the reward model, we use the REINFORCE algorithm [68] to improve the model: where  means the current policy, i.e., the gesture model and   () means the probability of  given policy .During the fine-tuning process of the model, the reward model accurately scores the gesture under the given speech to improve alignment between speech and increase gesture diversity.

Physics Guidance.
Inspired by [73], we consider that the foot should have contact with the ground when there is a left-right acceleration or an upward acceleration of the root.Then we use standard Inverse Kinematics (IK) optimization for physics guidance.
For more details please refer to the supplementary material.

EXPERIMENTS 4.1 Experiment Preparation
4.1.1Implementation Details.We perform the training and evaluation on the Trinity [16] and ZEGGS [19] datasets.Even based on motion capture, the hand quality is still low [5,56,90], so we ignore hand motion currently.Then the number of joints for the two datasets is   = 26 and   = 27, respectively.We choose seven more typical and longest-duration styles (happy, sad, neutral, old, relaxed, angry, still) for training and validation.For the Trinity dataset, there are no style labels and we consider all of its styles to be 'neutral'.And we divided the data into 8:1:1 by training, validation, and testing.We first resample the motion of both datasets to 30fps.All audio recordings are downsampled to 16kHz.In terms of retargeting network, we set   = 4, then the primal gesture is 7.5 fps.We set all reference poses R to the T-pose at the origin with the foot in the Z-plane.The dimension  of each node of the primal gesture in latent space after convolution is 16.We set  lc = 1,  ee = 2 and  adv = 0.25 for Equation ( 9) and use the Adam [32] optimizer with a batch size of 256 for 16000 epochs.The retargeting network trained on an NVIDIA V100 GPU takes about 3 days.While training the diffusion model and VQVAE, gesture data are cropped to a length of  = 30 (4 seconds).For the diffusion model, the Denoising module learns both the conditioned and the unconditioned distributions by randomly masking 10% of the samples using Bernoulli masks.The cross-local attention networks use 8 heads, 32 attention channels, 256 channels, the window size is 6, each window looks at the one window before it, and with a dropout of 0.1.As for self-attention networks are composed of 8 layers, 8 heads, 32 attention channels, 256 channels, and with a dropout of 0.1.We use the AdamW [51] optimizer (learning rate is 3×10 −5 ) with a batch size of 256 for 1000000 steps.Our models have been trained with  = 1000 noising steps and a cosine noise schedule.The diffusion model can be learned in about 3 days on one NVIDIA V100 GPU.As for VQVAE, the size   of codebook Z  is set to 512 with dimension   is 512.We set the down-sampling rate   = 2.And   = 0.1,  1 = 1 and  2 = 1 for Equation (18).we use the ADAM optimizer (learning rate is e-4,  1 = 0.5,  2 = 0.98) with a batch size of 128 for 200 epochs.The VQVAE is learned on one NVIDIA A100 GPU for several hours.For more datasets and training details please refer to the supplementary material.

Evaluation Metrics. Canonical correlation analysis (CCA)
[65] is to project two sets of vectors into a joint subspace and then find a sequence of linear transformations of each set of variables that maximizes the relationship between the transformed variables.CCA values can be used to measure the similarity between the generated gestures and the real ones.The closer the CCA is to 1, the better.The Fréchet gesture distance (FGD) [88] on feature space is proposed as a metric to quantify the quality of the generated gestures.To compute the FGD, we trained an autoencoder to extract the feature.Lower FGD is better.Diversity [45] in feature space is used to evaluate the diversity of the gestures.We also report average jerk, average acceleration [35], Hellinger distance [36], and Beat Align Score [45,50] in the supplementary material.

Objective Evaluation.
We compare our proposed model with StyleGestures [4], Audio2Gestures [43], ExampleGestures [19], and DiffuseStyleGesture [85].The quantitative results are shown in Table 1.On the global CCA, our proposed model outperforms all other existing methods.The highest global CCA shows a strong coupling between the generated gestures and the ground truth gestures.CCA for each sequence is not as good as the other methods, and we suggest that this is because for each speech, the model learns the gestures across the skeleton.Our method significantly surpasses the compared state-of-the-art methods with FGD, improves 6.64 (63%) than the best compared baseline model ExampleGestures.This shows the high quality of the generated gestures.We can see that our model is not as good as StyleGesture in terms of Diversity.The video results show that StyleGesture has a lot of cluttered movements, increasing diversity while decreasing human-likeness and appropriateness.However, we would like to emphasize that objective evaluation is currently not particularly relevant for assessing gesture generation [37].Subjective evaluation remains the gold standard for comparing gesture generation models [37,39].Current research on speech-driven gestures prefers to conduct only subjective evaluation [5,85].Please refer to the supplementary video for more comparisons.

User Study.
To understand the real visual performance of our method, we conduct a user study among the gesture sequences generated by each compared method and the ground truth motion capture data.Following the evaluation in GENEA [52], we evaluate human-likeness and gesture-speech appropriateness.The length of the evaluated clips ranged from 22 to 50 seconds, with an average length of 35.4 seconds, as longer durations produce more pronounced and convincing appropriateness results [87].For human-likeness evaluation, each evaluation page asked participants "How human-like does the gesture motion appear?"In terms of appropriateness evaluation, each evaluation page asked participants "How appropriate are the gestures for the speech?"Participants rated at 1-point interval from 5 to 1, with labels (from best to worst) of "excellent", "good", "fair", "poor", and "bad".More details about the user study are shown in the supplementary material.The mean opinion scores (MOS) on human-likeness and appropriateness are reported in the last two columns in Table 1.
In terms of human-likeness, our model significantly surpasses the compared state-of-the-art methods.However, it is not significantly different from ExampleGestures.This is because ExampleGestures uses a reference gesture as 'example' during inference, with already a priori knowledge of the gesture, sampling from a gesture distribution to get the generated gesture, so the human-likeness is strong.For gesture and speech appropriateness, our model significantly outperforms StyleGestures, Audio2Gesture, and ExampleGestures, giving competitive results with DiffuseStyleGesture.One reason for Table 1: Quantitative results on test set.Bold indicates the best metric.Among compared methods, StyleGestures [4], Au-dio2Gestures [43], ExampleGestures [19], and DiffuseStyleGesture [85] are reproduced using officially released code with some optimized settings.Objective evaluation is recomputed using the officially updated evaluation code [37,45].Human-likeness and appropriateness are the results of MOS with 95% confidence intervals.

Name
Objective evaluation Subjective evaluation Global CCA CCA for each sequence FGD ↓ Diversity ↑ Human-likeness Appropriateness Ground Truth 1.000 the gap compared to DiffuseStyleGesture is that DiffuseStyleGesture uses kinematic parameters such as the position, rotation angle, velocity, and rotation angular velocity of the root, as well as the position, rotation angle, velocity, rotation angular velocity, and gaze direction of each joint of the original motion as features of the gesture, which has a much larger dimension than the feature dimension of the primal skeleton gesture and may contain fine-grained skeletal details related to speech.According to the feedback from the participants, our generated gestures are "more semantically relevant" and "more natural", while our method has "less power" compared to Ground Truth.We suggest that this observation is due to the downsampling in the retargeting network and the VQVAE network.Smaller downsampling coefficients may result in faster and more powerful movements.2. The metrics on FGD indicate that after RL finetuning, the generated gestures have increased distance from the distribution of human gestures in the dataset, indicating that the model has explored some gestures that do not belong to the existing distribution of gestures in the dataset but are considered reasonable by the reward model.From the CCA and diversity metrics, it can be seen that the reward model can indeed generalize to gestures outside the dataset, allowing the model to generate more diverse and highquality gesture movements that are not limited to the dataset.When neither RL nor VQVAE is used, both FGD and diversity are still decreasing, which indicates the necessity of codebooks to generalize meaningful gestures.When we use only a single dataset, we notice that both FGD and diversity decrease a lot, which indicates the essential importance of gesture generation for learning on multiple datasets.

User
Study.Similarly, we conduct a user study of ablation studies.The MOS on human-likeness and appropriateness are shown in the last two columns in Table 2.In terms of human similarity, we can find that the scale of the dataset has a significant effect on the results, which demonstrates the importance of unifying the gesture dataset.For speech and gesture appropriateness, it is also found that the scale of the dataset has the largest impact on this metric.Secondly, the appropriateness also decreased without reinforcement learning, shows the importance of data exploration.
The visual comparisons of this study can be also referred to the supplementary video.

Diverse, Controllable, and Stylized Gesture Generation
• Stylization.We can generate stylized gestures by setting  and  in Equation (13).The intensity of the stylization can be controlled by the value of .As shown in Figure 7, for the same speech, different styles of gestures can be generated while preserving matching with the speech.• Diversity.Due to the diffusion model architecture, different noisy gesture and different seed gesture could generate different gestures even for the same speech and style, as shown in Figure 7.This is the same as real human speech, which creates diverse co-speech gestures related to the initial position.
• Controllability.Since we use VQVAE to generate gestures, it is easy to control the gesture or take out the code for interpretation.We can have a high level of control over speech-driven gestures at any time with the specified upper body code, as shown in the dashed box in 7.For more details please refer to the supplementary material.

DISCUSSION AND CONCLUSION
In this paper, we assume that the body gestures of the different skeletons are contained in the primal skeleton and present a unified gesture synthesis model for multiple skeletons.UnifiedGesture demonstrates x major strength: 1) Benefit from using the skeletonaware retargeting network to unify the different skeletons, while extending the dataset.The model has stronger generalization.And ablation experiments on a single skeleton effectively demonstrate that a larger amount of data can improve the performance of the model.2) Based on a diffusion model, probabilistic mapping enhances diversity while enabling the generation of high-quality, speech-matched, and style-controlled gestures.3) VQVAE learns a codebook to summarize meaningful gesture units to improve controllability and interpretability.Reinforcement learning with a learned reward function helps refine the gesture generation model, enabling the model to explore the data and able to increase the diversity of the generated gestures.The physics-based kinematic constraints also further improve gesture generation.There is room for improvement in this research.Besides speech, more modalities (e.g.text, facial expressions) could be taken into consideration to generate more appropriate gestures.Solving the problem that the skeleton-aware encoder and decoder need to be re-trained for the new skeleton is also our future research direction.4. Average Jerk, Average Acceleration, and Hellinger Distance are recomputed using [37].As for Beat Align Score, we use the method in [43] to calculate the beats of audio and follow [45] to calculate the beats and diversity of gestures.For Average Jerk and Average Acceleration, the closer to Ground Truth, the better.For Hellinger Distance, the smaller the better.Regarding Beat Align Score, the greater the better.From the results, it can be seen that the gestures generated by our model are closest to the real velocity and acceleration distributions.StyleGestures and DiffuseStyleGesture have motion velocity histogram distances that are more similar to the real gestures.This could be caused by the more hand movements of both of them, please refer to our supplementary video.DiffuseStyleGesture matches the beat of the speech better, which is consistent with the results of the human subjective evaluation.Here we also want to emphasize that currently there is a lack of valid objective metrics for gesture generation and that subjective evaluation is the most effective [37,39].Please refer to our video for further visualization and comparison.

E.2 Ablation Studies
In Table 2, we observe that when the RL (Reinforcement Learning) component is removed from our model (Ours -RL), the FGD (Frechet Gesture Distance) decreases from 3.850 to 3.132.This indicates that the gestures generated without RL are closer to the distribution of human gestures in the dataset.However, the slightly better FGD score does not necessarily represent better generalization.RL is essential for enabling the model to explore beyond the dataset and generate gestures that, though slightly further from the human distribution, are more diverse and considered reasonable by the reward model, as can be seen from the Global CCA and Diversity metrics.Therefore, while the ablated version without RL shows better FGD, the trade-off is in generalization and diversity.When neither RL nor VQVAE (Vector Quantized Variational AutoEncoder) is used (Ours -RL -VQVAE), the FGD is higher than Ours -RL, but still lower than our full model.The absence of VQVAE causes a reduction in diversity.This suggests that the VQVAE module helps to generate meaningful gestures.In this ablation, without the RL module, the model is not encouraged to explore beyond the dataset, and without VQVAE, it struggles to generalize meaningful gestures.The combination of RL and VQVAE in the full model ensures that meaningful gestures are generated, and the model is encouraged to explore beyond the dataset, which enhances the diversity and quality of the generated gestures.The ablation studies with -Skeleton A and -Skeleton B demonstrate the importance of having diverse training datasets.As we can see, removing either Skeleton A or Skeleton B increases the FGD dramatically to 13.76 and 12.45 respectively.This indicates that the model is not generalizing well without diverse training data.Similarly, Diversity is significantly reduced when training on a single dataset, highlighting the importance of training on multiple datasets for producing varied and high-quality gestures.Our full model, incorporating RL, VQVAE, and multiple datasets, achieves a balance across these aspects, as is evident in the objective and subjective evaluations.
Similarly, we calculated more objective measures of the ablation experiment.The results are shown in Table 5.As can be seen from the results, all four metrics even get better when the model does not use reinforcement learning as well as VQVAE; when the model removes reinforcement learning and VQVAE, the average velocity, average acceleration, and velocity distribution histogram achieves the best case.When the model was trained on a single dataset skeleton only, the average velocity and average acceleration decreased more significantly, but the other two metrics showed the opposite trend.These are contradictory to the subjective evaluation and other objective evaluation metrics, so we argue with the results of these objective metrics, but we still report these results.We believe that subjective evaluation is still the most convincing method for now, and that objective metrics are all still inconsistent with subjective human perception.Please refer to the video for further comparison.

F USER STUDY
Human-likeliness and Appropriateness for gesture scoring are the two dimensions that have been used in the gesture generation (GE-NEA) Challenge [37,39,90] and are currently the dominant metrics in gesture generation.Some work in user study has different focus on different topics, such as user evaluation diversity [43], consistency [102], stylization [85], appropriateness of interaction with the listener added in this year's gesture generation Challenge [100], etc.However, Human-likeliness and Appropriateness are dimensions that are used in almost all user studies of gesture generation methods, so we followed these two metrics.To analyze user studies using statistics, we used MOS with 95% confidence intervals to represent the results of each metric.If there is no overlap in the 95% confidence intervals of the ratings between the different models, then the difference is considered to be statistically significant.
We put all the generated primal skeleton gestures through the decoder of the ZEGGS dataset skeleton to generate the final gestures.The generated motion capture file (bvh) is rendered by Blender [18] and the camera stays still.Before starting the evaluation, we told participants, for each Video, two dimensions were evaluated: 1. naturalness (human-likeness), i.e., the quality of the generated motion, without considering speech.2. suitability (appropriateness), i.e., the relationship between the generated gestures and the speech, considering the speech, e.g., whether the gestures match the audio rhythm, or the text semantics.We asked people to ignore the influence of the hands on the scoring and to focus only on the skeletal movements of the whole body.A total of 31 individuals took part in the subjective evaluation scoring, with 2 subjects between the ages of 40 and 50 and the rest between the ages of 20 and 30.About 85% of the participants were male and 15% were female.They were all good English speakers.For both the comparison with the baseline model and the ablation experiments, we selected 10 segments of audio to be scored with their generated gestures.Five segments are male voices from the Trinity dataset, and another 5 segments are female voices from the ZEGGS dataset.Ten models in total were scored.For each model, there were 10 segments (approximately 5 minutes in total) of audio with generated gestures.We paid each participant an hourly rate of approximately 10 USD, which is above the average salary level [39].A screenshot of the subjective evaluation scoring screen is shown in Figure 10.

G GESTURE GENERATION FOR MULTIPLE SKELETONS
Take the general gesture generation of skeleton A as an example.Lupper    Similarly, the primal gesture sequence is fed into the decoder of whichever skeleton the gesture is generated, as shown in Figure 2.

H MORE DISCUSSION
We obtained similar results from our experiments on BEAT and TWH.The criteria for the motions in the BEAT and Trinity datasets are the same, and TWH is slightly more complex, especially for the shoulder joints, as shown on the right side of the Figure 11.From the figure, we can find that the three poses are generally similar, because the retargeting network is constrained with terminal positions and uses 5 terminals + 2 intermediate joints for a total of 7 joints as the middle representation, so more detailed information may be neglected.For example, the details of elbow and shoulder.
We appreciate the concerns regarding the scalability of our proposed method as new skeleton data is incorporated, necessitating the retraining of the network.We agree that it is theoretically feasible to decouple the problem into two separate parts -one focusing only on learning a uniform skeleton representation for the autoencoder system, and the other focusing only on co-speech gesture synthesis.This approach allows only the retargeting part of the network to be retrained when new skeleton data is added, resulting in significant computational cost and time savings.However, it is also important to acknowledge that while these two problems can be technically decoupled, there may be complex interactions between them in practice.There are indeed a lot of works [31,35,54,77] on learning to retarget between different skeletons; or on learning cospeech gesture generation [6,20,50,52].We are the first approach to attempt to integrate the both, and the integration of retargeting with speech-driven gestures can yield impressive results.

Figure 1 :
Figure 1: Gesture examples generated by our proposed method.Different skeletons are unified to the primal skeleton.The speech-driven primal skeleton generate gestures for the specified skeleton.The character used in the paper is publicly available.

Figure 2 :
Figure 2: Gesture generation pipeline of our proposed framework.We retarget the different skeletons to the primal skeleton.Given a speech segment (optional style, seed gesture), the output is the primal gesture after VQVAE encoding of the output of the diffusion model.We introduce reinforcement learning to refine the gesture generation network.Finally, a gesture of the skeleton is specified and generated, with physics guidance.

Figure 3 :
Figure 3: Motion is represented using a static encoder   and a dynamic encoder    .Assuming that all skeletons contain 5 end-effectors (2 hands, 2 feet, and head) and 2 mid-nodes, we use the latent representation L of the primal skeleton to represent all the different skeletons.

Figure 4 :
Figure 4: (Top) Denoising module.A noising step   and a noisy gesture sequence L  at this noising step conditioning on  (including seed gesture , style  and audio ) are fed into the model.'RM' is short for random mask.(Bottom) Sample module.At each step   , we predict the L0 with the denoising process based on the corresponding conditions, then add the noise to the noising step L   −1 with the diffuse process.This process is repeated from   =   until   = 0.

Figure 5 :
Figure 5: Structure of primal gesture VQVAE.After learning the discrete latent representation of the primal gesture of upper body, the gesture VQVAE encode and summarize meaningful gesture units.

Figure 6 :
Figure 6: Reward model training.We first sample a VQVAEencoded speech-gesture pair, denoted as trajectory  0 .Then, we randomly replace  gesture code(s) with random codes, where  = 1, • • • , , resulting in  speech-gesture trajectories with decreasing quality.Finally, we utilize the output of reward model  to classify the trajectories with different qualities and optimize the reward model with the loss function L  .

4. 3 . 1
Objective Evaluation.Moreover, we conduct ablation studies to address the performance effects of different components in the framework.We performed the experiments on the following components: (1) reinforcement learning, (2) VQVAE, and (3) multiple skeletons.The results of our ablation studies are summarized in Table

Figure 7 :
Figure 7: Visualization of the stylization, controllability, and diversity of generated gestures.We randomly select a 2.67second generated gesture clip (10 codes).Then setting  and  in Equation (13) to control the style and setting noisy gesture in diffusion model to generate diverse gestures.The dashed boxes indicate that we control their code the same.

Figure 9 :
Figure 9: Visualization of the dynamic encoder    network structure.It mainly includes skeletal convolution, activation function, and skeletal pooling.

0 and L lower 0 are
the upper body of primal gesture sequence after VQVAE reconstruction and the lower body of the diffusion

Figure 11 :
Figure 11: Retargeting visualization on BEAT and TWH.On the left is the result using Auto-rig in Blender, in the middle is the real motion, and on the right is the result generated by the retargeting network.

Figure 10 :
Figure 10: A screenshot of the subjective evaluation scoring interface.

Table 2 :
Ablation studies results.'−' indicates modules that are not used.Bold indicates the best metric.

Table 3 :
Hyper-parameters of the reinforcement learning.

Table 4 :
Additional quantitative results on the test set.Bold indicates the best metric, e.g., closest to Ground Truth or minimal, etc.

Table 5 :
Additional ablation studies results.'−' indicates modules that are not used.Bold indicates the best metric.