Semantics2Hands: Transferring Hand Motion Semantics between Avatars

Human hands, the primary means of non-verbal communication, convey intricate semantics in various scenarios. Due to the high sensitivity of individuals to hand motions, even minor errors in hand motions can significantly impact the user experience. Real applications often involve multiple avatars with varying hand shapes, highlighting the importance of maintaining the intricate semantics of hand motions across the avatars. Therefore, this paper aims to transfer the hand motion semantics between diverse avatars based on their respective hand models. To address this problem, we introduce a novel anatomy-based semantic matrix (ASM) that encodes the semantics of hand motions. The ASM quantifies the positions of the palm and other joints relative to the local frame of the corresponding joint, enabling precise retargeting of hand motions. Subsequently, we obtain a mapping function from the source ASM to the target hand joint rotations by employing an anatomy-based semantics reconstruction network (ASRN). We train the ASRN using a semi-supervised learning strategy on the Mixamo and InterHand2.6M datasets. We evaluate our method in intra-domain and cross-domain hand motion retargeting tasks. The qualitative and quantitative results demonstrate the significant superiority of our ASRN over the state-of-the-arts.


Source Motion
Copy Semantics-Preserving Retargeting The generation of realistic hand motions has demonstrated promising potential in diverse virtual avatar scenarios, including co-speech gesture synthesis [25,27,38] and sign language synthesis [14,30,39].Human hands, being the primary means of non-verbal communication [31], convey subtle nuances during the execution of particular hand gestures.Given people's high sensitivity to hand motions, even slight errors can significantly impact the user experience in virtual avatar applications.Consequently, maintaining consistent hand motion semantics across various virtual avatar hands is paramount.However, due to the highly articulated nature of the human hand with multiple degrees of freedom (DoFs) and the varying hand shapes and proportions of different avatars, directly copying joint rotations would significantly compromise the intricate semantics of hand motions, as shown in Figure 1.Consequently, developing a methodology that can preserve the semantics of hand motions when retargeting them to diverse avatars is essential.
Previous research has focused on motion retargeting and handobject interaction.Motion retargeting, pioneered by Gleicher [11], aims to identify the characteristics of source motions and transfer them to target motions on different characters.Early work [3,9,19] focused on optimization-based approaches.Recently, researchers have proposed data-driven approaches [1,35,41] using various network architectures and semantic measurements.These approaches can successfully retarget realistic body motions but do not apply to dexterous hand motion retargeting.Ge et al. [10] proposed a rule-based approach for retargeting sign language motions; however, their method is limited to a specific set of pre-defined hand movements and lacks sufficient testing.Hand-object interaction is a research area that focuses on synthesizing realistic hand motions during interactions with objects, including static grasp synthesis [12,34,45] and manipulation motion synthesis [24,37,40,43].However, these methods fail to preserve the semantics of hand motions in communication scenarios.Furthermore, they do not apply to diverse hand models with varying shapes and proportions.Despite the existing methods, it remains a challenge: retarget realistic hand motions with high fidelity across different hand models while preserving intricate motion semantics.
This paper focuses on retargeting dexterous hand motions across different hand models while preserving the semantics of the source hand motions.Hand motion retargeting requires a higher level of semantic measurement precision than body motion retargeting, making this idea novel.The semantic measurements previously employed in motion retargeting, including cycle consistency [1,35] and distance matrix [41], are inadequate due to the high density of hand joints within a limited space, which results in significant spatial interactions between finger joints and the palm.
Therefore, our central insight is that the spatial relationships between the finger joints and the palm are crucial for preserving hand motion semantics.Consequently, we encode the spatial relationships into a novel anatomy-based semantic matrix (ASM).We utilize ASM as the semantic measurement for precise hand motion retargeting.In particular, we first build anatomical local coordinate frames for finger joints on different hand models.Then we construct ASM based on the anatomical local coordinate frames.ASM quantifies the positions of the palm and other joints relative to the local frame of the given finger joint.Next, we acquire a mapping function from the source motion ASM to the target motion rotations using an anatomy-based semantics reconstruction network (ASRN).We train ASRN on two heterogeneous hand motion datasets [2,23].Unlike template mesh-based methods [40,43] for semantic correspondence, our approach is not dependent on template meshes and can be applied to various hand models.
We conducted comprehensive experiments to assess the quality of the hand motions generated by our ASRN.These experiments encompassed both intra-domain and cross-domain hand motion retargeting scenarios involving intricate hand motion sequences and a diverse range of hand shapes.The qualitative and quantitative results show that our ASRN outperforms existing motion retargeting methods by a large margin.
To summarize, our contributions are as three-fold: • We propose a novel task: semantics-preserving retargeting of dexterous hand motions across diverse hand models.• We introduce an anatomy-based semantic matrix (ASM) that quantifies hand motion semantics without relying on any template mesh, making it applicable to various hand models.• We propose a novel framework for semantics-preserving hand motion retargeting, leveraging the ASM.Experimental results on both intra-domain and cross-domain hand motion retargeting tasks validate the superior performance of our framework over existing methods.

RELATED WORK 2.1 Motion Retargeting
Motion retargeting aims to identify the features of the source motions and transfer them to the target motions on a different character.The pioneering work by Gleicher [11] addresses motion retargeting as a spatial-temporal optimization problem with the source motion features as kinematic constraints.Subsequent studies propose solutions to this optimization problem with various constraints [3,5,19,33].
Recently, data-driven methods [1,7,15,20,35,41,44] have become increasingly appealing due to the growing availability of motion capture data.Delhaisse et al. [7] and Jang et al. [15] train neural networks for retargeting using paired training data.Subsequently, Villegas et al. [35] develop an adversarial neural network trained with cycle consistency [44], eliminating the need for paired ground truth.Aberman et al. [1] propose a skeleton-aware network for retargeting motions between skeletons with varying topologies.Zhang et al. [41] also introduces the Distance Matrix for measuring body motion semantics.
However, all the methods above either truncate finger movements or merely replicate finger joint rotations during retargeting, resulting in the loss of intricate semantics in dexterous hand motions.In contrast, our framework carefully measures the hand motion semantics with an anatomy-based semantic matrix (ASM), and transfers these semantics to the target hand motion through a novel anatomy-based semantics reconstruction network (ASRN).

Hand-object Interaction Synthesis
The synthesis of hand grasping given an object has been extensively studied in robotics [4,8,29].Recently, several data-driven methods have been proposed [6,12,32,45].Among these methods, Karunratanakul et al. [17] and Jiang et al. [16] propose to represent the proximity between the hand and the object as an implicit function.
Object manipulation synthesis involves dynamic hand and object interaction, which makes it more relevant to our research.Previous researchers have tackled this issue by optimizing hand poses to meet different constraints [22,24,37,42].In a recent study, Zhang et al. [40] employed hand-object spatial representations to learn object manipulation using motion capture data.Subsequently, Zhou et al. [43] devised a different object-centric spatiotemporal representation.
However, these representations cannot capture the semantics of hand motion as they neglect the interaction between the palm and the fingers.Furthermore, these representations are explicitly designed for a given template hand mesh, which restricts their applicability to different hand models.In contrast, our ASM quantifies hand motion semantics without depending on a template mesh, allowing its application to diverse hand models.

PROBLEM FORMULATION
This paper aims to learn a mapping function  that transfers the source hand motion to the target hand while preserving the semantics of the source hand motion.The inputs to the function are the source joint rotation sequence Q A , the source hand shape parameter H A , the source hand anatomical parameter M rest A , the target hand where Q B is the target joint rotation sequence.

METHODOLOGY
Based on the formulation in Section 3, we have developed a framework for retargeting hand movements, as depicted in Figure 2. We introduce a novel anatomy-based semantic matrix (ASM) based on the finger anatomical coordinate frame.By utilizing the ASM, we train an anatomy-based semantics reconstruction network (ASRN) to predict the target joint rotation sequence using the source ASM, target hand shape parameter, and target hand anatomical parameter.
In the subsequent subsections, we briefly introduce the anatomical coordinate frame of finger movements, as outlined in Section 4.1.Next, we elaborate on the definition of the ASM in Section 4.2.Finally, we describe the framework pipeline and training details in Section 4.3.

Twist-bend-splay Frame
The human hand exhibits a high degree of articulation.Directly predicting rotations of all 15 finger joints can lead to abnormal hand postures.Previous works [21,36] suggest that constraints can be applied to the finger joint rotations to prevent abnormal hand movements.Yang et al. [36] extended MANO [28] to develop a hand model called A-MANO incorporating anatomical constraints.A-MANO assigns a Cartesian coordinate frame, known as the Twistbend-splay frame, to each joint in the hand's kinematic tree.The frame's x, y, and z axes align with the three revolute directions: twist, bend, and splay, based on hand anatomy.Most finger joints have only one degree of freedom (DoF) along the bend axis.While A-MANO shows promise in estimating MANO pose during hand-object interaction, it does not apply to hand models from external sources, such as the hands of Mixamo [2] characters.To mitigate this problem, we develop a tool for annotating the Twistbend-splay frames of different hand models.Figure 3 demonstrates that our tool can readily provide the Twist-bend-splay frames for hands obtained from both InterHand2.6M[23] and Mixamo [2].Details of our annotation tool can be found in Appendix A.

Anatomy-based Semantic Matrix
Our framework aims to preserve the intricate semantics while retargeting hand motions between hand models from different sources.This paper defines hand motion semantics as the spatial relationships between the fingers and the palm.Due to the absence of paired ground truth with intense semantic supervision, we introduce a novel anatomy-based semantic matrix (ASM) as a semantic measurement for hand motion retargeting.Compared to existing semantic measurements in body motion retargeting [1,35,41] and object manipulation synthesis [40,43], the proposed ASM captures the intricate semantics of hand motions and can be applied to hand models from different sources without any additional cost.
Our ASM is constructed based on the twist-bend-splay frame introduced in Section 4.1.The crucial insight behind constructing the ASM lies in that the orientation of the twist-bend-splay frame reveals the finger's structure.As shown in Figure 4, the splay axis (blue axis) extends from the finger pulp to the back surface of the finger, while the bend axis (green axis) stretches from the right side to the left side of the finger.The twist axis also aligns with the finger bone.In this scenario, we can deduce the spatial relationships between the middle fingertip and the index fingertip based on the coordinates of the middle fingertip within the local twist-bend-splay frame of the index fingertip.
The proposed ASM applies to hand models composed of five fingers, each consisting of four joints (including a dummy fingertip joint).The semantic matrix comprises two components: inter-finger semantic features and palm-finger semantic features.Formally, at time , the coordinates of the -th finger joint within the global frame are represented as g x  ∈ R 3 .g M  represents the rotation matrix of the twist-bend-splay frame of joint  within the global frame.The coordinates of another joint  within the local frame of joint  are given by  x  = g M T  ( g x  − g x  ).We define  x  as the inter-finger semantic feature of joint  concerning joint .Additionally, we introduce the palm-finger semantic feature to capture the overall hand posture, as depicted in Figure 4. Inspired by Yang et al. [36], we define nine palm anchors along the line connecting the metacarpophalangeal and wrist joints.We denote the palm-finger semantic feature of the -th anchor with respect to joint  as  x p  = g M T  ( g x p  − g x  ), where g x p  represents the coordinates of the -th anchor within the global frame.By combining the inter-finger semantic features and the palm-finger semantic features, we can construct the semantic matrix for joint  as: By having semantic matrices for all 20 finger joints, we obtain the semantic measurement of the entire hand model without relying on any standard mesh template.

Semantics-Preserving Retargeting
The hand retargeting pipeline comprises two stages: semantic feature extraction and semantics-preserving reconstruction.We extract semantic matrices from the source hand motion during the first stage.In the second stage, we employ the anatomy-based semantics reconstruction network (ASRN) to reconstruct hand motion on the target hand model from the source ASM while preserving the source semantics.The overall pipeline is depicted in Figure 2.
In the semantic feature extraction stage, the  -frame hand motion sequence in the twist-bend-splay frame, represented as quaternions of the 15 finger joints, is denoted as Q tbs A ∈ R  ×15×4 .After converting Q tbs A to the global frame using the rest orientation of the joint twist-bend-splay frames M rest A ∈ R 15×3×3 , we obtain Q A ∈ R  ×15×4 .We then perform forward kinematics (FK) to derive the global coordinates of the finger joints X A ∈ R  ×20×3 and the global orientation of the twist-bend-splay frames M tbs A ∈ R  ×20×3×3 .It is important to note that the FK results include the dummy fingertip joints.Additionally, the shape parameter H A ∈ R ℎ A takes different forms depending on the model type.In the case of MANO models, H A represents shape PCA coefficients published by Romero et al. [28], while for Mixamo models, H A corresponds to the normalized finger joint offsets.Finally, we extract the semantic matrices

and M tbs
A using Equation 2, where  D A is the concatenation of  D in  frames.
Having obtained the semantic matrices D A from the source hand motion, we utilize our ASRN to reconstruct the target hand motion Q tbs B ∈ R  ×15×4 on the target hand model B. A ResNet-like [13] architecture is employed.Consecutive 1D ResNet layers process the source ASM D A .Additionally, ASRN receives the target hand shape parameter H B and the target hand local frame rest orientation M rest B as inputs.An MLP encodes H B and M rest B initially, followed by concatenation with the input of each ResNet layer.The output of the final ResNet layer is used as input for a fully-connected layer, which predicts the target hand joint rotation Q tbs B in target hand twistbend-splay frames.Next, we extract semantic matrices D B from the generated hand motion.In this work, hand motion semantics preservation is modeled as preserving spatial relationships between the fingers and the palm.This design defines the semantic loss L sem as the weighted cosine similarity between the source and target semantic matrices: where the weight   is defined as: This weighting scheme encourages the network to focus on closefinger interactions.
To mitigate abnormal hand postures generated by our network, we propose an anatomical loss, denoted as L ana .Q tbs B is decomposed into three Euler angles:  twist ,  bend , and  splay , aligned with the local twist-bend-splay frame axes.Initially, we apply a penalty to  twist for all the joints along the hand's kinematic tree.Additionally, a penalty is imposed on  splay if it exceeds the acceptable range.Finally, we penalize the rotation angle  bend if it exceeds /2 or falls below 0. The anatomical loss is defined as: ( Since our network is trained on hand motion data from different hand models, the self-reconstruction supervision signals are only available when A and B belong to the same character.Therefore, ASRN is trained by minimizing the following loss function:

EXPERIMENTS 5.1 Datasets
The evaluation of our framework encompasses both the Mixamo dataset [2] and the InterHand2.6Mdataset [23].The Mixamo dataset comprises animations performed by various virtual characters with different shapes; however, the dataset does not guarantee consistent hand motion quality and diversity.The InterHand2.6M dataset is a comprehensive collection of hand motion data captured using a multi-view camera system and supplemented with MANO [28] hand pose annotations.While the InterHand2.6Mdataset offers high-quality hand motion data with considerable diversity, it has limitations regarding hand shape variations.During the training phase, we gathered 40,903 frames of hand motion data from nine distinct characters.In the testing phase, we obtained 14,316 frames of hand motion data from four different characters, ensuring that none of the testing characters were present during the network's training.

Implementation Details
The hyper-parameters  sem and  ana are set to 1.0 and 0.1 respectively.The network is trained for 100 epochs with a batch size of 64.We use the Adam optimizer [18] with the learning rate set to 10 −4 to train the network.The input to the network is a sequence of 8 frames with a frame rate of 5 fps.The network is implemented in PyTorch [26] and trained on a single NVIDIA RTX 2080 Ti GPU.Further details can be found in Appendix B.

Evaluation Metrics
For hand motions with paired ground truth (GT) on different characters, we use Mean Square Error (MSE) to measure how close the retargeted joint positions are to the paired GT.In the absence of paired GT, the following metrics are used to evaluate the quality of the retargeted hand motions: palm and  finger represent the average cosine similarity between the retargeted hand motion and the GT hand motion.Higher values indicate better preservation of the original spatial relationships between the fingers and the palm in the retargeted hand motion.

Qualitative Results
The results of hand motion retargeting among hands with various shapes are depicted in Figure 5.The TBS Copy method copies Q tbs A to Q tbs B , while the Copy method copies Q A to Q B .The DM method replaces our proposed ASM with the distance matrices proposed by Zhang et al. [41].During training, the network did not encounter any of the source or target hands in the last row.Existing methods barely account for the intricate spatial relationships between the fingers and the palm, leading to inconsistent and unnatural hand motions.In contrast, our method effectively preserves the spatial relationships between the fingers and the palm, resulting in hand motions that are more natural and preserve semantics.Figure 10 shows the detailed spatial relationships in the results of our method.

Quantitative Results
Table 1 shows comparison between our method and existing body motion retargeting techniques.We compare the methods across three tasks with different sources and targets: Mixamo to Mixamo (MX2MX), InterHand to Mixamo (IH2MX), and Mixamo to Inter-Hand (MX2IH).Because the Mixamo dataset provides paired GT, we use MSE to assess the quality of the retargeted hand motions for the MX2MX task.For the other two cross-domain tasks, we utilize  palm and  finger as metrics for the quality of the retargeted hand motions.Because the Mixamo dataset may create a new character with an archived motion by using motion copy, the Copy method has the lowest MSE.However, as the qualitative results reveal, this does not mean the motion copy is optimal.Our method achieves a reduction in MSE of 85.6% and 83.8% compared to SAN [1] and DM [41], which utilize distinct semantic measurements.Additionally, our method achieves the highest  palm and  finger in the IH2MX and MX2IH tasks, indicating that its superior ability to preserve the original spatial relationships between the fingers and the palm during the retargeted hand motion.This observation suggests that our proposed ASM outperforms the distance matrices [41] and the implicit measurement learned in SAN [1].

User Study
We conduct a user study to evaluate the performance of our framework against Copy, SAN [1], and DM [41].We invited 26 participants and showed them six static hand posture pictures and six hand motion videos.Each picture and video contains one source motion and four anonymous results.Participants were instructed to rank the pictures and videos based on three aspects: preservation of static posture semantics (PS), preservation of motion semantics (MS), and motion quality (MQ), from best to worst.The average rankings are presented in Table 2. Overall, our method achieved the best performance in all three aspects.

CONCLUSION
In this paper, we propose the problem of semantics-preserving hand motion retargeting.We encode the spatial relationships between the fingers and the palm using anatomy-based semantic matrices (ASM).We train an anatomy-based semantics reconstruction network (ASRN) to retarget the motion semantics of the source hand onto the target hand, utilizing the source ASM.We evaluate our framework on both intra-domain and cross-domain retargeting tasks.Our method demonstrates superior performance to existing motion retargeting methods, both qualitatively and quantitatively.
A TWIST-BEND-SPLAY FRAME ANNOTATION This section presents our frame annotation tool for Twist-bendsplay.A previous study by Yang et al. [36] introduced A-MANO, a hand model that incorporates Twist-bend-splay frames.A-MANO, an extension of MANO, is limited in its applicability to other hand models.This paper presents the implementation of a versatile frame annotation tool for Twist-bend-splay, applicable to any hand model with five fingers and 15-finger joints.The annotation tool can semiautomatically derive the frame orientation of finger joints for Twistbend-splay from the model's kinematic tree and mesh information.
Specifically, our annotation tool first computes the twist axis n twist as the vector from the child of the current joint to the joint itself.Next, we project rays onto the normal plane defined by n twist and perform ray-mesh queries.The ray-mesh hit locations on the mutually perpendicular axes n splay and n bend are denoted as p splay and p bend , respectively.m splay and m bend represent the normal vectors of the mesh at p splay and p bend , respectively.We iterate through all the possible axis directions and minimize the following loss function: where o is the location of the corresponding finger joint.The underlying insight of L annotate is that the fingers are narrower from top to bottom but wider from left to right.Therefore, we minimize . Moreover, we aim to align the axes with the mesh normals, thus maximizing n spaly • m splay + n • m bend .Finally, our annotation tool displays the frames of Twist-bend-splay on the hand model, as depicted in Figure 6.If needed, the user can manually adjust the orientation of the splay and bend axes.

B NETWORK ARCHITECTURE AND TRAINING DETAILS
As depicted in Figure 7, the proposed Action Sequence Reconstruction Network (ASRN) architecture comprises two main components: the static encoders and the motion reconstruction convolutional network.Each static encoder consists of one MLP layer and two ResNet-like convolutional layers.The motion reconstruction convolutional network is composed of four ResNet-like convolutional layers.The input to each layer concatenates the output from the previous layer and the output from the corresponding static encoder.The ASRN takes the source ASM denoted as D A as input and generates the target joint rotation denoted as Q tbs B as output.To train the ASRN, we employ the Adam optimizer [18] with a learning rate of 10 −4 and a batch size of 64.The ASRN is trained for 100 epochs.
Since the shape parameter H B ∈ R ℎ B varies based on the model type, we train an ASRN for each specific form of the shape parameter.In this study, we introduce two ASRNs specifically for MANO and Mixamo.For MANO, H B is represented as a 10-dimensional vector, while Mixamo represents a 45-dimensional vector.The ASRNs for both MANO and Mixamo are trained using identical hyperparameters.Each network is trained on the InterHand2.6M and Mixamo datasets, but with distinct target hand models.

C ABLATION STUDY
The qualitative results of two ablated versions of our methods are illustrated in Figure 8 and Figure 9.
Figure 8 compares the results with and without the inclusion of the anatomical loss L ana .Excluding L ana leads to a higher occurrence of unnatural finger poses, such as the abnormal splay of the interphalangeal joint of the little finger.
Figure 9 compares the results with and without using the weighting scheme described in Equation 4. The weighting scheme promotes the network's attention toward proximal joint interactions.Consequently, the whole model produces a motion where the thumb pulp contacts the index fingertip, while the ablated model fails to achieve this contact.

D SUPPLEMENTARY QUALITATIVE RESULTS
Figure 10 presents additional qualitative results of our method.Our approach effectively preserves accurate hand motion semantics.

Figure 1 :
Figure 1: Despite the accurate body motions, errors introduced by copying finger joint rotations make the "thumb-up" gesture illegible.

Figure 2 :
Figure 2: The figure presents an overview of the proposed pipeline consisting of two stages.The extraction stage involves the retrieval of ASM from the source hand motion.The reconstruction stage utilizes the source ASM, target hand shape parameter, and target hand anatomical parameter to reconstruct the target hand motion.

Figure 3 :
Figure3: Left: Twist-bend-splay frames obtained from different hand models using our annotation tool.Right: Finger movements in the twist, splay, and bend directions.Note that the bend and splay directions of the thumb joints differ significantly from those of the other four fingers.

Figure 4 :
Figure 4: Left: The inter-finger semantic features capture the subtle semantics of finger movements.Right: The palmfinger semantic features capture the overall hand posture.Yellow cubes represent the palm anchors.
) where  sem and  ana are hyper-parameters.The indicator function 1 A=B takes the value 1 if A and B belong to the same character, and 0 otherwise.

Figure 5 :
Figure 5: Qualitative comparison between the proposed framework and the state-of-the-art methods.

Figure 6 :
Figure 6: Our annotation tool allows the user to adjust the splay axis (red axis) and bend axis (black axis) directions for Mixamo and MANO hands.

Figure 7 :
Figure 7: The network architecture of the proposed ASRN.

Figure 8 :
Figure 8: Comparison of results with and without the inclusion of the anatomical loss L ana .

Figure 9 :
Figure 9: Comparison of results with and without the weighting scheme described in Equation 4.

Figure 10 :
Figure 10: Our framework maintains precise spatial relationships among the fingers.

Table 1 :
Comparison with the state-of-the-arts.Ours / L  is the model without anatomical loss in Equation 5. Ours /weight is the model without the weight scheme in Equation 4.  finger ↑  palm ↑  finger ↑

Table 2 :
Ranking results of the user study.We invite 26 participants to compare the retargeting results from three aspects: static posture semantics (PS), motion semantics (MS), and motion quality (MQ).