MOCHA: Real-Time Motion Characterization via Context Matching

Transforming neutral, characterless input motions to embody the distinct style of a notable character in real time is highly compelling for character animation. This paper introduces MOCHA, a novel online motion characterization framework that transfers both motion styles and body proportions from a target character to an input source motion. MOCHA begins by encoding the input motion into a motion feature that structures the body part topology and captures motion dependencies for effective characterization. Central to our framework is the Neural Context Matcher, which generates a motion feature for the target character with the most similar context to the input motion feature. The conditioned autoregressive model of the Neural Context Matcher can produce temporally coherent character features in each time frame. To generate the final characterized pose, our Characterizer network incorporates the characteristic aspects of the target motion feature into the input motion feature while preserving its context. This is achieved through a transformer model that introduces the adaptive instance normalization and context mapping-based cross-attention, effectively injecting the character feature into the source feature. We validate the performance of our framework through comparisons with prior work and an ablation study. Our framework can easily accommodate various applications, including characterization with only sparse input and real-time characterization. Additionally, we contribute a high-quality motion dataset comprising six different characters performing a range of motions, which can serve as a valuable resource for future research.


ABSTRACT
Transforming neutral, characterless input motions to embody the distinct style of a notable character in real time is highly compelling for character animation.This paper introduces MOCHA, a novel online motion characterization framework that transfers both motion styles and body proportions from a target character to an input source motion.MOCHA begins by encoding the input motion into a motion feature that structures the body part topology and captures motion dependencies for effective characterization.Central to our framework is the Neural Context Matcher, which generates a motion feature for the target character with the most similar context to the input motion feature.The conditioned autoregressive model of the Neural Context Matcher can produce temporally coherent character features in each time frame.To generate the final characterized pose, our Characterizer network incorporates the characteristic aspects of the target motion feature into the input motion feature while preserving its context.This is achieved through a transformer model that introduces the adaptive instance normalization and context mapping-based cross-attention, effectively injecting the character feature into the source feature.We validate the performance of our framework through comparisons with prior work and an ablation study.Our framework can easily accommodate various applications, including characterization with only sparse input and real-time characterization.Additionally, we contribute a high-quality motion dataset comprising six different characters performing a range of motions, which can serve as a valuable resource for future research.

INTRODUCTION
Transforming styleless motions to embody a particular character is invaluable for animating characters in feature films or in VR.As the popularity of interactive and immersive VR and AR applications grows, so does the need for real-time user motion characterization.In this paper, we propose a novel technique for transforming a variety of user motions into motions that express a specific character (e.g., Princess, Clown, etc.) in real-time.
We adopt an example-based approach, presuming the availability of a stylistic motion database for the target character.However, the character's body proportions will not be an exact match to the user's, and the character may not even possess a human-like physique.In this work, we posit that a character's distinctive movement style is intricately linked with their unique body shape, so we need to solve both motion stylization and motion retargeting problems jointly.
Current data-driven motion stylization techniques often perceive style as a consistent feature within data.One approach is to extract style elements from the target examples and apply them to the source motion [Aberman et al. 2020b;Jang et al. 2022].Another approach treats style as a conditioning variable for motion generation [Park et al. 2021;Tao et al. 2022a].While these methods perform well on stereotypical locomotion datasets, we challenge their assumption by introducing a high quality and diverse dataset of professional performances.Our data clearly illustrates style as context-dependent.For example, the expression of "happiness" in a jumping motion manifests differently than in a crawling motion and their characteristics are not interchangeable.
To this end, we propose a novel framework that can characterize a user's diverse motions in real-time.The basic idea is that when user performs a motion, we search for a target character's motion with the most similar context from a motion database and then transfer the style elements of the found motion to the user's motion.However, this approach has the disadvantages of storing character motion database, searching a suitable motion in real-time, as well as limiting target motions to the ones in the database.To improve upon this, inspired by learned motion matching technique [Holden et al. 2020], we train a neural network, dubbed Neural Context Matcher (NCM), to generate target motion features suitable for characterizing a source motion instead of searching a database.Modeled as a conditional VAE running autoregessively, the NCM can generate temporally coherent motion features for the target character suitable for the input motion, which is key to obtaining high-quality motion characterization for diverse motions.
In addition, the motions generated by our framework not only reflect the motion style aspects but also the target character's body proportions, making additional motion retargeting to the target character unnecessary.Hence we call our technique motion characterization.This is possible because our framework encodes both motion styles and body proportions into motion feature and effectively transfers them to the source motion.To the best of our knowledge, our work is the first that performs motion stylization and retagetting concurrently.Besides, we propose several crucial ideas, such as introducing contrastive loss and incorporating adaptive instance normalization (AdaIN) [Huang and Belongie 2017] into transformer decoder, that significantly enhance the motion stylization quality.
Figure 1 shows snapshots of our characterization results.Various motions, such as walking, sitting and jumping (white character), can be characterized to match Zombie, Princess, and Clown.We demonstrate the effectiveness of our framework through comparisons with previous work and ablation study.Additionally, we showcase its capability for real-time live characterization from streamed motion data.We also show that our framework can accommodate sparse inputs, enabling its application in VR tracker-based motion capture systems.
The major contributions of our work can be summarized as follows: • We present the first online motion characterization framework that can transfer both the motion style aspects and body proportions of characters to a variety of user motions.• We release a high-quality character motion dataset that contains a total of 6 characters performing various actions, with each action conducted with 5 emotions.

RELATED WORK 2.1 Motion style transfer
Research in motion style transfer centers around two fundamental questions: the definition and representation of style.Conventionally, style is regarded as time-invariant variations within the same motion content, structure, or context, which can vary based on individual characteristics [Ma et al. 2010] or emotions [Amaya et al. 1996].A lot of research investigate decomposition algorithms to factorize and parameterize style from motion content [Brand and Hertzmann 2000;Mason et al. 2018;Min et al. 2009;Rose et al. 1998;Shapiro et al. 2006;Unuma et al. 1995;Yumer and Mitra 2016].Style can also be represented non-parametrically or as a label.For instance, Gaussian Process has been used to learn a latent space of pose styles [Grochow et al. 2004] or motion styles [Ikemoto et al. 2009;Wang et al. 2007].More recently, explicit style labels are used to condition generative models to stylize the output motion [Smith et al. 2019;Tao et al. 2022b].The style labels can also be latent learned from data [Park et al. 2021].
Treating style as a static feature limits its representative power.These approaches can often capture stereotypical pose features, but fall short at delivering complex and nuanced characteristics.Xia et al. [2015] addressed this issue by adapting style parameters in real-time based on local nearest-neighbors.They also contributed a dataset that supported many follow up research.Style may also be spatially varying.Motion Puzzle [Jang et al. 2022] integrates style elements based on body parts into a single character.In our work, we consider style as non-parametric and context-dependent, and therefore choose a different class of approach more akin to motion matching [Clavet 2016].
Recent research in motion style transfer is heavily influenced by the vast body of work in image style transfer.In this context, motion styles are analogous to image textures.The pioneering work of Neural Style Transfer [Gatys et al. 2015] introduced the use of the Gram matrix of latent features for style representation, an approach later adopted in the motion domain [Du et al. 2019;Holden et al. 2016].Its successor, the Adaptive Instance Norm (AdaIN) layer [Huang and Belongie 2017], has later become the predominant technique with widespread adoption (e.g., [Aristidou et al. 2022]).Additionally, Generative Adversarial Networks (GANs), in combination with contrastive and cycle-consistency losses [Zhou et al. 2016] have proven effective in self-supervised style learning [Aberman et al. 2020b;Dong et al. 2017], which we also apply in our work.

Motion Matching
Motion matching [Clavet 2016] provides continuous and controllable animation in real-time for interactive gaming via nearestneighbor search of motion features.To improve its scalability in memory and speed, learned motion matching [Holden et al. 2020] approximates this process using a neural network.Our Neural Contact Matcher (NCM) applies learned motion matching on learned context features to find the closest matching target motion, ensuring real-time and high quality output.

Puppeteering and Motion Retargeting
Puppeteering and motion retargeting can be regarded as a special case of motion style transfer, where the control or source motion must be mapped and adapted to the target character's design and style.Automatically mapping between two arbitrary characters remains a notable challenge with no unique answers.To address this, heuristic-based solution have been developed to analyze either the structure [Kry et al. 2009] or motion space between characters [Dontcheva et al. 2003;Seol et al. 2013].For the less complex task of mapping between two bipedal skeletons -considered homeomorphic graphs -graph convolutional networks have been successful [Aberman et al. 2020a;Park et al. 2021].We therefore employ this strategy to extra motion features, facilitating retargeting across a diverse range of characters and styles.

MOTION DATA REPRESENTATION AND PROCESSING
Figure 2 illustrates the motion representation of our method.The reference frame of a motion, denoted as root  , is located at the ground projection of the pelvis joint at current frame  and aligned to the ground normal direction and the pelvis' forward facing direction.A motion sequence at frame  is represented in two ways: X  and Y  express each element with respect to root  and with respect to the parent, respectively.
where  denotes all joints (including the pelvis),   ∈ R 3 and   ∈ R 6 (6 for two orthogonal axes) are joint translations and rotations,    ∈ R 3 and    ∈ R 3 are joint linear and angular velocities local to root  .Likewise,   ,   ,    and    denote joint translation, rotation, and linear and angular velocities local to the parent, with respect to root  .As a result, the total dimensions of our human motion feature with  frames are X ∈ R  ×  ×15 and Y ∈ R  ×  ×15 , where   is the number of joints.

MOCHA FRAMEWORK
Figure 3 (a) illustrates our motion characterization framework at runtime.Input to our framework is a motion sequence of length  (1 second, 60 frames) from past to current frame , represented with X  .For output, we generate a characterized motion sequence of the same length, represented as Y  , but only use the final characterized pose y  .
Our framework comprises a bodypart encoder, a neural context matcher (NCM), and a characterizer networks.At each time frame, the bodypart encoder transforms the input source motion into a feature vector (dubbed source feature) that structures human motion into six parts and captures sequential motion dependencies.Subsequently, given a target character, the NCM generates a corresponding feature of the target character (dubbed character feature) that shares the most similar context with the source motion.Lastly, the characterizer uses the source feature and the contextmatched character feature to synthesize a characterized pose while preserving the source motion's context.
The bodypart encoder and characterizer are character-agnostic, allowing a single trained network to work for all characters.In contrast, the NCM is trained separately for each target character, so the number of NCMs increases linearly with the number of characters.

Bodypart Encoder
The bodypart encoder in Fig. 3 (a) consists of two components: body patch embedding, which reduces the spatial and temporal resolutions of the input motion while preserving bodypart structure, and transformer encoder, which uses a transformer-based structure to learn sequential motion dependencies.
Body patch embedding.Following the approach of flattening 2D patches input for standard vision transformer [Dosovitskiy et al. 2020], we employ body patch embedding, which maintains the graph structure of the human skeleton as much as possible.Specifically, we use spatial-temporal graph convolutional blocks (STGCN) [Yan et al. 2018] to reduce the spatial (joint) and temporal (frame) resolutions.STGCN blocks project an input motion into a sequential feature embedding E. We define embedding process as follows: where P denotes positional encoding with learnable parameters.E comprises a total of  4 *    patches with  channels, where   (= 6) is the number of bodyparts (head, spine, arms, and legs).
Transformer encoder.We capture spatial-temporal dependencies of body patches by using a transformer structure.We feed the embedding sequence E into the transformer encoder   to generate a source feature  =   (E).Each layer of the encoder consists of a multi-head self-attention module (MSA) and a feed-forward network (FFN). (3)

Neural Context Matcher
After bodypart encoding, we search a character feature that shares a similar context with the source feature.This character feature will later be used to imbue its style aspects into the source feature.
Inspired by learned motion matching (LMM) approach [Holden et al. 2020], we train the neural context matcher (NCM) to generate the best matching character feature.Unlike LMM, which performs matching every few frames, NCM runs at every frame to maximize responsiveness to the input motion, which is crucial for producing temporally continuous character features.To this end, we model the NCM using an autoregressive conditional variational autoencoder.The NCM implicitly models a distribution of possible next character features that match the current source context feature given previous character feature.Samples are drawn from this distribution and passed through the NCM decoder to create a character feature for next frame, one at a time in an autoregressive fashion (yellow part in Fig. 3 (a)).
is extracted from encoded feature   via context mapping network.The context space made by the context mapping is characteragnostic, capturing shared information on context across character domains.It enables context matching between different characters, which is a crucial step in characterizing.The context mapping is learned from unlabeled motion data in an supervised manner with a set of loss terms as will be discussed in Sec.6.5.
Prior Net.The distribution over possible latent variable   ∈ R  for character feature   ℎ is described by a learned prior [Rempe et al. 2021] conditioned on previous character feature   −1 ℎ and current source context feature  (   ): which parameterizes a Gaussian distribution with diagonal covariance via a neural network.
NCM decoder.The character feature   ℎ is predicted by the NCM decoder, which takes as input the latent variables   while being conditioned on previous character feature   −1 ℎ and current source context feature  (   ): We use transformer based C-VAE model for the NCM.In training phase, both NCM encoder and decoder are trained as detailed in Section 5.2 while only the decoder is used for inference.

Characterizer
The characterizer transfers the style aspects (e.g., skeleton proportions, characteristic movements, etc.) of the character feature to the source feature.For this, the character transformer decoder with adaptive instance normalization (AdaIN) [Huang and Belongie 2017] and multi-head cross-attention generates characterized decoded feature   (dubbed translated feature), which is then upsampled with De-STGCN blocks to obtain the final characterized motion.
Character transformer decoder.After context matching, we feed the source feature and character feature into the transformer decoder.The character transformer decoder generates translated feature   =   (   ,   ℎ ) which merges the context of source motion and the character (style) aspects of target character motion.
As shown in Fig. 3 (b), instead of employing traditional transformer decoder block, we model our decoder with the AdaIN, multihead cross-attention module (MCA) with context mapping function, and FNN layer.The AdaIN module transfers the global statistics of character feature, as in [Jang et al. 2022], by taking    as input and injecting character feature   ,ℎ as: where  and  are the channel-wise mean and variance, respectively.AdaIN scales the normalized    with a learned affine transformation with scales  and biases  generated by   ℎ .In the second step, we feed the globally-stylized feature   ′′ and character feature   ℎ to MCA.We use the context feature  (  ′′ ) to generate the query , the character context feature  (  ℎ ) to generate the key , and the character feature   ℎ to generate the value  : where   ,  ,  ∈ R  × ℎ .Then, the output sequence   of the transformer decoder is obtained by Body patch expanding and output.The translated feature ) × is upsampled with additional De-STGCN blocks, which has symmetric architectures to that of STGCN, to yield output translated motion Y   as follows: Finally, we pick the last frame pose y   from Y  as final output.
Root motion.We maintain the root angular velocity of the source motion while scaling the linear velocity according to the ratio of the average hip velocities of the source motion X  and the output motion Y   .This strategy allows for maintaining the overall shape of the root trajectory of the source motion, while varying the global linear velocity to match the target character.

TRAINING
Our training pipeline consists of two-stages; the bodypart encoder and characterizer are trained first, followed by training the NCM.Thus, in the first stage, our framework learns to extract encoded features from a motion and transfer style aspects of one motion (e.g., motion style and skeleton proportions) to the other motion to synthesize a characterized motion.At this stage, the context mapping network is also trained as a part of transformer decoder.The first stage training is conducted in an unsupervised way.The second stage trains the NCM in a supervised manner to generate target character features corresponding to the input context features from the source.

Stage-1
Figure 4 shows the training architecture in stage-1.Without the NCM, we randomly choose a source motion X   and a character motion X  ℎ to train the bodypart encoder and the characterizer jointly.In addition to using common loss terms for learning style transfer, such as identity and cyclic consistency losses, we newly introduce a contrastive loss to enhance local context preservation both spatially and temporally.To simplify the notation, we use  (•) and  (  ,  ℎ ) to denote body part encoder and characterizer, and omit the superscript .
Identity and cycle loss.To design identity and cycle consistency losses, we first define a reconstruction loss that computes the difference of two motions both in terms of X and Y [Holden et al. 2020] as well as their velocities. where ℎ , ℎ is time step, and   ,   ,    ,    are the relative weights. (X) represents the rate of change of X.Part of  (X) corresponds to the joint accelerations.
The identity loss ensures that the input motion remains unchanged when it is used for both the source and character motions: where X   =  ( (  ,   )) and   =  (X  ).X   is obtained by feeding X  to both X  and X ℎ in Fig. 4. Likewise, X  ℎ =  ( ( ℎ ,  ℎ )) with  ℎ =  (X ℎ ).
To guarantee that the resulting motion Y  (hence X  ) preserves the context of the source motion X  and the characteristics of the character motion X ℎ , we employ cycle consistency loss [Choi et al. 2020].
where Body patch contrastive loss.To ensure that the characterized output motion not only maintains the overall context  (  ) ∈ R (  4 *    ) × of source motion X  , but also preserves context of body patches at a specific location between source and output, we introduce a body patch contrastive loss.For example, in Figure 5, the context of a princess leg at frame  −  should be closer to that of input zombie leg at frame  −  than the other patches of the same input.
To define the body patch-level context loss between  (  ) and  (  ), we use infoNCE loss [Oord et al. 2018]: where  is the temperature parameter, and  + and  -denote positive and negative for v.We set pseudo positive samples between body patch-level context of source  (  ) and characterized motion  (  ); for a body patch  (  )  ∈  (  ), we set its positive patch  (  )  as the patch in the same location in  (  ), and negative patches  (  ) \ as all other patches.
where  ∈ {1, 2, . . .} and  (=  /4 *    ) is the number of body patches.Figure 5 illustrates how body patch-level contrastive  context loss work, with procedure to define positive and negative samples.
The total loss function of stage-1 is thus: where   ,   , and   are weights.

Stage-2
In stage-2, the NCM is trained to infer a target character feature  ℎ that shares the same context with a source feature   .Briefly, this is achieved by providing the ground truth character feature  ℎ obtained from searching a target feature database To begin with, we roughly annotate the motion data with action labels (e.g., walk, jump, and crawl) to facilitate the nearest neighbor search; limiting the search only on the relevant subset of the same action label saves time and enhances accuracy of the results.
As the NCM operates autoregressively, training the NCM encoder and decoder is conducted with a sequence of source features    |  =0 and its corresponding sequence of target features   ℎ |  =0 obtained with nearest neighbor search.These feature sequences represent continuous motions in motion database.In the training, the NCM transformer takes the previous character feature and current source context feature as condition to reconstruct the contextmatched target character feature.At the same time, it attempts to shape the latent variable  as a standard normal distribution (  ,   ) using   .This process is illustrated in Fig. 6.For details on the training procedure, please see Algorithm 1 in supplementary material.
At run-time, the encoder is discarded and the decoder is used to predict matched character features.At each time frame, we pass  (   ) through decoder to predict a matched character feature   ℎ (Sec.4.2).

EVALUATION AND EXPERIMENTS
We conduct comparisons with other methods, perform an ablation study, and conduct additional experiments to demonstrate the utility of our framework.For visual animation results, please refer to the supplementary video.

Datasets
We constructed a high-quality character motion dataset with five professional actors.The dataset comprises a total of 6 characters (Clown, Ogre, Princess, Robot, Zombie, and AverageJoe) performing various actions (Dance, Fight, Jump, Crawling, Run, Walk, and Sit) with 5 emotion variations for each action (Angry, Happy, Neutral, Sad, and Scared).Every character preserves the same body proportions as the actors, except we scale the forearm of the Ogre character to be twice as long to convey its style.The dataset contains a total of 573k frames, captured at 60 fps, resulting in approximately 159 minutes of data.Since different emotions are manifested with unique styles, we consider a character-emotion pair as an individual character, such as "Neutral Zombie" and "Happy Ogre".We additionally tested our algorithm on the Adult2child dataset [Dong et al. 2020] for quantitative evaluation.It consists of 17 subjects including nine adults (older than 18 years) and eight children (5-10 years old) performing actions such as Jump as high as you can in place, Punch, Kick, Walk, and Hop Scotch.We treat each subject as a character due to their personalized styles.Adult2child dataset is suitable for our evaluation since the skeleton proportions and motion styles vary depending on the subject.

Qualitative evaluation
First, we qualitatively compare our method with two baselines: Motion Puzzle [Jang et al. 2022] and replacing NCM in MOCHA with nearest-neighbor search (Nearest Neighbor).Since Motion Puzzle only works for a single body proportion, we have to replace its reconstruction loss with ours to accommodate skeleton variations at training.Note that Motion Puzzle conducts an offline, full sequence-to-sequence translation, while MOCHA is an online translator.
Figure 7 compares the characterization result of Neutral Aver-ageJoe as the source (1st row) among MOCHA, Nearest Neighbor, and Motion Puzzle.Unlike MOCHA, both Motion Puzzle and the Nearest Neighbor method require a specific reference motion as input.Therefore, we manually selected a reference from each target character performing the most similar action (2nd row) for them.
Compared to Motion Puzzle, MOCHA better preserves the unique style in each target character.Our results accurately capture the limping legs of the Neutral Zombie, the energetic waves of the Angry Clown, the elegantly crossed legs of the Princess, and the wildly flailing arms of the Happy Zombie.In contrast, Motion Puzzle dampens the target style in favor of the source motion.
Nearest Neighbor exhibits noticeable discontinuity in the output, due to the relatively sparse context space and insufficient temporal features used in search.It cannot respect the source motion context when similar context is not available for the target character.Thanks to the autoregressive CVAE, our method using NCM can mitigate these issues to produce smooth and consistent output.

Quantitative evaluation
We quantitatively compare the degree of motion quality, context preservation, and style reflection with Motion Puzzle and the Nearest Neighbor method.Following prior work [Jang et al. 2022;Park et al. 2021], we use Frechet Motion Distance (FMD), context recognition accuracy (CRA), and style recognition accuracy (SRA) as metrics.We train the CRA classifier and the SRA classifier using vision transformer [Dosovitskiy et al. 2020] for its superior accuracy.The two datasets, MOCHA and Adult2child, are each split into a 90% training set and a 10% test set.Results are detailed in Table 1.We didn't apply CRA on the MOCHA dataset because it does not contain action labels.We also have to remove Motion Puzzle from the Adult2Child dataset comparison as it fails to train.When compared to Motion Puzzle, our method yields superior FMD and SRA scores, demonstrating higher motion quality and a more precise style reflection.A comparison with the Nearest Neighbor method underscores the significance of our NCM module, as we consistently achieve higher SRA scores due to superior temporal coherence.The better CRA score for Nearest Neighbor can be attributed to its direct usage of features from the database.

Ablation study
We conducted ablation studies to demonstrate the effectiveness of the AdaIN layer and the contrastive loss.For additional ablation studies, please refer to the supplementary material.
Effect of AdaIN.The AdaIN module facilitates successful characterization by incorporating a global style before the cross attention layer in the characterizing transformer decoder block (Figure 3 (b)).To assess its impact, we trained a Characterizer without the AdaIN module.Results in Figure 8 shows that the ablated model cannot capture the Ogre's raising arms style without using AdaIN.
Effect of contrastive loss.To examine the effect of patch-level contrastive loss, we tested stage-1 training without the contrastive loss.Since the advantage of our contrastive loss is to preserve spatial-temporal context features across all characters, we randomly picked a source and a character motion to generate output.Figure 9 shows that the left and right legs of the output Ogre character are swapped (red circle) without patch-wise contrastive loss because attention module alone may mismatch the left leg of the source with the right leg of the character.In contrast, our model can distinguish the left and right side of the body.Additionally, despite significant differences of two motions, our method reasonably preserves the arm swinging motion of the source (red arrow), which cannot be achieved with the ablation model.

Context space analysis
Successful context matching across different characters is possible thanks to the shared character-agnostic context space constructed by the context mapping module.To visualize its result, we randomly sampled a character motion and computed its context feature as a query, then searched another character's context database for the closest match.Figure 10 (a) shows semantically similar yet diverse postures across four different characters that match the query.In another experiment, given a source motion (X  ), we searched for the nearest context feature ( ℎ ) of a target character and generated a characterized motion (Y  ).Subsequently, both source and characterized motions were mapped to the context space as shown in Figure 10 (b).One can see that all three context feature points ( ( (X  )),  ( ℎ ),  ( (Y  ))) are located very close together, showing that the learned motion context indeed shares the same space across characters.

Applications
Input from unseen subjects.We test the effectiveness of our framework for inputs from unseen subjects with varying heights.Figure 11 (a) shows that our framework works successfully even when motions from unseen subjects with different body proportions are given as input.
Sparse input.We supposed that our framework could work with sparse inputs from the hip and end-effectors as they may contain essential information about context and style.To verify this, we reduced the input joints to only six (hip, head, hands, and feet) and retrained the entire framework.As shown in Figure 11 (b), our framework successfully characterizes motion as Neutral Princess from sparse input.This experiment suggests that our framework can accommodate 6-point tracker input data for real-time applications.
Live characterization from streamed motion data.We demonstrate the ability of our framework to characterize streamed motion data in real-time.Figure 11 (c) shows a snapshot of live characterization of a streamed motion captured with Xsens Awinda sensor.The supplementary video shows that our method can produce successfully characterized motion even with network delays and noisy input.The overall inference time is under 16ms, achieving 60Hz or higher frame rate with a 2080 Ti GPU.

DISCUSSION AND CONCLUSION
We introduced MOCHA, a motion characterization framework that can transform user motions to embody distinct style of characters in real-time, and demonstrated its effectiveness through a number of experiments and analysis.
Our method has several limitations that necessitates further exploration.First, our method is most effective when a character motion dataset contains a single characteristic style per context.If multiple motions with different styles share the same context for a target character, our NCM may encounter difficulties in generating temporally consistent character features, leading to discontinuous motions.We observed such phenomena when consolidating all emotion sets of a character into one, thereby allowing emotional style variations within the same motion.Future work is needed to enable such style variations even within a single character.
Second, while our unsupervised learning of context mapping is effective without manual style labels, it is not flexible enough to encompass all stylistic diversity of the character data.For instance, Angry Ogre motions in our database has a unique style of running on all fours and jumping sideways.Our algorihtm currently could not link them to the running and jumping motions of other characters, as shown in the supplementary video.More advanced methods for context learning is desirable to improve the characterization quality.

Figure 1 :
Figure 1: Our characterization framework transforms neutral motions to express distinct style of characters in real-time.

Figure 2 :
Figure 2: Illustration of the motion representation at frame .We define the character forward facing direction of current frame  as reference frame, which is denoted as root  .

Figure 3 :
Figure 3: Network configuration.(a) Overall architecture for motion characterization in run-time.Our framework consists of bodypart encoder, neural context matcher, and characterizer networks.(b) Detail of the characterizer transformer decoder block (  ).

Figure 4 :
Figure 4: Stage-1 training process.A source motion X   and a target motion X  ℎ are randomly selected from different characters in the motion dataset, and the networks are trained to make the characterized motion Y   preserve the context of X   while reflecting the characteristic aspects of X  ℎ .
is obtained by feeding X  and X ℎ to top and bottom inputs in Fig 4 to get Y  , followed by feeding X  (=  (Y  )) and X  to top and bottom inputs in Fig. 4 to get Y   .

Figure 5 :
Figure 5: Illustration of body patch-level context contrastive loss.Blue box indicates a positive sample, while yellow boxes denote negative samples.

Figure 6 :
Figure 6: Training architecture of Neural Context Matcher.Once trained, only the decoder is used for inference.

Figure 7 :
Figure 7: Qualitative evaluation.Source motions of Neutral AverageJoe (top) are characterized with each method (3-5 rows).Reference motion is used for the Nearest Neighbor and Motion Puzzle.

Figure 8 :
Figure 8: Ablation study on AdaIN.Result obtained without AdaIN fails to capture the jumping style of Ogre.

Figure 9 :
Figure 9: Contrastive loss test.Ogre's left and right leg are mistakenly swapped (red circle) without the contrastive loss.Furthermore, the contrastive loss helps retain the left arm's swing phase of the source motion to the characterized motion (red arrows).

Figure 10 :
Figure 10: (a) Four best matching motions in the entire datasets given the white character's motion (top: running, bottom: walking).(b) A t-SNE visualization of the entire context features in the test dataset with colors differentiating characters.Three context feature points corresponding to a source, matched target, and output motion are located very closely to each other.

Figure 11 :
Figure 11: Applications of our framework.Characterization results from an unseen subject (left), with sparse input from 6 joints (middle), and from streamed motion data (right).

Figure 12 :
Figure 12: Ablation test on Prior Net.Result obtained without learned prior distribution fails to generate Neutral Clown sitting pose.

Figure 13 :
Figure13: Ablation test on the autoregressive scheme of the NCM.Our characterized output motion achieves higher quality of temporal consistency and transition naturalness compare to what is obtained without autoregressively feeding the previous character feature.

Table 1 :
Frechet Motion Distance (FMD), context recognition accuracy (CRA), and style recognition accuracy (SRA) on samples generated by each method as quantitative comparison.