Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available at https://github.com/mesnico/text-to-motion-retrieval.


INTRODUCTION AND RELATED WORK
Pose-estimation methods [12] can detect 3D human-body keypoints in a single RGB video stream.The keypoints detected in individual frames constitute a simplified spatio-temporal representation of human motion in the form of a so-called skeleton sequence.As indicated in [40], the analysis of such representation opens unprecedented application potential in many domains, ranging from virtual reality, through robotics and security, to sports and medicine.The ever-increasing popularity of skeleton data calls for technologies able to effectively and efficiently access large volumes of such spatio-temporal data based on its content.
The trained architectures can serve as motion encoders that express the motion semantics by a high-dimensional feature vector extracted from the last hidden network layer.This concept can be transferred to the motion retrieval task to support content-based access based on the query-by-example paradigm [5,38,40], which aims at identifying the database motions that are the most similar to a user-defined query motion.Besides balancing descriptiveness and indexability of the motion features, the most critical issue is to specify a convenient query motion example.The example can be selected from available skeleton sequences [39], drawn in a visualization-driven graphical user interface [4], modeled by puppet interfaces [31], specified as a set of logical constraints [18], or artificially generated [10].However, such a query example may not ever exist, or its construction requires professional modeling skills.This paper focuses on motion retrieval but simplifies query specification by enabling users to formulate a query by free text.
With the current advances in cross-modal learning, especially in the field of textual-visual processing, the trend is to learn common multi-modal spaces [28] so that similar images can be described and searched with textual descriptions [27].A representative example is the CLIP model [36], which learns an effective common space for the visual and textual modalities.This allows the use of open vocabularies or complex textual queries for searching images.
Our work has many analogies with the text-to-video retrieval task [13,22,23,41,50], given that the moving skeleton also evolves in space and time.Despite the popularity of such powerful and versatile text-vision models, no effort has been made for the skeletondata modality.Differently from video data, the skeleton is anonymized and avoids learning many common biases present in video datasets.To the best of our knowledge, there is only one approach [20] that relates to text-to-motion matching.However, it uses pre-training and tackles only the classification task.A few available datasets providing the training data for text-to-motion retrieval -e.g., the KIT Motion Language [35] and recently-released HumanML3D [15] datasets -are primarily used for motion generation from a textual description [16,34,44,47,48], where the idea is to align text and motion embeddings into a common space, but never explicitly handling the text-to-motion retrieval task.

Contributions of this Paper
We tackle the above-mentioned gap by introducing a novel text-tomotion retrieval task, which aims at searching databases of skeleton sequences and retrieving those that are the most relevant to a detailed textual query.For this task, we define evaluation metrics, establish new qualitative baselines, and propose the first text-tomotion retrieval approach.These initial contributions can be employed for future studies on this challenging yet unexplored task.
Specifically, one of the main paper contributions is the proposal of a fair baseline by adopting promising (1) motion encoders already employed as backbones in other motion-related tasks and (2) text encoders successfully applied in natural language processing (NLP) and text-to-image retrieval.The core of this baseline is a two-stream pipeline where the motion and text modalities are processed by separate encoders.The obtained representations are then projected into the same common space, for which a metric is learned in a similar way as in CLIP [36] or ALADIN [30] in the text-to-image scenario.The choice of a two-stream pipeline is strategic to make the approach scalable to large motion collections, as feature vectors extracted from both modalities can be easily stored in off-the-shelf indexes implementing efficient similarity search access.
Inspired by recent advances in video processing [3], we also propose a transformer-based motion encoder -the Motion Transformer (MoT) -that employs divided space-time attention on skeleton joints.We show that MoT reaches competitive results with respect to a state-of-the-art motion encoder, DG-STGCN [11], on both KIT Motion Language and HumanML3D datasets.

TEXT-TO-MOTION RETRIEVAL PIPELINE
The main idea of our approach is to rely on a two-stream pipeline, where motion and text features are first extracted through adhoc encoders and then projected into the same common space, as schematically illustrated in Figure 2. In this section, we sketch the whole pipeline which consists of the: (i) text encoder, (ii) motion encoder, and (iii) loss function used to optimize the common space.

Text Encoders
Inspired by recent works in NLP, we rely on two pre-trained textual models, namely BERT [19] and the textual encoder from CLIP [36].
BERT.We use the implementation from [14], which performed the task of motion synthesis conditioned on a natural language prompt.This model stacks together a BERT pre-trained module and an LSTM model composed of two layers for aggregating the BERT output tokens, producing the final text embedding.We take the final hidden state of the LSTM model as our final sentence representation.As in [14], the BERT model is fixed.At training time, we only update the LSTM weights.
CLIP.It is a recently-introduced vision-language model trained in a contrastive manner for projecting images and natural language descriptions in the same common space [36].Here, we use the textual encoder of CLIP, which is composed of a transformer encoder [45] with modifications introduced in [37], and employs lower-cased byte pair encoding (BPE) representation of the text.We then stack an affine projection to the CLIP representation, whichsimilarly to the BERT+LSTM case -is the only layer to be trained.

Motion Encoders
Differently from the textual pipeline, which takes as input an unstructured natural language sentence, the input to motion encoder models is a vector x ∈ R  × × , where  is the time length of the motion,  is the number of joints of the human-body skeleton, and  is the number of features used to encode each joint.
Bidirectional GRU.This architecture is widely adopted in timeseries processing, and an early variant that used LSTM was applied to frame-level action detection in continuous motion data [6].In particular, we first increase the dimensionality of the input -which is  = 9 in our case -by using a two-layer feed-forward network (FFN) before feeding it into the GRU: . Then, we compute the final motion embedding by concatenating the representations − →  and ← −  .
Upper-Lower GRU.To better learn semantics of different body parts, we adopt the model in [14] to independently process the upper and lower parts of the skeleton using two GRU layers.
DG-STGCN.This architecture [11] recently reached state-ofthe-art results in motion classification.Their GCN module features a spatial module, built of affinity matrices to capture dynamic graphical structures, and a temporal module that performs temporal aggregation using group-wise temporal convolutions.We refer the reader to the original formulation [11] for further details.
MoT.Our proposed architecture that we built on top of the successful transformer-based video processing network ViViT [3].In the original implementation, which processes a sequence of frames, the dimension  is the number of grid-arranged rectangular patches from each frame.In our case, instead, the spatial features come from the joints.Instead of using as  all individual skeleton joints, we first aggregate them obtaining features for five different body parts, similar to the pre-processing performed in Upper-Lower GRU.In this way,  = 5, which is far less than the total number of skeleton joints.This is beneficial from a computational point of view, and we found that this solution also reaches the best performance.

Optimization
We explore two widely-adopted metric learning loss functions, namely the symmetric triplet loss widely used in text-to-image [26] and the InfoNCE Loss, introduced for cross-modal matching in [49] and employed in CLIP [36] and recent cross-modal works [23].We assume (m  , c  ) is the -th motion and caption embedding pair,  (•, •) is the cosine similarity, and  is the batch size.
The symmetric triplet loss is defined as:

EXPERIMENTAL EVALUATION 3.1 Metrics
Exact-search.Exact-search metrics leverage the intrinsic ground truth available in the employed datasets, where motions come with one (or more) textual descriptions.We can consider motions associated with the given textual query as the exact solutions, while all the other ones as irrelevant by default.In this context, the recall@k measures the percentage of queries that find the correct result within the first k elements in the results list, while the median and mean ranks represent the median and mean rank of the exact result computed among all the queries.
Relevance-based.There can exist motions relevant to a certain extent to the given textual query that are not paired in the dataset.In this context, the normalized Discounted Cumulative Gain (nDCG) metric is widely employed.The DCG takes into consideration the relevance a specific item has with the query, discounting it with a logarithmic factor that depends on the rank of that item: (+1) .The nDCG normalizes DCG by its maximum theoretical value and thus returns values in the [0, 1] range.We define the relevance similarly to previous works in image-totext retrieval [7,26,29], that use a proxy relevance between textual descriptions, which is much easier to compute.In this work, we use two textual relevance functions: (i) the SPICE relevance [2] -a handcrafted relevance that exploits graphs associated with the syntactic parse trees of the sentences and has a certain degree of robustness against synonyms; and (ii) the spaCy relevance obtained from the spaCy Python tool, which implements a deep learning-powered similarity score for pairs of texts.

Datasets and Evaluation Protocol
We employ two recently introduced datasets, HumanML3D [15] and KIT Motion Language [35].Both datasets carry one or more human-written descriptions for each motion.We employ the same pre-processing pipeline for both datasets -the one developed in the codebase of the HumanML3D dataset [15].We employ  = 9 features to represent each joint: six features encoding continuous rotation representation plus three features encoding rotation-invariant forward kinematics joint positions.
KIT Motion-Language Dataset contains 3,911 recordings of fullbody motion in the Master Motor Map form [43], along with textual descriptions for each motion.It has a total of 6,278 annotations in English, where each motion recording has one or more annotations that explain the action, like "A human walks two steps forwards, pivots 180 degrees, and walks two steps back".
HumanML3D is, in its essence, very similar to KIT Motion Language Dataset.However, it is a more recent dataset developed by adding textual annotations to already-existing and widely-used motion-capture datasets -AMASS [25] and HumanAct12 [17].It contains 14,616 motions annotated by 44,970 textual descriptions.
The results are reported on the test set of the respective datasets after removing possibly redundant queries.In particular, we use 938 and 8,401 textual queries to search among 734 and 4,198 motions for the KIT and HumanML3D datasets, respectively.For HumanML3D, these motions are obtained by splitting the originally provided ones using the available segment annotations associating a motion subsequence with the text that describes it.In this sense, HumanML3D enables a finer retrieval, as texts are more likely to describe the correct subsequence instead of the whole motion.

Results
We report text-to-motion retrieval results in Table 1, obtained with the InfoNCE loss (see Section 3.3.1 for a comparison of loss functions).The best results are competitively achieved by both DG-STGCN and our transformer-based MoT.The first remarkable insight is the superiority of CLIP over the BERT+LSTM on all the metrics in both datasets.With CLIP, the effectiveness of DG-STGCN and MoT over GRU-based methods is evident, especially on the KIT dataset, where the mean rank is almost 30 % lower.The nDCG metric, through the highly-semantic text-based relevance scores, confirms the trend of the recall@k values, suggesting that the CLIP model paired with GCNs and Transformers can both retrieve exact  and relevant results in earlier positions in the results list.Notably, from an absolute perspective, all the methods reach overall low performance on exact search, confirming the difficulty of the introduced text-to-motion retrieval task.This may be due to (i) some intrinsic limitations that are hard to eliminate -e.g., textual descriptions are written by annotators by possibly looking at the original video, which the network has no access to -or (ii) difficulties in capturing high-level semantics in motion or text data.In Figure 1, we report two qualitative examples of text-to-motion retrieval using CLIP + MoT, on HumanML3D.We can notice the potential of such natural-language-based approach to motion retrieval.Specifically, note how the approach is sensible to asymmetries -in the first case, where the counterclockwise adjective is specified in the query, only the correctly-oriented motions are returned in the first positions; in the second case, where no right or left is specified, both the original and mirrored motions are returned (e.g., the 1st and 2nd results).

Ablation Study on Loss
Function and Space Dimensionality.In Figure 3, we report performance when varying the dimensionality of the common space, for the two motion models DG-STGCN and MoT employing the CLIP text model.We can notice how, on both metrics in Figure 3a/3b, the effectiveness remains quite high even for very small dimensions of the common space, with a negligible improvement after 256 dimensions.Specifically, with only 16 dimensions instead of 256, the performance drops by only about 6 % on nDCG with SPICE relevance and on average 15 % on Recall@10, considering both motion encoders.This suggests that the intrinsic dimensionality of the learned space is quite small, opening the way for further studies and feature visualization in future works.
In Figure 4, we also report the remarkable performance gain achieved by InfoNCE loss over the standard symmetric triplet loss.We can see how the InfoNCE loss induces the best results basically in all the configurations, confirming its power even in the under-explored text-motion joint domain.Breaking down the contributions of this variation on the text and motion models in Figures 4a and 4b respectively, we notice how the best gains are achieved by using the CLIP textual model and the MoT motion model.

CONCLUSIONS
In this paper, we introduced the task of text-to-motion retrieval as an alternative to the query-by-example search, and inherently different from the searching using a query label from a fixed pool of labels.We employed two state-of-the-art text-encoder networks, as well as widely adopted motion-encoder networks, for learning a common space and producing the first baselines for this novel task.We demonstrated that the CLIP text encoder works best also for encoding domain-specific natural sentences inherently different from image-descriptive ones, and that Transformers and GCNs obtain better motion representation than GRU-based encoders.In future works, we plan to train the models jointly on the two datasets and perform some cross-dataset evaluation to measure their generalization abilities and robustness.Other improvements include the use of video modality other than the motion and some unsupervised pre-training methods for boosting performance.

Figure 2 :
Figure 2: Schematic illustration of the learning process of the common space of both the text and motion modalities.

Table 1 :
Text-to-motion retrieval results on both the KIT Motion Language Dataset and HumanML3D Dataset.We report the best and the second-best results with bold and underlined font, respectively.