DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation

When hearing music, it is natural for people to dance to its rhythm. Automatic dance generation, however, is a challenging task due to the physical constraints of human motion and rhythmic alignment with target music. Conventional autoregressive methods introduce compounding errors during sampling and struggle to capture the long-term structure of dance sequences. To address these limitations, we present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation. This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model. To bridge the gap between music and motion for conditional generation, DiffDance employs a pretrained audio representation learning model to extract music embeddings and further align its embedding space to motion via contrastive loss. During training our cascaded diffusion model, we also incorporate multiple geometric losses to constrain the model outputs to be physically plausible and add a dynamic loss weight that adaptively changes over diffusion timesteps to facilitate sample diversity. Through comprehensive experiments performed on the benchmark dataset AIST++, we demonstrate that DiffDance is capable of generating realistic dance sequences that align effectively with the input music. These results are comparable to those achieved by state-of-the-art autoregressive methods.


INTRODUCTION
See the music, hear the dance.
George Balanchine Dance serves as a universal language that transcends not only cultural boundaries but also those between species.In recent times, dance videos have emerged as the most popular video category on social media platforms such as TikTok and YouTube.Dance is inherently intertwined with music, as individuals naturally move to the rhythm.However, creating a satisfactory dance is challenging, as it requires elegant movements that synchronize with the music on both a global style and local rhythm level, a feat that may take professional dancers years of practice.Consequently, the task of automatic dance generation from music has garnered significant interest from the deep learning community in recent years.
Existing works [14,20,21,34] primarily formulate music-todance generation as a sequence-to-sequence generation task that autoregressively generates dance sequences.However, these approaches trained with teacher forcing strategy are susceptible to compounding error introduced by autoregressive generation, which becomes problematic for generating long sequences.Besides, conventional methods rely on handcrafted spectrogram features as music conditions, including MFCC, onset strength, constant-Q chromagram, etc.These features may lack a deep understanding of the music-dance relationship and could be suboptimal for music-todance generation.
Diffusion models, a newly developed class of non-autoregressive generative models, achieve impressive results in various tasks [5,18,41].For conditional synthesis, diffusion models also demonstrate a strong capacity to generate diverse and realistic samples [27,31].Recent work [38] proposes Motion Diffusion Model (MDM), which achieves state-of-the-art results in text-to-motion and action-tomotion.However, we argue that directly applying existing motion diffusion models to dance generation is problematic since they are designed for motion with low temporal resolution and short sequence length.In contrast, dance sequences are usually much longer and more complex than general human motion, exhibiting global structures such as symmetrical repetitive movements.Therefore, these models struggle to model extremely long sequences and fail to produce realistic dance with long-term structure.
In this work, we aim to generate high temporal resolution dance sequences aligned with input music leveraging diffusion models.To this end, we propose a novel cascaded human diffusion model framework named DiffDance.Specifically, DiffDance is a two-stage method that contains a music-to-dance diffusion model first generating low-resolution dance sequences and a sequence superresolution model upscaling the low-resolution sequence by filling intermediate motion between input motions.To enable conditional generation, we use Wav2CLIP [39] to map input music into learned embeddings instead of conventional handcrafted features.Models in both stages are conditioned on the learned embeddings and use classifier-free guidance [13] to improve sample quality.Since the Wav2CLIP audio encoder shares an embedding space solely with images and text, we align its embedding space to motion by freezing the motion encoder in MotionCLIP [37] and fine-tuning our audio encoder using paired data in AIST++ [21].After fine-tuning, the audio encoder can produce latent representations aligned with motion semantics, thus further boosting music-to-dance performance.Lastly, we incorporate multiple geometric losses during training, as derived from existing motion generation literature [17,29,32], introducing key joints position and rotation regularization losses to prevent unnatural artifacts such as foot sliding and instant rotation.We further add a dynamic loss weight to encourage model sampling at large timesteps and correcting unnatural motion at small timesteps.As illustrated in Fig. 1, our DiffDance can generate diverse, realistic, and coherent dance sequences guided by various music inputs.
Our contributions can be summarized as follows: • We propose a cascaded motion diffusion model that generates long-form, high-resolution dance sequences; • We align the CLIP embedding space of music and dance for better feature representation and demonstrate the effectiveness of classifier-free guidance in music-to-dance generation; • We incorporate a variety of geometric losses and a dynamic loss weight schedule to produce realistic samples while maintaining diversity; • Extensive experiments demonstrate our proposed DiffDance surpasses the state-of-the-art model in terms of dance quality and music-dance correspondence.

RELATED WORK 2.1 Music to Dance Generation
Generating dance sequences from music, which aims to produce realistic choreographed movements aligned with input music, is a challenging task built on motion synthesis [1,3,6,8].Early works [19,33,44] favored the generation of 2D dance sequences from music, predominantly due to the abundant data from online dance videos.Recently, AIST++ [21], a large-scale 3D motion dataset, greatly pushed the development of 3D dance generation.Various works explore this task leveraging different network architectures, such as LSTMs [40], Transformers [14,21,34], and GANs [16].Among them, works with transformer architecture achieve state-of-the-art results, highlighting the superiority of Transformers in sequence modeling.For instance, FACT [21] introduced a full-attention cross-modal transformer that generates high-quality 3D dance sequences in an autoregressive manner.Bailando [34] designed a two-stage method consisting of pose VQ-VAE and motion GPT.The motion GPT is fine-tuned via actor-critic learning to realize temporal coherency.MNET [16] proposed a conditional GAN framework including a Transformer Generator and Discriminator to produce dance motions conditioned on multiple music genres.In contrast, our method leverages a cascaded diffusion model framework to directly generate the whole dance sequences avoiding compounding errors introduced by autoregressive generation.
It is also worth mentioning that existing methods largely overlook the aspect of music representation, typically directly using handcrafted music features extracted by Librosa [25] as conditional music features, such as MFCC, onset strength, and the constant-Q chromagram.However, recent advancements in text-to-image and text-to-video generation [10,31] have demonstrated that using conditional features extracted from large-scale representation learning models can markedly enhance cross-modal generation performance.Motivated by this finding, our model deviates from traditional handcrafted music features and instead employs the Wav2CLIP [39] audio encoder, a robust audio representation learning method, for music-driven dance generation, resulting in improved dance generation performance.

Diffusion Models
Diffusion models [11,35,36] are a class of likelihood-based generative models that learn to recover samples from random noise via a denoising process.They have achieved great success in generating high-fidelity images [5], and demonstrate strong abilities to generalize to other domains such as audio [15] and language [22].For conditional generation, existing models often use classifier guidance [5] or classifier-free guidance [13] to improve sample synthesis quality.Recently, some seminal works [38,42] adapt diffusion models to motion synthesis and achieve impressive results.Specifically, MDM [38] proposed a transformer-based diffusion model that leverages classifier-free guidance to solve text-to-motion synthesis.However, dance sequences are much more difficult for diffusion models to synthesize since they have longer sequence lengths and more complex movements.In the vision domain, cascaded diffusion models [10,12] are effective methods that can generate high-resolution samples while keeping each sub-network relatively simple.Inspired by that, we solve music-to-dance generation via a cascaded motion diffusion model.Besides, we add multiple geometric losses during training in order to ensure the model output to be physically-plausible.

METHOD
Our aim is to generate dance sequences  1: = {  }   =1 with length  given the music condition .For 3D dance generation, the dance sequence is represented in -dimensional features of  body joints, resulting in a   ∈   × motion representation.In the following, we first briefly discuss the preliminary knowledge of diffusion models in Sec.3.1.Then, we introduce our proposed DiffDance, a cascaded motion diffusion model trained with multiple geometric losses in Sec.3.2.In Sec.3.3, we finally discuss aligning music embedding space to motion and using classifier-free guidance for music conditional generation.

Preliminaries of Diffusion Models
Diffusion models define a fixed Markovian process by gradually adding noise to sample data  0 ∈ ( 0 ).The forward diffusion process is defined as: where  ∈ [1, ],   is a pre-defined variance schedule that controls the rate of noise injection, and (  |  −1 ) is a Gaussian transition kernel.After  steps, the amount of noise becomes sufficiently large, and the Markovian chain approximately converges to a standard Gaussian distribution N (0, I).
In order to convert noise back to sample for generation, diffusion models learn a reverse process via: where  is a parameterized neural network to predict the mean of Gaussian.In practice, we simply set Σ  (  , ) =  2  I following [11].Diffusion models are trained by minimizing the variational upper bound on the negative log-likelihood of data.Typically, we use a simplified version with  2 loss [11]: Once the diffusion model is trained, we can generate a new sample  by iteratively running the reverse diffusion process from timestep  to 0.

Cascaded Motion Diffusion Model
In this section, we formulate our DiffDance, a cascaded motion diffusion model for music-to-dance generation.Fig. 2 summarizes the cascaded pipeline of our DiffDance.Framework.DiffDance consists of a Music-to-Dance (M2D) diffusion model and a Sequence Super-Resolution (SSR) diffusion model.Specifically, the M2D model is similar to MDM [38], a transformerbased diffusion model as illustrated in Fig. 2. Different from conventional diffusion models that predict   , our model    (  , , ) directly predicts the original data point  0 , given the noised data   , timestep , and music representation .Both the timestep  and music representation  are projected into the dimension of transformer.Next, we sum these embeddings together and concatenate it with noised input   to guide the generation of our model.
The SSR model  employ conditioning augmentation [12] to our SSR model.Conditioning augmentation has been proven an effective strategy to significantly improve the sample quality and model robustness in cascaded generation pipelines [12,31].In practice, we add Gaussian noise corresponding to a random diffusion timestep  to corrupt input motion during training.We also add  to the original diffusion timestep  as conditional information.At inference time, we sweep over all possible values of  and fix it that yields the best quality.Adding some noise to the inputs can eliminate unnatural artifacts in generated low-resolution dance sequences, thus bridging the gap between the ground-truth distribution during training and the model output distribution at inference time.
Training Objectives.Generating physically-plausible motion sequences is challenging using the original denoising objective with Equation 5. MDM adds a set of geometric losses [29,32] to regularize the training process.These losses, denoted as    , regularize the positions and position velocities of all joints equally.However, we observe undesirable motions, such as instant movements and rotations generated by our vanilla model trained with    , which are unreasonable for dance choreography.To produce fluent and natural dance sequences, we further add    to regularize key joints such as 'hand' and 'foot'.This regularization is also from both positions and position velocities perspectives formulated as: where  and ℎ denote key joints of foot and hand joints respectively, and L   constrains the difference between key joint positions p and ground truth p  as well as position velocities v  and ground truth v   .To account for the high complexity of motion sequences, we computed joint linear velocities with respect to the 'root' node's relative velocities.Therefore, our positions' loss function excludes the 'root' joint's linear velocity.
Besides, we propose to regularize the rotations for key joints explicitly.Note that the 'root' joint is also regularized as a key joint for barycentric motion consistency.The rotation loss is formulated as: where  denotes key joints of 'root', and L   constrains the difference between key joint rotations r  and ground truth r as well as rotation velocities v   and ground truth v  .We also find it unsuitable to apply these losses uniformly across all diffusion timesteps.As described in the analysis in [4], the backward diffusion process can be roughly divided into a generation stage and a denoising stage, corresponding to large and small timesteps, respectively.When  approaches the total timestep  , the noise becomes sufficiently large, corrupting the input dance into a noised version that has almost lost all its geometric information.Intuitively, implementing geometric losses at the generation stage will not impose physical constraints on samples but may instead impair sample quality and diversity.To address this, we introduce a simple dynamic loss decay weight   = 1 −    , which linearly decreases as the diffusion timestep  increases.The overall training loss can be expressed as: where  1 ,  2 and  3 are the hyper-parameter loss weights for the all joints position loss L   , key joints position loss L   and key joints rotation loss L   respectively.

Conditional Generation
Given input music, our objective is to extract an effective music representation containing rich semantic information, such as the style and rhythm of the music.Subsequently, we utilize this representation as conditions to generate dance sequences aligned with the input music.Music Representations.Previous approaches have not placed significant emphasis on music representations for conditional generation.These methods primarily employ handcrafted music features, e.g., onset strength as rhythmic features, and constant-Q chromagram as chroma features.One distinct drawback of these features is that they lack high-level semantics critical to cross-modality generation.CLIP [30], a large-scale visual-textual embedding model, has demonstrated its efficacy for text-guided generation works [10,31].Likewise, we propose to use the audio encoder of Wav2CLIP [39], which encodes an audio clip into a 512-dimensional vector that shares an embedding space with text and image in CLIP.However, there still exists a significant domain gap between extracted music representations and dance motions since the Wav2CLIP audio encoder is solely trained on general audio-visual datasets.To better align the embedding space of music audio with dance motions, we fine-tune our audio encoder by adding multi-layer perceptrons as adapter layers and map its output to a motion encoder, as illustrated in Fig. 3. Specifically, we use the motion encoder in [37], which also extracts a 512-dimensional embedding aligned with CLIP joint representation.To mitigate modal collapse, we freeze the weights of both the motion encoder and music encoder during fine-tuning and only train the adapter layers with InfoNCE loss [2].The music-to-dance contrastive loss for the -th pair between music clip   and dance sequence   is formulated as: where  (  ,   ) represents the cosine similarity, and  is a learnable temperature parameter.
Classifier-free Guidance.After learning the music-dance joint embedding space, we extract music representations and leverage classifier-free guidance [13] to improve sample quality.In practice, we jointly train a single diffusion model on both conditional and unconditional objectives by randomly dropping the music condition .During sampling, we can improve sample quality by adjusting the  0 prediction using: where   (  , , ) and   (  , , ∅) correspond to the conditional and unconditional model respectively, and  is the guidance strength which is typically set greater than 1 to enable classifier-free guidance.We apply classifier-free guidance for both models in the twostage pipeline.

EXPERIMENTS 4.1 Dataset
AIST++ [21] is a large-scale publicly available 3D human dance dataset containing 1363 3D dance sequences and music pairs.There are 980 training sets, 40 test sets, and 343 candidate sets in AIST++.Note that the candidate sets are not recommended by AIST++ for training or evaluation.In terms of data format, dance motion is represented as 60-FPS 3D pose sequences in SMPL format [23].We conduct all the experiments on the AIST++ dataset and follow the experimental setting in [34].

Implementation Details
Alignment Setting.For fine-tuning the Wav2CLIP adapter which consists of 2 MLP layers with 512 hidden size, we use AdamW [24] with learning-rate 1 −5 and train 100 epochs with batch size 64.Music is loaded by Librosa [25] and split into multiple clips of 6 seconds.Dance sequences are represented in rotation 6d format [43] and also split into 6-second clips correspondingly.Similar to the experimental setting in MotionCLIP [37], we down-sample the frame rate of dance clips from 60-FPS to 30-FPS.Cascaded Diffusion Model Setting.We train our cascaded diffusion model for 500 epochs using AdamW with learning-rate 1 −4 .Music is loaded by Librosa and split into 20-second clips, and the dance sequences are split correspondingly with rotation 6d representation.Music is mapped to 512-dim vectors via frozen Wav2CLIPadapter.For classifier-free guidance, we randomly mask 10% music condition  at each training step.The diffusion transformer has 12 layers with 768-dim hidden size and 6 heads.The dropout ratio is set to 0.1.All the geometric losses weights  1 ,  2 , and  3 are set to 1.0, and the decay coefficient  for   is set to 0.1.For the base M2D model at the first stage, we set batch size to 32.We use dance sequences of 20 seconds to learn the long-term dance semantics and down-sample FPS from 60 to 15.For the SSR model at the second stage, we set batch size  1: Baselines comparison on AIST++.Best values are in bold, and runner-up values are underlined.Quantitatively, our model achieves state-of-the-art performance on FID  and Beat Align Score.Qualitatively, our model generates more realistic dance sequences and outperforms baseline approaches in the user study.↓ indicates lower is better, and ↑ indicates higher is better.* Note that Li et al. generates discontinuous and jittery motion, leading to abnormally high Div  , which is also reported in [21,34].
dance sequences to the default value of 60.The whole cascaded framework trains on 4 Tesla V100 GPUs in 24 hours.
In the inference stage, we generate a 20-second (1200 frames) dance sequence, guided by a 2-second (120 frames) seed dance sequence, and set the classifier-free guidance weight  to 2.5.We set the inference diffusion timestep  to 100, as we observe no significant difference between samples generated using the original timesteps and the reduced timesteps of 100s.

Evaluation Metrics
We follow previous works [21,34] to quantitatively evaluate the generated samples in terms of dance quality, dance diversity, and music beat alignment.To evaluate dance quality, we first extract kinetic features [28] and geometric features [26] of ground truth and generated samples.Then we calculate Frechet Inception Distance (FID) [9] score, including FID  based on kinetic features and FID  based on geometric features.For dance diversity, we calculate the average feature distance of kinetic features and geometric features following [21], denoted as Div  and Div  , respectively.As for dancemusic consistency, we use Beat Alignment Score (BAS) introduced in [21], which calculates the average distance between music beat and its nearest dance beat: where   = {   } is the dance beats defined as the local minima of the kinetic velocity,   = {   } is the music beats extracted using Librosa [25] toolbox, and  is a normalized parameter which is set to 3 in all the experiments.

Comparison with Baselines
We mainly compare our proposed DiffDance with several existing methods, including Li et al. [20], DanceNet [44], DanceRevolution [14], FACT [21] and current state-of-the-art method Bailando [34].Following [34], we generate 40 dance sequences for 20-second music clips in AIST++ test set and calculate the evaluation metrics.According to the comparisons shown in Table 1, our proposed DiffDance demonstrates comparable generative ability with state-of-the-art approaches, achieving 2 best and 2 runner-up in all 5 objective evaluation metrics.Moreover, more users prefer  our generated dance sequences compared to other methods in the user study.Motion Quality.As shown in Table 1, our DiffDance achieves FID  score of 24.09 and FID  score of 20.68.Compared with existing methods, our model achieves the best FID  , which outperforms 14.45% with a margin of 4.07 than the state-of-the-art method Bailando.This indicates that the kinetic features of generated samples, which guarantee dance characteristics, including motion velocities and energies, are much closer to those of ground truth dance distribution.
As for the FID  score, which reflects the quality of choreography units, we achieve the second-best performance, which is 6.47% better than FACT with a margin of 1.43.
We investigated the reasons that the FID  score of Bailando is much better than other methods, including ours.Firstly, Bailando adopts a Choreographic Memory Codebook to record and quantize dancing units from the 980 dances of the AIST++ training set, which remembers the inherent dancing units of AIST++ to a certain extent.Secondly, all 1363 dances in AIST++ dataset are used during evaluation (the train/test split is based on music-dance pairs).As a result, more than 70% ground truth dancing units of the evaluation set have been memorized and quantized during training.The above two aspects will result in an overestimation of the FID  score, which is affected by the distance of geometric features between generated dance sequences and ground truth.As reported in Bailando, the quantization of dancing positions is essential, which helps FID  score improve from 147.28 to 9.2 with a considerable margin.Therefore, we argue that Bailando overestimates FID  evaluation.Our DiffDance is comparatively more general, as our model does not directly memorize dance units in the dataset, and the FID  score of DiffDance still outperforms other methods except Bailando.Motion Diversity.Motion diversity is represented as the average Euclidean distance of generated dances in the kinetic and geometric feature spaces.Table 1 shows that our DiffDance achieves Div  of 6.02 and Div  of 2.89.The diversity of geometric features underperforms several previous methods partly due to the introduction of multiple regularization losses, which might limit the solution space of generated dance even with a dynamic loss weight.Besides, the guidance strength  used in classifier-free guidance also has a trade-off between sample quality and diversity.For  > 1, this over-emphasizes the importance of condition  during sampling, which might lead to higher quality but less diverse samples.Beat Align Score.It is important for dance generation approaches to produce dance sequences that align well with input music.Our DiffDance achieves the best beat align score of 0.2418, which outperforms 3.69% over Bailando.We even obtain a better beat align score than ground truth (0.2418 v.s.0.2374).As shown in Fig. 4, the distance between music beats and dance beats of our Diffdance is smaller than ground-truth beats.This reflects that our model can generate dance sequences that tightly follow the beats of music, and has a deep understanding of music semantics corresponding to dance movements, indicating the effectiveness of aligning the Wav2CLIP music embedding space to motion and the use of classifier-free guidance.
However, it is crucial to note that this does not necessarily imply that DiffDance surpasses human performances in every aspect of dance.The BAS strictly rewards dance beats synchronized with music beats, yet a precise one-to-one mapping between the two doesn't invariably exist.Human performances may excel in other aspects of dance, such as expressiveness, creativity, and emotion, without necessarily maintaining a perfect alignment with the musical beat.This observation suggests that a new metric is needed in the future for further assessment of rhythmic alignment between music and dance.User Study.Compared with objective evaluations for dance generation, subjective evaluation can provide a comprehensive performance comparison.Therefore, we conduct extensive user studies where we ask the participants to choose their preferred dance sequence generated by each previous method (including ground truth)

Ablation Studies
Model Architecture.We explore the effectiveness of the following 4 architecture settings as shown in Table 2. a) Two-stage pipeline.
Compared to the one-stage model as 'w/o two-stage', DiffDance improves FID  by 5.46 (22.67%), which reflects that the cascaded framework is essential for high-resolution dance generation.We also present the qualitative results of both stages to demonstrate how our cascaded framework enhances the overall performance of dance generation.As depicted in Fig. 5, we visualize three distinct functions of the SSR model.First, the SSR model can increase the temporal resolution by generating meaningful dance frames as interpolations, which is its primary function.Second, the SSR model can create new relative motion inspired by low-resolution dance  4, where we sweep the timestep  from 0 to 40.Note that  = 0 means no conditioning augmentation is used for training and testing.Our model achieves the best quality and diversity at  = 30, which is 30% of the SSR diffusion timestep .This indicates that adding moderate amounts of noise augmentation is beneficial for the cascaded generation pipeline.We fixed this noise ratio during sampling for all other experiments.

CONCLUSION
In this paper, we introduce a cascaded motion diffusion framework called DiffDance for music-driven dance generation.DiffDance comprises a base music-to-dance diffusion model and a sequence superresolution diffusion model capable of generating high-resolution, long-form dance sequences with temporal consistency.To enhance semantic music representation, we align the music embedding space with motion by fine-tuning our music encoder using a contrastive objective.Additionally, we employ classifier-free guidance in the music-to-dance diffusion process.We also incorporate various geometric losses and a dynamic loss decay weight to improve the fidelity and diversity of dance samples.Comprehensive experimental results demonstrate the superiority of our method from both qualitative and quantitative perspectives.

(Figure 2 :
Figure 2: (Left) Overall framework.DiffDance uses a frozen music encoder to extract music features.The music-to-dance diffusion model maps music representation into a low-resolution (15-FPS) dance sequence, while the sequence super-resolution diffusion model increases its temporal resolution (60-FPS).(Right) Model Details.Both models receive a noised dance   , diffusion timestep , and music representation . is extracted via Wav2CLIP encoder and then summed with .For the sequence super-resolution model, it receives additional low-resolution dance   and timestep , as depicted in the dotted box.After conditional augmentation, low-resolution dance is channel-wise concatenated with noised dance.

Figure 3 :
Figure 3: Alignment of music and motion.We introduce adapter layers to the music encoder and employ a contrastive loss to align the embedding spaces of music and motion.

Figure 4 :
Figure 4: Beats alignment visualization.We present a visualization of music beats, kinematic beats of generated dances, and two ground truth dances.The distances between generated beats and music beats are smaller, indicating better rhythmic alignment.

Figure 5 :
Figure 5: Visualization of outputs from both stages of our cascaded model.The base M2D model generates low-resolution dance sequences containing key motions.The SSR model improves low-resolution outputs by interpolating dance frames for increased temporal resolution, producing novel and meaningful relative dance frames, and rectifying ataxic poses.This highlights the advantages of our two-stage configuration.
8, and keep FPS for high-resolution Method Motion Quality Motion Diversity Beat Align Score↑ User Study FID  ↓ FID  ↓ Div  ↑

Table 2 :
Ablation study of model architectures.We compare the performance of full DiffDance and several architecture variants.

Table 3 :
Ablation study of loss functions.We compare the performance of our base music-to-dance model trained with different losses.

Table 4 :
Ablation study of conditioning augmentation.We compare various diffusion timesteps  used in the SSR model during sampling. = 0 represents the model without conditioning augmentation.