Abstract
Synthesize human motions from music (i.e., music to dance) is appealing and has attracted lots of research interests in recent years. It is challenging because of the requirement for realistic and complex human motions for dance, but more importantly, the synthesized motions should be consistent with the style, rhythm, and melody of the music. In this article, we propose a novel autoregressive generative model, DanceNet, to take the style, rhythm, and melody of music as the control signals to generate 3D dance motions with high realism and diversity. Due to the high long-term spatio-temporal complexity of dance, we propose the dilated convolution to improve the receptive field, and adopt the gated activation unit as well as separable convolution to enhance the fusion of motion features and control signals. To boost the performance of our proposed model, we capture several synchronized music-dance pairs by professional dancers and build a high-quality music-dance pair dataset. Experiments have demonstrated that the proposed method can achieve state-of-the-art results.
- [1] . 2017. Adobe Mixamo Dataset. Retrieved November 11, 2021 from https://www.mixamo.com.Google Scholar
- [2] . 2012. Online real-time onset detection with recurrent neural networks. In Proceedings of the 15th International Conference on Digital Audio Effects (DAFx’12).Google Scholar
- [3] . 2016. Madmom: A new Python audio and music signal processing library. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, New York, NY, 1174–1178. Google Scholar
Digital Library
- [4] . 2016. Joint beat and downbeat tracking with recurrent neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR’16). 255–261.Google Scholar
- [5] . 2000. Learning statistical models of human motion. In Proceedings of the IEEE Workshop on Human Modeling, Analysis, and Synthesis, Vol. 2000.Google Scholar
- [6] . 2000. Style machines. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques. ACM, New York, NY, 183–192. Google Scholar
Digital Library
- [7] . 2002. Music-driven motion editing: Local motion transformations guided by music analysis. In Proceedings 20th Eurographics UK Conference. IEEE, Los Alamitos, CA, 38–44. Google Scholar
Digital Library
- [8] . 2005. Performance animation from low-dimensional control signals. ACM Transactions on Graphics 24, 3 (2005), 686–696. Google Scholar
Digital Library
- [9] . 2007. Constraint-based motion optimization using a statistical dynamic model. ACM Transactions on Graphics 26 (2007), 8. Google Scholar
Digital Library
- [10] . 2017. Convolutional recurrent neural networks for music classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’17). IEEE, Los Alamitos, CA, 2392–2396.Google Scholar
Digital Library
- [11] . 2010. Carnegie-Mellon Motion Capture Database. Retrieved November 11, 2021 from http://mocap.cs.cmu.edu.Google Scholar
- [12] . 2020. Learning dynamic relationships for 3D human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6519–6527.Google Scholar
Cross Ref
- [13] . 2010. Universal onset detection with bidirectional long-short term memory neural networks. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR’10). 589–594.Google Scholar
- [14] . 2015. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision. 4346–4354. Google Scholar
Digital Library
- [15] . 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3497–3506.Google Scholar
Cross Ref
- [16] . 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 249–256.Google Scholar
- [17] . 2006. Tonal description of music audio signals. Ph.D. Dissertation. Universitat Pompeu Fabra, Barcelona, Spain.Google Scholar
- [18] . 2019. A neural temporal model for human motion prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12116–12125.Google Scholar
Cross Ref
- [19] . 2017. Onsets and frames: Dual-objective piano transcription. arXiv preprint arXiv:1710.11153 (2017).Google Scholar
- [20] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [21] . 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems. 6626–6637. Google Scholar
Digital Library
- [22] . 2016. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics 35, 4 (2016), 138. Google Scholar
Digital Library
- [23] . 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700–4708.Google Scholar
Cross Ref
- [24] . 2016. Structural-RNN: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5308–5317.Google Scholar
Cross Ref
- [25] . 2016. Feature learning for chord recognition: The deep chroma extractor. arXiv preprint arXiv:1612.05065 (2016).Google Scholar
- [26] . 2008. Motion graphs. In ACM SIGGRAPH 2008 Classes. ACM, New York, NY, 51. Google Scholar
Digital Library
- [27] . 2016. Downbeat tracking using beat synchronous features with recurrent neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR’16). 129–135.Google Scholar
- [28] . 2013. Rhythmic pattern modeling for beat and downbeat tracking in musical audio. In Proceedings of the Annual Conference of the International Society for Music Information Retrieval (ISMIR’13). 227–232.Google Scholar
- [29] . 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105. Google Scholar
Digital Library
- [30] . 2009. Modeling spatial and temporal variation in motion data. ACM Transactions on Graphics 28 (2009), 171. Google Scholar
Digital Library
- [31] . 2015. Deep learning. Nature 521, 7553 (2015), 436–444.Google Scholar
Cross Ref
- [32] . 2019. Dancing to music. In Advances in Neural Information Processing Systems. 3581–3591.Google Scholar
- [33] . 2018. Listen to dance: Music-driven choreography generation using autoregressive encoder-decoder network. arXiv preprint arXiv:1811.00818 (2018).Google Scholar
- [34] . 2018. Interactive character animation by learning multi-objective control. ACM Transactions on Graphics 37, 6 (2018), 1–10.Google Scholar
Digital Library
- [35] . 2018. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5226–5234.Google Scholar
Cross Ref
- [36] . 2002. Motion texture: A two-level statistical model for character motion synthesis. ACM Transactions on Graphics 21 (2002), 465–472. Google Scholar
Digital Library
- [37] . 2017. Auto-conditioned LSTM network for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363 3 (2017).Google Scholar
- [38] . 2017. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2891–2900.Google Scholar
Cross Ref
- [39] . 2012. Motion Graphs++: A compact generative model for semantic motion analysis and synthesis. ACM Transactions on Graphics 31, 6 (2012), 153. Google Scholar
Digital Library
- [40] . 2011. Chroma toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR’11).Google Scholar
- [41] . 2016. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).Google Scholar
- [42] . 2019. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7753–7762.Google Scholar
Cross Ref
- [43] . 2001. Learning switching linear models of human motion. In Advances in Neural Information Processing Systems. 981–987. Google Scholar
Digital Library
- [44] . 2007. Construction and optimal search of interpolated motion graphs. ACM Transactions on Graphics 26, 3 (2007), 106. Google Scholar
Digital Library
- [45] . 2017. SFU Motion Capture Database. Retrieved November 11, 2021 from http://mocap.cs.sfu.ca.Google Scholar
- [46] . 2018. Dance with melody: An LSTM-autoencoder approach to music-oriented dance synthesis. In Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference. ACM, New York, NY, 1598–1606. Google Scholar
Digital Library
- [47] . 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2 (2012), 26–31.Google Scholar
- [48] . 2016. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems. 4790–4798. Google Scholar
Digital Library
- [49] . 2015. Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics 34, 4 (2015), 119. Google Scholar
Digital Library
- [50] . 2018. Weakly supervised deep recurrent neural networks for basic dance step generation. arXiv preprint arXiv:1807.01126 (2018).Google Scholar
- [51] . 2019. Convolutional sequence generation for skeleton-based action synthesis. In Proceedings of the IEEE International Conference on Computer Vision. 4394–4402.Google Scholar
Cross Ref
- [52] . 2012. Two-person interaction detection using body-pose features and multiple instance learning. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Los Alamitos, CA, 28–35.Google Scholar
Cross Ref
Index Terms
Music2Dance: DanceNet for Music-Driven Dance Generation
Recommendations
Example-Based Automatic Music-Driven Conventional Dance Motion Synthesis
We introduce a novel method for synthesizing dance motions that follow the emotions and contents of a piece of music. Our method employs a learning-based approach to model the music to motion mapping relationship embodied in example dance motions along ...
Pop Music Generation: From Melody to Multi-style Arrangement
Special Issue on KDD 2018, Regular Papers and Survey PaperMusic plays an important role in our daily life. With the development of deep learning and modern generation techniques, researchers have done plenty of works on automatic music generation. However, due to the special requirements of both melody and ...
PC-Dance: Posture-controllable Music-driven Dance Synthesis
MM '22: Proceedings of the 30th ACM International Conference on MultimediaMusic-driven dance synthesis is a task to generate high-quality dance according to the music given by the user, which has promising entertainment applications. However, most of the existing methods cannot provide an efficient and effective way for user ...






Comments