skip to main content
research-article

Music2Dance: DanceNet for Music-Driven Dance Generation

Authors Info & Claims
Published:16 February 2022Publication History
Skip Abstract Section

Abstract

Synthesize human motions from music (i.e., music to dance) is appealing and has attracted lots of research interests in recent years. It is challenging because of the requirement for realistic and complex human motions for dance, but more importantly, the synthesized motions should be consistent with the style, rhythm, and melody of the music. In this article, we propose a novel autoregressive generative model, DanceNet, to take the style, rhythm, and melody of music as the control signals to generate 3D dance motions with high realism and diversity. Due to the high long-term spatio-temporal complexity of dance, we propose the dilated convolution to improve the receptive field, and adopt the gated activation unit as well as separable convolution to enhance the fusion of motion features and control signals. To boost the performance of our proposed model, we capture several synchronized music-dance pairs by professional dancers and build a high-quality music-dance pair dataset. Experiments have demonstrated that the proposed method can achieve state-of-the-art results.

REFERENCES

  1. [1] Adobe. 2017. Adobe Mixamo Dataset. Retrieved November 11, 2021 from https://www.mixamo.com.Google ScholarGoogle Scholar
  2. [2] Böck Sebastian, Arzt Andreas, Krebs Florian, and Schedl Markus. 2012. Online real-time onset detection with recurrent neural networks. In Proceedings of the 15th International Conference on Digital Audio Effects (DAFx’12).Google ScholarGoogle Scholar
  3. [3] Böck Sebastian, Korzeniowski Filip, Schlüter Jan, Krebs Florian, and Widmer Gerhard. 2016. Madmom: A new Python audio and music signal processing library. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, New York, NY, 11741178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Böck Sebastian, Krebs Florian, and Widmer Gerhard. 2016. Joint beat and downbeat tracking with recurrent neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR’16). 255261.Google ScholarGoogle Scholar
  5. [5] Bowden Richard. 2000. Learning statistical models of human motion. In Proceedings of the IEEE Workshop on Human Modeling, Analysis, and Synthesis, Vol. 2000.Google ScholarGoogle Scholar
  6. [6] Brand Matthew and Hertzmann Aaron. 2000. Style machines. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques. ACM, New York, NY, 183192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Cardle Marc, Barthe Loic, Brooks Stephen, and Robinson Peter. 2002. Music-driven motion editing: Local motion transformations guided by music analysis. In Proceedings 20th Eurographics UK Conference. IEEE, Los Alamitos, CA, 3844. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Chai Jinxiang and Hodgins Jessica K.. 2005. Performance animation from low-dimensional control signals. ACM Transactions on Graphics 24, 3 (2005), 686696. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Chai Jinxiang and Hodgins Jessica K.. 2007. Constraint-based motion optimization using a statistical dynamic model. ACM Transactions on Graphics 26 (2007), 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Choi Keunwoo, Fazekas György, Sandler Mark, and Cho Kyunghyun. 2017. Convolutional recurrent neural networks for music classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’17). IEEE, Los Alamitos, CA, 23922396.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] CMU. 2010. Carnegie-Mellon Motion Capture Database. Retrieved November 11, 2021 from http://mocap.cs.cmu.edu.Google ScholarGoogle Scholar
  12. [12] Cui Qiongjie, Sun Huaijiang, and Yang Fei. 2020. Learning dynamic relationships for 3D human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 65196527.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Eyben Florian, Böck Sebastian, Schuller Björn, and Graves Alex. 2010. Universal onset detection with bidirectional long-short term memory neural networks. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR’10). 589594.Google ScholarGoogle Scholar
  14. [14] Fragkiadaki Katerina, Levine Sergey, Felsen Panna, and Malik Jitendra. 2015. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision. 43464354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Ginosar Shiry, Bar Amir, Kohavi Gefen, Chan Caroline, Owens Andrew, and Malik Jitendra. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 34973506.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Glorot Xavier and Bengio Yoshua. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 249256.Google ScholarGoogle Scholar
  17. [17] Gómez Emilia. 2006. Tonal description of music audio signals. Ph.D. Dissertation. Universitat Pompeu Fabra, Barcelona, Spain.Google ScholarGoogle Scholar
  18. [18] Gopalakrishnan Anand, Mali Ankur, Kifer Dan, Giles Lee, and Ororbia Alexander G.. 2019. A neural temporal model for human motion prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1211612125.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Hawthorne Curtis, Elsen Erich, Song Jialin, Roberts Adam, Simon Ian, Raffel Colin, Engel Jesse, Oore Sageev, and Eck Douglas. 2017. Onsets and frames: Dual-objective piano transcription. arXiv preprint arXiv:1710.11153 (2017).Google ScholarGoogle Scholar
  20. [20] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Heusel Martin, Ramsauer Hubert, Unterthiner Thomas, Nessler Bernhard, and Hochreiter Sepp. 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems. 66266637. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Holden Daniel, Saito Jun, and Komura Taku. 2016. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics 35, 4 (2016), 138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Huang Gao, Liu Zhuang, Maaten Laurens Van Der, and Weinberger Kilian Q.. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 47004708.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Jain Ashesh, Zamir Amir R., Savarese Silvio, and Saxena Ashutosh. 2016. Structural-RNN: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 53085317.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Korzeniowski Filip and Widmer Gerhard. 2016. Feature learning for chord recognition: The deep chroma extractor. arXiv preprint arXiv:1612.05065 (2016).Google ScholarGoogle Scholar
  26. [26] Kovar Lucas, Gleicher Michael, and Pighin Frédéric. 2008. Motion graphs. In ACM SIGGRAPH 2008 Classes. ACM, New York, NY, 51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Krebs Florian, Böck Sebastian, Dorfer Matthias, and Widmer Gerhard. 2016. Downbeat tracking using beat synchronous features with recurrent neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR’16). 129135.Google ScholarGoogle Scholar
  28. [28] Krebs Florian, Böck Sebastian, and Widmer Gerhard. 2013. Rhythmic pattern modeling for beat and downbeat tracking in musical audio. In Proceedings of the Annual Conference of the International Society for Music Information Retrieval (ISMIR’13). 227232.Google ScholarGoogle Scholar
  29. [29] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 10971105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Lau Manfred, Bar-Joseph Ziv, and Kuffner James. 2009. Modeling spatial and temporal variation in motion data. ACM Transactions on Graphics 28 (2009), 171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] LeCun Yann, Bengio Yoshua, and Hinton Geoffrey. 2015. Deep learning. Nature 521, 7553 (2015), 436444.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Lee Hsin-Ying, Yang Xiaodong, Liu Ming-Yu, Wang Ting-Chun, Lu Yu-Ding, Yang Ming-Hsuan, and Kautz Jan. 2019. Dancing to music. In Advances in Neural Information Processing Systems. 35813591.Google ScholarGoogle Scholar
  33. [33] Lee Juheon, Kim Seohyun, and Lee Kyogu. 2018. Listen to dance: Music-driven choreography generation using autoregressive encoder-decoder network. arXiv preprint arXiv:1811.00818 (2018).Google ScholarGoogle Scholar
  34. [34] Lee Kyungho, Lee Seyoung, and Lee Jehee. 2018. Interactive character animation by learning multi-objective control. ACM Transactions on Graphics 37, 6 (2018), 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Li Chen, Zhang Zhen, Lee Wee Sun, and Lee Gim Hee. 2018. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 52265234.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Li Yan, Wang Tianshu, and Shum Heung-Yeung. 2002. Motion texture: A two-level statistical model for character motion synthesis. ACM Transactions on Graphics 21 (2002), 465472. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Li Zimo, Zhou Yi, Xiao Shuangjiu, He Chong, and Li Hao. 2017. Auto-conditioned LSTM network for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363 3 (2017).Google ScholarGoogle Scholar
  38. [38] Martinez Julieta, Black Michael J., and Romero Javier. 2017. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 28912900.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Min Jianyuan and Chai Jinxiang. 2012. Motion Graphs++: A compact generative model for semantic motion analysis and synthesis. ACM Transactions on Graphics 31, 6 (2012), 153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Müller Meinard and Ewert Sebastian. 2011. Chroma toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR’11).Google ScholarGoogle Scholar
  41. [41] Oord Aaron van den, Dieleman Sander, Zen Heiga, Simonyan Karen, Vinyals Oriol, Graves Alex, Kalchbrenner Nal, Senior Andrew, and Kavukcuoglu Koray. 2016. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).Google ScholarGoogle Scholar
  42. [42] Pavllo Dario, Feichtenhofer Christoph, Grangier David, and Auli Michael. 2019. 3D human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 77537762.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Pavlovic Vladimir, Rehg James M., and MacCormick John. 2001. Learning switching linear models of human motion. In Advances in Neural Information Processing Systems. 981987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Safonova Alla and Hodgins Jessica K.. 2007. Construction and optimal search of interpolated motion graphs. ACM Transactions on Graphics 26, 3 (2007), 106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] SFU. 2017. SFU Motion Capture Database. Retrieved November 11, 2021 from http://mocap.cs.sfu.ca.Google ScholarGoogle Scholar
  46. [46] Tang Taoran, Jia Jia, and Mao Hanyang. 2018. Dance with melody: An LSTM-autoencoder approach to music-oriented dance synthesis. In Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference. ACM, New York, NY, 15981606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Tieleman Tijmen and Hinton Geoffrey. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2 (2012), 2631.Google ScholarGoogle Scholar
  48. [48] Oord Aaron Van den, Kalchbrenner Nal, Espeholt Lasse, Vinyals Oriol, Graves Alex, Koray Kavukcuoglu. 2016. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems. 47904798. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Xia Shihong, Wang Congyi, Chai Jinxiang, and Hodgins Jessica. 2015. Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics 34, 4 (2015), 119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Yalta Nelson, Watanabe Shinji, Nakadai Kazuhiro, and Ogata Tetsuya. 2018. Weakly supervised deep recurrent neural networks for basic dance step generation. arXiv preprint arXiv:1807.01126 (2018).Google ScholarGoogle Scholar
  51. [51] Yan Sijie, Li Zhizhong, Xiong Yuanjun, Yan Huahan, and Lin Dahua. 2019. Convolutional sequence generation for skeleton-based action synthesis. In Proceedings of the IEEE International Conference on Computer Vision. 43944402.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Yun Kiwon, Honorio Jean, Chattopadhyay Debaleena, Berg Tamara L, and Samaras Dimitris. 2012. Two-person interaction detection using body-pose features and multiple instance learning. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, Los Alamitos, CA, 2835.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Music2Dance: DanceNet for Music-Driven Dance Generation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2
      May 2022
      494 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3505207
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 February 2022
      • Accepted: 1 September 2021
      • Revised: 1 July 2021
      • Received: 1 January 2021
      Published in tomm Volume 18, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!