skip to main content
research-article

Will You Ever Become Popular? Learning to Predict Virality of Dance Clips

Authors Info & Claims
Published:16 February 2022Publication History
Skip Abstract Section

Abstract

Dance challenges are going viral in video communities like TikTok nowadays. Once a challenge becomes popular, thousands of short-form videos will be uploaded within a couple of days. Therefore, virality prediction from dance challenges is of great commercial value and has a wide range of applications, such as smart recommendation and popularity promotion. In this article, a novel multi-modal framework that integrates skeletal, holistic appearance, facial and scenic cues is proposed for comprehensive dance virality prediction. To model body movements, we propose a pyramidal skeleton graph convolutional network (PSGCN) that hierarchically refines spatio-temporal skeleton graphs. Meanwhile, we introduce a relational temporal convolutional network (RTCN) to exploit appearance dynamics with non-local temporal relations. An attentive fusion approach is finally proposed to adaptively aggregate predictions from different modalities. To validate our method, we introduce a large-scale viral dance video (VDV) dataset, which contains over 4,000 dance clips of eight viral dance challenges. Extensive experiments on the VDV dataset well demonstrate the effectiveness of our approach. Furthermore, we show that short video applications such as multi-dimensional recommendation and action feedback can be derived from our model.

REFERENCES

  1. [1] Andriluka Mykhaylo, Pishchulin Leonid, Gehler Peter, and Schiele Bernt. 2014. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the CVPR. IEEE, 36863693. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Ba Jimmy Lei, Kiros Jamie Ryan, and Hinton Geoffrey E.. 2016. Layer normalization. arxiv:1607.06450.Google ScholarGoogle Scholar
  3. [3] Bai Shaojie, Kolter J. Zico, and Koltun Vladlen. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arxiv:1803.01271.Google ScholarGoogle Scholar
  4. [4] Bielski Adam and Trzcinski Tomasz. 2018. Pay attention to virality: Understanding popularity of social media videos with the attention mechanism. In Proceedings of the CVPRW. IEEE, 23352337.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Bielski Adam and Trzcinski Tomasz. 2018. Understanding multimodal popularity prediction of social media videos with self-attention. IEEE Access 6 (2018), 7427774287.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Burges Christopher, Shaked Tal, Renshaw Erin, Lazier Ari, Deeds Matt, Hamilton Nicole, and Hullender Gregory N.. 2005. Learning to rank using gradient descent. In Proceedings of the ICML. JMLR, 8996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Cao Qiong, Shen Li, Xie Weidi, Parkhi Omkar M., and Zisserman Andrew. 2018. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG). IEEE, 6774.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Carion Nicolas, Massa Francisco, Synnaeve Gabriel, Usunier Nicolas, Kirillov Alexander, and Zagoruyko Sergey. 2020. End-to-end object detection with transformers. In Proceedings of the ECCV. Springer, 213229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the CVPR. IEEE, 62996308.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Chen Jingyuan, Song Xuemeng, Nie Liqiang, Wang Xiang, Zhang Hanwang, and Chua Tat-Seng. 2016. Micro tells macro: Predicting the popularity of micro-videos via a transductive model. In Proceedings of the 24th ACM International Conference on Multimedia (MM). ACM, 898907. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Chen Zhao, Badrinarayanan Vijay, Lee Chen-Yu, and Rabinovich Andrew. 2018. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the ICML. JMLR, 794803.Google ScholarGoogle Scholar
  12. [12] Committee International Olympic. 2021. List of summer and winter olympic sports. Retrieved from https://www.olympic.org/sports.Google ScholarGoogle Scholar
  13. [13] Dauphin Yann N., Fan Angela, Auli Michael, and Grangier David. 2017. Language modeling with gated convolutional networks. In Proceedings of the ICML. JMLR, 933941. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL. ACL, 41714186.Google ScholarGoogle Scholar
  15. [15] Doughty Hazel, Damen Dima, and Mayol-Cuevas Walterio. 2018. Who’s better? who’s best? Pairwise deep ranking for skill determination. In Proceedings of the CVPR. IEEE, 60576066.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Doughty Hazel, Mayol-Cuevas Walterio, and Damen Dima. 2019. The pros and cons: Rank-aware temporal attention for skill determination in long videos. In Proceedings of the CVPR. IEEE, 78627871.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Fang Hao-Shu, Xie Shuqin, Tai Yu-Wing, and Lu Cewu. 2017. RMPE: Regional multi-person pose estimation. In Proceedings of the ICCV. IEEE, 23342343.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Farha Yazan Abu and Gall Jurgen. 2019. MS-TCN: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the CVPR. IEEE, 35753584.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Feichtenhofer Christoph, Fan Haoqi, Malik Jitendra, and He Kaiming. 2019. Slowfast networks for video recognition. In Proceedings of the ICCV. IEEE, 62026211.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Gao Jibin, Zheng Wei-Shi, Pan Jia-Hui, Gao Chengying, Wang Yaowei, Zeng Wei, and Lai Jianhuang. 2020. An asymmetric modeling for action assessment. In Proceedings of the ECCV. Springer, 222238.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Gao Yixin, Vedula S. Swaroop, Reiley Carol E., Ahmidi Narges, Varadarajan Balakrishnan, Lin Henry C., Tao Lingling, Zappella Luca, Béjar Benjamın, Yuh David D. et al. 2014. JHU-ISI gesture and skill assessment working set (JIGSAWS): A surgical activity dataset for human motion modeling. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention Workshop (MICCAIW), Vol. 3. Springer, 3.Google ScholarGoogle Scholar
  22. [22] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. IEEE, 770778.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 17351780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Karpathy Andrej, Toderici George, Shetty Sanketh, Leung Thomas, Sukthankar Rahul, and Fei-Fei Li. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the CVPR. IEEE, 17251732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In Proceedings of the ICLR. OpenReview.net.Google ScholarGoogle Scholar
  26. [26] Kipf Thomas N. and Welling Max. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the ICLR. OpenReview.net.Google ScholarGoogle Scholar
  27. [27] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the NeurIPS. 10971105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Le Quoc V., Zou Will Y., Yeung Serena Y., and Ng Andrew Y.. 2011. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the CVPR. IEEE, 33613368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Lea Colin, Flynn Michael D., Vidal Rene, Reiter Austin, and Hager Gregory D.. 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the CVPR. IEEE, 156165.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Li Jing. 2018. Studies on Douyin app communication in social platforms: Take relevant Douyin short videos and posts on microblog as examples. In Proceedings of the ICALLH. 313317.Google ScholarGoogle Scholar
  31. [31] Li Maosen, Chen Siheng, Chen Xu, Zhang Ya, Wang Yanfeng, and Tian Qi. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the CVPR. IEEE, 35953603.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Li Maosen, Chen Siheng, Zhao Yangheng, Zhang Ya, Wang Yanfeng, and Tian Qi. 2020. Dynamic multiscale graph neural networks for 3D skeleton-based human motion prediction. In Proceedings of the CVPR. IEEE, 214223.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Li Yongjun, Chai Xiujuan, and Chen Xilin. 2018. End-to-end learning for action quality assessment. In Proceedings of the PCM. Springer, 125134.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Li Yongjun, Chai Xiujuan, and Chen Xilin. 2018. ScoringNet: Learning key fragment for action quality assessment with ranking loss in skilled sports. In Proceedings of the ACCV. Springer, 149164.Google ScholarGoogle Scholar
  35. [35] Martin J. A., Regehr Glenn, Reznick Richard, Macrae Helen, Murnaghan John, Hutchison Carol, and Brown M.. 1997. Objective structured assessment of technical skill (OSATS) for surgical residents. Brit. J. Surg. 84, 2 (1997), 273278.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Mohsin Maryam. 2021. 10 TikTok statistics that you need to know in 2021. Retrieved from https://www.oberlo.com/blog/tiktok-statistics.Google ScholarGoogle Scholar
  37. [37] Nie Liqiang, Liu Meng, and Song Xuemeng. 2019. Multimodal learning toward micro-video understanding. Synth. Lect. Image, Vid. Multimedia Process. 9, 4 (2019), 1186.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Niepert Mathias, Ahmed Mohamed, and Kutzkov Konstantin. 2016. Learning convolutional neural networks for graphs. In Proceedings of the ICML. JMLR, 20142023. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Pan Jia-Hui, Gao Jibin, and Zheng Wei-Shi. 2019. Action assessment by joint relation graphs. In Proceedings of the ICCV. IEEE, 63316340.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Parmar Paritosh and Morris Brendan. 2019. Action quality assessment across multiple actions. In Proceedings of the WACV. IEEE, 14681476.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Parmar Paritosh and Morris Brendan Tran. 2016. Measuring the quality of exercises. In Proceedings of the EMBC. IEEE, 22412244.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Parmar Paritosh and Morris Brendan Tran. 2019. What and how well you performed? A multitask learning approach to action quality assessment. In Proceedings of the CVPR. IEEE, 304313.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Parmar Paritosh and Morris Brendan Tran. 2017. Learning to score olympic events. In Proceedings of the CVPRW. IEEE, 2028.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proc. NeurIPS. 8024–8035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Pinto Henrique, Almeida Jussara M., and Gonçalves Marcos A.. 2013. Using early view patterns to predict the popularity of YouTube videos. In Proceedings of the WSDM. ACM, 365374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Pirsiavash Hamed, Vondrick Carl, and Torralba Antonio. 2014. Assessing the quality of actions. In Proceedings of the ECCV. Springer, 556571.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Qiu Zhaofan, Yao Ting, and Mei Tao. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the ICCV. IEEE, 55335541.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Su Yuting, Li Yang, Bai Xu, and Jing Peiguang. 2020. Predicting the popularity of micro-videos via a feature-discrimination transductive model. Multimedia Syst. 26, 5 (2020), 519534.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Tang Yansong, Ni Zanlin, Zhou Jiahuan, Zhang Danyang, Lu Jiwen, Wu Ying, and Zhou Jie. 2020. Uncertainty-aware score distribution learning for action quality assessment. In Proceedings of the CVPR. IEEE, 98399848.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Tao Lili, Paiement Adeline, Damen Dima, Mirmehdi Majid, Hannuna Sion, Camplani Massimo, Burghardt Tilo, and Craddock Ian. 2016. A comparative study of pose representation and dynamics modelling for online motion quality assessment. Comput. Vis. Image Underst. 148 (2016), 136152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the ICCV. IEEE, 44894497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Trzciński Tomasz and Rokita Przemysław. 2017. Predicting popularity of online videos using support vector regression. IEEE Trans. Multimedia 19, 11 (2017), 25612570.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Droogenbroeck Marc Van and Paquot Olivier. 2012. Background subtraction: Experiments and improvements for ViBe. In Proceedings of the CVPRW. IEEE, 3237.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the NeurIPS. 59986008.Google ScholarGoogle Scholar
  55. [55] Venkataraman Vinay, Vlachos Ioannis, and Turaga Pavan K.. 2015. Dynamical regularity for action analysis. In Proceedings of the BMVC. 67–1.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Wang Jiahao, Du Zhengyin, Li Annan, and Wang Yunhong. 2019. Atrous temporal convolutional network for video action segmentation. In Proceedings of the ICIP. IEEE, 15851589.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Wang Limin, Xiong Yuanjun, Wang Zhe, Qiao Yu, Lin Dahua, Tang Xiaoou, and Gool Luc Van. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the ECCV. Springer, 2036.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Wang Xiaolong, Girshick Ross, Gupta Abhinav, and He Kaiming. 2018. Non-local neural networks. In Proceedings of the CVPR. IEEE, 77947803.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Wongsuparatkul Ekapol and Sinthupinyo Sukree. 2020. View count of online videos prediction using clustering view count patterns with multivariate linear model. In Proceedings of the ICCCM. 123129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Xiang Xiang, Tian Ye, Reiter Austin, Hager Gregory D., and Tran Trac D.. 2018. S3D: Stacking segmental P3D for action quality assessment. In Proceedings of the ICIP. IEEE, 928932.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Xie Jiayi, Zhu Yaochen, Zhang Zhibin, Peng Jian, Yi Jing, Hu Yaosi, Liu Hongyi, and Chen Zhenzhong. 2020. A multimodal variational encoder-decoder framework for micro-video popularity prediction. In Proceedings of the WWW. 25422548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Yan Sijie, Xiong Yuanjun, and Lin Dahua. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Yu Fisher and Koltun Vladlen. 2016. Multi-scale context aggregation by dilated convolutions. In Proceedings of the ICLR. OpenReview.net.Google ScholarGoogle Scholar
  64. [64] Zhang Kaipeng, Zhang Zhanpeng, Li Zhifeng, and Qiao Yu. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23, 10 (2016), 14991503.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Zhang Yuchao, Li Pengmiao, Zhang Zhili, Zhang Chaorui, Wang Wendong, Ning Yishuang, and Lian Bo. 2020. GraphInf: A GCN-based popularity prediction system for short video networks. In Proceedings of the ICWS. Springer, 6176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Zhou Bolei, Lapedriza Agata, Khosla Aditya, Oliva Aude, and Torralba Antonio. 2017. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2017), 14521464.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Zia Aneeq, Sharma Yachna, Bettadapura Vinay, Sarin Eric L., Clements Mark A., and Essa Irfan. 2015. Automated assessment of surgical skills using frequency analysis. In Proceedings of the MICCAI. Springer, 430438. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Will You Ever Become Popular? Learning to Predict Virality of Dance Clips

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2
        May 2022
        494 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3505207
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 February 2022
        • Accepted: 1 July 2021
        • Revised: 1 May 2021
        • Received: 1 November 2020
        Published in tomm Volume 18, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!