Abstract
Dance challenges are going viral in video communities like TikTok nowadays. Once a challenge becomes popular, thousands of short-form videos will be uploaded within a couple of days. Therefore, virality prediction from dance challenges is of great commercial value and has a wide range of applications, such as smart recommendation and popularity promotion. In this article, a novel multi-modal framework that integrates skeletal, holistic appearance, facial and scenic cues is proposed for comprehensive dance virality prediction. To model body movements, we propose a pyramidal skeleton graph convolutional network (PSGCN) that hierarchically refines spatio-temporal skeleton graphs. Meanwhile, we introduce a relational temporal convolutional network (RTCN) to exploit appearance dynamics with non-local temporal relations. An attentive fusion approach is finally proposed to adaptively aggregate predictions from different modalities. To validate our method, we introduce a large-scale viral dance video (VDV) dataset, which contains over 4,000 dance clips of eight viral dance challenges. Extensive experiments on the VDV dataset well demonstrate the effectiveness of our approach. Furthermore, we show that short video applications such as multi-dimensional recommendation and action feedback can be derived from our model.
- [1] . 2014. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the CVPR. IEEE, 3686–3693. Google Scholar
Digital Library
- [2] . 2016. Layer normalization.
arxiv:1607.06450 .Google Scholar - [3] . 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.
arxiv:1803.01271 .Google Scholar - [4] . 2018. Pay attention to virality: Understanding popularity of social media videos with the attention mechanism. In Proceedings of the CVPRW. IEEE, 2335–2337.Google Scholar
Cross Ref
- [5] . 2018. Understanding multimodal popularity prediction of social media videos with self-attention. IEEE Access 6 (2018), 74277–74287.Google Scholar
Cross Ref
- [6] . 2005. Learning to rank using gradient descent. In Proceedings of the ICML. JMLR, 89–96. Google Scholar
Digital Library
- [7] . 2018. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG). IEEE, 67–74.Google Scholar
Cross Ref
- [8] . 2020. End-to-end object detection with transformers. In Proceedings of the ECCV. Springer, 213–229.Google Scholar
Digital Library
- [9] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the CVPR. IEEE, 6299–6308.Google Scholar
Cross Ref
- [10] . 2016. Micro tells macro: Predicting the popularity of micro-videos via a transductive model. In Proceedings of the 24th ACM International Conference on Multimedia (MM). ACM, 898–907. Google Scholar
Digital Library
- [11] . 2018. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the ICML. JMLR, 794–803.Google Scholar
- [12] . 2021. List of summer and winter olympic sports. Retrieved from https://www.olympic.org/sports.Google Scholar
- [13] . 2017. Language modeling with gated convolutional networks. In Proceedings of the ICML. JMLR, 933–941. Google Scholar
Digital Library
- [14] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL. ACL, 4171–4186.Google Scholar
- [15] . 2018. Who’s better? who’s best? Pairwise deep ranking for skill determination. In Proceedings of the CVPR. IEEE, 6057–6066.Google Scholar
Cross Ref
- [16] . 2019. The pros and cons: Rank-aware temporal attention for skill determination in long videos. In Proceedings of the CVPR. IEEE, 7862–7871.Google Scholar
Cross Ref
- [17] . 2017. RMPE: Regional multi-person pose estimation. In Proceedings of the ICCV. IEEE, 2334–2343.Google Scholar
Cross Ref
- [18] . 2019. MS-TCN: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the CVPR. IEEE, 3575–3584.Google Scholar
Cross Ref
- [19] . 2019. Slowfast networks for video recognition. In Proceedings of the ICCV. IEEE, 6202–6211.Google Scholar
Cross Ref
- [20] . 2020. An asymmetric modeling for action assessment. In Proceedings of the ECCV. Springer, 222–238.Google Scholar
Digital Library
- [21] . 2014. JHU-ISI gesture and skill assessment working set (JIGSAWS): A surgical activity dataset for human motion modeling. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention Workshop (MICCAIW), Vol. 3. Springer, 3.Google Scholar
- [22] . 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. IEEE, 770–778.Google Scholar
Cross Ref
- [23] . 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780. Google Scholar
Digital Library
- [24] . 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the CVPR. IEEE, 1725–1732. Google Scholar
Digital Library
- [25] . 2015. Adam: A method for stochastic optimization. In Proceedings of the ICLR. OpenReview.net.Google Scholar
- [26] . 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the ICLR. OpenReview.net.Google Scholar
- [27] . 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the NeurIPS. 1097–1105. Google Scholar
Digital Library
- [28] . 2011. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the CVPR. IEEE, 3361–3368. Google Scholar
Digital Library
- [29] . 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the CVPR. IEEE, 156–165.Google Scholar
Cross Ref
- [30] . 2018. Studies on Douyin app communication in social platforms: Take relevant Douyin short videos and posts on microblog as examples. In Proceedings of the ICALLH. 313–317.Google Scholar
- [31] . 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the CVPR. IEEE, 3595–3603.Google Scholar
Cross Ref
- [32] . 2020. Dynamic multiscale graph neural networks for 3D skeleton-based human motion prediction. In Proceedings of the CVPR. IEEE, 214–223.Google Scholar
Cross Ref
- [33] . 2018. End-to-end learning for action quality assessment. In Proceedings of the PCM. Springer, 125–134.Google Scholar
Cross Ref
- [34] . 2018. ScoringNet: Learning key fragment for action quality assessment with ranking loss in skilled sports. In Proceedings of the ACCV. Springer, 149–164.Google Scholar
- [35] . 1997. Objective structured assessment of technical skill (OSATS) for surgical residents. Brit. J. Surg. 84, 2 (1997), 273–278.Google Scholar
Cross Ref
- [36] . 2021. 10 TikTok statistics that you need to know in 2021. Retrieved from https://www.oberlo.com/blog/tiktok-statistics.Google Scholar
- [37] . 2019. Multimodal learning toward micro-video understanding. Synth. Lect. Image, Vid. Multimedia Process. 9, 4 (2019), 1–186.Google Scholar
Cross Ref
- [38] . 2016. Learning convolutional neural networks for graphs. In Proceedings of the ICML. JMLR, 2014–2023. Google Scholar
Digital Library
- [39] . 2019. Action assessment by joint relation graphs. In Proceedings of the ICCV. IEEE, 6331–6340.Google Scholar
Cross Ref
- [40] . 2019. Action quality assessment across multiple actions. In Proceedings of the WACV. IEEE, 1468–1476.Google Scholar
Cross Ref
- [41] . 2016. Measuring the quality of exercises. In Proceedings of the EMBC. IEEE, 2241–2244.Google Scholar
Cross Ref
- [42] . 2019. What and how well you performed? A multitask learning approach to action quality assessment. In Proceedings of the CVPR. IEEE, 304–313.Google Scholar
Cross Ref
- [43] . 2017. Learning to score olympic events. In Proceedings of the CVPRW. IEEE, 20–28.Google Scholar
Cross Ref
- [44] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proc. NeurIPS. 8024–8035. Google Scholar
Digital Library
- [45] . 2013. Using early view patterns to predict the popularity of YouTube videos. In Proceedings of the WSDM. ACM, 365–374. Google Scholar
Digital Library
- [46] . 2014. Assessing the quality of actions. In Proceedings of the ECCV. Springer, 556–571.Google Scholar
Cross Ref
- [47] . 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the ICCV. IEEE, 5533–5541.Google Scholar
Cross Ref
- [48] . 2020. Predicting the popularity of micro-videos via a feature-discrimination transductive model. Multimedia Syst. 26, 5 (2020), 519–534.Google Scholar
Cross Ref
- [49] . 2020. Uncertainty-aware score distribution learning for action quality assessment. In Proceedings of the CVPR. IEEE, 9839–9848.Google Scholar
Cross Ref
- [50] . 2016. A comparative study of pose representation and dynamics modelling for online motion quality assessment. Comput. Vis. Image Underst. 148 (2016), 136–152. Google Scholar
Digital Library
- [51] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the ICCV. IEEE, 4489–4497. Google Scholar
Digital Library
- [52] . 2017. Predicting popularity of online videos using support vector regression. IEEE Trans. Multimedia 19, 11 (2017), 2561–2570.Google Scholar
Cross Ref
- [53] . 2012. Background subtraction: Experiments and improvements for ViBe. In Proceedings of the CVPRW. IEEE, 32–37.Google Scholar
Cross Ref
- [54] . 2017. Attention is all you need. In Proceedings of the NeurIPS. 5998–6008.Google Scholar
- [55] . 2015. Dynamical regularity for action analysis. In Proceedings of the BMVC. 67–1.Google Scholar
Cross Ref
- [56] . 2019. Atrous temporal convolutional network for video action segmentation. In Proceedings of the ICIP. IEEE, 1585–1589.Google Scholar
Cross Ref
- [57] . 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the ECCV. Springer, 20–36.Google Scholar
Cross Ref
- [58] . 2018. Non-local neural networks. In Proceedings of the CVPR. IEEE, 7794–7803.Google Scholar
Cross Ref
- [59] . 2020. View count of online videos prediction using clustering view count patterns with multivariate linear model. In Proceedings of the ICCCM. 123–129.Google Scholar
Digital Library
- [60] . 2018. S3D: Stacking segmental P3D for action quality assessment. In Proceedings of the ICIP. IEEE, 928–932.Google Scholar
Cross Ref
- [61] . 2020. A multimodal variational encoder-decoder framework for micro-video popularity prediction. In Proceedings of the WWW. 2542–2548. Google Scholar
Digital Library
- [62] . 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI. Google Scholar
Digital Library
- [63] . 2016. Multi-scale context aggregation by dilated convolutions. In Proceedings of the ICLR. OpenReview.net.Google Scholar
- [64] . 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23, 10 (2016), 1499–1503.Google Scholar
Cross Ref
- [65] . 2020. GraphInf: A GCN-based popularity prediction system for short video networks. In Proceedings of the ICWS. Springer, 61–76.Google Scholar
Digital Library
- [66] . 2017. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2017), 1452–1464.Google Scholar
Cross Ref
- [67] . 2015. Automated assessment of surgical skills using frequency analysis. In Proceedings of the MICCAI. Springer, 430–438. Google Scholar
Digital Library
Index Terms
Will You Ever Become Popular? Learning to Predict Virality of Dance Clips
Recommendations
ViralBERT: A User Focused BERT-Based Approach to Virality Prediction
UMAP '22 Adjunct: Adjunct Proceedings of the 30th ACM Conference on User Modeling, Adaptation and PersonalizationRecently, Twitter has become the social network of choice for sharing and spreading information to a multitude of users through posts called ‘tweets’. Users can easily re-share these posts to other users through ‘retweets’, which allow information to ...
Spotting Flares: The Vital Signs of the Viral Spread of Tweets Made During Communal Incidents
With the increasing use of Twitter for encouraging users to instigate violent behavior with hate and racial content, it becomes necessary to investigate the uniqueness in the dynamics of the spread of tweets made during violent communal incidents and the ...
Prediction of Virality Timing Using Cascades in Social Media
Predicting content going viral in social networks is attractive for viral marketing, advertisement, entertainment, and other applications, but it remains a challenge in the big data era today. Previous works mainly focus on predicting the possible ...






Comments