skip to main content
research-article

Socializing the Videos: A Multimodal Approach for Social Relation Recognition

Authors Info & Claims
Published:16 April 2021Publication History
Skip Abstract Section

Abstract

As a crucial task for video analysis, social relation recognition for characters not only provides semantically rich description of video content but also supports intelligent applications, e.g., video retrieval and visual question answering. Unfortunately, due to the semantic gap between visual and semantic features, traditional solutions may fail to reveal the accurate relations among characters. At the same time, the development of social media platforms has now promoted the emergence of crowdsourced comments, which may enhance the recognition task with semantic and descriptive cues. To that end, in this article, we propose a novel multimodal-based solution to deal with the character relation recognition task. Specifically, we capture the target character pairs via a search module and then design a multistream architecture for jointly embedding the visual and textual information, in which feature fusion and attention mechanism are adapted for better integrating the multimodal inputs. Finally, supervised learning is applied to classify character relations. Experiments on real-world data sets validate that our solution outperforms several competitive baselines.

References

  1. David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993–1022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Joya Chen, Hao Du, Yufei Wu, Tong Xu, and Enhong Chen. 2020. Cross-modal video moment retrieval based on visual-textual relationship alignment. SCIENTIA SINICA Informationis 50, 6 (2020), 862–876.Google ScholarGoogle ScholarCross RefCross Ref
  3. Lei Ding and Alper Yilmaz. 2010. Learning relations among movie characters: A social network perspective. In European Conference on Computer Vision, 410–423.Google ScholarGoogle ScholarCross RefCross Ref
  4. Chunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, David A. Ross, George Toderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jagannath Malik. 2017. AVA: A video dataset of spatio-temporally localized atomic visual actions. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6047–6056.Google ScholarGoogle Scholar
  5. Omar Hamdoun, Fabien Moutarde, Bogdan Stanciulescu, and Bruno Steux. 2008. Person re-identification in multi-camera system by signature based on interest point descriptors collected on short video sequences. In 2nd ACM/IEEE International Conference on Distributed Smart Cameras, 1–6.Google ScholarGoogle ScholarCross RefCross Ref
  6. K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep residual learning for image recognition. ArXiv e-prints (Dec. 2015). arxiv:cs.CV/1512.03385Google ScholarGoogle Scholar
  7. Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’17). abs/1703.07737 (2017).Google ScholarGoogle Scholar
  8. Anthony Hu and Seth Flaxman. 2018. Multimodal sentiment analysis to explore the structure of emotions. In the 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Qingqiu Huang, Wentao Liu, and Dahua Lin. 2018. Person search in videos with one portrait through visual and temporal links. CoRR. abs/1807.10510 (2018).Google ScholarGoogle Scholar
  10. J. Lv and B. Wu. 2019. Spatio-temporal attention model based on multi-view for social relation understanding. In International Conference on Multimedia Modeling, 390–401.Google ScholarGoogle Scholar
  11. Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue, and Shih-Fu Chang. 2017. Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Transactions on Multimedia 20 (2017), 3137–3147.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Rémi Lajugie, Damien Garreau, Francis R. Bach, and Sylvain Arlot. 2014. Metric learning for temporal sequence alignment. In Advances in Neural Information Processing Systems (NIPS'14).Google ScholarGoogle Scholar
  13. Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z. Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), 2197–2206.Google ScholarGoogle Scholar
  14. Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, and Lianli Gao. 2018. Social relation recognition from videos via multi-scale spatial-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR'19).Google ScholarGoogle Scholar
  15. Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2017. Attention clusters: Purely attention based local feature integration for video classification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7834–7843.Google ScholarGoogle Scholar
  16. Guangyi Lv, Tong Xu, Enhong Chen, Qi Feng Liu, and Yi Zheng. 2016. Reading the videos: Temporal labeling for crowdsourced time-sync videos based on semantic embedding. In the Thirtieth AAAI Conference on Artificial Intelligence (AAAI'16). 3000–3006.Google ScholarGoogle Scholar
  17. Guangyi Lv, Tong Xu, Qi Liu, Enhong Chen, Weidong He, Mingxiao An, and Zhongming Chen. 2019. Gossiping the videos: An embedding-based generative adversarial framework for time-sync comments generation. In the 23rd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'19).Google ScholarGoogle ScholarCross RefCross Ref
  18. Guangyi Lv, Kun Zhang, Le Wu, Enhong Chen, Tong Xu, Qi Liu, and Weidong He. 2019. Understanding the users and videos by mining a novel danmu dataset. IEEE Transactions on Big Data (2019). https://ieeexplore.ieee.org/document/8887283.Google ScholarGoogle ScholarCross RefCross Ref
  19. Jinna Lv, Wu Liu, Linghong Linda Zhou, Bai Jun Wu, and Huadong Ma. 2018. Multi-stream fusion model for social relation recognition from videos. In International Conference on Multimedia Modeling. 355--368.Google ScholarGoogle ScholarCross RefCross Ref
  20. Bingpeng Ma, Yu Su, and Frédéric Jurie. 2012. Local descriptors encoded by fisher vectors for person re-identification. In European Conference on Computer Vision. 413--422.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Dai, J. Lv, and B. Wu. 2019. Two-stage model for social relationship understanding from videos. In IEEE International Conference on Multimedia and Expo (ICME’19), 1132–1137.Google ScholarGoogle Scholar
  22. Seung-Bo Park, Kyeong-Jin Oh, and GeunSik Jo. 2011. Social network analysis in a movie using character-net. Multimedia Tools and Applications 59 (2011), 601–627.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Multi-level multiple attentions for contextual multimodal sentiment analysis. In IEEE International Conference on Data Mining (ICDM ’17), 1033–1038.Google ScholarGoogle ScholarCross RefCross Ref
  24. Bryan James Prosser, Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. 2010. Person re-identification by support vector ranking. BMVC 2, 5(2010) 6.Google ScholarGoogle Scholar
  25. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating objects and relations in user-generated videos. In the 2019 on International Conference on Multimedia Retrieval. 279--287.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In the 25th ACM International Conference on Multimedia. 1300--1308.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yantao Shen, Tong Xiao, Hongsheng Li, Shuai Yi, and Xiaogang Wang. 2018. End-to-end deep kronecker-product matching for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18), 6886–6895.Google ScholarGoogle ScholarCross RefCross Ref
  29. Nitish Srivastava and Ruslan Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research 15 (2012), 2949–2980.Google ScholarGoogle Scholar
  30. Qianru Sun, Bernt Schiele, and Mario Fritz. 2017. A domain based approach to social relation recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), 435–444.Google ScholarGoogle ScholarCross RefCross Ref
  31. Gokhan Tanisik, Cemil Zalluhoglu, and Nazli Ikizler-Cinbis. 2015. Facial descriptors for human interaction recognition in still images. ArXiv abs/1509.05366 (2015).Google ScholarGoogle Scholar
  32. Quang Dieu Tran and Jai E. Jung. 2015. CoCharNet: Extracting social networks using character co-occurrence in movies. Journal of UCS 21 (2015), 796–815.Google ScholarGoogle Scholar
  33. Valentin Vielzeuf, Stephane Pateux, and Frederic Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild. In the 19th ACM International Conference on Multimodal Interaction. 569--576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Chung-Yi Weng, Wei-Ta Chu, and Ja-Ling Wu. 2009. RoleNet: Movie analysis from the perspective of social networks. IEEE Transactions on Multimedia 11 (2009), 256–271.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6609–6618.Google ScholarGoogle ScholarCross RefCross Ref
  36. Yuanlu Xu, Bingpeng Ma, Rui Huang, and Liang Lin. 2014. Person search in a scene by jointly modeling people commonness and person uniqueness. In the 22nd ACM International Conference on Multimedia. 937--940.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Wei, X. Wang, W. Guan, et al. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29 (2019).Google ScholarGoogle Scholar
  38. Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, and Yi Yang. 2018. A unified analysis of stochastic momentum methods for deep learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. 2017. CityPersons: A diverse dataset for pedestrian detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), 4457–4465.Google ScholarGoogle ScholarCross RefCross Ref
  40. Shanshan Zhang, Jian Xi Yang, and Bernt Schiele. 2018. Occluded pedestrian detection through guided attention in CNNs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18), 6995–7003.Google ScholarGoogle ScholarCross RefCross Ref
  41. Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2015. Learning social relation traits from face images. In 2015 IEEE International Conference on Computer Vision (ICCV’15), 3631–3639.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Peilun Zhou, Tong Xu, Zhizhuo Yin, Dong Liu, Enhong Chen, Guangyi Lv, and Changliang Li. 2020. Character-oriented video summarization with visual and textual cues. IEEE Transactions on Multimedia 22,10 (2020), 2684--2697.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Socializing the Videos: A Multimodal Approach for Social Relation Recognition

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!