Abstract
As a crucial task for video analysis, social relation recognition for characters not only provides semantically rich description of video content but also supports intelligent applications, e.g., video retrieval and visual question answering. Unfortunately, due to the semantic gap between visual and semantic features, traditional solutions may fail to reveal the accurate relations among characters. At the same time, the development of social media platforms has now promoted the emergence of crowdsourced comments, which may enhance the recognition task with semantic and descriptive cues. To that end, in this article, we propose a novel multimodal-based solution to deal with the character relation recognition task. Specifically, we capture the target character pairs via a search module and then design a multistream architecture for jointly embedding the visual and textual information, in which feature fusion and attention mechanism are adapted for better integrating the multimodal inputs. Finally, supervised learning is applied to classify character relations. Experiments on real-world data sets validate that our solution outperforms several competitive baselines.
- David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993–1022.Google Scholar
Digital Library
- Joya Chen, Hao Du, Yufei Wu, Tong Xu, and Enhong Chen. 2020. Cross-modal video moment retrieval based on visual-textual relationship alignment. SCIENTIA SINICA Informationis 50, 6 (2020), 862–876.Google Scholar
Cross Ref
- Lei Ding and Alper Yilmaz. 2010. Learning relations among movie characters: A social network perspective. In European Conference on Computer Vision, 410–423.Google Scholar
Cross Ref
- Chunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, David A. Ross, George Toderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jagannath Malik. 2017. AVA: A video dataset of spatio-temporally localized atomic visual actions. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6047–6056.Google Scholar
- Omar Hamdoun, Fabien Moutarde, Bogdan Stanciulescu, and Bruno Steux. 2008. Person re-identification in multi-camera system by signature based on interest point descriptors collected on short video sequences. In 2nd ACM/IEEE International Conference on Distributed Smart Cameras, 1–6.Google Scholar
Cross Ref
- K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep residual learning for image recognition. ArXiv e-prints (Dec. 2015). arxiv:cs.CV/1512.03385Google Scholar
- Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR ’17). abs/1703.07737 (2017).Google Scholar
- Anthony Hu and Seth Flaxman. 2018. Multimodal sentiment analysis to explore the structure of emotions. In the 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'18).Google Scholar
Digital Library
- Qingqiu Huang, Wentao Liu, and Dahua Lin. 2018. Person search in videos with one portrait through visual and temporal links. CoRR. abs/1807.10510 (2018).Google Scholar
- J. Lv and B. Wu. 2019. Spatio-temporal attention model based on multi-view for social relation understanding. In International Conference on Multimedia Modeling, 390–401.Google Scholar
- Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue, and Shih-Fu Chang. 2017. Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Transactions on Multimedia 20 (2017), 3137–3147.Google Scholar
Digital Library
- Rémi Lajugie, Damien Garreau, Francis R. Bach, and Sylvain Arlot. 2014. Metric learning for temporal sequence alignment. In Advances in Neural Information Processing Systems (NIPS'14).Google Scholar
- Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z. Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), 2197–2206.Google Scholar
- Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, and Lianli Gao. 2018. Social relation recognition from videos via multi-scale spatial-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR'19).Google Scholar
- Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2017. Attention clusters: Purely attention based local feature integration for video classification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7834–7843.Google Scholar
- Guangyi Lv, Tong Xu, Enhong Chen, Qi Feng Liu, and Yi Zheng. 2016. Reading the videos: Temporal labeling for crowdsourced time-sync videos based on semantic embedding. In the Thirtieth AAAI Conference on Artificial Intelligence (AAAI'16). 3000–3006.Google Scholar
- Guangyi Lv, Tong Xu, Qi Liu, Enhong Chen, Weidong He, Mingxiao An, and Zhongming Chen. 2019. Gossiping the videos: An embedding-based generative adversarial framework for time-sync comments generation. In the 23rd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'19).Google Scholar
Cross Ref
- Guangyi Lv, Kun Zhang, Le Wu, Enhong Chen, Tong Xu, Qi Liu, and Weidong He. 2019. Understanding the users and videos by mining a novel danmu dataset. IEEE Transactions on Big Data (2019). https://ieeexplore.ieee.org/document/8887283.Google Scholar
Cross Ref
- Jinna Lv, Wu Liu, Linghong Linda Zhou, Bai Jun Wu, and Huadong Ma. 2018. Multi-stream fusion model for social relation recognition from videos. In International Conference on Multimedia Modeling. 355--368.Google Scholar
Cross Ref
- Bingpeng Ma, Yu Su, and Frédéric Jurie. 2012. Local descriptors encoded by fisher vectors for person re-identification. In European Conference on Computer Vision. 413--422.Google Scholar
Digital Library
- P. Dai, J. Lv, and B. Wu. 2019. Two-stage model for social relationship understanding from videos. In IEEE International Conference on Multimedia and Expo (ICME’19), 1132–1137.Google Scholar
- Seung-Bo Park, Kyeong-Jin Oh, and GeunSik Jo. 2011. Social network analysis in a movie using character-net. Multimedia Tools and Applications 59 (2011), 601–627.Google Scholar
Digital Library
- Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Multi-level multiple attentions for contextual multimodal sentiment analysis. In IEEE International Conference on Data Mining (ICDM ’17), 1033–1038.Google Scholar
Cross Ref
- Bryan James Prosser, Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. 2010. Person re-identification by support vector ranking. BMVC 2, 5(2010) 6.Google Scholar
- Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.Google Scholar
Digital Library
- Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating objects and relations in user-generated videos. In the 2019 on International Conference on Multimedia Retrieval. 279--287.Google Scholar
Digital Library
- Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In the 25th ACM International Conference on Multimedia. 1300--1308.Google Scholar
Digital Library
- Yantao Shen, Tong Xiao, Hongsheng Li, Shuai Yi, and Xiaogang Wang. 2018. End-to-end deep kronecker-product matching for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18), 6886–6895.Google Scholar
Cross Ref
- Nitish Srivastava and Ruslan Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. Journal of Machine Learning Research 15 (2012), 2949–2980.Google Scholar
- Qianru Sun, Bernt Schiele, and Mario Fritz. 2017. A domain based approach to social relation recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), 435–444.Google Scholar
Cross Ref
- Gokhan Tanisik, Cemil Zalluhoglu, and Nazli Ikizler-Cinbis. 2015. Facial descriptors for human interaction recognition in still images. ArXiv abs/1509.05366 (2015).Google Scholar
- Quang Dieu Tran and Jai E. Jung. 2015. CoCharNet: Extracting social networks using character co-occurrence in movies. Journal of UCS 21 (2015), 796–815.Google Scholar
- Valentin Vielzeuf, Stephane Pateux, and Frederic Jurie. 2017. Temporal multimodal fusion for video emotion classification in the wild. In the 19th ACM International Conference on Multimodal Interaction. 569--576.Google Scholar
Digital Library
- Chung-Yi Weng, Wei-Ta Chu, and Ja-Ling Wu. 2009. RoleNet: Movie analysis from the perspective of social networks. IEEE Transactions on Multimedia 11 (2009), 256–271.Google Scholar
Digital Library
- Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6609–6618.Google Scholar
Cross Ref
- Yuanlu Xu, Bingpeng Ma, Rui Huang, and Liang Lin. 2014. Person search in a scene by jointly modeling people commonness and person uniqueness. In the 22nd ACM International Conference on Multimedia. 937--940.Google Scholar
Digital Library
- Y. Wei, X. Wang, W. Guan, et al. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29 (2019).Google Scholar
- Yan Yan, Tianbao Yang, Zhe Li, Qihang Lin, and Yi Yang. 2018. A unified analysis of stochastic momentum methods for deep learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence.Google Scholar
Digital Library
- Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. 2017. CityPersons: A diverse dataset for pedestrian detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), 4457–4465.Google Scholar
Cross Ref
- Shanshan Zhang, Jian Xi Yang, and Bernt Schiele. 2018. Occluded pedestrian detection through guided attention in CNNs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18), 6995–7003.Google Scholar
Cross Ref
- Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2015. Learning social relation traits from face images. In 2015 IEEE International Conference on Computer Vision (ICCV’15), 3631–3639.Google Scholar
Digital Library
- Peilun Zhou, Tong Xu, Zhizhuo Yin, Dong Liu, Enhong Chen, Guangyi Lv, and Changliang Li. 2020. Character-oriented video summarization with visual and textual cues. IEEE Transactions on Multimedia 22,10 (2020), 2684--2697.Google Scholar
Cross Ref
Index Terms
Socializing the Videos: A Multimodal Approach for Social Relation Recognition
Recommendations
Social Context-aware Person Search in Videos via Multi-modal Cues
Person search has long been treated as a crucial and challenging task to support deeper insight in personalized summarization and personality discovery. Traditional methods, e.g., person re-identification and face recognition techniques, which profile ...
Semantic three-stream network for social relation recognition
Highlights- We propose a social relation recognition method.
- Our method learns ...
Graphical abstractDisplay Omitted
AbstractIn this paper, we propose a semantic three-stream network (STN) for social relation recognition, which learns discriminative features from facial images directly by exploiting semantic information effectively. Specifically, we employ a ...
Gujarati Script Recognition
AbstractCharacter recognition is the extraction of printed or handwritten text from images into machine-readable format. The extracted text can be easily edited, modified and efficiently stored. While there are several Optical Character Recognition (OCR) ...






Comments