Abstract
In this article, we deal with the problem of understanding human-to-human interactions as a fundamental component of social events analysis. Inspired by the recent success of multi-modal visual data in many recognition tasks, we propose a novel approach to model dyadic interaction by means of features extracted from synchronized 3D skeleton coordinates, depth, and Red Green Blue (RGB) sequences. From skeleton data, we extract new view-invariant proxemic features, named Unified Proxemic Descriptor (UProD), which is able to incorporate intrinsic and extrinsic distances between two interacting subjects. A novel key frame selection method is introduced to identify salient instants of the interaction sequence based on the joints’ energy. From Red Green Blue Depth (RGBD) videos, more holistic CNN features are extracted by applying an adaptive pre-trained Convolutional Neural Networks (CNNs) on optical flow frames. For better understanding the dynamics of interactions, we expand the boundaries of dyadic interactions analysis by proposing a fundamentally new modeling for non-treated problem aiming to discern the active from the passive interactor. Extensive experiments have been carried out on four multi-modal and multi-view interactions datasets. The experimental results demonstrate the superiority of our proposed techniques against the state-of-the-art approaches.
- Jake K. Aggarwal and Michael S. Ryoo. 2011. Human activity analysis: A review. ACM Computing Surveys (CSUR) 43, 3 (2011), 16. Google Scholar
Digital Library
- Judee K. Burgoon, Lesa A. Stern, and Leesa Dillman. 2007. Interpersonal adaptation: Dyadic interaction patterns. Cambridge University Press.Google Scholar
- Chao Yeh Chen and Kristen Grauman. 2017. Efficient activity detection in untrimmed video with max-subgraph search. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 5 (2017), 908--921. Google Scholar
Digital Library
- Lulu Chen, Hong Wei, and James Ferryman. 2013. A survey of human motion analysis using depth imagery. Pattern Recognition Letters 34, 15 (2013), 1995--2006. Google Scholar
Digital Library
- Claudio Coppola, Serhan Cosar, Diego R. Faria, Nicola Bellotto, and others. 2017. Automatic detection of human interactions from RGB-D data for social activity classification. (2017).Google Scholar
- Claudio Coppola, Diego R. Faria, Urbano Nunes, and Nicola Bellotto. 2016. Social activity recognition based on probabilistic merging of skeleton features with proximity priors from RGB-D data. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5055--5061.Google Scholar
Digital Library
- Andreas Eitel, Jost Tobias Springenberg, Luciano Spinello, Martin Riedmiller, and Wolfram Burgard. 2015. Multimodal deep learning for robust RGB-D object recognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).Google Scholar
Cross Ref
- Georgios Evangelidis, Gurkirt Singh, and Radu Horaud. 2014. Skeletal quads: Human action recognition using joint quadruples. In International Conference on Pattern Recognition (ICPR). Google Scholar
Digital Library
- Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008), 1871--1874. Google Scholar
Digital Library
- Alan Page Fiske. 1991. Structures of Social Life: The Four Elementary Forms of Human Relations: Communal Sharing, Authority Ranking, Equality Matching, Market Pricing. Free Press.Google Scholar
- Alan P. Fiske. 1992. The four elementary forms of sociality: Framework for a unified theory of social relations. Psychological Review 99, 4 (1992), 689.Google Scholar
Cross Ref
- Yun Fu, Yunde Jia, and Yu Kong. 2014. Interactive phrases: Semantic descriptions for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2014).Google Scholar
- Edward T. Hall. 1963. A system for the notation of proxemic behavior. American Anthropologist 65, 5 (1963), 1003--1026.Google Scholar
Cross Ref
- Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5344--5352.Google Scholar
Cross Ref
- De-An Huang and Kris M. Kitani. 2014. Action-reaction: Forecasting the dynamics of human interaction. In European Conference on Computer Vision. Springer, 489--504.Google Scholar
- Kyriaki Kalimeri, Bruno Lepri, Oya Aran, Dinesh Babu Jayagopi, Daniel Gatica-Perez, and Fabio Pianesi. 2012. Modeling dominance effects on nonverbal behaviors using Granger causality. In 14th ACM International Conference on Multimodal Interaction. ACM, 23--26. Google Scholar
Digital Library
- Yu Kong, Yunde Jia, and Yun Fu. 2012. Learning human interaction by interactive phrases. In European Conference on Computer Vision (ECCV), Vol. 7572. Google Scholar
Digital Library
- Julian F. P. Kooij, M. C. Liem, Johannes D. Krijnders, Tjeerd C. Andringa, and Dariu M. Gavrila. 2016. Multi-modal human aggression detection. Computer Vision and Image Understanding 144 (2016), 106--120. Google Scholar
Digital Library
- Iulia Lefter, Catholijn M Jonker, Stephanie Klein Tuente, Wim Veling, and Stefan Bogaerts. 2017. NAA: A multimodal database of negative affect and aggression. In 2017 7th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 21--27.Google Scholar
Cross Ref
- Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, and Alex C Kot. 2017. Global context-aware attention LSTM networks for 3D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Minghuang Ma, Haoqi Fan, and Kris M Kitani. 2016. Going deeper into first-person activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1894--1903.Google Scholar
Cross Ref
- Alvaro Marcos-Ramiro, Daniel Pizarro, Marta Marron-Romera, and Daniel Gatica-Perez. 2015. Let your body speak: Communicative cue extraction on natural interaction using RGBD data. IEEE Transactions on Multimedia 17, 10 (2015), 1721--1732.Google Scholar
Digital Library
- Louis-Philippe Morency, Iwan de Kok, and Jonathan Gratch. 2008. Context-based recognition during human interactions: Automatic feature selection and encoding dictionary. In 10th International Conference on Multimodal Interfaces. ACM, 181--188. Google Scholar
Digital Library
- Eshed Ohn-Bar and Mohan Trivedi. 2013. Joint angles similarities and HOG2 for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Google Scholar
Digital Library
- Sunghyun Park, Stefan Scherer, Jonathan Gratch, Peter Carnevale, and Louis-Philippe Morency. 2013. Mutual behaviors during dyadic negotiation: Automatic prediction of respondent reactions. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 423--428. Google Scholar
Digital Library
- Alonso Patron-Perez, Marcin Marszalek, Andrew Zisserman, and Ian D. Reid. 2010. High five: Recognising human interactions in TV shows. In BMVC, Vol. 1. Citeseer, 2.Google Scholar
- Eric Postma and Marie Nilsenova. 2016. Measuring the causal dynamics of facial interaction. (2016).Google Scholar
- Michael S. Ryoo. 2011. Human activity prediction: Early recognition of ongoing activities from streaming videos. In 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, 1036--1043. Google Scholar
Digital Library
- M. S. Ryoo and J. K. Aggarwal. 2009. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In IEEE International Conference on Computer Vision (ICCV).Google Scholar
- Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Neural Information Processing Systems Conference (NIPS). Google Scholar
Digital Library
- K. Soomro, A. Roshan Zamir, and M. Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01.Google Scholar
- Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 588--595. Google Scholar
Digital Library
- Alessandro Vinciarelli, Anna Esposito, Elisabeth André, Francesca Bonin, Mohamed Chetouani, Jeffrey F. Cohn, Marco Cristani, Ferdinand Fuhrmann, Elmer Gilmartin, Zakia Hammal, and others. 2015. Open challenges in modelling, analysis and synthesis of human behaviour in human--human and human--machine interactions. Cognitive Computation 7, 4 (2015), 397--413.Google Scholar
Cross Ref
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision. 3551--3558. Google Scholar
Digital Library
- Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. 2015. Towards good practices for very deep two-stream ConvNets. Arxiv Preprint Arxiv:1507.02159 (2015).Google Scholar
- Pichao Wang, Wanqing Li, Zhimin Gao, Jing Zhang, Chang Tang, and Philip O. Ogunbona. 2016. Action recognition from depth maps using deep convolutional neural networks. IEEE Transactions on Human-Machine Systems 46, 4 (2016), 498--509.Google Scholar
Cross Ref
- Christian Wolf, Eric Lombardi, Julien Mille, Oya Celiktutan, Mingyuan Jiu, Emre Dogan, Gonen Eren, Moez Baccouche, Emmanuel Dellandréa, Charles-Edmond Bichot, and others. 2014. Evaluation of video activity localizations integrating quality and quantity measurements. Computer Vision and Image Understanding 127 (2014), 14--30. Google Scholar
Digital Library
- Ning Xu, Anan Liu, Weizhi Nie, Yongkang Wong, Fuwu Li, and Yuting Su. 2015. Multi-modal 8 multi-view 8 interactive benchmark dataset for human action recognition. In 23rd ACM International Conference on Multimedia. ACM, 1195--1198. Google Scholar
Digital Library
- Xiaodong Yang and YingLi Tian. 2014. Super normal vector for activity recognition using depth sequences. In IEEE Conference on Computer Vision and Pattern Recognition. 804--811. Google Scholar
Digital Library
- Xiaodong Yang and YingLi Tian. 2017. Super normal vector for human activity recognition with depth cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 5 (2017), 1028--1039. Google Scholar
Digital Library
- Ryo Yonetani, Kris M Kitani, and Yoichi Sato. 2016. Recognizing micro-actions and reactions from paired egocentric videos. In IEEE Conference on Computer Vision and Pattern Recognition. 2629--2638.Google Scholar
Cross Ref
- Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L Berg, and Dimitris Samaras. 2012. Two-person interaction detection using body-pose features and multiple instance learning. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).Google Scholar
Cross Ref
- Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime TV-L 1 optical flow. In Joint Pattern Recognition Symposium. 214--223. Google Scholar
Digital Library
- Maryam Ziaeefard, Robert Bergevin, and Louis-Philippe Morency. 2015. Time-slice prediction of dyadic human activities. In BMVC. 167--1.Google Scholar
Index Terms
Understanding the Dynamics of Social Interactions: A Multi-Modal Multi-View Approach
Recommendations
Robust Multi-Modal Cues for Dyadic Human Interaction Recognition
MUSA2 '17: Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective AttributesActivity analysis methods usually tend to focus on elementary human actions but ignore to analyze complex scenarios. In this paper, we focus particularly on classifying interactions between two persons in a supervised fashion. We propose a robust multi-...
Multi-modal social interaction recognition using view-invariant features
ISIAA 2017: Proceedings of the 1st ACM SIGCHI International Workshop on Investigating Social Interactions with Artificial AgentsThis paper addresses the issue of analyzing social interactions between humans in videos. We focus on recognizing dyadic human interactions through multi-modal data, specifically, depth, color and skeleton sequences. Firstly, we introduce a new person-...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaAutomatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...






Comments