Abstract

We propose a data-driven approach to detect conversational groups by identifying spatial arrangements typical of these focused social encounters. Our approach uses a novel Deep Affinity Network (DANTE) to predict the likelihood that two individuals in a scene are part of the same conversational group, considering their social context. The predicted pair-wise affinities are then used in a graph clustering framework to identify both small (e.g., dyads) and large groups. The results from our evaluation on multiple, established benchmarks suggest that combining powerful deep learning methods with classical clustering techniques can improve the detection of conversational groups in comparison to prior approaches. Finally, we demonstrate the practicality of our approach in a human-robot interaction scenario. Our efforts show that our work advances group detection not only in theory, but also in practice.
Supplemental Material
Available for Download
- Jake K Aggarwal and Michael S Ryoo. 2011. Human activity analysis: A review. ACM Computing Surveys (CSUR), Vol. 43, 3 (2011), 16.Google Scholar
Digital Library
- X. Alameda-Pineda, J. Staiano, R. Subramanian, L. Batrinca, E. Ricci, B. Lepri, O. Lanz, and N. Sebe. 2016. SALSA: A novel dataset for multimodal group behavior analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 38, 8 (Aug 2016), 1707--1720. https://doi.org/10.1109/TPAMI.2015.2496269Google Scholar
Digital Library
- Stefano Alletto, Giuseppe Serra, Simone Calderara, Francesco Solera, and Rita Cucchiara. 2014. From ego to nos-vision: Detecting social relationships in first-person views. In Proceedings of the 2014 Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) (Columbus, Ohio). IEEE, 580--585.Google Scholar
Digital Library
- Till Ballendat, Nicolai Marquardt, and Saul Greenberg. 2010. Proxemic interaction: Designing for a proximity and orientation-aware environment. In Proceedings of the 2010 ACM International Conference on Interactive Tabletops and Surfaces (Saarbrücken, Germany). ACM Press, 121--130.Google Scholar
Digital Library
- Loris Bazzani, Marco Cristani, and Vittorio Murino. 2012. Decentralized particle filter for joint individual-group tracking. In Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Providence, Rhode Island). IEEE, 1886--1893.Google Scholar
Cross Ref
- Dan Bohus, Sean Andrist, and Eric Horvitz. 2017. A study in scene shaping: Adjusting f-formations in the wild. In Proceedings of the 2017 AAAI Fall Symposium: Natural Communication for Human-Robot Collaboration (Arlington, Virginia). AAAI.Google Scholar
- Dan Bohus and Eric Horvitz. 2009a. Dialog in the open world: platform and applications. In Proceedings of the 2009 International Conference on Multimodal Interfaces (Cambridge, Massachusetts). ACM, 31--38.Google Scholar
Digital Library
- Dan Bohus and Eric Horvitz. 2009b. Learning to predict engagement with a spoken dialog system in open-world settings. In Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Computational Linguistics, 244--252.Google Scholar
Digital Library
- Dan Bohus, Chit W Saw, and Eric Horvitz. 2014. Directions robot: In-the-wild experiences and lessons learned. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems. International Foundation for Autonomous Agents and Multi-agent Systems, 637--644.Google Scholar
- Oliver Brdiczka, Jérôme Maisonnasse, and Patrick Reignier. 2005. Automatic detection of interaction groups. In Proceedings of the 2005 International Conference on Multimodal Interfaces. ACM, 32--36.Google Scholar
Digital Library
- Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2018. OpenPose: Realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv preprint arXiv:1812.08008 (2018).Google Scholar
- Ming-Ching Chang, Nils Krahnstoever, and Weina Ge. 2011. Probabilistic group-level motion analysis and scenario recognition. In Proceedings of the 2011 International Conference on Computer Vision (ICCV). IEEE, 747--754.Google Scholar
Digital Library
- Chih-Wei Chen, Rodrigo Cilla Ugarte, Chen Wu, and Hamid Aghajan. 2011. Discovering social interactions in real work environments. In Face and Gesture 2011. IEEE, 933--938.Google Scholar
- Wongun Choi, Khuram Shahid, and Silvio Savarese. 2009. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In Proceedings of the 2009 International Conference on Computer Vision Workshops (ICCV Worshops). IEEE, 1282--1289.Google Scholar
- Tanzeem Choudhury and Alex Pentland. 2002. The sociometer: A wearable device for understanding human networks. In CSCW'02 Workshop: Ad hoc Communications and Collaboration in Ubiquitous Computing Environments. ACM.Google Scholar
- Marco Cristani, Loris Bazzani, Giulia Paggetti, Andrea Fossati, Diego Tosato, Alessio Del Bue, Gloria Menegaz, and Vittorio Murino. 2011. Social interaction discovery by statistical analysis of f-formations. In Proceedings of the 2011 British Machine Vision Conference (BMVC). BMVA Press, 23.1--23.12.Google Scholar
Cross Ref
- Marco Cristani, Ramya Raghavendra, Alessio Del Bue, and Vittorio Murino. 2013. Human behavior analysis in video surveillance: A social signal processing perspective. Neurocomputing, Vol. 100 (2013), 86--97.Google Scholar
Digital Library
- Peter Dalsgaard and Kim Halskov. 2010. Designing urban media facc ades: Cases and challenges. In Proceedings of the 2010 Conference on Human Factors in Computing Systems (CHI). ACM, 2277--2286.Google Scholar
- Elwys De Stefani and Lorenza Mondada. 2014. Reorganizing mobile formations: When ?guided" participants initiate reorientations in guided tours. Space and Culture, Vol. 17, 2 (2014), 157--175.Google Scholar
Cross Ref
- Eyal Dim and Tsvi Kuflik. 2015. Automatic detection of social behavior of museum visitor pairs. ACM Transactions on Interactive Intelligent Systems (TiiS), Vol. 4, 4 (2015), 17.Google Scholar
- Vanessa Evers, Nuno Menezes, Luis Merino, Dariu Gavrila, Fernando Nabais, Maja Pantic, and Paulo Alvito. 2014. The development and real-world application of frog, the fun robotic outdoor guide. In Proceedings of the Companion Publication of the 2014 ACM Conference on Computer Supported Cooperative Work and Social Computing. ACM, 281--284.Google Scholar
Digital Library
- Alircza Fathi, Jessica K Hodgins, and James M Rehg. 2012. Social interactions: A first-person perspective. In Proceedings of the 2012 Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1226--1233.Google Scholar
Cross Ref
- Tian Gan, Yongkang Wong, Daqing Zhang, and Mohan S Kankanhalli. 2013. Temporal encoded f-formation system for social interaction detection. In Proceedings of the 2013 ACM international conference on Multimedia. ACM, 937--946.Google Scholar
Digital Library
- Weina Ge, Robert T Collins, and Barry Ruback. 2009. Automatically detecting the small group structure of a crowd. In Proceedings of the 2009 Workshop on Applications of Computer Vision. IEEE, 1--8.Google Scholar
Cross Ref
- Erving Goffman. 2008. Behavior in public places .Simon and Schuster.Google Scholar
- Isabella Gomez Torres, Gaurav Parmar, Samarth Aggarwal, Nathaniel Mansur, and Alec Guthrie. 2019. Affordable smart wheelchair. In Extended Abstracts of the 2019 Conference on Human Factors in Computing Systems (CHI). ACM, Article SRC07, 6 pages. https://doi.org/10.1145/3290607.3308463Google Scholar
Digital Library
- Georg Groh, Alexander Lehmann, Jonas Reimers, Marc René Frieß, and Loren Schwarz. 2010. Detecting social situations from interaction geometry. In Proceedings of the 2010 IEEE Second International Conference on Social Computing. IEEE, 1--8.Google Scholar
Digital Library
- Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. 2018. Social GAN: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR. IEEE, 2255--2264.Google Scholar
Cross Ref
- Edward Twitchell Hall. 1910. The Hidden Dimension. Vol. 609. Garden City, NY: Doubleday.Google Scholar
- Hooman Hedayati, Daniel Szafir, and Sean Andrist. 2019. Recognizing f-formations in the open world. In Proceedings of the 2019 ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 558--559.Google Scholar
Cross Ref
- Hayley Hung, Gwenn Englebienne, and Laura Cabrera Quiros. 2014. Detecting conversing groups with a single worn accelerometer. In Proceedings of the 16th International Conference on Multimodal Interaction. ACM, 84--91.Google Scholar
Digital Library
- Hayley Hung and Ben Kröse. 2011. Detecting f-formations as dominant sets. In Proceedings of the 2011 International Conference on Multimodal Interfaces. ACM, 231--238.Google Scholar
Digital Library
- Helge Hüttenrauch, Kerstin Severinson Eklundh, Anders Green, and Elin A Topp. 2006. Investigating spatial relationships in human-robot interaction. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5052--5059.Google Scholar
Cross Ref
- Junko Ichino, Kazuo Isoda, Tetsuya Ueda, and Reimi Satoh. 2016. Effects of the display angle on social behaviors of the people around the display: A field study at a museum. In Proceedings of the 2016 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW). ACM, 26--37.Google Scholar
Digital Library
- Hanbyul Joo, Tomas Simon, Mina Cikara, and Yaser Sheikh. 2019. Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction. In Proceedings of the 2019 Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 10873--10883.Google Scholar
Cross Ref
- Manuela Jungmann, Richard Cox, and Geraldine Fitzpatrick. 2014. Spatial play effects in a tangible game with an f-formation of multiple players. In Proceedings of the 2014 Australasian User Interface Conference-Volume 150. Australian Computer Society, Inc., 57--66.Google Scholar
- Adam Kendon. 1990. Conducting interaction: Patterns of behavior in focused encounters. Vol. 7. CUP Archive.Google Scholar
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Hideaki Kuzuoka, Yuya Suzuki, Jun Yamashita, and Keiichi Yamazaki. 2010. Reconfiguring spatial formation arrangement by robot body orientation. In Proceedings of the 2010 ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 285--292.Google Scholar
- Oswald Lanz. 2006. Approximate bayesian multibody tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 28, 9 (2006), 1436--1449.Google Scholar
Digital Library
- Nicolai Marquardt, Robert Diaz-Marino, Sebastian Boring, and Saul Greenberg. 2011. The proximity toolkit: Prototyping proxemic interactions in ubiquitous computing ecologies. In Proceedings of the 2011 ACM Symposium on User Interface Software and Technology (UIST). ACM, 315--326.Google Scholar
Digital Library
- Nicolai Marquardt, Ken Hinckley, and Saul Greenberg. 2012. Cross-device interaction via micro-mobility and f-formations. In Proceedings of the 2012 ACM Symposium on User Interface Software and Technology. ACM, 13--22.Google Scholar
Digital Library
- Paul Marshall, Yvonne Rogers, and Nadia Pantidi. 2011. Using f-formations to analyse spatial patterns of interaction in physical environments. In Proceedings of the 2011 ACM Conference on Computer Supported Cooperative Work (CSCW). ACM, 445--454.Google Scholar
Digital Library
- Yoichi Matsuyama, Arjun Bhardwaj, Ran Zhao, Oscar Romeo, Sushma Akoju, and Justine Cassell. 2016. Socially-aware animated intelligent personal assistant agent. In Proceedings of the 2016 meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). Association for Computational Linguistics, 224--227.Google Scholar
Cross Ref
- Microsoft. 2019. Azure Kinect SDK (K4A). https://github.com/microsoft/Azure-Kinect-Sensor-SDK. [Online; accessed 14-October-2019].Google Scholar
- Alejandro Moreno, Robby van Delden, Ronald Poppe, and Dennis Reidsma. 2013. Socially aware interactive playgrounds. IEEE pervasive computing, Vol. 12, 3 (2013), 40--47.Google Scholar
Digital Library
- Daniel Olgu'in Olgu'in, Benjamin N Waber, Taemie Kim, Akshay Mohan, Koji Ara, and Alex Pentland. 2009. Sensible organizations: Technology and methodology for automatically measuring organizational behavior. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Vol. 39, 1 (2009), 43--55.Google Scholar
Digital Library
- Hyun S Park, Eakta Jain, and Yaser Sheikh. 2012. 3D social saliency from head-mounted cameras. In Proceedings of the 2012 International Conference on Neural Information Processing Systems (NIPS). Curran Associates Inc., 422--430.Google Scholar
- Massimiliano Pavan and Marcello Pelillo. 2007. Dominant sets and pairwise clustering. IEEE transactions on pattern analysis and machine intelligence, Vol. 29, 1 (2007), 167--172.Google Scholar
Cross Ref
- Ashwini Pokle, Roberto Mart'in-Mart'in, Patrick Goebel, Vincent Chow, Hans M Ewald, Junwei Yang, Zhenkai Wang, Amir Sadeghian, Dorsa Sadigh, Silvio Savarese, et al. 2019. Deep local trajectory replanning and control for robot navigation. arXiv preprint arXiv:1905.05279 (2019).Google Scholar
- Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 652--660.Google Scholar
- Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y Ng. 2009. ROS: An open-source Robot Operating System. In Proceedings of the 2009 International Conference on Robotics and Automation (ICRA) Workshop on Open Source Software (Kobe, Japan), Vol. 3. IEEE, 5.Google Scholar
- Elisa Ricci, Jagannadan Varadarajan, Ramanathan Subramanian, Samuel Rota Bulo, Narendra Ahuja, and Oswald Lanz. 2015. Uncovering interactions and interactors: Joint estimation of head, body orientation and f-formations from surveillance videos. In Proceedings of the 2015 International Conference on Computer Vision (ICCV). IEEE, 4660--4668.Google Scholar
Digital Library
- Jorge Rios-Martinez, Anne Spalanzani, and Christian Laugier. 2015. From proxemics theory to socially-aware navigation: A survey. International Journal of Social Robotics, Vol. 7, 2 (2015), 137--153.Google Scholar
Cross Ref
- Navyata Sanghvi, Ryo Yonetani, and Kris Kitani. 2018. Learning group communication from demonstration. In Proceedings of Robotics: Science and Systems (RSS), Workshop on Models and Representations for Natural Human-Robot Communication. RSS.Google Scholar
- Friederike Schneemann and Patrick Heinemann. 2016. Context-based detection of pedestrian crossing intention for autonomous driving in urban environments. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2243--2248.Google Scholar
Digital Library
- Francesco Setti, Oswald Lanz, Roberta Ferrario, Vittorio Murino, and Marco Cristani. 2013. Multi-scale f-formation discovery for group detection. In Proceedings of the 2013 IEEE International Conference on Image Processing (ICIP). IEEE, 3547--3551.Google Scholar
Cross Ref
- Francesco Setti, Chris Russell, Chiara Bassetti, and Marco Cristani. 2015. F-formation detection: Individuating free-standing conversational groups in images. PLOS One, Vol. 10, 5 (2015), e0123783.Google Scholar
Cross Ref
- Mason Swofford, John Peruzzi, and Marynel Vázquez. 2018. Conversational group detection with deep convolutional networks. arXiv preprint arXiv:1810.04039 (2018).Google Scholar
- Lili Tong, Audrey Serna, Simon Pageaud, Sébastien George, and Aurélien Tabard. 2016. It's not how you stand, it's how you move: F-formations and collaboration dynamics in a mobile learning game. In Proceedings of the 18th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI). ACM, 318--329.Google Scholar
Digital Library
- Rudolph Triebel, Kai Arras, Rachid Alami, Lucas Beyer, Stefan Breuers, Raja Chatila, Mohamed Chetouani, Daniel Cremers, Vanessa Evers, Michelangelo Fiore, et al. 2016. Spencer: A socially aware service robot for passenger guidance and help in busy airports. In Field and Service Robotics. Springer, 607--622.Google Scholar
- Sebastiano Vascon and Loris Bazzani. 2017. Chapter 3 - Group detection and tracking using sociological features. In Group and Crowd Behavior for Computer Vision. Academic Press. https://doi.org/10.1016/B978-0--12--809276--7.00004--7Google Scholar
- Sebastiano Vascon, Eyasu Z Mequanint, Marco Cristani, Hayley Hung, Marcello Pelillo, and Vittorio Murino. 2016. Detecting conversational groups in images and sequences: A robust game-theoretic approach. Computer Vision and Image Understanding, Vol. 143 (2016), 11--24.Google Scholar
Digital Library
- Marynel Vázquez. 2017. Reasoning About Spatial Patterns of Human Behavior During Group Conversations with Robots. Ph.D. Dissertation. The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.Google Scholar
- Marynel Vázquez, Elizabeth J Carter, Braden McDorman, Jodi Forlizzi, Aaron Steinfeld, and Scott E Hudson. 2017. Towards robot autonomy in group conversations: Understanding the effects of body orientation and gaze. In Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction (HRI). ACM, 42--52.Google Scholar
Digital Library
- Marynel Vázquez, Aaron Steinfeld, and Scott E Hudson. 2015. Parallel detection of conversational groups of free-standing people and tracking of their lower-body orientation. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 3010--3017.Google Scholar
Digital Library
- Danny Wyatt, Tanzeem Choudhury, and Jeff Bilmes. 2007. Conversation detection and speaker segmentation in privacy-sensitive situated speech data. In Proceedings of the 2007 Conference of the International Speech Communication Association (INTERSPEECH). ISCA.Google Scholar
Cross Ref
- Ting Yu, Ser-Nam Lim, Kedar Patwardhan, and Nils Krahnstoever. 2009. Monitoring, recognizing and discovering social networks. In Proceedings of the 2009 Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1462--1469.Google Scholar
Cross Ref
- Gloria Zen, Bruno Lepri, Elisa Ricci, and Oswald Lanz. 2010. Space speaks: towards socially and personality aware visual surveillance. In Proceedings of the 2010 ACM International Workshop on Multimodal Pervasive Video Analysis (MPVA). ACM, 37--42.Google Scholar
Digital Library
Index Terms
Improving Social Awareness Through DANTE: Deep Affinity Network for Clustering Conversational Interactants
Recommendations
Gaze awareness in conversational agents: Estimating a user's conversational engagement from eye gaze
Special issue on interaction with smart objects, Special section on eye gaze and conversationIn face-to-face conversations, speakers are continuously checking whether the listener is engaged in the conversation, and they change their conversational strategy if the listener is not fully engaged. With the goal of building a conversational agent ...
"I like your shirt" - dialogue acts for enabling social talk in conversational agents
IVA'11: Proceedings of the 10th international conference on Intelligent virtual agentsThis paper presents a set of dialogue acts which can be used to implement small talk conversations in conversational agents. Although many conversational agents are supposed to engage in small talk, no systematic development of social dialogue acts and ...






Comments