Abstract
In this paper we propose a novel deep learning based approach to generate realistic three-party head and eye motions based on novel acoustic speech input together with speaker marking (i.e., speaking time for each interlocutor). Specifically, we first acquire a high quality, three-party conversational motion dataset. Then, based on the acquired dataset, we train a deep learning based framework to automatically predict the dynamic directions of both the eyes and heads of all the interlocutors based on speech signal input. Via the combination of existing lip-sync and speech-driven hand/body gesture generation algorithms, we can generate realistic three-party conversational animations. Through many experiments and comparative user studies, we demonstrate that our approach can generate realistic three-party head-and-eye motions based on novel speech recorded on new subjects with different genders and ethnicities.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, A Deep Learning-Based Model for Head and Eye Motion Generation in Three-party Conversations
- Carlos Busso, Zhigang Deng, Michael Grimm, Ulrich Neumann, and Shrikanth Narayanan. 2007. Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech, and Language Processing 15, 3 (2007), 1075--1086. Google Scholar
Digital Library
- Carlos Busso, Zhigang Deng, Ulrich Neumann, and Shrikanth Narayanan. 2005. Natural head motion synthesis driven by acoustic prosodic features. Computer Animation and Virtual Worlds 16, 3-4 (2005), 283--290. Google Scholar
Digital Library
- Justine Cassell, Catherine Pelachaud, Norman Badler, Mark Steedman, Brett Achorn, Tripp Becket, Brett Douville, Scott Prevost, and Matthew Stone. 1994. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques. ACM, 413--420. Google Scholar
Digital Library
- Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. Beat: the behavior expression animation toolkit. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. ACM, 477--486. Google Scholar
Digital Library
- Erika Chuang and Christoph Bregler. 2005. Mood swings: expressive speech animation. ACM Transactions on Graphics (TOG) 24, 2 (2005), 331--347. Google Scholar
Digital Library
- Zhigang Deng, John P Lewis, and Ulrich Neumann. 2005. Automated eye motion using texture synthesis. IEEE Computer Graphics and Applications 25, 2 (2005), 24--30. Google Scholar
Digital Library
- Zhigang Deng and Junyong Noh. 2008. Computer facial animation: A survey. In Data-driven 3D facial animation. Springer, 1--28. Google Scholar
Digital Library
- Yu Ding, Yuting Zhang, Meihua Xiao, and zhigang Deng. 2017. A Multifaceted Study on Eye Contact based Speaker Identification in Three-party Conversations. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 3011--3021. Google Scholar
Digital Library
- Mary Ellen Foster, Andre Gaschler, Manuel Giuliani, Amy Isard, Maria Pateraki, and Ronald Petrick. 2012. Two people walk into a bar: Dynamic multi-party social interaction with a robot agent. In Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 3--10. Google Scholar
Digital Library
- Charles Goodwin and John Heritage. 1990. Conversation analysis. Annual review of anthropology 19, 1 (1990), 283--307.Google Scholar
- Hans Peter Graf, Eric Cosatto, Volker Strom, and Fu Jie Huang. 2002. Visual prosody: Facial movements accompanying speech. In Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on. IEEE, 396--401. Google Scholar
Digital Library
- Jonathan Gratch, Anna Okhmatovskaia, Francois Lamothe, Stacy Marsella, Mathieu Morales, Rick J van der Werf, and Louis-Philippe Morency. 2006. Virtual rapport. In IVA, Vol. 6. Springer, 14--27. Google Scholar
Digital Library
- Erdan Gu and Norman Badler. 2006. Visual attention and eye gaze during multiparty conversations with distractions. In Intelligent Virtual Agents. Springer, 193--204. Google Scholar
Digital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780. Google Scholar
Digital Library
- Martin Johansson, Gabriel Skantze, and Joakim Gustafson. 2013. Head pose patterns in multiparty human-robot teambuilding interactions. In International Conference on Social Robotics. Springer, 351--360. Google Scholar
Digital Library
- Sonu Chopra Khullar and Norman I Badler. 2001. Where to look? Automating attending behaviors of virtual human characters. Autonomous Agents and Multi-Agent Systems 4, 1-2 (2001), 9--23. Google Scholar
Digital Library
- Yutaka Kondo, Kentaro Takemura, Jun Takamatsu, and Tsukasa Ogasawara. 2013. A gesture-centric android system for multiparty human-robot interaction. Journal of Human-Robot Interaction 2, 1 (2013), 133--151. Google Scholar
Digital Library
- Binh H Le, Xiaohan Ma, and Zhigang Deng. 2012. Live speech driven head-and-eye motion generators. IEEE transactions on visualization and computer graphics 18, 11 (2012), 1902--1914. Google Scholar
Digital Library
- Patrick Ledda, Alan Chalmers, Tom Troscianko, and Helge Seetzen. 2005. Evaluation of tone mapping operators using a high dynamic range display. In ACM Transactions on Graphics (TOG), Vol. 24. ACM, 640--648. Google Scholar
Digital Library
- Jina Lee and Stacy Marsella. 2009. Learning a model of speaker head nods using gesture corpora. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 289--296. Google Scholar
Digital Library
- Sooha Park Lee, Jeremy B Badler, and Norman I Badler. 2002. Eyes alive. In ACM Transactions on Graphics (TOG), Vol. 21. ACM, 637--644. Google Scholar
Digital Library
- Sergey Levine, Philipp Krähenbühl, Sebastian Thrun, and Vladlen Koltun. 2010. Gesture controllers. In ACM Transactions on Graphics (TOG), Vol. 29. ACM, 124. Google Scholar
Digital Library
- Sergey Levine, Christian Theobalt, and Vladlen Koltun. 2009. Real-time prosody-driven synthesis of body language. In ACM Transactions on Graphics (TOG), Vol. 28. ACM, 172. Google Scholar
Digital Library
- John P Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Frédéric H Pighin, and Zhigang Deng. 2014. Practice and Theory of Blendshape Facial Models. Eurographics (State of the Art Reports) 1, 8 (2014).Google Scholar
- Xiaohan Ma and Zhigang Deng. 2009. Natural eye motion synthesis by modeling gaze-head coupling. In Virtual Reality Conference, 2009. VR 2009. IEEE. IEEE, 143--150. Google Scholar
Digital Library
- RM Maatman, Jonathan Gratch, and Stacy Marsella. 2005. Natural behavior of a listening agent. In International Workshop on Intelligent Virtual Agents. Springer, 25--36. Google Scholar
Digital Library
- Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, and Ari Shapiro. 2013. Virtual character performance from speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation. ACM, 25--35. Google Scholar
Digital Library
- Soh Masuko and Junichi Hoshino. 2006. Generating head--eye movement for virtual actor. Systems and Computers in Japan 37, 12 (2006), 33--44. Google Scholar
Digital Library
- Yoichi Matsuyama, Hikaru Taniyama, Shinya Fujie, and Tetsunori Kobayashi. 2010. Framework of Communication Activation Robot Participating in Multiparty Conversation.. In AAAI Fall Symposium: Dialog with Robots.Google Scholar
- Kevin G Munhall, Jeffery A Jones, Daniel E Callan, Takaaki Kuratate, and Eric Vatikiotis-Bateson. 2004. Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological science 15, 2 (2004), 133--137.Google Scholar
- Bilge Mutlu, Toshiyuki Shiwa, Takayuki Kanda, Hiroshi Ishiguro, and Norihiro Hagita. 2009. Footing in human-robot conversations: how robots might shape participant roles using gaze cues. In Proceedings of the 4th ACM/IEEE international conference on Human robot interaction. ACM, 61--68. Google Scholar
Digital Library
- Kazuhiro Otsuka, Hiroshi Sawada, and Junji Yamato. 2007. Automatic inference of cross-modal nonverbal interactions in multiparty conversations: who responds to whom, when, and how? from gaze, head gestures, and utterances. In Proceedings of the 9th international conference on Multimodal interfaces. ACM, 255--262. Google Scholar
Digital Library
- Kazuhiro Otsuka, Yoshinao Takemae, and Junji Yamato. 2005. A Probabilistic Inference of Multiparty-conversation Structure Based on Markov-switching Models of Gaze Patterns, Head Directions, and Utterances. In Proceedings of the 7th International Conference on Multimodal Interfaces (ICMI '05). ACM, New York, NY, USA, 191--198. Google Scholar
Digital Library
- Tomislav Pejsa, Daniel Rakita, Bilge Mutlu, and Michael Gleicher. 2016. Authoring directed gaze for full-body motion capture. ACM Transactions on Graphics (TOG) 35, 6 (2016), 161. Google Scholar
Digital Library
- Christopher Peters and Carol O'Sullivan. 2003. Attention-driven eye gaze and blinking for virtual humans. In ACM SIGGRAPH 2003 Sketches & Applications. ACM, 1--1. Google Scholar
Digital Library
- Kerstin Ruhland, Sean Andrist, Jeremy Badler, Christopher Peters, Norman Badler, Michael Gleicher, Bilge Mutlu, and Rachel Mcdonnell. 2014. Look me in the eyes: A survey of eye and gaze animation for virtual agents and artificial systems. In Eurographics State-of-the-Art Report. 69--91.Google Scholar
- Mehmet E Sargin, Yucel Yemez, Engin Erzin, and Ahmet M Tekalp. 2008. Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 8 (2008), 1330--1345. Google Scholar
Digital Library
- William Steptoe, Oyewole Oyekoya, and Anthony Steed. 2010. Eyelid kinematics for virtual characters. Computer animation and virtual worlds 21, 3-4 (2010), 161--171. Google Scholar
Digital Library
- Matthew Stone, Doug DeCarlo, Insuk Oh, Christian Rodriguez, Adrian Stere, Alyssa Lees, and Chris Bregler. 2004. Speaking with hands: Creating animated conversational characters from recordings of human performance. ACM Transactions on Graphics (TOG) 23, 3 (2004), 506--513. Google Scholar
Digital Library
- Marcus Thiebaux, Brent Lance, and Stacy Marsella. 2009. Real-time expressive gaze animation for virtual humans. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 321--328. Google Scholar
Digital Library
- Roel Vertegaal, Robert Slagter, Gerrit Van der Veer, and Anton Nijholt. 2001. Eye gaze patterns in conversations: there is more to conversational agents than meets the eyes. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 301--308. Google Scholar
Digital Library
- Roel Vertegaal, Gerrit van der Veer, and Harro Vons. 2000. Effects of gaze on multiparty mediated communication. In Graphics Interface. 95--102.Google Scholar
- Vinoba Vinayagamoorthy, Maia Garau, Anthony Steed, and Mel Slater. 2004. An eye gaze model for dyadic interaction in an immersive virtual environment: Practice and experience. In Computer Graphics Forum, Vol. 23. Wiley Online Library, 1--11.Google Scholar
- Congyi Wang, Fuhao Shi, Shihong Xia, and Jinxiang Chai. 2016. Real-time 3d eye gaze animation using a single rgb camera. ACM Transactions on Graphics (TOG) 35, 4 (2016), 118. Google Scholar
Digital Library
- Paul Watzlawick, Janet Beavin Bavelas, and Don D Jackson. 2011. Pragmatics of human communication: A study of interactional patterns, pathologies and paradoxes. WW Norton & Company.Google Scholar
Index Terms
A Deep Learning-Based Model for Head and Eye Motion Generation in Three-party Conversations
Recommendations
Live Speech Driven Head-and-Eye Motion Generators
This paper describes a fully automated framework to generate realistic head motion, eye gaze, and eyelid motion simultaneously based on live (or recorded) speech input. Its central idea is to learn separate yet interrelated statistical models for each ...
Analysis of relationship between head motion events and speech in dialogue conversations
Head motion naturally occurs in synchrony with speech and may convey paralinguistic information (such as intentions, attitudes and emotions) in dialogue communication. With the aim of verifying the relationship between head motion events and speech ...
Generation of nodding, head tilting and eye gazing for human-robot dialogue interaction
HRI '12: Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot InteractionHead motion occurs naturally and in synchrony with speech during human dialogue communication, and may carry paralinguistic information, such as intentions, attitudes and emotions. Therefore, natural-looking head motion by a robot is important for ...






Comments