skip to main content
research-article

A Deep Learning-Based Model for Head and Eye Motion Generation in Three-party Conversations

Published:26 July 2019Publication History
Skip Abstract Section

Abstract

In this paper we propose a novel deep learning based approach to generate realistic three-party head and eye motions based on novel acoustic speech input together with speaker marking (i.e., speaking time for each interlocutor). Specifically, we first acquire a high quality, three-party conversational motion dataset. Then, based on the acquired dataset, we train a deep learning based framework to automatically predict the dynamic directions of both the eyes and heads of all the interlocutors based on speech signal input. Via the combination of existing lip-sync and speech-driven hand/body gesture generation algorithms, we can generate realistic three-party conversational animations. Through many experiments and comparative user studies, we demonstrate that our approach can generate realistic three-party head-and-eye motions based on novel speech recorded on new subjects with different genders and ethnicities.

Skip Supplemental Material Section

Supplemental Material

References

  1. Carlos Busso, Zhigang Deng, Michael Grimm, Ulrich Neumann, and Shrikanth Narayanan. 2007. Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech, and Language Processing 15, 3 (2007), 1075--1086. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Carlos Busso, Zhigang Deng, Ulrich Neumann, and Shrikanth Narayanan. 2005. Natural head motion synthesis driven by acoustic prosodic features. Computer Animation and Virtual Worlds 16, 3-4 (2005), 283--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Justine Cassell, Catherine Pelachaud, Norman Badler, Mark Steedman, Brett Achorn, Tripp Becket, Brett Douville, Scott Prevost, and Matthew Stone. 1994. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques. ACM, 413--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. Beat: the behavior expression animation toolkit. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. ACM, 477--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Erika Chuang and Christoph Bregler. 2005. Mood swings: expressive speech animation. ACM Transactions on Graphics (TOG) 24, 2 (2005), 331--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Zhigang Deng, John P Lewis, and Ulrich Neumann. 2005. Automated eye motion using texture synthesis. IEEE Computer Graphics and Applications 25, 2 (2005), 24--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Zhigang Deng and Junyong Noh. 2008. Computer facial animation: A survey. In Data-driven 3D facial animation. Springer, 1--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yu Ding, Yuting Zhang, Meihua Xiao, and zhigang Deng. 2017. A Multifaceted Study on Eye Contact based Speaker Identification in Three-party Conversations. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 3011--3021. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Mary Ellen Foster, Andre Gaschler, Manuel Giuliani, Amy Isard, Maria Pateraki, and Ronald Petrick. 2012. Two people walk into a bar: Dynamic multi-party social interaction with a robot agent. In Proceedings of the 14th ACM international conference on Multimodal interaction. ACM, 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Charles Goodwin and John Heritage. 1990. Conversation analysis. Annual review of anthropology 19, 1 (1990), 283--307.Google ScholarGoogle Scholar
  11. Hans Peter Graf, Eric Cosatto, Volker Strom, and Fu Jie Huang. 2002. Visual prosody: Facial movements accompanying speech. In Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on. IEEE, 396--401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jonathan Gratch, Anna Okhmatovskaia, Francois Lamothe, Stacy Marsella, Mathieu Morales, Rick J van der Werf, and Louis-Philippe Morency. 2006. Virtual rapport. In IVA, Vol. 6. Springer, 14--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Erdan Gu and Norman Badler. 2006. Visual attention and eye gaze during multiparty conversations with distractions. In Intelligent Virtual Agents. Springer, 193--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Martin Johansson, Gabriel Skantze, and Joakim Gustafson. 2013. Head pose patterns in multiparty human-robot teambuilding interactions. In International Conference on Social Robotics. Springer, 351--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sonu Chopra Khullar and Norman I Badler. 2001. Where to look? Automating attending behaviors of virtual human characters. Autonomous Agents and Multi-Agent Systems 4, 1-2 (2001), 9--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yutaka Kondo, Kentaro Takemura, Jun Takamatsu, and Tsukasa Ogasawara. 2013. A gesture-centric android system for multiparty human-robot interaction. Journal of Human-Robot Interaction 2, 1 (2013), 133--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Binh H Le, Xiaohan Ma, and Zhigang Deng. 2012. Live speech driven head-and-eye motion generators. IEEE transactions on visualization and computer graphics 18, 11 (2012), 1902--1914. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Patrick Ledda, Alan Chalmers, Tom Troscianko, and Helge Seetzen. 2005. Evaluation of tone mapping operators using a high dynamic range display. In ACM Transactions on Graphics (TOG), Vol. 24. ACM, 640--648. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jina Lee and Stacy Marsella. 2009. Learning a model of speaker head nods using gesture corpora. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 289--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sooha Park Lee, Jeremy B Badler, and Norman I Badler. 2002. Eyes alive. In ACM Transactions on Graphics (TOG), Vol. 21. ACM, 637--644. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sergey Levine, Philipp Krähenbühl, Sebastian Thrun, and Vladlen Koltun. 2010. Gesture controllers. In ACM Transactions on Graphics (TOG), Vol. 29. ACM, 124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sergey Levine, Christian Theobalt, and Vladlen Koltun. 2009. Real-time prosody-driven synthesis of body language. In ACM Transactions on Graphics (TOG), Vol. 28. ACM, 172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. John P Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Frédéric H Pighin, and Zhigang Deng. 2014. Practice and Theory of Blendshape Facial Models. Eurographics (State of the Art Reports) 1, 8 (2014).Google ScholarGoogle Scholar
  25. Xiaohan Ma and Zhigang Deng. 2009. Natural eye motion synthesis by modeling gaze-head coupling. In Virtual Reality Conference, 2009. VR 2009. IEEE. IEEE, 143--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. RM Maatman, Jonathan Gratch, and Stacy Marsella. 2005. Natural behavior of a listening agent. In International Workshop on Intelligent Virtual Agents. Springer, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, and Ari Shapiro. 2013. Virtual character performance from speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation. ACM, 25--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Soh Masuko and Junichi Hoshino. 2006. Generating head--eye movement for virtual actor. Systems and Computers in Japan 37, 12 (2006), 33--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yoichi Matsuyama, Hikaru Taniyama, Shinya Fujie, and Tetsunori Kobayashi. 2010. Framework of Communication Activation Robot Participating in Multiparty Conversation.. In AAAI Fall Symposium: Dialog with Robots.Google ScholarGoogle Scholar
  30. Kevin G Munhall, Jeffery A Jones, Daniel E Callan, Takaaki Kuratate, and Eric Vatikiotis-Bateson. 2004. Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological science 15, 2 (2004), 133--137.Google ScholarGoogle Scholar
  31. Bilge Mutlu, Toshiyuki Shiwa, Takayuki Kanda, Hiroshi Ishiguro, and Norihiro Hagita. 2009. Footing in human-robot conversations: how robots might shape participant roles using gaze cues. In Proceedings of the 4th ACM/IEEE international conference on Human robot interaction. ACM, 61--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kazuhiro Otsuka, Hiroshi Sawada, and Junji Yamato. 2007. Automatic inference of cross-modal nonverbal interactions in multiparty conversations: who responds to whom, when, and how? from gaze, head gestures, and utterances. In Proceedings of the 9th international conference on Multimodal interfaces. ACM, 255--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kazuhiro Otsuka, Yoshinao Takemae, and Junji Yamato. 2005. A Probabilistic Inference of Multiparty-conversation Structure Based on Markov-switching Models of Gaze Patterns, Head Directions, and Utterances. In Proceedings of the 7th International Conference on Multimodal Interfaces (ICMI '05). ACM, New York, NY, USA, 191--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tomislav Pejsa, Daniel Rakita, Bilge Mutlu, and Michael Gleicher. 2016. Authoring directed gaze for full-body motion capture. ACM Transactions on Graphics (TOG) 35, 6 (2016), 161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Christopher Peters and Carol O'Sullivan. 2003. Attention-driven eye gaze and blinking for virtual humans. In ACM SIGGRAPH 2003 Sketches & Applications. ACM, 1--1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kerstin Ruhland, Sean Andrist, Jeremy Badler, Christopher Peters, Norman Badler, Michael Gleicher, Bilge Mutlu, and Rachel Mcdonnell. 2014. Look me in the eyes: A survey of eye and gaze animation for virtual agents and artificial systems. In Eurographics State-of-the-Art Report. 69--91.Google ScholarGoogle Scholar
  37. Mehmet E Sargin, Yucel Yemez, Engin Erzin, and Ahmet M Tekalp. 2008. Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 8 (2008), 1330--1345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. William Steptoe, Oyewole Oyekoya, and Anthony Steed. 2010. Eyelid kinematics for virtual characters. Computer animation and virtual worlds 21, 3-4 (2010), 161--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Matthew Stone, Doug DeCarlo, Insuk Oh, Christian Rodriguez, Adrian Stere, Alyssa Lees, and Chris Bregler. 2004. Speaking with hands: Creating animated conversational characters from recordings of human performance. ACM Transactions on Graphics (TOG) 23, 3 (2004), 506--513. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Marcus Thiebaux, Brent Lance, and Stacy Marsella. 2009. Real-time expressive gaze animation for virtual humans. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 1. International Foundation for Autonomous Agents and Multiagent Systems, 321--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Roel Vertegaal, Robert Slagter, Gerrit Van der Veer, and Anton Nijholt. 2001. Eye gaze patterns in conversations: there is more to conversational agents than meets the eyes. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 301--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Roel Vertegaal, Gerrit van der Veer, and Harro Vons. 2000. Effects of gaze on multiparty mediated communication. In Graphics Interface. 95--102.Google ScholarGoogle Scholar
  43. Vinoba Vinayagamoorthy, Maia Garau, Anthony Steed, and Mel Slater. 2004. An eye gaze model for dyadic interaction in an immersive virtual environment: Practice and experience. In Computer Graphics Forum, Vol. 23. Wiley Online Library, 1--11.Google ScholarGoogle Scholar
  44. Congyi Wang, Fuhao Shi, Shihong Xia, and Jinxiang Chai. 2016. Real-time 3d eye gaze animation using a single rgb camera. ACM Transactions on Graphics (TOG) 35, 4 (2016), 118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Paul Watzlawick, Janet Beavin Bavelas, and Don D Jackson. 2011. Pragmatics of human communication: A study of interactional patterns, pathologies and paradoxes. WW Norton & Company.Google ScholarGoogle Scholar

Index Terms

  1. A Deep Learning-Based Model for Head and Eye Motion Generation in Three-party Conversations

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!