skip to main content
research-article

Talking Face Generation via Facial Anatomy

Published:25 February 2023Publication History
Skip Abstract Section

Abstract

To generate the corresponding talking face from a speech audio and a face image, it is essential to match the variations in the facial appearance with the speech audio in subtle movements of different face regions. Nevertheless, the facial movements generated by the existing methods lack detail and vividness, or the methods are only oriented toward a specific person. In this article, we propose a novel two-stage network to generate talking faces for any target identity through annotations of the action units (AUs). In the first stage, the relationship between the audio and the AUs in the audio-to-AU network is learned. The audio-to-AU network needs to produce the consistent AU group for the input audio. In the second stage, the AU group in the first stage and a face image are fed into the generation network to output the resulting talking face image. Various results confirm that, compared to state-of-the-art methods, our approach is able to produce more realistic and vivid talking faces for arbitrary targets with richer details of facial movements, such as the cheek motion and eyebrow motion.

REFERENCES

  1. [1] Chen Lele, Cui Guofeng, Liu Celong, Li Zhong, Kou Ziyi, Xu Yi, and Xu Chenliang. 2020. Talking-head generation with rhythmic head motion. In Proceedings of the European Conference on Computer Vision. 3551.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Chen Lele, Li Zhiheng, Maddox Ross K., Duan Zhiyao, and Xu Chenliang. 2018. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision. 538553.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Chen Lele, Maddox Ross K., Duan Zhiyao, and Xu Chenliang. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 78247833.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Lele, Srivastava Sudhanshu, Duan Zhiyao, and Xu Chenliang. 2017. Deep cross-modal audio-visual generation. In Proceedings of the Thematic Workshops of ACM Multimedia. 349357.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Chen Weicong, Tan Xu, Xia Yingce, Qin Tao, Wang Yu, and Liu Tie-Yan. 2020. DualLip: A system for joint lip reading and generation. In Proceedings of the ACM International Conference on Multimedia. 19851993.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Choi Yunjey, Choi Minje, Kim Munyoung, Ha Jung-Woo, Kim Sunghun, and Choo Jaegul. 2018. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 87898797.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chung Joon Son, Jamaludin Amir, and Zisserman Andrew. 2017. You said that? In Proceedings of the British Machine Vision Conference (BMVC’17).Google ScholarGoogle Scholar
  8. [8] Chung Joon Son and Zisserman Andrew. 2017. Lip reading in the wild. In Proceedings of the Asian Conference on Computer Vision. 87103.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Chung Joon Son and Zisserman Andrew. 2017. Out of time: Automated lip sync in the wild. In Proceedings of the Asian Conference on Computer Vision. 251263.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Ekman P. E. and Friesen W. V.. 1978. Facial Action Coding System (FACS): A Technique for the Measurement of Facial Action. Consulting Psychologists Press.Google ScholarGoogle Scholar
  11. [11] Eskimez Sefik Emre, Maddox Ross K., Xu Chenliang, and Duan Zhiyao. 2020. End-to-end generation of talking faces from noisy speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’20). 19481952.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Eskimez Sefik Emre, Zhang You, and Duan Zhiyao. 2021. Speech driven talking face generation from a single image and an emotion condition. IEEE Transactions on Multimedia 24 (2021), 3480–3490.Google ScholarGoogle Scholar
  13. [13] Fan Bo, Wang Lijuan, Soong Frank K., and Xie Lei. 2015. Photo-real talking head with deep bidirectional LSTM. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’15). 48844888.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Hao Jiaqi, Liu Shiguang, and Xu Qing. 2021. Controlling eye blink for talking face generation via eye conversion. In Proceeding of SIGGRAPH Asia Technical Communications. 14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Huang Feiran, Wei Kaimin, Weng Jian, and Li Zhoujun. 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), Article 79, 19 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Izadinia Hamid, Saleemi Saleemi, and Shah Mubarak. 2013. Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia 15, 2 (2013), 378390.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Jin Xutong, Li Sheng, Qu Tianshu, Manocha Dinesh, and Wang Guoping. 2020. Deep-Modal: Real-time impact sound synthesis for arbitrary shapes. In Proceedings of the ACM International Conference on Multimedia. 11711179.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Johnson Justin, Alahi Alexandre, and Fei-Fei Li. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision. 694711.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] R. Prajwal K., Mukhopadhyay Rudrabha, Philip Jerin, Jha Abhishek, Namboodiri Vinay, and Jawahar C. V.. 2019. Towards automatic face-to-face translation. In Proceedings of the ACM International Conference on Multimedia. 14281436.Google ScholarGoogle Scholar
  20. [20] Kulkarni Girish, Premraj Visruth, Ordonez Vicente, Dhar Sagnik, Li Siming, Choi Yejin, Berg Alexander C., and Berg Tamara L.. 2013. Baby talk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 28912903.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Liu Ruoyu, Zhao Yao, Wei Shikui, Zheng Liang, and Yang Yi. 2019. Modality-invariant image-text embedding for image-sentence matching. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), Article 27, 19 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Liu Shiguang and Hao Jiaqi. 2022. Generating talking face with controllable eye movements by disentangled blinking feature. IEEE Transactions on Visualization and Computer Graphics. Online ahead of print, August 17, 2022.Google ScholarGoogle Scholar
  23. [23] Liu Shiguang, Li Sijia, and Cheng Haonan. 2022. Towards an end-to-end visual-to-raw-audio generation with GANs. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2022), 12991312.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Liu Xueliang, Wang Meng, Zha Zheng-Jun, and Hong Richang. 2019. Cross-modality feature learning via convolutional autoencoder. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), Article 7, 20 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Liu Zhilei, Song Guoxian, Cai Jianfei, Cham Tat-Jen, and Zhang Juyong. 2019. Conditional adversarial synthesis of 3D facial action units. Neurocomputing 355 (2019), 200208.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Niu Tianrui, Feng Fangxiang, Li Lingxuan, and Wang Xiaojie. 2020. Image synthesis from locally related texts. In Proceedings of the International Conference on Multimedia Retrieval. 145153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Oord Aaron van den, Dieleman Sander, Zen Heiga, Simonyan Karen, Vinyals Oriol, Graves Alex, Kalchbrenner Nal, Senior Andrew, and Kavukcuoglu Koray. 2016. WaveNet: A generative model for raw audio. In Proceedings of the 9th ISCA Speech Synthesis Workshop. 125.Google ScholarGoogle Scholar
  28. [28] Owens Andrew, Isola Phillip, McDermott Josh, Torralba Antonio, Adelson Edward H., and Freeman William T.. 2016. Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 24052413.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Park Jungsik and Park Jong-II. 2018. A framework for virtual 3D manipulation of face in video. In Proceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces. 649650.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Pham Hai X., Cheung Samuel, and Pavlovic Vladimir. 2017. Speech-driven 3D facial animation with implicit emotional awareness: A deep learning approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 8088.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Pham Hai X., Wang Yuting, and Pavlovic Vladimir. 2018. Generative adversarial talking head: Bringing portraits to life with a weakly supervised neural network. arXiv preprint arXiv:1803.07716 (2018).Google ScholarGoogle Scholar
  32. [32] Prajwal K. R., Mukhopadhyay Rudrabha, Namboodiri Vinay, and Jawahar C. V.. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the ACM International Conference on Multimedia. 484492.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Pumarola Albert, Agudo Antonio, Martinez Aleix M., Sanfeliu Alberto, and Moreno-Noguer Francesc. 2020. GANimation: One-shot anatomically consistent facial animation. International Journal of Computer Vision 128, 3 (2020), 698713.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Rasiwasia Nikhil, Pereira Jose Costa, Coviello Emanuele, Doyle Gabriel, Lanckriet Gert R. G., Levy Roger, and Vasconcelos Nuno. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia. 251260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Reed Scott, Akata Zeynep, Yan Xinchen, Logeswaran Lajanugen, Schiele Bernt, and Lee Honglak. 2016. Generative adversarial text to image synthesis. In Proceedings of the 33rd International Conference on Machine Learning. 10601069.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Shlizerman Eli, Dery Lucio, Schoen Hayden, and Kemelmacher-Shlizerman Ira. 2018. Audio to body dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 75747583.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Siarohin Aliaksandr, Lathuilière Stéphane, Tulyakov Sergey, Ricci Elisa, and Sebe Nicu. 2019. First order motion model for image animation. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’19).Google ScholarGoogle Scholar
  38. [38] Song Yang, Zhu Jingwen, Li Dawei, Wang Xiaolong, and Qi Hairong. 2019. Talking face generation by conditional recurrent adversarial network. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). 919925.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Suwajanakorn Supasorn, Seitz Steven M., and Kemelmacher-Shlizerman Ira. 2017. Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics 36, 4 (2017), Article 95, 13 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Baltrusaitis Marwa Mahmoud Tadas and Robinson Peter. 2015. Cross-dataset learning and person-specific normalisation for automatic action unit detection. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 3551.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Zadeh Yao Chong Lim Tadas Baltrušaitis, Amir and Morency Louis-Philippe. 2018. OpenFace 2.0: Facial behavior analysis toolkit. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 3551.Google ScholarGoogle Scholar
  42. [42] Tang Taoran, Jia Jia, and Mao Hanyang. 2018. Dance with melody: An LSTM-autoencoder approach to music-oriented dance synthesis. In Proceedings of the ACM International Conference on Multimedia. 15981606.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Tu Xiaoguang, Zou Yingtian, Zhao Jian, Ai Wenjie, Dong Jian, Yao Yuan, Wang Zhikang, et al. 2022. Image-to-video generation via 3D facial dynamics. IEEE Transactions on Circuits and Systems for Video Technology 32, 4 (2022), 18051819. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Wang Shuo, Guo Dan, Xu Xin, Zhuo Li, and Wang Meng. 2019. Cross-modality retrieval by joint correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), Article 56, 16 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Yi Ran, Ye Zipeng, Zhang Juyong, Bao Hujun, and Liu Yong-Jin. 2020. Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:2002.10137 (2020).Google ScholarGoogle Scholar
  46. [46] Yuan Mingkuan and Peng Yuxin. 2018. Text-to-image synthesis via symmetrical distillation networks. In Proceedings of the ACM International Conference on Multimedia. 14071415.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Zhao Guoying, Barnard Mark, and Pietikainen Matti. 2009. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11, 7 (2009), 12541265.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Zhou Hang, Liu Yu, Liu Ziwei, Luo Ping, and Wang Xiaogang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence. 92999306.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Zhou Yang, Han Xintong, Shechtman Eli, Echevarria Jose, Kalogerakis Evangelos, and Li Dingzeyu. 2020. MakeltTalk: Speaker-aware talking-head animation. ACM Transactions on Graphics 39, 6 (2020), Article 221, 15 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Zhu Hao, Huang Huaibo, Li Yi, Zheng Aihua, and He Ran.. 2020. Arbitrary talking face generation via attentional audio-visual coherence learning. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI’20). 23622368.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Talking Face Generation via Facial Anatomy

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3
          May 2023
          514 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3582886
          • Editor:
          • Abdulmotaleb El Saddik
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 February 2023
          • Online AM: 17 November 2022
          • Accepted: 11 November 2022
          • Revised: 16 July 2022
          • Received: 8 December 2021
          Published in tomm Volume 19, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)176
          • Downloads (Last 6 weeks)14

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!