Abstract
To generate the corresponding talking face from a speech audio and a face image, it is essential to match the variations in the facial appearance with the speech audio in subtle movements of different face regions. Nevertheless, the facial movements generated by the existing methods lack detail and vividness, or the methods are only oriented toward a specific person. In this article, we propose a novel two-stage network to generate talking faces for any target identity through annotations of the action units (AUs). In the first stage, the relationship between the audio and the AUs in the audio-to-AU network is learned. The audio-to-AU network needs to produce the consistent AU group for the input audio. In the second stage, the AU group in the first stage and a face image are fed into the generation network to output the resulting talking face image. Various results confirm that, compared to state-of-the-art methods, our approach is able to produce more realistic and vivid talking faces for arbitrary targets with richer details of facial movements, such as the cheek motion and eyebrow motion.
- [1] . 2020. Talking-head generation with rhythmic head motion. In Proceedings of the European Conference on Computer Vision. 35–51.Google Scholar
Digital Library
- [2] . 2018. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision. 538–553.Google Scholar
Digital Library
- [3] . 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7824–7833.Google Scholar
Cross Ref
- [4] . 2017. Deep cross-modal audio-visual generation. In Proceedings of the Thematic Workshops of ACM Multimedia. 349–357.Google Scholar
Digital Library
- [5] . 2020. DualLip: A system for joint lip reading and generation. In Proceedings of the ACM International Conference on Multimedia. 1985–1993.Google Scholar
Digital Library
- [6] . 2018. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8789–8797.Google Scholar
Cross Ref
- [7] . 2017. You said that? In Proceedings of the British Machine Vision Conference (BMVC’17).Google Scholar
- [8] . 2017. Lip reading in the wild. In Proceedings of the Asian Conference on Computer Vision. 87–103.Google Scholar
Cross Ref
- [9] . 2017. Out of time: Automated lip sync in the wild. In Proceedings of the Asian Conference on Computer Vision. 251–263.Google Scholar
Cross Ref
- [10] . 1978. Facial Action Coding System (FACS): A Technique for the Measurement of Facial Action. Consulting Psychologists Press.Google Scholar
- [11] . 2020. End-to-end generation of talking faces from noisy speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’20). 1948–1952.Google Scholar
Cross Ref
- [12] . 2021. Speech driven talking face generation from a single image and an emotion condition. IEEE Transactions on Multimedia 24 (2021), 3480–3490.Google Scholar
- [13] . 2015. Photo-real talking head with deep bidirectional LSTM. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’15). 4884–4888.Google Scholar
Cross Ref
- [14] . 2021. Controlling eye blink for talking face generation via eye conversion. In Proceeding of SIGGRAPH Asia Technical Communications. 1–4.Google Scholar
Digital Library
- [15] . 2020. Attention-based modality-gated networks for image-text sentiment analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), Article 79, 19 pages.Google Scholar
Digital Library
- [16] . 2013. Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia 15, 2 (2013), 378–390.Google Scholar
Digital Library
- [17] . 2020. Deep-Modal: Real-time impact sound synthesis for arbitrary shapes. In Proceedings of the ACM International Conference on Multimedia. 1171–1179.Google Scholar
Digital Library
- [18] . 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision. 694–711.Google Scholar
Cross Ref
- [19] . 2019. Towards automatic face-to-face translation. In Proceedings of the ACM International Conference on Multimedia. 1428–1436.Google Scholar
- [20] . 2013. Baby talk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891–2903.Google Scholar
Digital Library
- [21] . 2019. Modality-invariant image-text embedding for image-sentence matching. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), Article 27, 19 pages.Google Scholar
Digital Library
- [22] . 2022. Generating talking face with controllable eye movements by disentangled blinking feature. IEEE Transactions on Visualization and Computer Graphics. Online ahead of print, August 17, 2022.Google Scholar
- [23] . 2022. Towards an end-to-end visual-to-raw-audio generation with GANs. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2022), 1299–1312.Google Scholar
Cross Ref
- [24] . 2019. Cross-modality feature learning via convolutional autoencoder. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), Article 7, 20 pages.Google Scholar
Digital Library
- [25] . 2019. Conditional adversarial synthesis of 3D facial action units. Neurocomputing 355 (2019), 200–208.Google Scholar
Digital Library
- [26] . 2020. Image synthesis from locally related texts. In Proceedings of the International Conference on Multimedia Retrieval. 145–153.Google Scholar
Digital Library
- [27] . 2016. WaveNet: A generative model for raw audio. In Proceedings of the 9th ISCA Speech Synthesis Workshop. 125.Google Scholar
- [28] . 2016. Visually indicated sounds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2405–2413.Google Scholar
Cross Ref
- [29] . 2018. A framework for virtual 3D manipulation of face in video. In Proceedings of the IEEE Conference on Virtual Reality and 3D User Interfaces. 649–650.Google Scholar
Cross Ref
- [30] . 2017. Speech-driven 3D facial animation with implicit emotional awareness: A deep learning approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 80–88.Google Scholar
Cross Ref
- [31] . 2018. Generative adversarial talking head: Bringing portraits to life with a weakly supervised neural network. arXiv preprint arXiv:1803.07716 (2018).Google Scholar
- [32] . 2020. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the ACM International Conference on Multimedia. 484–492.Google Scholar
Digital Library
- [33] . 2020. GANimation: One-shot anatomically consistent facial animation. International Journal of Computer Vision 128, 3 (2020), 698–713.Google Scholar
Cross Ref
- [34] . 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the ACM International Conference on Multimedia. 251–260.Google Scholar
Digital Library
- [35] . 2016. Generative adversarial text to image synthesis. In Proceedings of the 33rd International Conference on Machine Learning. 1060–1069.Google Scholar
Digital Library
- [36] . 2018. Audio to body dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7574–7583.Google Scholar
Cross Ref
- [37] . 2019. First order motion model for image animation. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’19).Google Scholar
- [38] . 2019. Talking face generation by conditional recurrent adversarial network. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). 919–925.Google Scholar
Digital Library
- [39] . 2017. Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics 36, 4 (2017), Article 95, 13 pages.Google Scholar
Digital Library
- [40] . 2015. Cross-dataset learning and person-specific normalisation for automatic action unit detection. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 35–51.Google Scholar
Digital Library
- [41] . 2018. OpenFace 2.0: Facial behavior analysis toolkit. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition. 35–51.Google Scholar
- [42] . 2018. Dance with melody: An LSTM-autoencoder approach to music-oriented dance synthesis. In Proceedings of the ACM International Conference on Multimedia. 1598–1606.Google Scholar
Digital Library
- [43] . 2022. Image-to-video generation via 3D facial dynamics. IEEE Transactions on Circuits and Systems for Video Technology 32, 4 (2022), 1805–1819.
DOI: Google ScholarCross Ref
- [44] . 2019. Cross-modality retrieval by joint correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), Article 56, 16 pages.Google Scholar
Digital Library
- [45] . 2020. Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:2002.10137 (2020).Google Scholar
- [46] . 2018. Text-to-image synthesis via symmetrical distillation networks. In Proceedings of the ACM International Conference on Multimedia. 1407–1415.Google Scholar
Digital Library
- [47] . 2009. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11, 7 (2009), 1254–1265.Google Scholar
Digital Library
- [48] . 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence. 9299–9306.Google Scholar
Digital Library
- [49] . 2020. MakeltTalk: Speaker-aware talking-head animation. ACM Transactions on Graphics 39, 6 (2020), Article 221, 15 pages.Google Scholar
Digital Library
- [50] . 2020. Arbitrary talking face generation via attentional audio-visual coherence learning. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI’20). 2362–2368.Google Scholar
Cross Ref
Index Terms
Talking Face Generation via Facial Anatomy
Recommendations
Recognizing Action Units for Facial Expression Analysis
Most automatic expression analysis systems attempt to recognize a small set of prototypic expressions, such as happiness, anger, surprise, and fear. Such prototypic expressions, however, occur rather infrequently. Human emotions and intentions are more ...
Expression-invariant face recognition by facial expression transformations
In this paper, we present a method of expression-invariant face recognition that transforms input face image with an arbitrary expression into its corresponding neutral facial expression image. When a new face image with an arbitrary expression is ...
Automatic face analysis system based on face recognition and facial physiognomy
ICHIT'06: Proceedings of the 1st international conference on Advances in hybrid information technologyAn automatic face analysis system is proposed which uses face recognition and facial physiognomy. It first detects human's face, extracts its features, and classifies the shape of facial features. It will analyze the person's facial physiognomy and then ...






Comments