skip to main content
research-article

Progressive Transformer Machine for Natural Character Reenactment

Published:17 February 2023Publication History
Skip Abstract Section

Abstract

Character reenactment aims to control a target person’s full-head movement by a driving monocular sequence that is made up of the driving character video. Current algorithms utilize convolution neural networks in generative adversarial networks, which extract historical and geometric information to iteratively generate video frames. However, convolution neural networks can merely capture local information with limited receptive fields and ignore global dependencies that play a crucial role in face synthesis, leading to generating unnatural video frames. In this work, we design a progressive transformer module that introduces multi-head self-attention with convolution refinement to simultaneously capture global-local dependencies. Specifically, we utilize the non-lapping window-based multi-head self-attention mechanism with hierarchical architecture to obtain the larger receptive fields at low-resolution feature map and thus extract global information. To better model local dependencies, we introduce the convolution operation to further refine the attentional weight in the multi-head self-attention mechanism. Finally, we use several stacked progressive transformer modules with the down-sampling operation to encode information of appearance information of previously generated frames and parameterized 3D face information of the current frame. Similarly, we use several stacked progressive transformer modules with the up-sampling operation to iteratively generate video frames. In this way, it can capture global-local information to facilitate generating video frames that are globally natural while preserving sharp outlines and rich detail information. Extensive experiments on several standard benchmarks suggest that the proposed method outperforms current leading algorithms.

REFERENCES

  1. [1] Barron Jonathan T. and Malik Jitendra. 2014. Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell. 37, 8 (2014), 16701687.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Blanz Volker and Vetter Thomas. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. 187194.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Booth James, Roussos Anastasios, Ververas Evangelos, Antonakos Epameinondas, Ploumpis Stylianos, Panagakis Yannis, and Zafeiriou Stefanos. 2018. 3D reconstruction of “in-the-wild” faces in images and videos. IEEE Trans. Pattern Anal. Mach. Intell. 40, 11 (2018), 26382652.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bulat Adrian and Tzimiropoulos Georgios. 2017. How far are we from solving the 2D and 3D face alignment problem? (And a dataset of 230,000 3D facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision. 10211030.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Tianshui, Lin Liang, Hui Xiaolu, Chen Riquan, and Wu Hefeng. 2020. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3 (2020), 1371–1384. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chen Tianshui, Pu Tao, Liu Lingbo, Shi Yukai, Yang Zhijing, and Lin Liang. 2022. Heterogeneous semantic transfer for multi-label recognition with partial labels. Retrieved from https://arXiv:2205.11131.Google ScholarGoogle Scholar
  7. [7] Chen Tianshui, Pu Tao, Wu Hefeng, Xie Yuan, Liu Lingbo, and Lin Liang. 2021. Cross-domain facial expression recognition: A unified evaluation benchmark and adversarial graph learning. IEEE Trans. Pattern Anal. Mach. Intell. (2021), 1–1. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Deng Yu, Yang Jiaolong, Xu Sicheng, Chen Dong, Jia Yunde, and Tong Xin. 2019. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Dong Yihe, Cordonnier Jean-Baptiste, and Loukas Andreas. 2021. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. Retrieved from https://arXiv:2103.03404.Google ScholarGoogle Scholar
  10. [10] Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, et al. 2020. An image is worth \(16\times 16\) words: Transformers for image recognition at scale. Retrieved from https://arXiv:2010.11929.Google ScholarGoogle Scholar
  11. [11] Doukas Michail Christos, Koujan Mohammad Rami, Sharmanska Viktoriia, Roussos Anastasios, and Zafeiriou Stefanos. 2021. Head2Head++: Deep facial attributes re-targeting. IEEE Trans. Biometr., Behav., Identity Sci. 3, 1 (2021), 3143.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Garrido Pablo, Valgaerts Levi, Rehmsen Ole, Thormahlen Thorsten, Perez Patrick, and Theobalt Christian. 2014. Automatic face reenactment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 42174224.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2014. Generative adversarial nets. Adv. Neural Info. Process. Syst. 27 (2014).Google ScholarGoogle Scholar
  14. [14] Heusel Martin, Ramsauer Hubert, Unterthiner Thomas, Nessler Bernhard, and Hochreiter Sepp. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Info. Process. Syst. 30 (2017).Google ScholarGoogle Scholar
  15. [15] Ilg Eddy, Mayer Nikolaus, Saikia Tonmoy, Keuper Margret, Dosovitskiy Alexey, and Brox Thomas. 2016. FlowNet 2.0: Evolution of optical flow estimation with deep networks. Retrieved from http://arxiv.org/abs/1612.01925.Google ScholarGoogle Scholar
  16. [16] Isola Phillip, Zhu Jun-Yan, Zhou Tinghui, and Efros Alexei A.. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Jiang Yifan, Chang Shiyu, and Wang Zhangyang. 2021. Transgan: Two transformers can make one strong gan. Retrieved from https://arXiv:2102.07074.Google ScholarGoogle Scholar
  18. [18] Kim Hyeongwoo, Garrido Pablo, Tewari Ayush, Xu Weipeng, Thies Justus, Niessner Matthias, Pérez Patrick, Richardt Christian, Zollhöfer Michael, and Theobalt Christian. 2018. Deep video portraits. ACM Trans. Graph. 37, 4 (2018), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] King Davis E.. 2009. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 10 (2009), 17551758.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Koujan Mohammad Rami, Doukas Michail Christos, Roussos Anastasios, and Zafeiriou Stefanos. 2020. Head2head: Video-based neural head synthesis. In Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG’20). IEEE, 1623.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Koujan Mohammad Rami and Roussos Anastasios. 2018. Combining dense nonrigid structure from motion and 3d morphable models for monocular 4d face reconstruction. In Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production. 19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Li Hongfei, Li Chuandong, Ouyang Deqiang, and Nguang Sing Kiong. 2021. Impulsive synchronization of unbounded delayed inertial neural networks with actuator saturation and sampled-data control and its application to image encryption. IEEE Trans. Neural Netw. Learn. Syst. 32, 4 (2021), 14601473. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Li Yawei, Zhang Kai, Cao Jiezhang, Timofte Radu, and Gool Luc Van. 2021. Localvit: Bringing locality to vision transformers. Retrieved from https://arXiv:2104.05707.Google ScholarGoogle Scholar
  24. [24] Liu Shiguang, Wang Huixin, and Pei Min. 2022. Facial-expression-aware emotional color transfer based on convolutional neural network. ACM Trans. Multimedia Comput. Commun. Appl. 18, 1 (2022), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Liu Ze, Lin Yutong, Cao Yue, Hu Han, Wei Yixuan, Zhang Zheng, Lin Stephen, and Guo Baining. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. Retrieved from https://arXiv:2103.14030.Google ScholarGoogle Scholar
  26. [26] Liu Zicheng, Shan Ying, and Zhang Zhengyou. 2001. Expressive expression mapping with ratio images. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. 271276.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Loshchilov Ilya and Hutter Frank. 2018. Fixing weight decay regularization in Adam. https://openreview.net/forum?id=rk6qdGgCZ.Google ScholarGoogle Scholar
  28. [28] Luo Wenjie, Li Yujia, Urtasun Raquel, and Zemel Richard. 2016. Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 49054913.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Man Xin, Ouyang Deqiang, Li Xiangpeng, Song Jingkuan, and Shao Jie. 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Trans. Multimedia Comput., Commun. Appl. 18, 4 (2022), 117.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Messina Nicola, Amato Giuseppe, Esuli Andrea, Falchi Fabrizio, Gennaro Claudio, and Marchand-Maillet Stéphane. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimedia Comput., Commun. Appl. 17, 4 (2021), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Ouyang Deqiang, Shao Jie, and Hu Cheng. 2020. Stability property of impulsive inertial neural networks with unbounded time delay and saturating actuators. Neural Comput. Appl. 32, 11 (2020), 65716580.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Ouyang Deqiang, Shao Jie, Jiang Haijun, Wen Shiping, and Nguang Sing Kiong. 2021. Finite-time stability of coupled impulsive neural networks with time-varying delays and saturating actuators. Neurocomputing 453 (2021), 590598.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Ranftl René, Bochkovskiy Alexey, and Koltun Vladlen. 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 1217912188.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Saragih Jason M., Lucey Simon, and Cohn Jeffrey F.. 2011. Real-time avatar animation from a single image. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG’11). IEEE, 117124.Google ScholarGoogle Scholar
  35. [35] Shaw Peter, Uszkoreit Jakob, and Vaswani Ashish. 2018. Self-attention with relative position representations. Retrieved from https://arXiv:1803.02155.Google ScholarGoogle Scholar
  36. [36] Siarohin Aliaksandr, Lathuiliere Stephane, Tulyakov Sergey, Ricci Elisa, and Sebe Nicu. 2019. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Siarohin Aliaksandr, Lathuilière Stéphane, Tulyakov Sergey, Ricci Elisa, and Sebe Nicu. 2019. First order motion model for image animation. In Advances in Neural Information Processing Systems, Wallach H., Larochelle H., Beygelzimer A., d'Alché-Buc F., Fox E., and Garnett R. (Eds.), Vol. 32. Curran Associates.Google ScholarGoogle Scholar
  38. [38] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from https://arXiv:1409.1556.Google ScholarGoogle Scholar
  39. [39] Sutherland Danica J., Tung Hsiao-Yu, Strathmann Heiko, De Soumyajit, Ramdas Aaditya, Smola Alex, and Gretton Arthur. 2016. Generative models and model criticism via optimized maximum mean discrepancy. Retrieved from https://arXiv:1611.04488.Google ScholarGoogle Scholar
  40. [40] Tewari Ayush, Zollhöfer Michael, Garrido Pablo, Bernard Florian, Kim Hyeongwoo, Pérez Patrick, and Theobalt Christian. 2018. Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 25492559.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Thies Justus, Zollhöfer Michael, and Nießner Matthias. 2019. Deferred neural rendering: Image synthesis using neural textures. ACM Trans. Graph. 38, 4 (2019), 112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Thies Justus, Zollhöfer Michael, Nießner Matthias, Valgaerts Levi, Stamminger Marc, and Theobalt Christian. 2015. Real-time expression transfer for facial reenactment.ACM Trans. Graph. 34, 6 (2015), 183–1.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Thies Justus, Zollhofer Michael, Stamminger Marc, Theobalt Christian, and Nießner Matthias. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 23872395.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Thies Justus, Zollhöfer Michael, Theobalt Christian, Stamminger Marc, and Niessner Matthias. 2018. HeadOn: Real-time reenactment of human portrait videos. ACM Trans. Graph. 37, 4, Article 164 (July 2018), 13 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Touvron Hugo, Cord Matthieu, Sablayrolles Alexandre, Synnaeve Gabriel, and Jégou Hervé. 2021. Going deeper with image transformers. Retrieved from https://arXiv:2103.17239.Google ScholarGoogle Scholar
  46. [46] Unterthiner Thomas, Steenkiste Sjoerd van, Kurach Karol, Marinier Raphaël, Michalski Marcin, and Gelly Sylvain. 2019. FVD: A new metric for video generation.Google ScholarGoogle Scholar
  47. [47] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., Vol. 30. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.Google ScholarGoogle Scholar
  48. [48] Wang Ting-Chun, Liu Ming-Yu, Zhu Jun-Yan, Liu Guilin, Tao Andrew, Kautz Jan, and Catanzaro Bryan. 2018. Video-to-video synthesis. Retrieved from https://arXiv:1808.06601.Google ScholarGoogle Scholar
  49. [49] Wang Xintao, Li Yu, Zhang Honglun, and Shan Ying. 2021. Towards real-world blind face restoration with generative facial prior. Retrieved from https://arxiv.org/abs/2101.04061.Google ScholarGoogle Scholar
  50. [50] Wang Xueping, Wang Yunhong, and Li Weixin. 2019. U-Net conditional GANs for photo-realistic and identity-preserving facial expression synthesis. ACM Trans. Multimedia Comput., Commun. Appl. 15, 3s (2019), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Wiles Olivia, Koepke A., and Zisserman Andrew. 2018. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 670686.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Wu Haiping, Xiao Bin, Codella Noel, Liu Mengchen, Dai Xiyang, Yuan Lu, and Zhang Lei. 2021. Cvt: Introducing convolutions to vision transformers. Retrieved from https://arXiv:2103.15808 (2021).Google ScholarGoogle Scholar
  53. [53] Yuan Kun, Guo Shaopeng, Liu Ziwei, Zhou Aojun, Yu Fengwei, and Wu Wei. 2021. Incorporating convolution designs into visual transformers. Retrieved from https://arXiv:2103.11816.Google ScholarGoogle Scholar
  54. [54] Yuan Kun, Guo Shaopeng, Liu Ziwei, Zhou Aojun, Yu Fengwei, and Wu Wei. 2021. Incorporating convolution designs into visual transformers. Retrieved from https://arxiv.org/abs/2103.11816.Google ScholarGoogle Scholar
  55. [55] Zakharov Egor, Shysheya Aliaksandra, Burkov Egor, and Lempitsky Victor. 2019. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 94599468.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Zhang Kaipeng, Zhang Zhanpeng, Li Zhifeng, and Qiao Yu. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23, 10 (2016), 14991503.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Zhang Liruo, Nguang Sing Kiong, Ouyang Deqiang, and Yan Shen. 2020. Synchronization of delayed neural networks via integral-based event-triggered scheme. IEEE Trans. Neural Netw. Learn. Syst. 31, 12 (2020), 50925102. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Zhang Richard, Isola Phillip, Efros Alexei A., Shechtman Eli, and Wang Oliver. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 586595.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Zheng Sixiao, Lu Jiachen, Zhao Hengshuang, Zhu Xiatian, Luo Zekun, Wang Yabiao, Fu Yanwei, Feng Jianfeng, Xiang Tao, Torr Philip H. S., et al. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 68816890.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Zhou Daquan, Kang Bingyi, Jin Xiaojie, Yang Linjie, Lian Xiaochen, Jiang Zihang, Hou Qibin, and Feng Jiashi. 2021. Deepvit: Towards deeper vision transformer. Retrieved from https://arXiv:2103.11886.Google ScholarGoogle Scholar
  61. [61] Zollhöfer Michael, Thies Justus, Garrido Pablo, Bradley Derek, Beeler Thabo, Pérez Patrick, Stamminger Marc, Nießner Matthias, and Theobalt Christian. 2018. State of the art on monocular 3D face reconstruction, tracking, and applications. In Proceedings of the Computer Graphics Forum, Vol. 37. Wiley Online Library, 523550.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Progressive Transformer Machine for Natural Character Reenactment

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2s
      April 2023
      545 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3572861
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 February 2023
      • Online AM: 25 August 2022
      • Accepted: 19 August 2022
      • Revised: 25 July 2022
      • Received: 24 May 2022
      Published in tomm Volume 19, Issue 2s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)156
      • Downloads (Last 6 weeks)17

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!