Abstract
Character reenactment aims to control a target person’s full-head movement by a driving monocular sequence that is made up of the driving character video. Current algorithms utilize convolution neural networks in generative adversarial networks, which extract historical and geometric information to iteratively generate video frames. However, convolution neural networks can merely capture local information with limited receptive fields and ignore global dependencies that play a crucial role in face synthesis, leading to generating unnatural video frames. In this work, we design a progressive transformer module that introduces multi-head self-attention with convolution refinement to simultaneously capture global-local dependencies. Specifically, we utilize the non-lapping window-based multi-head self-attention mechanism with hierarchical architecture to obtain the larger receptive fields at low-resolution feature map and thus extract global information. To better model local dependencies, we introduce the convolution operation to further refine the attentional weight in the multi-head self-attention mechanism. Finally, we use several stacked progressive transformer modules with the down-sampling operation to encode information of appearance information of previously generated frames and parameterized 3D face information of the current frame. Similarly, we use several stacked progressive transformer modules with the up-sampling operation to iteratively generate video frames. In this way, it can capture global-local information to facilitate generating video frames that are globally natural while preserving sharp outlines and rich detail information. Extensive experiments on several standard benchmarks suggest that the proposed method outperforms current leading algorithms.
- [1] . 2014. Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell. 37, 8 (2014), 1670–1687.Google Scholar
Digital Library
- [2] . 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. 187–194.Google Scholar
Digital Library
- [3] . 2018. 3D reconstruction of “in-the-wild” faces in images and videos. IEEE Trans. Pattern Anal. Mach. Intell. 40, 11 (2018), 2638–2652.Google Scholar
Digital Library
- [4] . 2017. How far are we from solving the 2D and 3D face alignment problem? (And a dataset of 230,000 3D facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision. 1021–1030.Google Scholar
Cross Ref
- [5] . 2020. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3 (2020), 1371–1384.
DOI: Google ScholarCross Ref
- [6] . 2022. Heterogeneous semantic transfer for multi-label recognition with partial labels. Retrieved from https://arXiv:2205.11131.Google Scholar
- [7] . 2021. Cross-domain facial expression recognition: A unified evaluation benchmark and adversarial graph learning. IEEE Trans. Pattern Anal. Mach. Intell. (2021), 1–1.
DOI: Google ScholarCross Ref
- [8] . 2019. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.Google Scholar
Cross Ref
- [9] . 2021. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. Retrieved from https://arXiv:2103.03404.Google Scholar
- [10] . 2020. An image is worth \(16\times 16\) words: Transformers for image recognition at scale. Retrieved from https://arXiv:2010.11929.Google Scholar
- [11] . 2021. Head2Head++: Deep facial attributes re-targeting. IEEE Trans. Biometr., Behav., Identity Sci. 3, 1 (2021), 31–43.Google Scholar
Cross Ref
- [12] . 2014. Automatic face reenactment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4217–4224.Google Scholar
Digital Library
- [13] . 2014. Generative adversarial nets. Adv. Neural Info. Process. Syst. 27 (2014).Google Scholar
- [14] . 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Info. Process. Syst. 30 (2017).Google Scholar
- [15] . 2016. FlowNet 2.0: Evolution of optical flow estimation with deep networks. Retrieved from http://arxiv.org/abs/1612.01925.Google Scholar
- [16] . 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google Scholar
Cross Ref
- [17] . 2021. Transgan: Two transformers can make one strong gan. Retrieved from https://arXiv:2102.07074.Google Scholar
- [18] . 2018. Deep video portraits. ACM Trans. Graph. 37, 4 (2018), 1–14.Google Scholar
Digital Library
- [19] . 2009. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 10 (2009), 1755–1758.Google Scholar
Digital Library
- [20] . 2020. Head2head: Video-based neural head synthesis. In Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG’20). IEEE, 16–23.Google Scholar
Digital Library
- [21] . 2018. Combining dense nonrigid structure from motion and 3d morphable models for monocular 4d face reconstruction. In Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production. 1–9.Google Scholar
Digital Library
- [22] . 2021. Impulsive synchronization of unbounded delayed inertial neural networks with actuator saturation and sampled-data control and its application to image encryption. IEEE Trans. Neural Netw. Learn. Syst. 32, 4 (2021), 1460–1473.
DOI: Google ScholarCross Ref
- [23] . 2021. Localvit: Bringing locality to vision transformers. Retrieved from https://arXiv:2104.05707.Google Scholar
- [24] . 2022. Facial-expression-aware emotional color transfer based on convolutional neural network. ACM Trans. Multimedia Comput. Commun. Appl. 18, 1 (2022), 1–19.Google Scholar
Digital Library
- [25] . 2021. Swin transformer: Hierarchical vision transformer using shifted windows. Retrieved from https://arXiv:2103.14030.Google Scholar
- [26] . 2001. Expressive expression mapping with ratio images. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. 271–276.Google Scholar
Digital Library
- [27] . 2018. Fixing weight decay regularization in Adam. https://openreview.net/forum?id=rk6qdGgCZ.Google Scholar
- [28] . 2016. Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 4905–4913.Google Scholar
Digital Library
- [29] . 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Trans. Multimedia Comput., Commun. Appl. 18, 4 (2022), 1–17.Google Scholar
Digital Library
- [30] . 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimedia Comput., Commun. Appl. 17, 4 (2021), 1–23.Google Scholar
Digital Library
- [31] . 2020. Stability property of impulsive inertial neural networks with unbounded time delay and saturating actuators. Neural Comput. Appl. 32, 11 (2020), 6571–6580.Google Scholar
Digital Library
- [32] . 2021. Finite-time stability of coupled impulsive neural networks with time-varying delays and saturating actuators. Neurocomputing 453 (2021), 590–598.Google Scholar
Digital Library
- [33] . 2021. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 12179–12188.Google Scholar
Cross Ref
- [34] . 2011. Real-time avatar animation from a single image. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG’11). IEEE, 117–124.Google Scholar
- [35] . 2018. Self-attention with relative position representations. Retrieved from https://arXiv:1803.02155.Google Scholar
- [36] . 2019. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).Google Scholar
Cross Ref
- [37] . 2019. First order motion model for image animation. In Advances in Neural Information Processing Systems, , , , , , and (Eds.), Vol. 32. Curran Associates.Google Scholar
- [38] . 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from https://arXiv:1409.1556.Google Scholar
- [39] . 2016. Generative models and model criticism via optimized maximum mean discrepancy. Retrieved from https://arXiv:1611.04488.Google Scholar
- [40] . 2018. Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2549–2559.Google Scholar
Cross Ref
- [41] . 2019. Deferred neural rendering: Image synthesis using neural textures. ACM Trans. Graph. 38, 4 (2019), 1–12.Google Scholar
Digital Library
- [42] . 2015. Real-time expression transfer for facial reenactment.ACM Trans. Graph. 34, 6 (2015), 183–1.Google Scholar
Digital Library
- [43] . 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2387–2395.Google Scholar
Cross Ref
- [44] . 2018. HeadOn: Real-time reenactment of human portrait videos. ACM Trans. Graph. 37, 4, Article
164 (July 2018), 13 pages.DOI: Google ScholarDigital Library
- [45] . 2021. Going deeper with image transformers. Retrieved from https://arXiv:2103.17239.Google Scholar
- [46] . 2019. FVD: A new metric for video generation.Google Scholar
- [47] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., Vol. 30. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.Google Scholar
- [48] . 2018. Video-to-video synthesis. Retrieved from https://arXiv:1808.06601.Google Scholar
- [49] . 2021. Towards real-world blind face restoration with generative facial prior. Retrieved from https://arxiv.org/abs/2101.04061.Google Scholar
- [50] . 2019. U-Net conditional GANs for photo-realistic and identity-preserving facial expression synthesis. ACM Trans. Multimedia Comput., Commun. Appl. 15, 3s (2019), 1–23.Google Scholar
Digital Library
- [51] . 2018. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV’18). 670–686.Google Scholar
Digital Library
- [52] . 2021. Cvt: Introducing convolutions to vision transformers. Retrieved from https://arXiv:2103.15808 (2021).Google Scholar
- [53] . 2021. Incorporating convolution designs into visual transformers. Retrieved from https://arXiv:2103.11816.Google Scholar
- [54] . 2021. Incorporating convolution designs into visual transformers. Retrieved from https://arxiv.org/abs/2103.11816.Google Scholar
- [55] . 2019. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9459–9468.Google Scholar
Cross Ref
- [56] . 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23, 10 (2016), 1499–1503.Google Scholar
Cross Ref
- [57] . 2020. Synchronization of delayed neural networks via integral-based event-triggered scheme. IEEE Trans. Neural Netw. Learn. Syst. 31, 12 (2020), 5092–5102.
DOI: Google ScholarCross Ref
- [58] . 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 586–595.Google Scholar
Cross Ref
- [59] . 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6881–6890.Google Scholar
Cross Ref
- [60] . 2021. Deepvit: Towards deeper vision transformer. Retrieved from https://arXiv:2103.11886.Google Scholar
- [61] . 2018. State of the art on monocular 3D face reconstruction, tracking, and applications. In Proceedings of the Computer Graphics Forum, Vol. 37. Wiley Online Library, 523–550.Google Scholar
Cross Ref
Index Terms
Progressive Transformer Machine for Natural Character Reenactment
Recommendations
Multiscale spatial–spectral transformer network for hyperspectral and multispectral image fusion
AbstractFusing hyperspectral images (HSIs) and multispectral images (MSIs) is an economic and feasible way to obtain images with both high spectral resolution and spatial resolution. Due to the limited receptive field of convolution kernels, ...
Graphical abstractDisplay Omitted
Highlights- A multiscale spatial–spectral Transformer network is proposed.
- Spectral multi-...
Deep 3D morphable model refinement via progressive growing of conditional Generative Adversarial Networks
Abstract3D face reconstruction from a single 2D image is a fundamental Computer Vision problem of extraordinary difficulty. Statistical modeling techniques, such as the 3D Morphable Model (3DMM), have been widely exploited because of their ...
Highlights- A solution for reconstructing a fine-grained realistic 3D face model.
- 3D face ...
Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution
Computer Vision – ECCV 2022AbstractCompressed video super-resolution (VSR) aims to restore high-resolution frames from compressed low-resolution counterparts. Most recent VSR approaches often enhance an input frame by “borrowing” relevant textures from neighboring video frames. ...






Comments