ABSTRACT
Face forgery by deepfake is widely spread over the internet and this raises severe societal concerns. In this paper, we propose a novel video transformer with incremental learning for detecting deepfake videos. To better align the input face images, we use a 3D face reconstruction method to generate UV texture from a single input face image. The aligned face image can also provide pose, eyes blink and mouth movement information that cannot be perceived in the UV texture image, so we use both face images and their UV texture maps to extract the image features. We present an incremental learning strategy to fine-tune the proposed model on a smaller amount of data and achieve better deepfake detection performance. The comprehensive experiments on various public deepfake datasets demonstrate that the proposed video transformer model with incremental learning achieves state-of-the-art performance in the deepfake video detection task with enhanced feature learning from the sequenced data.
Supplemental Material
- Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. MesoNet: a Compact Facial Video Forgery Detection Network. In IEEE International Workshop on Information Forensics and Security (WIFS). IEEE. https://arxiv.org/abs/1809.00888Google Scholar
Cross Ref
- Shruti Agarwal, Tarek El-Gaaly, Hani Farid, and Ser-Nam Lim. 2020. Detecting Deep-Fake Videos from Appearance and Behavior. ArXiv, Vol. abs/2004.14491 (2020).Google Scholar
- Belhassen Bayar and Matthew C. Stamm. 2016. A deep learning approach to universal image manipulation detection using a new convolutional layer. In ACM Workshop on Information Hiding and Multimedia Security. 5--10. Google Scholar
Digital Library
- Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. ArXiv, Vol. abs/2004.05150 (2020).Google Scholar
- Nicolò Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, and Stefano Tubaro. 2020. Video Face Manipulation Detection Through Ensemble of CNNs. arxiv: 2004.07676 [cs.CV]Google Scholar
- Gary Bradski. 2000. The OpenCV Library. Dr. Dobb's Journal of Software Tools (2000).Google Scholar
- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877--1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdfGoogle Scholar
- Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. ArXiv, Vol. abs/2005.12872 (2020).Google Scholar
- Francisco M Castro, Manuel J Mar'in-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. 2018. End-to-end incremental learning. In Proceedings of the European conference on computer vision (ECCV). 233--248.Google Scholar
Digital Library
- Polychronis Charitidis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, and Ioannis Kompatsiaris. 2020. Investigating the Impact of Pre-processing and Prediction Aggregation on the DeepFake Detection Task. arXiv: Computer Vision and Pattern Recognition (2020).Google Scholar
- Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of European Conference on Computer Vision (ECCV).Google Scholar
Digital Library
- François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 1800--1807.Google Scholar
- Umur Aybars Ciftci, and Lijun Yin. 2020. FakeCatcher: Detection of Synthetic Portrait Videos using Biological Signals. In IEEE Transactions on Pattern Analysis and Machine Intelligence. IEEE. https://arxiv.org/abs/1901.02212Google Scholar
- Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. 2017. Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection. In ACM Workshop on Information Hiding and Multimedia Security. 1--6. Google Scholar
Digital Library
- Hang Dai, Shujie Luo, Yong Ding, and Ling Shao. 2020 a. Commands for autonomous vehicles by progressively stacking visual-linguistic representations. In European Conference on Computer Vision. Springer, 27--32.Google Scholar
Cross Ref
- Hang Dai, Nick Pears, Patrik Huber, and William AP Smith. 2020 b. 3D Morphable Models: The Face, Ear and Head. In 3D Imaging, Analysis and Applications. Springer, 463--512.Google Scholar
- Hang Dai, Nick Pears, William Smith, and Christian Duncan. 2020 c. Statistical modeling of craniofacial shape and texture. International Journal of Computer Vision, Vol. 128, 2 (2020), 547--571.Google Scholar
Cross Ref
- Hang Dai, Nick Pears, William AP Smith, and Christian Duncan. 2017. A 3d morphable model of craniofacial shape and texture variation. In Proceedings of the IEEE International Conference on Computer Vision. 3085--3093.Google Scholar
Cross Ref
- Oscar de Lima, Sean Franklin, Shreshtha Basu, Blake Karwoski, and Annet George. 2020. Deepfake Detection using Spatiotemporal Convolutional Networks. arXiv: Computer Vision and Pattern Recognition, Vol. abs/2006.14749 (2020).Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.Google Scholar
- Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. 2020. The DeepFake Detection Challenge (DFDC) Dataset. arXiv: Computer Vision and Pattern Recognition (2020).Google Scholar
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arxiv: 2010.11929 [cs.CV]Google Scholar
- William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. ArXiv, Vol. abs/2101.03961 (2021).Google Scholar
- Rohit Girdhar, João Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video Action Transformer Network. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 244--253.Google Scholar
- Jianzhu Guo, Xiangyu Zhu, and Zhen Lei. 2018. 3DDFA. https://github.com/cleardusk/3DDFA.Google Scholar
- Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, and Stan Z. Li. 2020. Towards Fast, Accurate and Stable 3D Dense Face Alignment. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
- David Güera and Edward J. Delp. 2018. Deepfake Video Detection Using Recurrent Neural Networks. In 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE. https://ieeexplore.ieee.org/document/8639163Google Scholar
- Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in Vision: A Survey. arxiv: 2101.01169 [cs.CV]Google Scholar
Digital Library
- Pavel Korshunov and Sebastien Marcel. 2018. DeepFakes: a New Threat to Face Recognition? Assessment and Detection. arXiv: Computer Vision and Pattern Recognition, Vol. abs/1812.08685 (2018).Google Scholar
- Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. 2020 a. Face X-ray for More General Face Forgery Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 5001--5010. https://arxiv.org/abs/1912.13458Google Scholar
Cross Ref
- Yuezun Li and Siwei Lyu. 2019. Exposing DeepFake Videos By Detecting Face Warping Artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. IEEE. https://arxiv.org/abs/1811.00656Google Scholar
- Yuezun Li, Pu Sun, Honggang Qi, and Siwei Lyu. 2020 b. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. In IEEE Conference on Computer Vision and Patten Recognition (CVPR). Seattle, WA, United States.Google Scholar
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv, Vol. abs/1907.11692 (2019).Google Scholar
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems. Google Scholar
Digital Library
- Shujie Luo, Hang Dai, Ling Shao, and Yong Ding. 2020. C4AV: Learning Cross-Modal Representations from Transformers. In European Conference on Computer Vision. Springer, 33--38.Google Scholar
Digital Library
- Luke Melas. 2020. PyTorch Pretrained ViT. https://github.com/lukemelas/PyTorch-Pretrained-ViT.Google Scholar
- Yisroel Mirsky and Wenke Lee. January, 2021. The Creation and Detection of Deepfakes: A Survey. In Association for Computing Machinery (ACM). https://dl.acm.org/doi/10.1145/3425780 Google Scholar
Digital Library
- Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, 2823--2832. https://arxiv.org/abs/2003.06711 Google Scholar
Digital Library
- Huy H. Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. 2019 a. Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos. In IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE. https://arxiv.org/abs/1906.06876Google Scholar
- Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen. 2019 b. Capsule-Forensics: Using Capsule Networks to Detect Forged Images and Videos. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://ieeexplore.ieee.org/document/8682602Google Scholar
Cross Ref
- Myle Ott, Sergey Edunov, David Grangier, and M. Auli. 2018. Scaling Neural Machine Translation. In WMT.Google Scholar
- Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. ArXiv, Vol. abs/2001.07966 (2020).Google Scholar
- Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR, Vol. abs/1511.06434 (2016).Google Scholar
- Alec Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training.Google Scholar
- Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).Google Scholar
- Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. FaceForensics+: Learning to Detect Manipulated Facial Images. arxiv: 1901.08971 [cs.CV]Google Scholar
- Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, and Prem Natarajan. 2019. Recurrent Convolutional Strategies for Face Manipulation Detection in Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. IEEE, 80--87.Google Scholar
- Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 7463--7472.Google Scholar
- Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2020. Training data-efficient image transformers & distillation through attention. ArXiv, Vol. abs/2012.12877 (2020).Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Google Scholar
Digital Library
- Luisa Verdoliva. 2020. Media Forensics and DeepFakes: An Overview. IEEE Journal of Selected Topics in Signal Processing, Vol. 14 (2020), 910--932.Google Scholar
Cross Ref
- Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-Attention with Linear Complexity. ArXiv, Vol. abs/2006.04768 (2020).Google Scholar
- Ross Wightman. 2019. PyTorch Image Models. https://github.com/rwightman/pytorch-image-models. https://doi.org/10.5281/zenodo.4414861Google Scholar
- Xinsheng Xuan, Bo Peng, Wei Wang, and Jing Dong. 2019. On the generalization of GAN image forensics. In Sun Z., He R., Feng J., Shan S., Guo Z. (eds) Biometric Recognition. CCBR 2019. Springer, Cham. https://doi.org/10.1007/978--3-030--31456--9_15Google Scholar
Digital Library
- Xin Yang, Yuezun Li, and Siwei Lyu. 2019. Exposing Deep Fakes Using Inconsistent Head Poses. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8261--8265. https://doi.org/10.1109/ICASSP.2019.8683164Google Scholar
- Xiangyu Zhu, Hao Wang, Hongyan Fei, Zhen Lei, and Stan Z. Li. 2020. Face Forgery Detection by 3D Decomposition. arxiv: 2011.09737 [cs.CV]Google Scholar
Index Terms
- Video Transformer for Deepfake Detection with Incremental Learning
Recommendations
Incremental learning patch-based bag of facial words representation for face recognition in videos
Video-based face recognition is a fundamental topic in image processing and video analysis, and presents various challenges and opportunities. In this paper, we introduce an incremental learning approach to video-based face recognition which efficiently ...
Face Tracking and Recognition via Incremental Local Sparse Representation
ICIG '13: Proceedings of the 2013 Seventh International Conference on Image and GraphicsThis paper addresses the problem of tracking and recognizing faces via incremental local sparse representation. We first develop a robust face tracking algorithm based on the local sparse appearance. This sparse representation model exploits both ...
Facial depth forgery detection based on image gradient
AbstractWith the widespread application of deep learning, many artificially generated fake images and videos appear on the Internet. However, it is difficult for people to distinguish the real from the fake ones, making the research on detecting and ...





Comments