skip to main content
10.1145/3474085.3475332acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Video Transformer for Deepfake Detection with Incremental Learning

Authors Info & Claims
Published:17 October 2021Publication History

ABSTRACT

Face forgery by deepfake is widely spread over the internet and this raises severe societal concerns. In this paper, we propose a novel video transformer with incremental learning for detecting deepfake videos. To better align the input face images, we use a 3D face reconstruction method to generate UV texture from a single input face image. The aligned face image can also provide pose, eyes blink and mouth movement information that cannot be perceived in the UV texture image, so we use both face images and their UV texture maps to extract the image features. We present an incremental learning strategy to fine-tune the proposed model on a smaller amount of data and achieve better deepfake detection performance. The comprehensive experiments on various public deepfake datasets demonstrate that the proposed video transformer model with incremental learning achieves state-of-the-art performance in the deepfake video detection task with enhanced feature learning from the sequenced data.

Skip Supplemental Material Section

Supplemental Material

MM21-mfp0940.mp4

mp4

17.7 MB

References

  1. Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. MesoNet: a Compact Facial Video Forgery Detection Network. In IEEE International Workshop on Information Forensics and Security (WIFS). IEEE. https://arxiv.org/abs/1809.00888Google ScholarGoogle ScholarCross RefCross Ref
  2. Shruti Agarwal, Tarek El-Gaaly, Hani Farid, and Ser-Nam Lim. 2020. Detecting Deep-Fake Videos from Appearance and Behavior. ArXiv, Vol. abs/2004.14491 (2020).Google ScholarGoogle Scholar
  3. Belhassen Bayar and Matthew C. Stamm. 2016. A deep learning approach to universal image manipulation detection using a new convolutional layer. In ACM Workshop on Information Hiding and Multimedia Security. 5--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. ArXiv, Vol. abs/2004.05150 (2020).Google ScholarGoogle Scholar
  5. Nicolò Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, and Stefano Tubaro. 2020. Video Face Manipulation Detection Through Ensemble of CNNs. arxiv: 2004.07676 [cs.CV]Google ScholarGoogle Scholar
  6. Gary Bradski. 2000. The OpenCV Library. Dr. Dobb's Journal of Software Tools (2000).Google ScholarGoogle Scholar
  7. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877--1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdfGoogle ScholarGoogle Scholar
  8. Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. ArXiv, Vol. abs/2005.12872 (2020).Google ScholarGoogle Scholar
  9. Francisco M Castro, Manuel J Mar'in-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. 2018. End-to-end incremental learning. In Proceedings of the European conference on computer vision (ECCV). 233--248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Polychronis Charitidis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, and Ioannis Kompatsiaris. 2020. Investigating the Impact of Pre-processing and Prediction Aggregation on the DeepFake Detection Task. arXiv: Computer Vision and Pattern Recognition (2020).Google ScholarGoogle Scholar
  11. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of European Conference on Computer Vision (ECCV).Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. François Chollet. 2017. Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 1800--1807.Google ScholarGoogle Scholar
  13. Umur Aybars Ciftci, and Lijun Yin. 2020. FakeCatcher: Detection of Synthetic Portrait Videos using Biological Signals. In IEEE Transactions on Pattern Analysis and Machine Intelligence. IEEE. https://arxiv.org/abs/1901.02212Google ScholarGoogle Scholar
  14. Davide Cozzolino, Giovanni Poggi, and Luisa Verdoliva. 2017. Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection. In ACM Workshop on Information Hiding and Multimedia Security. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hang Dai, Shujie Luo, Yong Ding, and Ling Shao. 2020 a. Commands for autonomous vehicles by progressively stacking visual-linguistic representations. In European Conference on Computer Vision. Springer, 27--32.Google ScholarGoogle ScholarCross RefCross Ref
  16. Hang Dai, Nick Pears, Patrik Huber, and William AP Smith. 2020 b. 3D Morphable Models: The Face, Ear and Head. In 3D Imaging, Analysis and Applications. Springer, 463--512.Google ScholarGoogle Scholar
  17. Hang Dai, Nick Pears, William Smith, and Christian Duncan. 2020 c. Statistical modeling of craniofacial shape and texture. International Journal of Computer Vision, Vol. 128, 2 (2020), 547--571.Google ScholarGoogle ScholarCross RefCross Ref
  18. Hang Dai, Nick Pears, William AP Smith, and Christian Duncan. 2017. A 3d morphable model of craniofacial shape and texture variation. In Proceedings of the IEEE International Conference on Computer Vision. 3085--3093.Google ScholarGoogle ScholarCross RefCross Ref
  19. Oscar de Lima, Sean Franklin, Shreshtha Basu, Blake Karwoski, and Annet George. 2020. Deepfake Detection using Spatiotemporal Convolutional Networks. arXiv: Computer Vision and Pattern Recognition, Vol. abs/2006.14749 (2020).Google ScholarGoogle Scholar
  20. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.Google ScholarGoogle Scholar
  21. Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. 2020. The DeepFake Detection Challenge (DFDC) Dataset. arXiv: Computer Vision and Pattern Recognition (2020).Google ScholarGoogle Scholar
  22. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arxiv: 2010.11929 [cs.CV]Google ScholarGoogle Scholar
  23. William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. ArXiv, Vol. abs/2101.03961 (2021).Google ScholarGoogle Scholar
  24. Rohit Girdhar, João Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video Action Transformer Network. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 244--253.Google ScholarGoogle Scholar
  25. Jianzhu Guo, Xiangyu Zhu, and Zhen Lei. 2018. 3DDFA. https://github.com/cleardusk/3DDFA.Google ScholarGoogle Scholar
  26. Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, and Stan Z. Li. 2020. Towards Fast, Accurate and Stable 3D Dense Face Alignment. In Proceedings of the European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar
  27. David Güera and Edward J. Delp. 2018. Deepfake Video Detection Using Recurrent Neural Networks. In 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE. https://ieeexplore.ieee.org/document/8639163Google ScholarGoogle Scholar
  28. Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in Vision: A Survey. arxiv: 2101.01169 [cs.CV]Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Pavel Korshunov and Sebastien Marcel. 2018. DeepFakes: a New Threat to Face Recognition? Assessment and Detection. arXiv: Computer Vision and Pattern Recognition, Vol. abs/1812.08685 (2018).Google ScholarGoogle Scholar
  30. Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. 2020 a. Face X-ray for More General Face Forgery Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 5001--5010. https://arxiv.org/abs/1912.13458Google ScholarGoogle ScholarCross RefCross Ref
  31. Yuezun Li and Siwei Lyu. 2019. Exposing DeepFake Videos By Detecting Face Warping Artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. IEEE. https://arxiv.org/abs/1811.00656Google ScholarGoogle Scholar
  32. Yuezun Li, Pu Sun, Honggang Qi, and Siwei Lyu. 2020 b. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. In IEEE Conference on Computer Vision and Patten Recognition (CVPR). Seattle, WA, United States.Google ScholarGoogle Scholar
  33. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv, Vol. abs/1907.11692 (2019).Google ScholarGoogle Scholar
  34. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Shujie Luo, Hang Dai, Ling Shao, and Yong Ding. 2020. C4AV: Learning Cross-Modal Representations from Transformers. In European Conference on Computer Vision. Springer, 33--38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Luke Melas. 2020. PyTorch Pretrained ViT. https://github.com/lukemelas/PyTorch-Pretrained-ViT.Google ScholarGoogle Scholar
  37. Yisroel Mirsky and Wenke Lee. January, 2021. The Creation and Detection of Deepfakes: A Survey. In Association for Computing Machinery (ACM). https://dl.acm.org/doi/10.1145/3425780 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. Emotions Don't Lie: An Audio-Visual Deepfake Detection Method using Affective Cues. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, 2823--2832. https://arxiv.org/abs/2003.06711 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Huy H. Nguyen, Fuming Fang, Junichi Yamagishi, and Isao Echizen. 2019 a. Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos. In IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE. https://arxiv.org/abs/1906.06876Google ScholarGoogle Scholar
  40. Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen. 2019 b. Capsule-Forensics: Using Capsule Networks to Detect Forged Images and Videos. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://ieeexplore.ieee.org/document/8682602Google ScholarGoogle ScholarCross RefCross Ref
  41. Myle Ott, Sergey Edunov, David Grangier, and M. Auli. 2018. Scaling Neural Machine Translation. In WMT.Google ScholarGoogle Scholar
  42. Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. ArXiv, Vol. abs/2001.07966 (2020).Google ScholarGoogle Scholar
  43. Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR, Vol. abs/1511.06434 (2016).Google ScholarGoogle Scholar
  44. Alec Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training.Google ScholarGoogle Scholar
  45. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).Google ScholarGoogle Scholar
  46. Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. FaceForensics+: Learning to Detect Manipulated Facial Images. arxiv: 1901.08971 [cs.CV]Google ScholarGoogle Scholar
  47. Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAlmageed, Iacopo Masi, and Prem Natarajan. 2019. Recurrent Convolutional Strategies for Face Manipulation Detection in Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. IEEE, 80--87.Google ScholarGoogle Scholar
  48. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 7463--7472.Google ScholarGoogle Scholar
  49. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2020. Training data-efficient image transformers & distillation through attention. ArXiv, Vol. abs/2012.12877 (2020).Google ScholarGoogle Scholar
  50. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Luisa Verdoliva. 2020. Media Forensics and DeepFakes: An Overview. IEEE Journal of Selected Topics in Signal Processing, Vol. 14 (2020), 910--932.Google ScholarGoogle ScholarCross RefCross Ref
  52. Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-Attention with Linear Complexity. ArXiv, Vol. abs/2006.04768 (2020).Google ScholarGoogle Scholar
  53. Ross Wightman. 2019. PyTorch Image Models. https://github.com/rwightman/pytorch-image-models. https://doi.org/10.5281/zenodo.4414861Google ScholarGoogle Scholar
  54. Xinsheng Xuan, Bo Peng, Wei Wang, and Jing Dong. 2019. On the generalization of GAN image forensics. In Sun Z., He R., Feng J., Shan S., Guo Z. (eds) Biometric Recognition. CCBR 2019. Springer, Cham. https://doi.org/10.1007/978--3-030--31456--9_15Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Xin Yang, Yuezun Li, and Siwei Lyu. 2019. Exposing Deep Fakes Using Inconsistent Head Poses. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8261--8265. https://doi.org/10.1109/ICASSP.2019.8683164Google ScholarGoogle Scholar
  56. Xiangyu Zhu, Hao Wang, Hongyan Fei, Zhen Lei, and Stan Z. Li. 2020. Face Forgery Detection by 3D Decomposition. arxiv: 2011.09737 [cs.CV]Google ScholarGoogle Scholar

Index Terms

  1. Video Transformer for Deepfake Detection with Incremental Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '21: Proceedings of the 29th ACM International Conference on Multimedia
        October 2021
        5796 pages
        ISBN:9781450386517
        DOI:10.1145/3474085

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 October 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader