Using ensemble models to detect deepfake images of human faces

Deepfakes, synthetic media generated through advanced artificial intelligence techniques, are a rising threat to content authenticity. This paper explores deepfake detection methods to discern manipulated media from genuine content. We evaluate two convolutional neural network (CNN) architectures - EfficientNetB4 and EfficientNetB4 with attention mechanisms - on the Forensics++ and DeepFake Detection Challenge datasets. Our key contributions are: 1) Demonstrating that integrating attention enhances model performance, with EfficientNetB4 attention achieving superior accuracy over baseline EfficientNetB4 in both intra-dataset and cross-dataset scenarios; 2) Elucidating attention’s efficacy in improving deepfake detection by concentrating on manipulated regions. Our experiments highlight attention’s potential in advancing state-of-the-art deepfake detection. As deepfakes grow increasingly realistic, robust techniques like attention become imperative for multimedia forensics. This paper provides valuable insights toward developing adaptable deepfake detection systems to preserve content integrity.


INTRODUCTION
The advent of deep learning technologies has ushered in an era of unprecedented advancements in multimedia manipulation, giving rise to the phenomenon known as deepfakes.Deepfakes, synthesized hyper-realistic media content created through sophisticated artificial intelligence (AI) techniques [9,10,13,18,19], pose a significant threat to the veracity of digital media.As the boundary between reality and fabrication becomes increasingly blurred, the need for effective deepfake detection methods becomes paramount.This paper explores the background of deepfakes, elucidating the underlying technology that enables their creation, and highlights the pressing necessity for robust deepfake detection mechanisms.
Deepfakes, a portmanteau of "deep learning" and "fake, " encompass a wide range of manipulated media, including images, videos, and audio recordings.The proliferation of deepfake technology has been fueled by the exponential growth of AI, particularly deep neural networks, which excel at learning complex patterns and generating realistic content.While deepfakes have found applications in entertainment and creative endeavors, their malicious use for spreading misinformation, propaganda, and cyber threats has raised significant concerns.The potential consequences of undetected deepfakes in areas such as politics, journalism, and personal privacy underscore the urgency of developing reliable detection methods.
The imperative for deepfake detection is grounded in preserving trust and the integrity of digital content.With the increasing prevalence of deepfakes, the ability to discern authentic media from manipulated counterparts becomes crucial for maintaining the credibility of visual and auditory information.Deepfake detection not only safeguards the authenticity of media in professional and personal contexts but also acts as a critical defense against the potential harm caused by malicious actors exploiting manipulated content.
In the landscape of deepfake detection, various methodologies have been proposed and implemented.Current mainstream methods leverage a combination of traditional image and video analysis techniques alongside cutting-edge deep learning approaches.Techniques such as facial and voice recognition, analysis of facial microexpressions, and anomaly detection in audiovisual content are commonly employed.Additionally, deep learning architectures, including convolutional neural networks (CNNs) and attention mechanisms, have demonstrated efficacy in distinguishing between genuine and manipulated media.This paper delves into the evaluation of two such models, EfficientNetB4 and EfficientNetB4 with attention mechanisms, against the Forensics++ and DeepFake Detection Challenge (DFDC) datasets, contributing to the ongoing discourse on effective deepfake detection methods.
Our primary contributions are as follows: (1) Demonstrating that integrating attention enhances model performance, with EfficientNetB4_attention achieving superior accuracy over baseline EfficientNetB4 in both intradataset and cross-dataset scenarios; (2) Elucidating attention's efficacy in improving deepfake detection by concentrating on manipulated regions.

BACKGROUND 2.1 Deepfake
Deepfake [12] refers to the technique of using artificial intelligence (AI) and machine learning algorithms to create manipulated or synthesized videos or images that appear to be real but are actually fabricated, as shown in Figure 1.The term "deepfake" combines "deep learning" (a subset of machine learning) and "fake".Deepfake technology utilizes powerful algorithms, particularly generative adversarial networks (GANs), to generate highly realistic and convincing visual content.GANs consist of two components: a generator that produces fake content and a discriminator that tries to distinguish between real and fake examples.Through an iterative training process, the generator learns to create increasingly realistic deepfakes, while the discriminator improves its ability to detect them.
While deepfake technology has gained attention for its potential use in entertainment and creative applications, it has also raised concerns due to its potential for misuse, such as spreading disinformation, defamation, or creating non-consensual explicit content.Deepfakes can be created by swapping faces, altering facial expressions, or even manipulating entire bodies, leading to the creation of convincing but entirely fabricated videos or images.
The implications of deepfakes extend beyond individual privacy concerns.They have the potential to undermine trust in visual media, making it more challenging to discern between genuine content and manipulated or falsified ones.As a result, there is a growing need for research, development, and countermeasures to detect and mitigate the impact of deepfakes.
Efforts are underway to develop deepfake detection methods, forensic techniques, and legislation to address the potential risks associated with deepfake technology.The goal is to strike a balance between the benefits of creative expression and the protection of individuals and society from the harmful consequences of deceptive and malicious use of deepfakes.

Deepfake Detection
The rise of deepfake technology has prompted the development of techniques for detecting and mitigating its harmful effects.Early detection methods primarily relied on identifying artifacts introduced during the manipulation process, such as inconsistencies in image quality or traces of digital manipulation [2,3,7,11,15].These techniques often utilized steganalysis approaches and noise cues to uncover the presence of fake content.
However, as deepfake synthesis quality improved and became more sophisticated, traditional detection methods became less effective.Consequently, researchers shifted their focus towards using deep learning techniques to tackle the detection problem directly.Deep neural networks, trained on large datasets of real and manipulated images, have shown promising results in discriminating between genuine and fake faces.By learning complex patterns and features indicative of deepfake content, these models demonstrated higher accuracy in detecting manipulated media.

Related Concepts
EfficientNet: EfficientNet, introduced by Google AI in 2019, is a family of convolutional neural network (CNN) models [16].These models aim to achieve a delicate equilibrium between high accuracy and computational efficiency.The primary motivation behind the development of EfficientNet was to address two prominent challenges in CNN architecture design: enhancing accuracy while maintaining computational efficiency.Traditional approaches often improved performance by increasing model size and depth, which led to heightened computational requirements.In contrast, EfficientNet strives to strike a balance between these factors by employing a compound scaling method that uniformly scales the network's depth, width, and resolution.These models are built on the MobileNetV2 backbone architecture and have demonstrated state-of-the-art performance across various computer vision tasks, including image classification, object detection, and semantic segmentation.Furthermore, EfficientNet has significantly influenced transfer learning, as its pretrained models are extensively utilized as feature extractors.In summary, EfficientNet offers efficient and accurate solutions for a range of computer vision tasks.Multiple-Head Attention Multiple Head Attention is a mechanism commonly used in transformer-based models, such as the Transformer [20] architecture used in natural language processing tasks.It improves the model's ability to capture complex relationships and dependencies within the input data.In the original Attention mechanism, a single attention head is used to compute the weighted sum of the input representations based on their relevance to a given query.However, using multiple attention heads allows the model to attend to different input parts simultaneously and learn different aspects of the data.Each attention head operates independently and learns its own set of attention weights.By having multiple heads, the model can capture diverse patterns and dependencies in the data, leading to enhanced representation learning.The outputs of the multiple attention heads are typically concatenated or combined to form the final representation.The main advantage of using multiple attention heads is the ability to capture both local and global dependencies.Local dependencies refer to capturing fine-grained relationships between nearby elements, while global dependencies capture broader relationships between elements further apart.By attending to different parts of the input simultaneously, the model can effectively capture both dependencies.

EfficientNet with Multiple-Head Attention
We choose EfficientNetB4 as our base model considering the balance between the performance and the computation cost.The structure of Vanilla EfficientNetB4 is shown in Figure 2  Recent advancements in attention mechanisms show significant promise for enhancing model performance across various computer vision tasks.Attention mechanisms enable models to concentrate on the most relevant parts of an image when making predictions, proving particularly advantageous for deepfake detection.This capability allows the model to focus on identifying artifacts and inconsistencies indicative of manipulated images and videos.
The attention mechanism offers an intuitive means for the model to zero in on irregularities, making it well-suited for pinpointing manipulated sections within deepfakes.Notably, it introduces only a marginal computational overhead, rendering it feasible for integration into the architecture of EfficientNetB4.Furthermore, attention mechanisms have demonstrated their efficacy in diverse computer vision tasks, underscoring their suitability for the field.
The incorporation of attention into EfficientNetB4 holds the potential for improving detection accuracy.By concentrating on discrepancies, the model can more effectively differentiate between real and fake images and videos.Additionally, attention mechanisms may enhance the model's resilience against adversarial attacks aimed at deceiving deepfake detectors.Furthermore, visualizing attention maps could offer insights into the model's decisionmaking process, revealing where it focuses during input analysis and potentially leading to enhanced model understanding.
One enhancement to the basic attention mechanism is multihead attention, which employs multiple parallel attention layers, or heads, to enable attending to different types of features simultaneously.For deepfake detection, multi-head attention could allow the model to focus concurrently on various manipulated aspects like inconsistent edges, blurred textures, and mismatched facial expressions.Each attention head specializes in a distinct artifact type.This enables more comprehensive coverage of deepfake irregularities compared to standard single-head attention.Moreover, averaging the outputs from the diverse attention heads provides a robust aggregated representation of areas likely to be faked.Multi-head attention's ability to jointly capture multiple deepfake clues makes it well-suited for improving EfficientNetB4's detection capabilities beyond what standard attention offers.The model can focus on a fuller range of manipulated facets to determine the authenticity of media precisely.
Given the aforementioned advantages of the attention mechanism, we have chosen to incorporate a multiple-head attention structure into EfficientNetB4.To be precise, we augment the vanilla EfficientNetB4 by appending a multiple-head attention layer at its conclusion, as depicted in Figure 3.In contrast to the attention mechanism utilized by EfficientNetB4 itself, namely the SE module, our attention mechanism exhibits the following distinctions: (1) While the SE module exclusively considers channel-wise relationships, our attention module can harness contextual information within the image to ascertain where attention should be directed.(2) Unlike the SE module, which applies attention across channels to accentuate informative ones, the attention module we employ focuses spatially on specific regions of the input image.
(3) Whereas the SE block is crafted for channel re-calibration, our attention mechanism is tailored to concentrate on deepfake artifacts.

EXPERIMENTS 5.1 Datasets
We test the proposed method on two different datasets: FaceForen-sics++ and DFDC.FaceForensics++ [14] is a forensics dataset that consists of 1000 original video sequences that have been manipulated with four automated face manipulation methods: Deepfakes, Face2Face, FaceSwap, and NeuralTextures.The data has been sourced from 977 YouTube videos, and all videos contain a trackable, mostly frontal face without occlusions, which enables automated tampering methods to generate realistic forgeries.The dataset provides binary masks that can be used for image and video classification as well as segmentation.In addition, the dataset provides 1000 Deepfakes models to generate and augment new data.The dataset is useful for research in deepfake detection and manipulation detection.
The DeepFake Detection Challenge (DFDC) Dataset [5] is a largescale dataset for deepfake detection, consisting of more than 100,000 videos of face swapping.The videos were created with various methods, including Deepfake, GAN-based, and non-learned techniques.The dataset was designed to measure progress on deepfake detection technology and to accelerate the development of new ways to detect manipulated media.Facebook created the dataset with paid actors who agreed to the use and manipulation of their likenesses.

Evaluation Settings
In this experimental study, we employed two prominent datasets, Forensics++ and DeepFake Detection Challenge (DFDC), to assess the performance of two distinct models, namely EfficientNetB4 and EfficientNetB4_attention.The datasets were partitioned into three subsets: a training set, a testing set, and a validation set.The training sets were utilized to train both models, allowing them to learn and adapt to the intricacies of the respective datasets.Subsequently, the trained models were evaluated on the validation sets to gauge their performance and generalization capabilities.By evaluating Forensics++ and DFDC datasets, we aimed to provide a comprehensive assessment of the model's effectiveness in detecting manipulated content, considering the diverse characteristics and challenges posed by these datasets.The use of validation sets allowed for a robust evaluation, providing insights into the models' ability to discern between authentic and manipulated content while facilitating comparisons between the two model architectures.

Evaluation Results
Table 1 and Table 2 present the results of our IID evaluation, where both the training set and the validation set originate from the same dataset, ensuring a consistent assessment environment.Our evaluation focused on the performance of three deepfake detection models, namely the baseline EfficientNetB4, Bonettini's method Ef-ficientNetAutoAttB4 [4], and our method EfficientNetB4_attention, trained on Forensics++ and DFDC datasets.The tables illustrate the efficacy of these models in discerning authentic content from manipulated media within the confines of a single dataset.
In Table 1, which details the evaluation on Forensics++, our models underwent rigorous training on the Forensics++ training set and were subsequently tested on the validation set from the same dataset.The results reveal the proficiency of all three models in detecting deepfakes within the Forensics++ dataset.Notably, EfficientNetB4_attention outperformed its baseline counterpart, demonstrating a superior ability to identify manipulated content within the IID context.Table 2 extends the evaluation to the DFDC dataset, where similar training and validation procedures were followed.The comparison of results indicates the robustness of our models in the context of DFDC, with EfficientNetB4_attention exhibiting a notable advancement in performance over both the baseline method, EfficientNetB4, and another existing approach, EfficientNetAutoAttB4.This suggests that integrating attention mechanisms enhances the models' discriminative power, contributing to superior deepfake detection accuracy in IID scenarios.The non-identity (non-IID) evaluation, as depicted in Table 1 and Table 2, represents a comprehensive assessment of deepfake detection models under the scenario where the training set and validation set are derived from distinct datasets.This challenging setting aims to simulate real-world scenarios where models must generalize across diverse datasets, showcasing their adaptability and robustness.In our study, we focused on the performance of three deepfake detection models: the baseline EfficientNetB4, Bonettini's method EfficientNetAutoAttB4, and our method EfficientNetB4_attention.
Table 1 details the non-IID evaluation on the Forensics++ dataset, wherein the models were trained on the Forensics++ training set and subsequently evaluated on the validation set of the DFDC dataset.The results demonstrate the models' capability to generalize across datasets with varying characteristics.Notably, Effi-cientNetB4_attention consistently outperformed both the baseline method, EfficientNetB4, and another existing approach, Efficient-NetAutoAttB4, underscoring its efficacy in achieving superior performance in the challenging non-IID context.
Extending the evaluation to the DFDC dataset in Table 2, a similar trend is observed.Here, the models trained on the DFDC training set were evaluated on the Forensics++ validation set, as shown in Table 3. Once again, EfficientNetB4_attention exhibited remarkable In summary, the non-IID evaluation results highlight the effectiveness of our EfficientNetB4_attention model in deepfake detection across diverse datasets.Its consistent outperformance compared to baseline methods, both within the Forensics++ and DFDC datasets, demonstrates the model's capacity to adapt and excel in real-world scenarios where the training and validation data originate from different sources, as shown in Table 4.These findings emphasize the potential of attention mechanisms in elevating the performance of deepfake detection models in addressing the challenges posed by cross-dataset generalization.

CONCLUSIONS
In conclusion, our study highlights the significance of attention mechanisms in advancing the state-of-the-art in deepfake detection.The consistent outperformance of EfficientNetB4_attention across different evaluation scenarios underscores its effectiveness in addressing the challenges posed by intra-dataset variations and cross-dataset generalization.As the threat of deepfakes continues to evolve, our findings contribute valuable insights toward developing more robust and adaptable deepfake detection mechanisms.

Figure 1 :
Figure 1: The original picture (left) and the fake picture (right) generated by the deepfake technology.
. It consists of X layers.According to the experiments in [19], EfficientNet-B4 achieves an impressive top-1 accuracy of 83.8% on the ImageNet [30] dataset, utilizing 19 million parameters and 4.2 billion FLOPS.

Figure 2 :
Figure 2: The structure of the vanilla EfficientNetB4.

Figure 3 :
Figure 3: The structure of the EfficientNetB4 with the multiple-head attention module.

Table 1 :
Conducting training using the Forensics++ dataset and subsequently testing on the same Forensics++ dataset.

Table 2 :
Conducting training using the DFDC dataset and subsequently testing on the same DFDC dataset.

Table 3 :
Conducting training using the DFDC dataset and subsequently testing on the Forensics++ dataset.performancegains over its counterparts, showcasing its ability to effectively detect deepfakes even when faced with variations in dataset characteristics.This reinforces the robustness and generalization capabilities of the attention-enhanced model in non-IID scenarios.

Table 4 :
Conducting training using the Forensics++ dataset and subsequently testing on the DFDC dataset.