Enhancing Deepfake Detection: Spatial-Temporal Preprocessing and Self-Attention ResI3D Model

Deepfake technology is the outcome of employing deep learning techniques to overlay the face of one individual onto the video of another. As deep learning technology advances rapidly, the proliferation of high-quality deepfakes for malicious digital activities is notably on the rise. With growing concerns about the misuse of deepfake technology, there is an increasing demand for research into deep learning-based methodologies to detect and counteract it. While Deepfake detection using deep learning has been a subject of prior research, these approaches primarily rely on images hence not utilizing temporal information. Additionally, research combining CNN and RNN has inherent limitations. It operates with compressed data, resulting in the loss of spatial information and the utilization of the inherent temporal characteristics in pixel-to-pixel temporal data. In this study, we propose a detection model that harnesses the inherent attributes of video data through self-attention on both the spatial and temporal axes, using the ResI3D model along with the Non-Local Block. Additionally, we conducted experiments during the preprocessing phase to validate and implement methods that facilitate the model's effective learning of both temporal and spatial information. As a result, our model demonstrated enhanced performance when compared to existing deepfake video detection models.


INTRODUCTION
Currently, deep learning technology has been successfully applied and used in various fields.In particular, in the field of computer vision, significant advancements have been achieved through the development of deep learning technology.However, as technology advances, cases of abuse through technologies that pose a threat to personal privacy and national security are also emerging in proportion.One of these technologies is Deepfake technology.Deepfake is a portmanteau of 'deep learning' and 'fake, ' and it involves using artificial intelligence technology to manipulate the face of a specific individual onto the image of another person.While Deepfake technology has high industrial value and is used in various fields, such as movies and music videos, it has raised issues of distortion of truth and misuse through video manipulation.Because it can create high-quality fake videos that are indistinguishable to the human eye, it is easily exploitable by malicious users, potentially leading to significant social problems or political threats.
Referred to as 'Deepfake', a broad facial manipulation technique encompasses traditional methods in computer vision and deep learning.Typically, these Deepfakes can be broadly categorized into two main types.Firstly, there is the 'Identity Swap' method, which involves replacing one person's face with another person's face, altering the identity.Identity Swap is achieved by exchanging the faces between the source video and the target video.Secondly, there is the 'Facial Reenactment' method, where one person's expressions or emotions are transferred onto another person's face.In the case of facial reenactment, the mouth shape from the source video is applied to the target video.Among these two methods, Identity Swap-based Deepfakes are often exploited in digital crimes.Within this category, the use of fake news created using Identity Swap Deepfakes has contributed to increased social anxiety, especially when disseminated on social media platforms.Therefore, the ability to determine the authenticity of videos is of paramount importance.
In this study, we aim to propose a deep learning-based methodology for detecting Identity Swap-based Deepfake videos, which are at high risk of exploitation.
As deep learning technology advances rapidly, the misuse of high-quality Deepfakes has significantly increased, leading to active research into methodologies for detecting Deepfake videos.Early Deepfake detection research used methods to capture abnormal movements of individuals in videos [1] or assessed the quality of the images themselves [2].Subsequently, research focused on using CNN (Convolutional Neural Network) at the image level by selecting a single frame from the input video.However, this approach has limitations as it does not take into account the temporal information between frames, which is crucial for determining videolevel manipulations.To address this limitation, research has been conducted by extracting frames, compressing them using CNN, and utilizing them as input for RNN (Recurrent Neural Network) to classify manipulations.Nevertheless, this method has its limitations since compression can lead to the loss of temporal information among pixels.Additionally, this approach does not fully leverage the characteristics of the data within the video, and the complex computations involved in RNN training make it challenging.
In this study, we aim to address the issue of Deepfake detection by using the video itself as input data.Firstly, we structure the visual and temporal information of Deepfake videos for effective learning through preprocessing.We conducted experiments to determine the preprocessing techniques that result in the best performance when extracting frames from the entire video and reconstructing it.This has led to an enhancement in Deepfake video detection performance.In the model section, we utilize the ResI3D model, which is a 3D version of the ResNet50 model, and propose a new structure by incorporating a Non-Local Block to the model architecture.Through this newly proposed structure, we consider the entire area, not just local pixel regions, and apply self-attention [3] to address long-term dependency issues that occur during the learning process.By leveraging both the temporal information of the video and the visual information of each frame image, we have remarkable detection performance.
In summary, we make the following main contributions: 1. Through our experiments, we optimized preprocessing by defining regions containing critical visual information and extracting frames to preserve the video's temporal data, leading to improved Deepfake detection performance.2. We introduce a model that applies self-attention to both the spatial and temporal dimensions of the input video, effectively capturing both temporal and visual information, leading to enhanced Deepfake video detection performance.

RELATED WORK 2.1 Face Manipulation
Deepfake technology involves manipulating videos to create fake content.It can be broadly categorized into computer graphics-based methods and learning-based methods, with distinctions based on what is altered-identity manipulation and expression manipulation.
Commonly used methodologies include Deepfakes [4], Face2Face [5], FaceSwap [6], and NeuralTextures [7].These methodologies have contributed to the creation of a dataset called FaceForensics++ [8].Face2Face alters facial expressions of a selected face from the source video, using computer graphics-based methods.FaceSwap changes the face of the source video using facial landmark information, also relying on computer graphics-based techniques to generate lifelike content.NeuralTextures employs a learning-based approach to manipulate video textures, effectively changing expressions and creating more natural video manipulations compared to computer graphics-based methods.It starts by detecting and extracting faces from input images, aligning them to the desired face positions, and subsequently learning to recreate them.Once the learning process is complete, it merges the reconstructed face with the original, resulting in more convincing Deepfake videos when compared to the FaceSwap method.Moreover, recent advancements in this field, such as the utilization of StyleGAN [9] developed by NVIDIA, now allow for the creation of entirely new Deepfake videos.This demonstrates the ongoing evolution of the technology.

Deepfake Detection
As Deepfake technology has advanced, the need to develop various techniques for detecting manipulated videos has grown.There are three main approaches to Deepfake detection.The first method is based on detecting abnormal movements unique to fake videos generated during the manipulation process.This approach aims to identify and classify these abnormal movements.Notable techniques include those that focus on abnormal eye blinking patterns for detection [10], as well as those that simultaneously extract features from video and audio to detect matching patterns in lip movements and voice [11].The second method is based on the quality of the images themselves.It relies on distinguishing features between the original and manipulated faces using a PCA-LDA (Principal Component Analysis -Linear Discriminant Analysis) classifier.This approach involves detecting features using the PCA-LDA classifier [12].Another technique in this category measures the quality drop in images during compression as a feature for detection [13].
The two methods introduced earlier have limitations in the current scenario, given the development of high-quality fake videos due to advances in Deepfake generation technology.To overcome these limitations, various AI-based detection techniques have been developed.Initially, methods in this area primarily used CNN with 2D convolutional filters as the basic structure.They typically focused on image -level detection rather than video.Research has explored Deepfake detection using Vision Transformers, adapting the Transformer concept from natural language processing to the field of images [15].Recent research in this area has also explored using the feature maps obtained through convolution layers as input to Vision Transformers, instead of breaking down images into small patches [16].This approach combines local features from images using CNN and global features from Vision Transformers, resulting in improved Deepfake detection performance.However, these methods are limited to images and do not consider temporal information between frames, focusing solely on facial manipulation.
To address this limitation, recent research has focused on methodologies that combine CNN and RNN.These methods extract feature maps from consecutive frames using CNN and use them as input for RNN to assess the authenticity of video content.An example of this is the research on Deepfake video detection using EfficientNet and Bi-LSTM [17].In this study, EfficientNet is used to acquire spatial information, and the information from each frame is used as input for Bi-LSTM [18] to acquire temporal information, which is then used to evaluate the authenticity of the Deepfake.Thus, research in this area emphasizes the importance of considering both frame-level visual information and temporal information between frames to detect Deepfake videos effectively.

Human Action Classification
The field of human action classification focuses on researching methodologies for classifying the actions of individuals appearing in input videos.This field has longer history compared to the Deepfake video detection field.To better understand and learn the temporal structure of video, methodologies have been developed using 3D CNNs [19].These methods utilize 3D convolutional filters configured to facilitate the learning of video data.However, they come with the drawback of increased training parameters and computational complexity, making training challenging.At the time when these methods were being researched, standardized video datasets like ImageNet [20] did not exist.Due to the absence of such datasets and the computational challenges, efficient 2D CNN models like ResNet [21] and Inception [22] couldn't be explored further.To overcome these limitations, the I3D model [23] was introduced.It essentially extends efficient 2D CNN networks into a 3D structure by replicating convolutional filters along the temporal axis.This approach resulted in a more computationally efficient structure compared to traditional 3D CNN models.It also enabled the utilization of pre-trained weights from large-scale datasets used in 2D CNNs, laying the foundation for highly performing human action classification methodologies.In this study, we aim to use the extensively researched I3D model, which has been extensively researched in the field of human action classification, to propose a methodology for Deepfake detection that achieves even higher performance.

METHODOLOGY
In this study, we aim to deviate from the conventional CNN and RNN combined methodologies, which are commonly used in recent Deepfake video detection research.Instead, we propose a methodology that leverages both the temporal information between video frames and the spatial information within each frame.In the preprocessing phase, we extract specific frames from the entire video to ensure that the input data effectively captures temporal information.Additionally, we separately extract the facial region to eliminate unnecessary parts and retain essential spatial information.Within the proposed model structure, we use 3D CNN and Non-Local Blocks to enable the learning of both temporal and spatial information from the input data.This approach aims to selectively emphasize features beneficial for Deepfake detection.

Preprocessing
The preprocessing stage comprises frame extraction, face detection and extraction, along with normalization.In this study, to identify Deepfake manipulation, we follow the process illustrated in Fig- ure1, which involves extracting the video frame by frame.These extracted frames are then reconstructed into video sequences and used for assessing the presence of manipulation.The reason for this preprocessing approach is that processing entire Deepfake videos as input data for training is resource-intensive and can lead to inefficient results.Therefore, we extract frames from the video for analysis.It's worth noting that previous research in Deepfake video detection has identified a significant issue with frame extraction.To distinguish between real and manipulated videos, it is crucial to utilize the flow of frames.However, when examining the frame extraction used in prior studies, it is observed that the first 20 frames of the video are continuously extracted.Benchmark datasets such as Faceforensics++ and celebDF are composed of videos with an FPS (Frames Per Second) of 30, which means they contain 30 frames per second.This discrepancy in the frame extraction approach and the natural FPS of benchmark datasets can lead to suboptimal results.
In other words, creating a model that detects Deepfake videos using less than one second of video data would be akin to using single images as input data, as it fails to capture any significant differences between frames.As shown in Figure 2, there is no apparent visual distinction between frames, making it challenging to differentiate between real and manipulated videos.Therefore, this study proposes a preprocessing method that utilizes data where significant differences exist between frames in the input data.This approach aims to maximize the visual differences over the temporal flow between frames, enabling effective learning of features that differentiate between real and manipulated videos based on the disparities between the two.
First proposed method named 'Temporal', frames are extracted at regular time intervals, ensuring a sufficient time gap between frames to make the differences between frames clear.During this process, various frame extraction methods were tested with differences of 5 frames, 10 frames, 15 frames, 20 frames, and 30 frames.The reason for using 10 frames as the standard is that it corresponds to 30 frames per second (FPS), maximizing the changes per second.Benchmark datasets such as Faceforensics++ and celebDF often consist of videos with an average of 300 frames, so this choice was based on available data.The experimental results indicated that extracting frames at 10-frame intervals was the most suitable.If the time gap between frames is too wide, it leads to a loss of temporal continuity, while if it's too narrow, there is no significant difference between frames, reducing the usefulness of the information.
The second proposed method named '4by5' combines the two approaches to frame extraction.It extracts four consecutive frames, followed by an interval of 10 frames, creating four sets of 5frames each.This combination aims to strike a balance between focusing on the frame differences and preserving temporal information.From the results of [Experiment 1], it was concluded that extracting frames at 10-frame intervals effectively captured the visual changes in Deepfake videos over time.Therefore, this method was chosen as the final data preprocessing approach.
Subsequently, in Deepfake videos, not all frames are manipulated; only the faces are replaced with different identities.To address this, face detection technology is employed to identify and extract the facial regions in videos and images.Advancements in computer vision and deep learning have led to the development of various high-performance convolutional neural network (CNN)based methods.Notable models in this domain include MTCNN [24], BlazeFace [25], and RetinaFace [26].In this study, RetinaFace, which is currently considered a state-of-the-art model in this field, was used to investigate which preprocessing method produced the best detection results.Three extraction methods were compared: Full (the entire video frame), Mask (the facial region detected using the face detection model), and Face (the detected face with an added margin).Since the frames obtained from these methods had varying sizes, they were resized to 224x224 pixels for consistency, and two-dimensional images effectively downsized using bilinear interpolation were used for experimentation.The results from [Experiment 2] showed that the Face method, which involved extracting frames with detected faces and adding a margin, was the most effective.This approach leveraged the features present in the pixels around the manipulated face, leading to better performance compared to using only the facial region.It was also observed that using the entire video frame introduced noise from non-facial pixels, which adversely affected the performance.
Furthermore, the primary objective of this study is to distinguish between Deepfake videos and real videos.When examining the benchmark datasets used, it is evident that Deepfake methods generate multiple datasets using real videos, resulting in an imbalance in the number of Deepfake and real videos.To address the data imbalance issue, real videos were augmented.Since the

Model
In this study, we aim to depart from the conventional approach of combining CNN and RNN structures, typically used in the field of Deepfake video detection.While these methods employ CNN architectures for image processing and RNN structures for handling temporal information, they have limitations such as ignoring the temporal structure of videos and facing challenges in RNN training [27].To overcome these limitations, we propose a methodology that utilizes the I3D model structure, which has been studied in human action classification.However, the traditional I3D model exhibits a drawback in that it relies on convolutional filters, resulting in shared weights for neighboring regions within the input data.In other words, it inherits the persistent limitations of CNN, where not all information in the input data can be effectively used for learning.In our proposed model, we aim to introduce a methodology that addresses these limitations and incorporates processes for improving model performance.
To address the inherent limitations of CNNs and to ensure effective learning of both temporal and spatial features in videos while utilizing the I3D model as the base structure, we propose adding a Non-Local Block [28] to the I3D architecture.This structural enhancement allows for considering all pixel information within the input data during the model's training process.
The Non-Local Block functions as an operator that works with feature maps, similar to convolutional filters, and can be incorporated into the middle of the CNN structure.By adopting this method, each pixel is re-expressed based on its similarity to all other pixels, ensuring the inclusion of all information within the input data during the training phase.A comparison of the operations between the Non-Local Block and the Convolutional Filter is as follows: In equation 1), the Convolutional Filter conducts a dot product with one pixel and its neighboring pixels through a trained filter.However, as shown in equation 2) and the process diagram in Figure 3, the Non-Local Block operates by performing dot products for one pixel with all other pixels based on similarity and learning.This feature enables learning that considers the entire region rather than being limited to adjacent areas within the data.By performing Self Attention, which compares all pixels, this method resolves longterm dependency issues in learning.Consequently, it highlights key pixels in the spatial-temporal axis, allowing for the detection of video manipulation during the model training and result generation process.
The overall structure of the proposed model, designed for Deepfake video detection, is divided into two main stages, as illustrated in Figure 4.
1.In the preprocessing stage, we extract 20 frames from the input video based on results of Experiment1.Facial areas are detected for each frame, and they reconstruct into a new video.We follow the Temporal approach for frame extraction, as determined by Experiment 1.We use the RetinaFace face detection model and apply the margin-based Face method, as established in Experiment 2. Subsequently, we rescale the pixel values of the input video by normalizing with the mean and standard deviation of the ImageNet dataset.
2. In the model stage, the video reconstructed during preprocessing serves as the input data.The model is based on ResI3D with the addition of a Non-Local Block.To effectively learn the temporal and spatial information between frame-level images in the given preprocessed data, the combination of the I3D model and

EXPERIMENTS
The experiment utilized benchmark datasets in the field of Deepfake detection, including FaceForensics++ and Celeb-DF [29].FaceForen-sics++ consists of 1,000 original videos collected from YouTube and 5,000 Deepfake videos generated using five different Deepfake generation techniques: Deepfake, FaceSwap, FaceShifter, Face2Face, and NeuralTextures.These techniques can be broadly categorized into identity manipulation and expression synthesis techniques, with Deepfake, FaceSwap, and FaceShifter belonging to identity manipulation and Face2Face and NeuralTextures to expression synthesis.Celeb-DF is composed of 890 original interview videos featuring famous individuals and 5,640 Deepfake videos generated from these originals.Unlike FaceForensics++, Celeb-DF is notable for containing high-quality Deepfake videos generated using a single method.Both datasets were used in this study.In the case of the FaceForensics++ dataset, only Deepfake videos created using identity manipulation techniques were selected, aligning with the research objectives.To address the data imbalance issue between Deepfake and real videos in the used dataset, real videos were augmented.In the final dataset used for experimentation, there were 3,000 Deepfake videos from FaceForensics++ (identity manipulation) and 5,640 Deepfake videos from Celeb-DF.The number of real videos was increased to 3,000 and 4,450, which is five times the original quantity.Both FaceForensics++ and Celeb-DF were used for the preprocessing-related experiments, including Experiments 1, 2, and 3.

Experiment 1: Frame Extraction Experiment
In this experiment, video preprocessing involved extracting frames at a specific interval to reconstruct the videos.The objective was to  find an extraction method that effectively captures the temporal information of Deepfake data and the changing spatial information in frame images.Three different extraction methods were tested: the Normal method, which is the commonly used method for extracting frames from a continuous segment; the Temporal method, which extracts frames with a 10-frame interval; and the 4by5 method, which is a combination of the two, as described in the preprocessing section.The experiment was conducted using the base model, ResI3D, with the goal of identifying a method that can effectively distinguish between real and Deepfake videos.The experimental results show that the Temporal method, which extracts frames at a regular interval 10 frames, demonstrates stable performance.This can be attributed to the dataset's temporal patterns, which can be captured by considering temporal and spatial information, making it an improved preprocessing method compared to the conventional approaches.

Experiment 2: Area-Related Experiment
In this experiment, we focused on Deepfakes where the identity has changed.Instead of using the videos directly, we applied preprocessing by detecting faces in the videos.The objective was to determine which the best preprocessing method for achieving optimal detection results.To achieve this, three different methods were employed to extract areas: Full (entire video frame), Mask (using a face detection model to extract the detected face area), and Face (adding a margin to the detected face area).The frames extracted using these methods had varying sizes, and they were resized to 224x224 for uniformity, using bilinear interpolation for effective downscaling of the 2D images.The experiments were conducted using the base model, ResI3D, to identify the most suitable area extraction method.The results indicated that the model's performance was highest when the Face extraction method was employed for preprocessing.Therefore, the Face extraction method was selected as the preprocessing method for utilizing area extraction.

Experiment 3: Data Augmentation
In this experiment, we utilized the Temporal method for frame extraction and the Face method for face area preprocessing.To address the data imbalance issue between real videos and Deepfake videos, we augmented the dataset.The objective was to evaluate the results using the augmented dataset, where the Deepfake detection model should perform well not only for Deepfakes but also for real videos.The experiments were conducted using the base model, ResI3D.The results demonstrated a significant improvement in the model's performance, as indicated by the AUC value, when the augmented dataset was used.Notably, the discriminative performance for real videos also improved.This suggests that addressing data imbalance through data augmentation is a crucial aspect for enhancing overall detection performance.

Experiment 4: Non-Local Block
In the next experiment, we compared the model's performance with and without the use of the Non -Local Block.In this study, we proposed a model structure that overcomes the limitations of conventional I3D models, which tend to focus on adjacent regions around specific pixels.We introduced the Non-Local Block to enable the model to learn from all information within the input video, leveraging both temporal and spatial information.The objective was to demonstrate improved performance by utilizing all the available information.To evaluate this, we used the ResI3D model as the base and compared its performance with and without the Non-Local Block.In Experiments 4 and 5, we utilized the preprocessed data based on the selected preprocessing methods from previous experiments.The results showed an enhancement in performance across all metrics when the Non-Local Block was added to the model.

Experiment 5: Comparative Experiment
Finally, we conducted a comparative experiment to evaluate the performance of our proposed methodology against existing Deepfake detection models.We compared our approach with four different versions of the ensemble model that combines four EfficientNetB4 models [31], an efficient Deepfake detection model utilizing Effi-cientNet and Bi-LSTM [17], a model trained by CVIT, the Multi-VIT model using three different scales of feature maps as input patches, and the base ResI3D model.The experiment data included Face-Forensics++, Celeb-DF, and an additional Korean dataset provided by AIhub.The Korean dataset was included to address the lack of datasets specifically for Asians and to validate Deepfake detection for Asians.The experiment environment utilized the final preprocessing results, including the Temporal frame extraction method, Face area extraction method, and augmented real video data.Model training consistently employed the Adam optimization algorithm and Binary Cross-Entropy Loss as the loss function for Deepfake detection.ImageNet data normalization was also consistently applied throughout the experiments.
In the final performance comparison experiment with existing models and SOTA models, the results demonstrated that our proposed model outperformed all other models in all aspects of the data.It exhibited significantly better performance, particularly on AIhub data, which is an Asian dataset.This suggests that our model can effectively extract features from Deepfake and real videos, regardless of race.

CONCLUSION
In this paper, a novel approach to Deepfake detection was introduced, which departed from traditional methods that combined CNN and RNN structures.Instead, a 3D CNN architecture was employed to model the extraction of spatiotemporal features using 3D convolution filters.The study specifically leveraged the efficient 3D ResNet50 architecture, known for its lower computational demands while delivering high performance across various domains.The addition of the Non Local Block enabled comprehensive learning of the entire input video region and emphasized key pixels in the spatiotemporal axis for manipulation detection.Various parameter comparisons determined the optimal facial region setup and model structure.The study compared the proposed model's effectiveness and superiority to existing Deepfake detection models.It notably outperformed the current state-of-the-art model in two different experimental conditions, especially using the FaceForensics++ and CelebDF dataset.This research has the potential to enhance the accurate detection of identity-based Deepfake videos, addressing concerns related to digital crimes involving Deepfakes.

Figure 1 :
Figure 1: Frame Extraction process .The visualization illustrates the methods for frame extraction from top to bottom: Normal method, proposed method named Temporal, and proposed method named 4by5.

Figure 2 :
Figure 2: Visualization of frame-by-frame images based on the extraction methods:

Figure 3 :
Figure 3: Calculation Process about Non Local Block

Figure 4 :
Figure 4: Scheme of the proposal methodology process

Figure 5 :
Figure 5: Comparison of AUC for Benchmark dataset with Deepfake models

Table 1 :
Comparison results for frame extraction methods

Table 2 :
Comparison results for the processing area

Table 3 :
Results of augmentation effects on real videos

Table 4 :
Results of adding the Non Local Block