CLIPping the Deception: Adapting Vision-Language Models for Universal Deepfake Detection

The recent advancements in Generative Adversarial Networks (GANs) and the emergence of Diffusion models have significantly streamlined the production of highly realistic and widely accessible synthetic content. As a result, there is a pressing need for effective general purpose detection mechanisms to mitigate the potential risks posed by deepfakes. In this paper, we explore the effectiveness of pre-trained vision-language models (VLMs) when paired with recent adaptation methods for universal deepfake detection. Following previous studies in this domain, we employ only a single dataset (ProGAN) in order to adapt CLIP for deepfake detection. However, in contrast to prior research, which rely solely on the visual part of CLIP while ignoring its textual component, our analysis reveals that retaining the text part is crucial. Consequently, the simple and lightweight Prompt Tuning based adaptation strategy that we employ outperforms the previous SOTA approach by 5.01% mAP and 6.61% accuracy while utilizing less than one third of the training data (200k images as compared to 720k). To assess the real-world applicability of our proposed models, we conduct a comprehensive evaluation across various scenarios. This involves rigorous testing on images sourced from 21 distinct datasets, including those generated by GANs-based, Diffusion-based and Commercial tools.

These powerful tools have become accessible to a wider audience due to open-source availability.
In response to this, researchers have been actively proposing novel methods for automatic detection of synthetic content [5,7,32,45,47].However, a major issue with existing deepfake detection models is their limited ability to generalize across different data distributions [5,29,32].Deepfake detection is typically posed as a supervised learning problem, where a deep neural network model is trained to differentiate between authentic (real) and manipulated (fake) images [32,45,52].However, a significant challenge arises: if the model is exclusively trained on a particular category of fake images, its performance may falter when confronted with novel types of manipulated images, i.e., the generalization dilemma [25].
In [32], Ojha et al. suggest that current detection models might be biased towards identifying certain types of fake images because they focus on easily detectable patterns found in those images.As a result, these models might miss out on the subtle features of real images, treating them as if they do not match the patterns learned from the fake images.In order to overcome this, the authors proposed to conduct classification using models that have trained on diverse range of images during their initial training, i.e., models that are not specifically trained for deepfake detection.They proposed to employ large vision-language models, in particular, CLIP (Contrastive Language-Image Pre-training) [37] as a feature extraction model, and train a linear classification head on top for detecting deepfakes.They also observed that CLIP, even without undergoing specific training for classifying real and fake images, exhibits remarkable capability right from the start in discerning between authentic and fake images.Refer to Figure 1 for details.
In [32], Ojha et al. adapted CLIP for deepfake detection using linear probing, and the results they achieved showed strong generalization capabilities as compared to previous state-of-the-art [45] in detecting deepfakes.However, as highlighted in [26,50], adapting CLIP through linear probing does not exploit its language component, and only relies on visual features, which can lead to suboptimal performance.Our hypothesis is that by adapting CLIP using both the visual and text encoders, we can enhance detection performance, leading to a more effective and generalizable strategy for deepfake detection.In order to then verify our hypothesis we now raise this question: "Could combining CLIP's visual and textual capabilities further improve deepfake detection methods?" In pursuit of an answer, we delve into existing research literature focused on adapting Vision-Language Models (VLMs), specifically CLIP [37], for image classification tasks.For instance, Prompt Tuning [50] by Zhou et al. involves adapting a pre-trained CLIP model using language supervision.This method freezes the large CLIP model, and optimizes a small embedding treated as a prompt.In [13], CLIP Adapter is introduced, which adds a lightweight linear layer inside the CLIP model.During training, the large CLIP model remains frozen, while the smaller linear layer is optimized.Surprisingly, these promising strategies have not been explored in detecting deepfakes.The primary focus of our study is thus to determine the most effective transfer learning strategy among various options for large vision-language models in the context of deepfake detection.Moreover, we also pose questions such as, how various experimental conditions might impact the performance of the adopted strategies.This includes examining their ability to generalize to unseen data, performance when trained with limited real or fake image samples, robustness to different post-processing operations, and the impact of using a restricted amount of data for training.
To answer all these questions, we conduct an empirical analysis of the robustness of CLIP [37] when trained using these strategies, and evaluate resulting models on data originating from varied distributions.Specifically, we take the pre-trained CLIP model, and train it for deepfake detection using four distinct strategies, including (1) Fine-tuning, (2) Linear Probing, (3) Prompt Tuning [50] and (4) training an Adapter Network [13].Following [45] and [32], we employ ProGAN [20] as our training set.However, in contrast to these studies, we only use 200 images for training as compared to 720 images used by these two studies.We analyze our models on an extensive test set comprising of 21 different GAN-based, Diffusion-based and Commercial image generators.Our approach achieves high classification performance while using less training data as compared to previous approaches.
Our contributions can be summarised as follows: • We conduct an extensive empirical investigation into four distinct transfer learning strategies aimed at enhancing the adaptability and robustness of CLIP for deepfake detection, while taking inspiration from recent research on adapting large VLMs.• Through experimentation, we illustrate that our chosen transfer learning strategies, notably Prompt Tuning, beats the current state-of-the-art [32] by a clear margin.• We carry out few-shot experiments, illustrating excellent performance of our models even when exposed to only 32 /  samples from each LSUN object category [49], highlighting the effectiveness of the selected lightweight transfer learning strategies.• Robustness analysis conducted in the presence of post-processing operations such as JPEG compression and Gaussian blurring.
• Analysis of the impact of training set size, demonstrating that CLIP-based detectors achieve solid performance even when trained using a smaller amount of data (20k real fake images).• We plan on making the associated code and trained models opensource for the benefit of research community.This paper is organized as follows.In Section 2 we present a brief description of related works.In Section 3 we introduce the problem background, our proposed deepfake detection workflows, and the datasets that we employ for evaluation of our models.In Section 4 we elaborate in detail about the experiments we carried out for the sake of this study, and discuss the achieved results.Finally in Section 5 we conclude our study.

RELATED WORKS 2.1 Pre-trained Vision-Language Models
Recent advancements in large-scale pre-trained models, which integrate vision and language capabilities, have showcased notable success across a variety of tasks encompassing both images and text [1,18,37].The primary rationale driving the extensive adoption of these models lies in their interesting zero-shot capabilities and robustness to distribution shifts.
Radford et al. proposed Contrastive Language-Image Pre-training (CLIP), a large-scale model that exhibits robust zero-shot performance on several downstream tasks including image classification, optical character recognition, image text retrieval, and multiple other tasks [37].CLIP was pre-trained on a large scale dataset containing 400 million images, and their associated text captions.CLIP was pre-trained utilizing a contrastive loss, aiming to maximize the similarity between corresponding image and text captions compared to dissimilar pairs.Moving away from the requirement of expensive data cleaning process similar to Radford et al., Jia et al. [18] utilized a large-scale noisy dataset containing one billion image-text pairs to pre-train their model.The model was comprised of dual-encoder architecture, which was tasked to align visual and language representations of image-text pairs through a contrastive loss.They showed that a large enough dataset can compensate for its noise, resulting in state-of-the-art representations even with such a straightforward learning approach.

Transfer Learning
Vision and language models like CLIP [37] and ALIGN [18] offer interesting zero-shot capabilities on several different downstream tasks.Yet, to attain performance levels comparable to state-of-theart models on these downstream tasks, these models require further fine-tuning on task-specific datasets.For example, even on a simple dataset like MNIST [27], the zero-shot CLIP model (ViT-B/16) which was tested in [26] achieved an accuracy of only 55%.
However, it becomes apparent that fine-tuning full model on downstream dataset affects its robustness to distribution shifts [37,48].In response to this challenge, several studies have introduced techniques to fine-tune large vision and language models.In [50] Zhou et al. proposed Context Optimization (CoOp), a fine-tuning strategy to adapt vision-language models similar to CLIP for downstream image classification tasks.CoOp injects learnable vectors to a textual prompt's context (either at the front, middle or end), which are optimized during fine-tuning by minimizing the classification loss, whereas, both the vision and text encoders of CLIP are kept frozen.Gao et al. introduced CLIP-Adapter [13], a bottleneck layer designed to learn new features during fine-tuning.Additionally, it employs a residual-style feature aggregation approach to seamlessly integrate the originally pre-trained CLIP features with the newly acquired ones, all while keeping CLIP model frozen itself.

Fake Image Generation and Detection
Deep learning models for fake image generation have been with us for quite some time.Goodfellow et al. initially introduced Generative Adversarial Networks (GANs), a neural network architecture for unconditional fake image generation [14].Seminal works were targeted on for example, improved training process of GANs [16,21,42], improving quality and diversity of the generated images [20,24] and conditional image synthesis [31,46].
In more recent times, text-to-image generation models have attracted interest following the introduction of Diffusion models [11,30].Most of the recent Diffusion based image synthesis models, including Stable Diffusion [39], SDXL [36], DALL-E [38], Imagen [41] have demonstrated the ability to produce high quality images.Diffusion models also demonstrate the ability to generate images spanning a diverse range of categories and scenes as compared to GANs.
With the widespread availability of powerful open-source fake image synthesis models, the necessity to develop models capable of detecting fake images has become more crucial than ever before.Numerous previously proposed deepfake image detection methods opted to learn a deep neural network classifier capable of classifying  vs   images originating from the same generative model [40].However, studies suggest that such classifiers do not generalize well onto detecting fake images coming from other distribution than the training one [25,52].
Wang et al. [45] proposed a simple yet effective solution to the challenge of detecting images generated by GANs.By training a well-known CNN architecture, ResNet-50 [17], on a single GANgenerated dataset (ProGAN [20]), along with augmentations like JPEG compression and blurring, they significantly improved the model's robustness.This approach performed well even on images generated by different GAN models.Building on this, Gragnaniello et al. [15] modified ResNet-50 for GAN image detection.They avoided down-sampling in initial layers in order to preserve high frequency GAN realted fingerprints, and applied intense augmentations during training, outperforming previous method [45].Corvi et al. [8] extended work proposed in [15], training the same modified ResNet-50 on the dataset from [45].They found their model excelled on GAN images but struggled with Diffusion models.However, training on images from LDMs [39] yielded success on Diffusion-generated images but not on GAN ones.In a recent study, Ojha et al. [32] noted that previous techniques [45] fail on Diffusion model-generated images when initially trained on images generated by GAN models.They utilized a fixed CLIP encoder to train a linear classifier on CLIP features, achieving SOTA results for both GAN and Diffusion model-generated images by just training their model on GAN generated images same as [15,45].

METHODOLOGY 3.1 Background
The ultimate objective of a deepfake detection system is to determine if any given image is (a) authentic: captured using a camera, or (b) fake: synthesized using a generative model (GAN or Diffusion).In this section, we outline the methodologies examined in this study for training our detection model, along with the datasets used to train and evaluate our model.However, we begin by first presenting the baseline [46] and current SOTA [32] approaches proposed recently to address this task.These studies effectively leads us towards our proposed solution.
Wang et al. in [45] trained a ResNet-50 [17] using cross-entropy loss to perform binary classification between  and   images, using data they generated using the ProGAN model [20] after training it on 20 different object categories taken originally from LSUN [49].For each of the 20 object categories, the authors generated 18k synthetic images, totaling up to 360k   images.They incorporated  images from the LSUN dataset, amounting to 18k  images for each of the 20 object categories.Consequently, their training dataset contained 720k  and   images.They demonstrated through comprehensive evaluation that a simple CNN, when trained with meticulous data augmentation techniques like compression and blurring, exhibits effective generalization for deepfake detection on previously unseen data.They evaluated their trained model on images synthesized by various different GAN models showing excellent results.
Following this, Ojha et al. [32] found that the work in [45] was not performing as expected when tested on images synthesized by Diffusion models.For instance, on images generated by models like Latent [39] and Guided [11] Diffusion models, the detection model's classification accuracy experiences a significant decline, reaching close to chance performance.This implies that during training, the model emphasizes solely on detecting the presence or absence of model specific artifacts in an image, while overlooking other distinguishing features between  and   images.As a consequence, the resulting model becomes biased towards a single class ( in this case), leading to the misclassification of   images from a Diffusion model without GAN-specific artifacts as .
To tackle this issue, the authors suggested that the classification process should occur in a feature space that has not been solely learned to discriminate between  and   images.This approach was aimed at preventing bias towards recognizing specific artifacts from one class (Real, GAN, or Diffusion) disproportionately better than the other [32].Additionally, the selected feature space must capture a wide range of images, ensuring a robust fake image detector that works reliably across various categories such as outdoor scenes, objects, faces and beyond.The authors identified that CLIP's [37] feature space possesses these desirable qualitiesit was not initially trained for  vs   classification, and has been exposed to a variety of images representing diverse objects and scenes.
To validate their hypothesis, the authors used CLIP's image encoder (ViT-Large) as a feature extractor, and trained a simple linear model on top.They used the same dataset as in [45] training.The obtained results supported their hypothesis: their simple approach achieved state-of-the-art performance on previously unseen images from both GAN and Diffusion models [32].While [32] achieves excellent results on most datasets, it still seems to struggle on some datasets, including Guided Diffusion [11], LDM [39], Deepfakes [40], FaceSwap [40], and Commercial generators such as DALL-E 3 1 , Adobe Firefly2 and Midjourney3 .

Transfer Learning
When applied to adapt vision-language models for downstream vision tasks, linear probing faces a significant drawback as it completely overlooks the language component.As noted in [50], a linear layer trained on visual features serves as a static set of weights exclusively representing visual concepts.Consequently, the semantics embedded in texts remain largely unexplored, and irrelevant during this process.This limitation is exemplified in [32], where only the visual component of CLIP is utilized for deepfake detection, while completely neglecting the text encoder.We believe that leveraging both the visual and text encoders of CLIP [37] can lead to an improved strategy for  vs   classification.
Based on this insight, we propose leveraging CoOp [50], a Prompt Tuning strategy as our central approach to adapt CLIP [37] for deepfake detection.Prompt tuning is particularly appealing as it integrates both the visual and language aspects of CLIP.To ensure a fair assessment of the robustness of various transfer learning strategies, we incorporate three additional methods, in addition to Prompt Tuning for this task, including (1) Linear Probing, (2) Full Fine-tuning and ( 3) training an Adapter Network [13].A concise overview of each employed transfer learning strategy is presented in the following sections.

3.2.1
Linear Probing: Linear probing, a well-known transfer learning strategy, involves fine-tuning a linear classifier on top of a frozen model (CLIP in our case).We follow the same approach as employed by Ojha et al. [32], i.e., we discard CLIP's text encoder while freezing its image encoder.We then train a single linear layer for classification on the frozen CLIP's image features, mapping the penultimate image features to logits for class predictions using the Sigmoid activation function.The optimization takes place using the binary cross entropy loss.We illustrate linear probing strategy in Figure 2.

3.2.2
Fine-tuning: Fine-tuning in this context means training the whole CLIP model (ViT-Large) again on the downstream dataset, which in our case is the ProGAN dataset which was also used by [45] and [32].Full fine-tuning requires significantly more compute resources, data, and training time since the entire model is retrained.Additionally, as model size increases, this strategy demonstrates instability and inefficiency [26].During the training of our models, we encountered this issue, and mitigated it by utilizing an extremely small learning rate, 1 × 10 −6 .To fine-tune our model, we adhere to the procedure outlined in the pre-training of CLIP [37].However, we introduce a modification: rather than utilizing entire text captions for each image, we provide only single-word captions, specifically either  or  .A typical Fine-tuning pipeline for adapting CLIP is illustrated in Figure 2.
Table 1: This table showcases the statistics of the test datasets.Certain datasets include their own collection of  images.However, for datasets that lack their own  images, we utilize LAION's [43] images instead.

Prompt Tuning:
Initially introduced in the domain of natural language processing [28], Prompt Tuning is a relatively recent transfer learning strategy adopted by the computer vision community.This approach involves fine-tuning a pre-trained model like CLIP [37] by learning randomly initialized prompts (textual [50] and/or visual [19]) during training.The primary goal of Prompt Tuning is to adapt the model on specific downstream tasks by optimizing the prompts to align better with the target objectives.
In this study, we employ Context Optimization (CoOp), a transfer learning strategy introduced by Zhou et al. in [50], to fine-tune CLIP for the task of deepfake image detection.CoOp appends learnable vectors along with the context words4 of a prompt.These learnable vectors can be either initialized with random values or pre-trained word embeddings [50].During training the learnable vectors are optimized whereas both the text and vision encoders of CLIP are kept frozen.
[] refers to class token of the dataset, e.g.,  and   in our case.Class token within each prompt   is swapped with the corresponding word embedding vector of the -th class name.The prompt  is then fed through the text encoder, and optimized using cross-entropy loss during training.As evident from Eq. 1, the context tokens are added at the beginning of the class labels.While the CoOp paper explores various appending strategies, such as "end" and "middle", our findings indicate that appending context tokens at the "front" yields comparably better results.We show Prompt Tuning (CoOp) based CLIP training strategy in Figure 2.

Adapter Network:
Stepping away from Prompt Tuning, Gao et al. introduced a simple yet effective alternative approach for fine-tuning vision-language models using feature adapters [13].Specifically, the authors introduce CLIP-Adapter, an extra lightweight bottleneck layer which is optimized during training while the remainder of the CLIP model is kept frozen.Additionally, to remain robust against unseen data distributions, CLIP-Adapter integrates the original zero-shot visual or language embeddings with the corresponding fine-tuning feature embeddings through a ResNet styled residual connection [17].This feature blending allows CLIP-Adapter to exploit both the knowledge stored in the original CLIP's feature space, and the newly acquired knowledge from the downstream training examples simultaneously.CLIP-Adapter can be applied to either the visual or language branch.In our study however, we only use Adapter Network with Vision branch, and leave the language branch as is.See Figure 2 for reference.

Generative Models Explored
In this paper, we conduct an in-depth investigation into four distinct transfer learning approaches for deepfake detection.Our analysis is aimed at assessing the robustness of these approaches when coupled with pre-trained CLIP [37] ViT-Large model for deepfake detection when exposed to unseen data coming from diverse deepfake generators including GANs and Diffusion models.We follow the same protocols outlined by Wang et al. [45] and Ojha et al. [32], and train our models using data coming from just one generative model i.e., ProGAN [20].However, for evaluation we incorporate an even broader spectrum of generative models in our analysis.This extension aims to align our evaluation more closely with real-world scenarios.In total, we assess our models across 21 distinct datasets, primarily categorized as GAN-based, Diffusion-based and commercial tools [8].For detailed dataset statistics, please refer to Table 1.
Another minor fluctuation in evaluation protocol we follow is that [32] employed three distinct configurations for image generation using Glide and LDMs, presenting their findings separately.In contrast, we include all images from Glide and LDM subsets in our analysis but display averaged results in our tables due to space constraints.

EXPERIMENTS
In this section, we present performance scores achieved by CLIP ViT-Large [37] when trained using four distinct transfer learning strategies: (1) Linear Probing, (2) Fine-tuning, (3) Adapter Network [13] and (4) Prompt Tuning [50].Additionally, we evaluate trained models released by [8,15,32,45] on the same test set on which we evaluate our own models.Our aim is to determine if our chosen transfer learning strategies offer superior generalization compared to previous studies.In subsequent sections, besides assessing generalization capabilities, we conduct further experiments to assess performance of our models under various conditions, including smaller training set sizes, few-shot analysis, and robustness to post-processing operations.

Generalization Performance
We evaluate our model's performance by comparing it with four prior studies that aim to detect various types of deepfake images generated by different fake image generators.The initial study [45] in this field employed ResNet-50 [17] as the classifier.They trained their models on 720 /  images sourced from the ProGAN dataset which they generated for the sake of their study.They also employed image augmentations such as JPEG noise and Gaussian blurring, which made their models more robust towards postprocessed images during evaluation.The second study [15] also employs ResNet-50, but with a simple adjustment to the original architecture to better preserve the low-level forensic traces present inside images.The proposed modified model was also trained on the ProGAN dataset for /  classification introduced in [45].
In [8] use the same modified ResNet-50 [15] but train it again on two different datasets, i.e., ProGAN/LSUN and LatentDiffusion/LSUN to better understand which generative model offers better generalization.The fourth study from Ojha et al. [32] attained state-of-the-art performance.They utilized the CLIP ViT-Large model as a feature extractor, and subsequently trained a linear network on top of it for /  classification.In Tables 2 and 3, we compare our models' performance with that of [8,15,32,45].These studies [15,45] demonstrate strong performance on GAN-generated images but show mediocre results on images from Diffusion-based and Commercial generators.Conversely, [32] achieves good results on both GAN-based and Diffusion-based generators, although performance decreases on images from Commercial image generators (see Table 6) and Face-Forensics++ dataset [40], which utilizes an Auto-encoder based architecture for image synthesis.
Our four proposed CLIP adaptation approaches for deepfake detection demonstrate consistently better performance across all datasets as apparent from numbers in Tables 2, 3 and 6.However, as seen in Tables 2 and 3, the Prompt Tuning strategy [50] notably outperforms other transfer learning strategies in terms of both mAP and average accuracy.Notably, Prompt Tuning optimizes only a fraction of parameters (12k) compared to the Adapter Network and full Fine-tuning approaches, which optimize a larger number of parameters.Overall, we surpass the previous SOTA [32] by 5.01% in mAP and 6.61% in average accuracy across images from 18 distinct synthetic image generators.

Effect of Transfer Learning Strategy
In this section, we assess and compare the effectiveness of transfer learning strategies trained on images from ProGAN/LSUN datasets.Results are summarized in Tables 2 and 3.It is evident from the reported numbers that Prompt Tuning (CoOp) outperforms other strategies.Despite a modest margin, this is noteworthy as Prompt Tuning optimizes only a fraction of parameters (≈ 12) compared to Linear Probing, Fine-tuning and Adapter Network, which optimize approximately (≈ 1.5), (≈ 427) and (≈ 590) parameters respectively.Moreover, in few-shot experiments as shown in Table 5 Prompt Tuning also outperforms other three strategies.However, in  2 and 3.The red dotted line represents chance performance.2 and 3.The red dotted line represents chance performance.
terms of robustness to post-processing operations, Linear Probing turns out to be best performing strategy.

Effect of Training Set Size
We also conducted experiments with various training set sizes, and in this section, we report on the performance of participating transfer learning strategies when trained with reduced numbers of  and   images.Using ProGAN's [20] data, we create four smaller datasets containing 20k, 40k, 60k and 80k images.As shown in Table 4, we observe that while larger training datasets generally yield higher scores, the differences are not significant.Moreover, since the   images in the training data are generated by a GAN model (ProGAN [20]), the impact of training data size is less pronounced when evaluating models on other GAN models in the test set compared to Diffusion models, or Commercial tools.This analysis indicates that even with limited training resources, it is still possible to train robust detection models without a significant decline in generalization capabilities.

Robustness to Post-processing Operations
In real-world scenarios, images commonly undergo post-processing before being shared online, and research indicates that these operations significantly impact detection models' performance [9,32,45].To assess how our models handle post-processing, following previous studies [32,45], we evaluate them on images subjected to two types of operations: (1) JPEG compression and (2) Gaussian blurring.
To gauge the impact of JPEG compression, we tested two qualities: 75% and 50%.For blurring effects, we used sigma values of 1 and 2. The performance results of our models are depicted in Figure 5.As expected, there is a decline in performance as sigma and compression values increase, though still acceptable considering our models weren't explicitly trained on compressed or blurred images.One thing we notice is that this decline is more pronounced for images generated by Commercial tools, except for fully Fine-tuned model.Linear Probing outperforms other adaptation strategies well across the three different generative model families.

Few-shot Analysis
We now conduct experiments to investigate how participating transfer learning approaches perform when trained on extremely limited data, specifically only 640 images (320 , 320  ).Here, we present the results achieved by our models in a few-shot setting.
We train CLIP (ViT-Large) model using four different transfer learning strategies, i.e., (1) Linear Probing, (2) Fine-tuning, (3) Adapter Network [13] and (4) Prompt Tuning [50] in a few-shot setting.We use only 32 (16  and 16  ) images from each of  the object categories available in the LSUN [49] and ProGAN [20] datasets.In total, we train the models using 640 /  images.We present the achieved Average Precision (AP) and Accuracy (Acc) scores in Table 5.It is apparent from the results that Prompt Tuning outperforms other transfer learning strategies by a clear margin on images sampled from GAN-based, Diffusion-based and Commercial image generators.

Performance on Commercial Tools
Besides evaluating the models on images sampled by a number of different GAN-based and Diffusion-based image generators, following [9] we also carry out evaluations of baseline methods, and the transfer learning strategies we employ on images generated by Commercial tools including Midjourney-V5, Adobe Firefly and DALL-E 3. We present the comparison of results in Table 6.The numbers clearly demonstrate that the transfer learning strategies utilized in this paper surpass previously proposed deepfake detection methods.Additionally, it's noteworthy that our models are trained using only 200k /  images, compared to the studies we're comparing against, which utilize 720k images for training.Through our experiments, we illustrate that transfer learning strategies incorporating both the image and text components of CLIP consistently surpass the performance of simpler approaches like Linear Probing, which solely utilizes the visual aspect of CLIP.Our findings highlight Prompt Tuning's superiority over current baselines and SOTA methods, achieving significant margins of improvement while showcasing its efficacy despite minimal training parameters.Additionally, we conduct few-shot experiments, analyze robustness under post-processing operations such as JPEG compression and Gaussian blurring, and demonstrate the consistent performance of our CLIP-based detectors even with a smaller training set size of 20k images.

Figure 1 :
Figure 1: Visualization of real (in red) and fake (in green) images utilizing t-SNE in the feature space of various image encoders.The feature space of CLIP demonstrates superior separation of real and fake image features as compared to other two supervised models.

Figure 2 :
Figure 2: In this figure, we present four distinct transfer learning strategies that are explored for /  image classification.At bottom right we list the number of trainable parameters for each approach.

Figure 3 :
Figure 3: Average precision (AP) score distribution of participating transfer learning strategies on the test set comprised of images sourced from 18 different datasets, as given in Tables2 and 3.The red dotted line represents chance performance.

Figure 4 :
Figure 4: Accuracy (Acc) scores achieved by participating transfer learning strategies on the test set comprised of images sourced from 18 different datasets, as given in Tables2 and 3.The red dotted line represents chance performance.

Figure 5 :
Figure 5: This figure shows how different transfer learning strategies cope with post-processing operations including JPEG compression and Gaussian blurring.Our models perform well with GAN and Diffusion images but struggle with those from commercial tools like DALL-E 3 and Adobe FireFly.Surprisingly, the Fine-tuned CLIP model is more robust against compressed images sampled using Commercial tools as compared to GAN-based and Diffusion-based images.Linear Probing achieves optimal performance across all three datasets.

Table 2 :
Generalization performance.This table presents the average precision (AP) of different methods for distinguishing  and   images.The studied adaptation approaches demonstrate significant improvements over the previous baselines and SOTA.

Table 3 :
Generalization performance.This table compares the accuracy (Acc) scores attained by our proposed techniques with various previous studies.The proposed CLIP adaptation strategies show noteworthy performance gains compared to previous baselines and SOTA techniques.

Table 4 :
This table presents scores achieved by our models trained using samller sized datasets.Results are organized based on number of available training images: 20k, 40k, 60k and 80k.We keep equal amount of /  images, e.g., for 20k subset, we have 10k  and 10k   images.

Table 5 :
We present the results from our few-shot (32-shot) experiments, wherein we train CLIP using various transfer learning strategies on /  images from the ProGAN dataset.We then evaluate the trained models on images generated byGANs, Diffusion models and Commercial image generators.Prompt Tuning 98.61 / 89.88 95.97 / 84.76 87.23 / 66.38 93.94 / 80.34

Table 6 :
Robustness of transfer learning strategies across different families of generative models.Our study examines the robustness of CLIP in detecting deepfake imagery across diverse data distributions.We explore four distinct transfer learning strategies, including Fine-tuning, Linear Probing, Prompt Tuning and training an Adapter Network, using a diverse training set of 200k images from the ProGAN dataset.Our experiments encompass evaluation on a comprehensive test set comprising 21 different image generators.