Multimodal Data Augmentation for Image Captioning using Diffusion Models

Image captioning, an important vision-language task, often requires a tremendous number of finely labeled image-caption pairs for learning the underlying alignment between images and texts. In this paper, we proposed a multimodal data augmentation method, leveraging a recent text-to-image model called Stable Diffusion, to expand the training set via high-quality generation of image-caption pairs. Extensive experiments on the MS COCO dataset demonstrate the advantages of our approach over several benchmark methods, and particularly a significant boost when having fewer training instances. In addition, models trained on our augmented datasets also outperform prior unpaired image captioning methods by a large margin. Finally, further improvement regarding the training efficiency and effectiveness can be obtained after intentionally filtering the generated data based on quality assessment.


Introduction
Image captioning aims to automatically generate textual descriptions of visual content in an image, an important task at the intersection of natural language processing (NLP) and computer vision (CV) (Staniūtė and Šešok, 2019;Atliha and Šešok, 2020;Cornia et al., 2020;Turkerud and Mengshoel, 2021).While impressive results have been achieved especially with emerging deep learning algorithms, most studies have been still tied to model optimization, extracting informative features, or developing better training techniques.Data, as a critical dimension that can significantly affect model performance, is greatly under-explored, despite a recent increase in interest (Feng et al., 2021a;Turkerud and Mengshoel, 2021).
Having large amounts of finely labeled imagetext pairs in supervised image captioning tasks is often desired (Hendricks et al., 2016;Zhu et al., 2022).Existing image captioning datasets, such as MS COCO (Lin et al., 2014), require human annotators to label images with descriptive sentences, which is laborious and time-consuming.Moreover, the collected images and annotated captions are possibly incomplete and lack variety, which limits the generalization ability of models trained on such datasets (Zhu et al., 2022).
To address the data issue, some researchers have attempted to leverage unpaired images and text, since they are easily obtained separately (Hossain et al., 2019;Kim et al., 2019;Laina et al., 2019;Zhu et al., 2022).This task is referred to as Unpaired Image Captioning (UIC), which has attracted considerable attention recently.Others knowledge, this is among the first attempt to apply augmentation for both images and texts simultaneously in image captioning tasks.
• We conduct extensive experiments and the results show that high-quality synthetic data generated by large pre-trained text-to-image models can significantly improve the quality of image captions in zero-and few-shot scenarios.Our study also provides a successful application of the Stable Diffusion model.
• Finally, we intend to release the code and the augmented datasets to the community1 .With our effective data generation and quality assessment for image-caption pairs, we demonstrate that training datasets can be constructed without expensive human annotation efforts for supervised vision-language tasks.
2 Related Work

Data Issue in Image Captioning
In recent years, image captioning models have developed rapidly benefited from deep learning algorithms.Encoder-Decoder architecture is one of the most common and effective model frameworks, with Convolutional Neural Networks (CNN) encoders to obtain image features and Recurrent Neural Networks (RNN) for decoding them into natural language (Hossain et al., 2019).Attention mechanisms (Xu et al., 2015;Lu et al., 2017;Anderson et al., 2018;Huang et al., 2019) and Transformers (Li et al., 2019;Cornia et al., 2020) are actively used and significantly boosting performance.
Training image captioning models with fully supervised methods requires tremendous paired image-caption data.With a lack of training data, even state-of-the-art models rarely perform well.To tackle this issue of limited labeled data, unpaired image captioning (UIC) and data augmentation approaches have been proposed by researchers.

Unpaired Image Captioning
The UIC task aims to generate captions from models trained using unpaired images and captions, which are easily obtained separately from various sources (e.g., the web).This has attracted significant attention from researchers given the high cost of obtaining paired images and texts.For example, Gu et al. (2018) implemented language pivoting with extra Chinese caption information.Feng et al. (2019) proposed an unpaired image captioning framework by learning a visual concept detector.Laina et al. (2019) exploited large text corpora outside the dataset and learned shared multi-modal embeddings for images and sentences.More recently, Zhu et al. (2022) tackled the visual concept recognition stage for UIC aided by only image-level class labels.On the other hand, semi-supervised learning methods also come to assist.Chen et al. (2016) generates missing visual information based on textual data.Kim et al. (2019) implemented GANs to assign pseudolabels to unlabeled images.Most UIC studies are in common that they exploit beyond image captioning data and require more or less some auxiliary information or expensive annotations.
Data Augmentation is to create additional training data with diversity.They have been widely applied to images and texts separately in various machine learning tasks.However, augmentation is rarely used in vision-language tasks, such as image captioning (Atliha and Šešok, 2020).
Most data augmentation in image captioning studies focus on either image or caption, rather than both simultaneously.Wang et al. (2018); Katiyar and Borgohain (2021) used standard image transformations like cropping, flipping, and mirroring to manipulate image data.However, these image augmentations may introduce noises when transformations distort the semantics of images.For caption augmentation, Atliha and Šešok (2020) applied synonym replacement and paraphrasing sentences using BERT (Devlin et al., 2018).Word permutation/replacement (Cui et al., 2018), back translation (Turkerud andMengshoel, 2021), and other NLP augmentation techniques were also used in prior literature.
However, multi-modal augmentation, where both images and texts are modified at the same time, is a relatively underexplored direction (Hartmann et al., 2022).Feng et al. (2021b) introduced an approach to combine CutMix (Yun et al., 2019) and caption editing, by inserting patches cut out from a different image and modifying the caption such that it correctly describes the new image.But the authors did not implement this method and no experimental results were provided.

Text-to-Image Synthesis
Text-to-Image synthesis is a challenging multimodal task of generating a high-quality image con-ditioned on a descriptive text (Du et al., 2022;Yang et al., 2022).This field had been dominated by Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), Variational Autoencoders (VAEs) (Kingma and Welling, 2013), and other generative models.Nonetheless, these models suffer from some drawbacks.For example, GANs are difficult to optimize (Mescheder et al., 2018) and are confined to data with limited variability (Brock et al., 2018;Karras et al., 2019); VAEs are more efficient to generate high-resolution images, but the sample quality is not good enough (Rombach et al., 2022).
A recently emerged family of deep generative models, diffusion models (Ho et al., 2020;Song et al., 2021), has shown their impressive power and beats GANs in text-to-image synthesis (Dhariwal and Nichol, 2021).While state-of-the-art models like DALLE-2 (Ramesh et al., 2022) and Imagen (Saharia et al., 2022) are able to generate images of surprisingly high quality, their inference is expensive and time-consuming (Rombach et al., 2022).In this study, we adopt an improved Latent Diffusion Model, Stable Diffusion (Rombach et al., 2022), as our text-to-image component.Stable Diffusion can generate high-resolution images comparable to state-of-the-art models with affordable computational resources and times.
In image captioning tasks, generative models are rarely applied.Kim et al. (2019) used Cycle-GAN (Zhu et al., 2017) as a baseline for their unpaired image captioning, which proved to be less effective.In the most recent work, Li et al. (2022) demonstrated that the best caption for an image is the one that leads to the best reconstruction of the original image, using Stable Diffusion as the text-to-image model and Flamingo (Alayrac et al., 2022) as the inverse one.However, diffusion models have not been utilized to improve the quality of generated image captions, despite the impressive text-to-image results they have achieved.

Method
Our proposed image captioning system consists of two parts.In the first part, we implement our imagecaption pair generation to construct the synthetic dataset.The dataset can be further expanded via text or image augmentation methods.The second part shows how we train and evaluate two selected image captioning models based on the constructed data in part one.

Multimodal Augmentation
We use MS COCO (Lin et al., 2014) as the base dataset to perform augmentation.MS COCO is a large and commonly used image captioning dataset, with 123,287 images and 616,767 captions in total.
In our multimodal augmentation, we first apply Stable Diffusion to generate one image for each COCO caption while discarding images with NSFW content.Then we pair the generated image and the true COCO caption to form a base synthetic dataset (denoted by SD base ).
Prior research has found that image captioning models trained on images with more diverse descriptions can achieve better performance (Devlin et al., 2018;Atliha and Šešok, 2020).Thus, one caption for one image in SD base may not be enough, and expanding the caption for each generated image is desirable.In this study, we expand captions using two strategies: (i) from ground-truth COCO captions, (ii) from automatic caption generation via text augmentation (e.g., paraphrasing).The obtained datasets are denoted as SD true and SD para , respectively.Figure 1 illustrates the way to construct three augmented datasets through one example in COCO data (i.e., one example indicates one image and its 5 corresponding captions).
• SD para is built upon SD base .Each true caption is augmented via k times of paraphrasing.Thus, SD para has: (synthetic image 1, k synthetic captions), • • •, (synthetic image 5, k synthetic captions).Note that we discard those generated paraphrases that are exactly the same.Therefore, the amount of corresponding captions for each generated image in SD para is k + 1 (i.e., one true caption and k ≤ 5 augmented ones).
Next, we create several additional datasets while accounting for the scenario where the availability of labeled data (i.e., pairs) varies.Specifically, one typical case would be that only a small subset of COCO image-caption pairs while all captions are available.In this case, our proposed multimodal augmentation can be applied to increase pairs.For example, we first keep these provided image-caption pairs.For the rest captions, we apply augmentation as in SD base to create more imagecaption pairs.Such a new dataset is denoted as n% COCO + SD base , where n is the percentage of the original pairs in COCO.For example, 10% COCO + SD base means 10% true pairs in COCO combined with SD base from the rest 90% captions.Similarly, n% COCO + SD true and n% COCO + SD para can be obtained.
In contrast to our multimodal augmentation method, there exist many uni-modal methods.For example, paraphrasing is a widely adopted text augmentation via a large language model similar to Atliha and Šešok (2020).For image augmentation, prior studies use random flipping combined with random perspective transformation (Wang et al., 2018;Katiyar and Borgohain, 2021).Based on these, we create two more datasets.The generated paraphrases are added into the COCO dataset as additional captions, denoted as COCO text .We replace the original COCO images with the augmented images to obtain COCO image .

Selected Image Captioning Models
Now we turn towards introducing two selected image captioning models upon which we evaluate the effectiveness of the augmented datasets.Our ob-jective is to investigate whether experiment results are model-specific and whether our constructed datasets can be used to improve image captioning models in general.To do so, we fix the imagecaptioning model to FC model, which is relatively simple and small-sized.We also choose another more advanced Transformer-based model for robustness.
FC Model is a frequently used model in many image captioning studies (Vinyals et al., 2015;Karpathy and Fei-Fei, 2015;Rennie et al., 2017;Luo et al., 2018).In the FC Model, a CNN architecture is first embedded to extract visual features for each image.Then the extracted feature embeddings are processed by Long Short-Term Memory (LSTM) modules (Hochreiter and Schmidhuber, 1997) to generate captions.
Transformer-based Model Recently, Transformer (Vaswani et al., 2017) models are boosting the performance of various deep learning tasks, including image captioning (Cornia et al., 2020;Luo et al., 2021;Zhou et al., 2022).In this study, we do not choose a specific Transformer-based image captioning model but rather a generic architecture: it first extracts visual features using a bottom-up approach (Anderson et al., 2018), and a basic Transformer module with self-attention is applied to decode the visual features to textual captions.

Experiment Setup
In this study, we apply the Stable Diffusion model version 1-4 2 implemented in huggingface to generate synthetic images with given captions.We keep default settings.Model parameters are frozen during the generation process.The NSFW images are automatically detected, and we do the image generation again for these images until they are suitable to be included in our synthetic datasets.Finally, 566, 747 synthetic images are generated.
Synthetic caption expansion for SD para and text augmentation for COCO text use the same paraphrasing approach based on true COCO captions: A pre-trained T5 model (Raffel et al., 2020) for paraphrasing3 .For image augmentation, we apply RandomHorizontalF lip and RandomP erspective transformation functions implemented by torchvision, with both transformation probabilities set to 0.5.
The implementation of feature extraction, model training, and model evaluation are mainly adapted from Luo et al. (2018) 4 .We train image-captioning models based on Karpathy's split (Karpathy and Fei-Fei, 2015) of the training set with the fully supervised method and a more effective CIDEr score optimization (Luo et al., 2018).An early-stop strategy is also adopted, with maximum training epochs of 30 and 15 for the FC model and the Transformerbased model, respectively.Regarding the evaluation, we use standard metrics for image captioning tasks, including BLEU (Papineni et al., 2002), ME-TEOR (Banerjee and Lavie, 2005), ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016).All evaluation metrics remain the same as Luo et al. (2018), and all the evaluations of COCO dataset and synthetic datasets are on Karpathy's test split.We set the seed as 42 to run all experiments with 2 RTX A4000 16G GPUs.In the sequel, we show three experiments and their results to demonstrate the effectiveness of our multimodal data augmentation upon which image captioning tasks are performed and evaluated.

Experiment 1: Fully Synthetic Dataset
In this experiment, we train models using 3 datasets where all images are synthetic: SD base , SD para , and SD true .We compare the performance of image captioning models (i.e., FC and Transformer) with state-of-the-art UIC methods.Note that UIC task assumes unpaired but true images were available, which is a more informed condition than our scenario where no true images are available.The experiment results are shown in and the bottom rows are the performances of FC model trained on our 3 synthetic datasets.Note that "B4", "M", "R", "C", and "S" stand for "BLEU-4", "METEOR", "ROUGE", "CIDEr", and "SPICE", respectively.
We have the following observations: (1) training with our basic multimodal-augmented dataset (SD base ) outperforms UIC baselines in all metrics by a large margin.In other words, the pretrained text-to-image model can be used to generate images based on given captions to train image captioning models.Image captioning models can be trained in a simple fully supervised manner on synthetic image-caption pairs, instead of semi-/unsupervised way based on true but unpaired imagecaption data in UIC tasks.Even without true images, the synthetic data performs surprisingly well in the downstream captioning task, which shows great potential for the application of synthetic data generation.(2) Expanding captions with synthetic paraphrases (SD para ) yields better results.Trained on SD true , a significant performance improvement is further achieved.This may be due to that text obtained by paraphrasing is less diverse than real captions annotated by humans.This is also in line with previous studies (Atliha and Šešok, 2020;Turkerud and Mengshoel, 2021), where researchers found that better model performance would be achieved if images in the training set have more captions with higher diversity.

Experiment 2: Few-shot Learning
Next, we intend to understand the performance of our multimodal data augmentation method when  the ground-truth data is limited.To do so, we first sample the original COCO data with different percentages, ranging from 10% to 50%, and then apply text, image, and multimodal data augmentation methods to them.In addition, we assume that all captions in COCO are available, so that multimodal data augmentation is to complete the unavailable images.This assumption is reasonable since the textual information alone is relatively easy to obtain.From the results (shown in Table 2), we find that when the amount of training instances is very limited (10%), basic data augmentations for unimodality (images or text) have nearly no effect on the performance of image captioning (i.e., comparing rows 2 and 3 with row 1).By contrast, our multimodal data augmentation can significantly improve the performance (i.e., comparing the last 3 rows with row 1).Moreover, caption expansion for synthetic images can further improve performance (i.e., comparing row 5: 10% + SD para with row 4: 10% + SD base ), which is consistent with findings in Experiment 1.
All experiment results from 10% to 50% COCO data are summarized in Figure 2. As the size of ground-truth data increases, the gain of performance improvement from data augmentation gradually decreases.When the true data reaches 40%, the results of augmenting with SD base become worse than those with only 40% true data.One plausible explanation is that 40% true data in COCO is large enough for training a decent model.Adding synthetic images might bring more noise which could hurt the performance.
Among the three multimodal augmentation methods, true caption expansion of synthetic images performs better than automatic paraphrasing expansion when the true data is very limited (10%).When the amount of true data increases (≥ 40%), however, true caption expansion cannot achieve the same results as n% COCO baseline.Nonetheless, automatic paraphrasing can still improve the model performance in this case.This may be due to novel words outside the COCO vocabulary, which is introduced by the paraphrasing model.

Experiment 3: Synthetic Data Filtering
In this section, we attempt to further improve the performance with augmented datasets in the image captioning task.
We noticed that images generated by Stable Diffusion possibly mismatch textual descriptions, e.g., omitting important objects or having unnatural distortions.Selecting high-quality images that align text well is highly desired.Previous studies have come up with various methods to do this.For example, (Salimans et al., 2016) introduced Inception Score (IS) to evaluate GAN-generated images.(Heusel et al., 2017) presented Frechet Inception Distance (FID) as the golden standards to measure image quality.These metrics may not be directly applied in our context.Part of the reason is that they measure either the distortion of generated images to real ones, or the difference between two distributions, while our objective aims to find images that better match text.
In addition, the evaluation should not depend on the real reference image and text.Therefore, we consider the following three aspects when assessing image quality in our context: • The quality of the image itself, measured by one of the recent proposed no-reference image quality assessment (NR-IQA) metrics, called MUSIQ (Ke et al., 2021).
• The similarity between the synthetic image and the corresponding input text.CLIPScore (Hessel et al., 2021) is a reference-free metric for measuring captioning performance based on CLIP embeddings, which can be used to measure the coherence of generated imagecaption pairs.
• Whether the synthetic image reflects important objects described in the caption.VIFI-DEL (Madhyastha et al., 2019) is a newly proposed quality measurement method for image captioning tasks, using object detection models to recognize main objects in images, and then calculating the similarity between the objects and the description text.
We use MUSIQ, CLIPScore, and VIFIDEL as our data filtering criteria, and find that CLIPScore is the most effective metric among the three.Selecting top 50% of data with the highest CLIPScore could achieve similar results to using full synthetic datasets in Experiment 1 (see Table 3).This indicates that it is unnecessary to train the model with full synthetic data.Data selection based on quality assessment can make the training more efficient without sacrificing performance.Since data selection is more effective on SD true among the three augmented datasets, we further explore the impact of data filtering in limited data situations with SD true .We find in Table 4 that data filtering actually improves the model performance.A significant boost still exists when the volume of true data gets larger, which surpasses the improvement by SD para .This indicates that a suitable data filtering can improve both training efficiency and image captioning performance when true labeled data is limited.However, the improvement for SD base and SD para are not significant.And the image captioning performance even decreases in some cases for the Transformer-based model (see Table 11 in Appendix).Therefore, CLIPScore and the other two criteria are not golden standards to select highquality data that are suitable for the captioning task, even though they have been testified to perform well in image-caption quality evaluation (Madhyastha et al., 2019;Hessel et al., 2021).
A similar discrepancy between generated data quality and downstream task performance has been reported in a prior image classification task (Ravuri and Vinyals, 2019).Authors found that although the GAN-generated images receive high scores close to those of true images, the classification model trained on fully synthetic images has a much lower accuracy than those trained on true images.Following this thread, we calculate the three metrics for both COCO data and synthetic data.We have a similar finding that three quality measures under the synthetic data are close level to those under true data (see Table 5).However, when completely replacing true data with the synthetic data as the training (e.g., in Experiment 1), the performance is quite lower than that of the true data (i.e., CIDEr score: 81.0 vs. 92.4).In this way, we extend findings in Ravuri and Vinyals (2019) Table 5: Comparison of multimodal data quality assessment using true vs. augmented image-caption pairs.

Conclusion
We developed a multimodal data augmentation method for the image captioning task, leveraging the power of diffusion models in image synthesis.It outperforms uni-modal methods on the MS COCO dataset for two typical image captioning models, especially when the amount of true labeled data is limited.It also performs significantly well comapred to UIC methods with fully synthetic images.
Our study is an early attempt to combine image and text modals via two inverse processes: textto-image and image-to-text.The effectiveness of synthetic multimodal data used as the training set was empirically verified, and better performance can be further achieved with data filtering.Though synthetic data are not able to replace true data, it is worth exploiting the potential of multimodal data synthesis and its applications in various downstream applications in the future.

Limitations
Below we discuss three limitations of our multimodal data augmentation method.(1) The most noteworthy one is the quality of synthetic data, a common challenge seen in many data generation studies.We have explored some quality assessment metrics for image-caption pairs and tested their performance, specifically in the captioning task.Effective and generalized multimodal data quality assessment still remains an open question for future research, which we believe is a valuable direction.(2) The quality and flexibility of synthetic images are bounded by the ability of text-to-image models we apply.For example, if we intend to generate synthetic training datasets for the human face recognition task, good performance is unlikely to achieve since Stable Diffusion is not good at drawing human faces.(3) Computational resource is needed to generate a large synthetic image dataset, even with lightweight models like Stable Diffusion.In this study, generating the whole synthetic COCO images took us about 2 weeks with 2 RTX A4000 16G GPUs, which is demanding for practitioners and researchers.

A Synthetic Datasets
Descriptive statistics of our augmented datasets are shown in Table 6.Since COCO text does not change images of the original COCO dataset and COCO image replaces all COCO images with transformed images, the number of images in both remains the same.The images in three augmented SD datasets are generated based on COCO captions, so the number of images is the same as the number of COCO captions.Textual augmentations are applied to COCO text and SD para .So the number of captions increases.Meanwhile, novel words are introduced by paraphrasing models, so the vocabulary sizes of the two datasets are also expanded.For synthetic image generation, we did not intentionally tune the prompts for Stable Diffusion to obtain higher quality.Because the prompt tuning is too time-consuming when generating such a huge synthetic dataset.And image aesthetics is not our

COCO captions → paraphrases
A young boy standing in front of a computer keyboard.→ A young boy standing before a computer keyboard.
A man is in a kitchen making pizzas.→ In a kitchen, a man is making pizzas.
A woman eating vegetables in front of a stove.→ A woman consuming vegetables in front of a cooker.
A toilet and a sink in small bathroom.→ A bathroom with a toilet and a sink.
A city street filled with traffic and parking lights.→ A city street crammed with traffic and parking lights.
Table 7: Examples of COCO captions and their corresponding paraphrases focus.Therefore, the COCO captions are directly used as input without any modifications.
Most generated images show clear objects that are recognizable to humans (see the top two rows in Figure 8), while others suffer from problems due to the limitations of text-to-image models.For example, Stable Diffusion is not good at drawing human hands and faces, and weird distortions of objects may occur (see the bottom two rows in Figure 8).Nonetheless, trained on these imperfect synthetic images, we have already shown that image captioning models can achieve quite good performance.
Since the Stable Diffusion model used to generate images and the T5 model for paraphrasing are open-source models, there are no issues concerning copyrights of the generated images and textual data.We also open access to our synthetic COCO dataset which can be used freely for further research.

B Experimental Results
Here we list the detailed results of all three experiments.
Table 9 shows the results of both the FC model and Transformer-based model in Experiment 1, in which we trained captioning models with our three synthetic SD datasets.
In Experiment 2, we perform data augmentation on limited COCO data, mixing sampled COCO dataset with our synthetic SD datasets.The baseline models are trained on these augmented datasets, and we compare the image captioning performance between our proposed method and the baseline data augmentation methods.The results of the FC model and Transformer-based model are shown in Table 10 and Table 11, respectively.The performances of the COCO dataset combined with the selected synthetic datasets are also listed in Table 10 and Table 11 for convenient comparison.
We checked the relationship between synthetic data quality and their downstream performance in Experiment 3. The captioning performances of the two baseline models trained on selected SD true datasets are presented in Table 12, compared with SD true and SD para .The results of data quality selection in scarce-data situations are also shown in Table 10 and Table 11.

COCO images
Generated images

C Captioning Examples
To give readers a more intuitive understanding of the improvement after applying the proposed multimodal data augmentation, we show some captioning examples in this section.The example images are sampled from the COCO test set, and the corresponding captions generated by trained models are listed in the right column.Table 13: Captions generated by baseline models trained on the COCO dataset and our synthetic datasets.Red indicates wrong objects detected or poor use of words (e.g., repeating existing words); Blue highlights correct objects or good use of words.We can observe that our method outperforms the baselines.And the quality of captions generated by our method is close to human-written sentences.

Figure 1 :
Figure 1: An illustrative example of how to construct the three synthetic datasets (SD base , SD true , and SD para ) using text-to-image generation and caption expansion.COCO img/cap represents images/captions in the original COCO dataset, syn.img/cap represents synthetic images/captions.

Figure 2 :
Figure 2: Image captioning performance of three multimodal augmented datasets with different portions of COCO data.The five groups are situations with 10% to 50% COCO data, respectively.Within each group, only n% COCO, n% COCO with SD base , n% COCO with SD para , and n% COCO with SD true are presented from left to right.The figure shows only part of the evaluation metrics.For more details, seeTable 10 in Appendix.

Table 1 .
Since the results of the two models are fairly consistent, we show the results from the FC model here and in-clude those from the Transformer-based model in Appendix.

Table 1 :
Performance comparison between our multimodal augmentation and UIC.The first row shows the performance of FC model trained on the original COCO dataset, which is regarded as a comparison benchmark; the middle rows list the performances of 4 commonly used UIC baselines reported in prior studies; Table 10 in Appendix.

Table 2 :
Performance of the FC model when 10% of pairs in COCO is available.

Table 3 :
Performance of the FC model trained on three augmented datasets under data filtering with top 50% CLIPScore.

Table 4 :
The performance of FC model in limited-data settings with synthetic data selection based on the top 50% CLIPScore criterion.
's study to the image captioning task.

Table 6 :
Descriptive statistics of the COCO training set and its augmented datasets.
Table 7 lists a few examples of paraphrased captions.

Table 8 :
Examples of COCO and synthetic images.The COCO images are in the left column, and the right column is the corresponding generated image.

Table 9 :
Captioning performance of the FC model and Transformer-based model trained on the three synthetic datasets.

Table 10 :
Data augmentation performance of FC model with limited true COCO data, including the performance of data quality selection in the last row of each group.

Table 11 :
Data augmentation performance of Transformer-based model with limited true COCO data, including the performance of data quality selection in the last row of each group.

Table 12 :
Image captioning performance of FC model and Transformer-based model trained on fully synthetic datasets after selecting high-quality data based on the top 50% CLIPScore criterion.