PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting

Vision-language (VL) Pre-training (VLP) has shown to well generalize VL models over a wide range of VL downstream tasks, especially for cross-modal retrieval. However, it hinges on a huge amount of image-text pairs, which requires tedious and costly curation. On the contrary,weakly-supervised VLP (W-VLP) explores means with object tags generated by a pre-trained object detector (OD) from images. Yet, they still require paired information, i.e. images and object-level annotations, as supervision to train an OD. To further reduce the amount of supervision, we propose Prompts-in-The-Loop (PiTL) that prompts knowledge from large language models (LLMs) to describe images. Concretely, given a category label of an image, e.g.refinery, the knowledge, e.g.a refinery could be seen with large storage tanks, pipework, and ..., extracted by LLMs is used as the language counterpart. The knowledge supplements, e.g. the common relations among entities most likely appearing in a scene. We create IN14K, a new VL dataset of 9M images and 1M descriptions of 14K categories from ImageNet21K with PiTL. Empirically, the VL models pre-trained with PiTL-generated pairs are strongly favored over other W-VLP works on image-to-text (I2T) and text-to-image (T2I) retrieval tasks, with less supervision. The results reveal the effectiveness of PiTL-generated pairs for VLP.

PiTL leverages the shared prompts (P1-3) on the category label and the responses (D1-D3) from a large language model to generate image-description pairs.Please refer to Table 1 for the nine prompts used in this work.

Weakly-supervised VLP
While the success of VLP methods relying on huge amounts of image-text annotations has been proven, a less visited research path is emerging to pursue more data-efficient VLP.The data-efficiency of a VL model can be viewed from the amount of supervision.
That is, would a VL model pre-trained with less image-text data remain as performant in downstream tasks?This question leads to works [6,19,32,33,39] in weakly-supervised VLP (W-VLP) that aims at not relying on image-text pairs from, e.g.SBU Captions [26] and Conceptual Captions (CC) [5,29].Without the aligned images and texts, these works instead resort to a pre-trained object detector (OD) that generates object tags, i.e. the visual entities detected in the given image.The paired images and the object tags offer weaker supervision than those from the image-sentence pairs, but are still effective as the cross-domain bridge.However, training an OD relies on object-level annotations, which is still a form of supervision [33].This seems to deviate these VLP works from the fully unsupervised path, which aims to remove any kind of cross-modal supervision.

Towards Unsupervised VLP
The unsupervised VLP (U-VLP), which aims at learning a VL model without any supervision across modalities, remains a daunting challenge.As a step towards U-VLP, we introduce Prompts-in-The-Loop (PiTL) that generates highly effective image-text pairs for W-VLP without an OD.PiTL capitalizes on image-level, i.e. a category label per image, instead of object-level supervision from the object bounding boxes and the corresponding object categories.This leads to a much harder W-VLP setting since much underlying information about an entity could no longer be inferred, such as the common co-occurrence of the visual entities, e.g. a chair and a desk, in a scene, and the entity relations.e.g. a person usually sits on a bench.
Specifically, given images with category labels, e.g. a duck as shown in Fig. 1, we prompt large language models (LLMs), e.g.GPT-3 [3], to generate descriptions as the external knowledge of the category labels of the images.In fact, different prompts provide different focuses on each target category, e.g. one that emphasizes colors: "Describe the colors seen from a/an <category>?", and another that emphasizes relations with other entities: "Describe what could a/an <category> be seen with?".This encourages a VL model to associate all plausible visual traits, entities, actions, and scenes pertaining to the target categories.
The prompting paradigm is becoming trendy.Recent works [25,27,36] reveal that textual prompts generated by LLMs lead to significant improvement in zero-shot image classification with VLP models like CLIP [28].For instance, prompting a VL model with the LLM-generated description, "Goldfish are small, orange 1 Some of the gathered image-text pairs may not be highly relevant as they are not validated by a human.fish with shiny scales.", is more likely to find matches in the visual domain than the generic "A photo of a goldfish.".Likewise, our work explores if the LLM-generated descriptions could be proved useful as well in the W-VLP setting.
Our contributions are summarized as follows.Firstly, we propose PiTL that generates an image-text dataset IN14K containing 9M images with 1M descriptions of 14K categories from the "Win-ter21" release of ImageNet-21K [8].Second, trained with half of the samples in IN14K, our models are shown competitive to the state of the arts in image-to-text (I2T) and text-to-image (T2I) retrieval tasks.With full IN14K, our models significantly outperform them, e.g. on MSCOCO-5K [23] by 11% and 10%, respectively, on I2T and T2I.Moreover, our models are comparable with VLP models trained with 4M aligned image-text pairs from, e.g.CC3M and SBU Captions, etc. Lastly, PiTL does not only come with the least cross-modal supervision among W-VLP works, but also leads to a small gap between the W-VLP and VLP performances.

W-VLP WITH IMAGE-LEVEL SUPERVISION
VLP aims at learning VL alignments given a large number of imagesentence pairs.The methodology is concluded as (1) learning shared semantics around VL modalities, (2) learning cross-modal context, e.g.masked modeling [9], and (3) learning to explicitly match images and texts.With the same aim, the existing W-VLP methods leverage OD-generated tags to form VL pairs.What is usually neglected is the cost of pre-training such an OD, which usually requires 10+ object-level annotations to be effective.The proposed PiTL aims at relaxing the requirement of having an OD via prompting LLMs [3] to generate descriptions for the object categories.

Forming Image-text Pairs via Prompting
PiTL elicits knowledge about an object category from nine prompts of different perspectives with an LLM.Five descriptions are collected for each prompt.Table 1 summarizes the nine prompts and their focuses.Some of them are more visually-relevant (P1-6), some focus more on knowledge around the target category (P7-8), and some are more open-ended (P9).We study the effectiveness of the descriptions generated by each prompt later in Sec.3.3.
Among PiTL-generated pairs, an image can be paired with different descriptions as long as they are of the same category.In pre-training, the positive pairs for the Image-Text Contrastive and Image-Text Matching losses (introduced later in Sec.2.3) are drawn from the images and descriptions of the same categories, and the negative pairs from those of the different categories.In this setup, pre-training with PiTL-generated pairs encourages the VL models to learn cross-modal alignment at the category level, i.e. images of a target category aligned to a group of descriptions, instead of instance level, i.e. an image aligned with a description as in other VLP works.As such, a given image would be associated with the plausible categorical visual traits, entities, actions, and scenes through the VL models.

VL Model Architecture
Our model architecture follows a state-of-the-art VL model, ALBEF [18], which has a multi-modal encoder fusing the representations generated by visual and textual encoders.Indeed, any VL model

Pre-training Losses
Our PiTL VL-models are pre-trained with four losses [18,35] that all contribute equally to the total loss L: where each objective is described as follows.
Image-Text Contrastive (ITC) aims to retain high and low similarities between the positive and negative image-text pairs, respectively.To obtain the ITC loss, one first calculates where  ( , ) = (   )     measures the dot-product similarity of an image-text pair ( , ).   ( ) and     ( ) are image-to-text and text-to-image similarities, respectively. is a learnable temperature parameter and  is the size of the queues storing the image and textual class embeddings.The ITC loss is then defined as where  denotes the pool of image-text pairs, y   and y    are dimensional binary vectors encoding ground-truth similarity, and  (•, •) refers to the cross-entropy function.Image-Text Matching (ITM) aims to predict whether an imagetext pair is matched.The token embedding    of the fusion encoder predicts the binary classification probability p   .The ITM loss is defined as where y   is a binary vector indicating the matching pairs.Masked Language Modeling (MLM) predicts the masked tokens in a sentence given an image and the unmasked textual tokens in the same sentence.15% masking probability is set.The MLM loss [9] is denoted as L  .
Intra-Modal Contrastive (IMC) aims to differentiate the semantics between the positive and negative pairs within the same modality [35], i.e. image-image and text-text pairs with similarities: The IMC objective is defined as where y    and y   indicate whether the pair is matched or not.It is worth noting that L   encourages the model to retain the unimodal semantics provided by the pre-trained weights of the vision and textual encoders, complementing L  , L   , and L  , all of which promote multi-modal alignments.

EXPERIMENTS 3.1 Settings
Our vision encoder is instantiated by ViT [10] and initialized with DEiT [31] or BEiT-B/16 [2] pre-trained weights.The textual encoder is initialized with BERT-Base [9] [40], and VL-Full [32] used by other VLP methods, are shown in Table 2 under the Pre-training Corpus column.We assess the pre-training quality on I2T and T2I with MSCOCO-5K and Flickr30K [26].The retrieval models are evaluated on recall at rank K (R@K).

Quantitative Results
Table 2 shows the main results of PiTL and the comparisons against the state-of-the-art VLP and W-VLP on the I2T and T2I tasks.Effects of Initialization and Dataset Sizes.We initialize the image encoder with weights pre-trained with no supervision (i.e. the self-supervised BEiT †) and with image-level supervision (i.e.ViT, BEiT ‡, and BEiT★).The best performances are obtained with BEiT★, whose weights contain the strongest visual semantics, on IN1K to IN14K.PiTL's results steadily improve with more images and descriptions.Compared to models without pre-training, BEiT † pretrained on IN1K has larger improvements in R@1 than the other initializations, i.e. 8.6% for I2T and 10.8% for T2I R@1 on MSCOCO-5K.PiTL with BEiT-B/16 † vs. W-VLP Works.To purely assess the generated image-text pairs, models initialized with self-supervised BEiT-B/16 † weights are mainly benchmarked.VLMixer starts out as a better model than PiTL when both are not pre-trained on any image-text pairs.However, once pre-trained, PiTL appears to be strongly competitive.For instance, PiTL pre-trained on IN1K outperforms VLMixer pre-trained on CC3M, across all the I2T metrics,

3 P1:
VLP: Vision-language Pre-training.W-VLP: Weakly-supervised VLP.Describe colors of a duck.P2: Describe a duck in a scene.P3: Describe what a duck could be seen with.(b) Our proposed Prompts-in-The-Loop (PiTL).

Figure 1 :
Figure 1: (a) Different VLP problem settings.The de facto VLP requires the aligned images and the corresponding descriptions.W-VLP learns on multiple object-level annotations, such as the object bounding boxes and the labels.Our proposed W-VLP, Prompts-in-The-Loop (PiTL), comes with the least supervision from the image category labels.(b)PiTL leverages the shared prompts (P1-3) on the category label and the responses (D1-D3) from a large language model to generate image-description pairs.Please refer to Table1for the nine prompts used in this work.

Table 1 :
[10]prompts and focuses for <category>, e.g.duck.textinputscouldalso be used instead, as proposing a new VL architecture is not the focus of this work.Specifically, given an image  and its paired text description  , the vision encoder   (•) follows ViT[10]consisting of a 12layer Transformer that generates the image embedding as   ( ) = (   ,  1  , . . .,     ).The text encoder   (•) is a 6-layer Transformer encoder that embeds the input text as   ( ) = (  tokens summarizing the image and the text, respectively.  and   are the numbers of image patches and textual tokens, respectively.A fusion encoder   (•) consisting of a 6-layer Transformer learns the interaction across the VL modalities encoded as   ( ) and   ( ), and generates   (  ( ),   ( )) = ( ,  1  , . . .,     ).
. The proposed PiTL W-VLP model is pre-trained on three subsets IN1K, IN6K, and IN14K created from ImageNet21K.Note that IN6K and IN14K are created in this work.Specifically, IN14K contains IN6K which also contains IN1K samples.Each prompt, out of the nine shown in Table 1, generates five responses for a category.Individual prompts are created for multiple synonyms, e.g.snorkeling and snorkel_diving under the same category.Statistics of IN1K, IN6K, IN14K along with other datasets, e.g.CC3M, BookCorpus (BC)