Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment

Language models have been supervised with both language-only objective and visual grounding in existing studies of visual-grounded language learning. However, due to differences in the distribution and scale of visual-grounded datasets and language corpora, the language model tends to mix up the context of the tokens that occurred in the grounded data with those that do not. As a result, during representation learning, there is a mismatch between the visual information and the contextual meaning of the sentence. To overcome this limitation, we propose GroundedBERT - a grounded language learning method that enhances the BERT representation with visually grounded information. GroundedBERT comprises two components: (i) the original BERT which captures the contextual representation of words learned from the language corpora, and (ii) a visual grounding module which captures visual information learned from visual-grounded datasets. Moreover, we employ Optimal Transport (OT), specifically its partial variant, to solve the fractional alignment problem between the two modalities. Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.


INTRODUCTION
Grounded language learning is concerned with learning the meaning of language as it applies to the real world.Humans, especially children, learn language from not only pure textual information but also different modalities such as vision and audio, which contain rich information that cannot be captured by text alone [35,44,50].However, many traditional language models are learned only from textual corpora [3,11].They have the limitation in learning complex semantics that require the combination of signals in data through cross-referencing and synthesis.
Recently, there have been many studies trying to improve the language representation with visual information [2,9,16,21,49,51,[54][55][56].In those attempts, they update the weights of the language encoder using the visual objective together with the pure languagebased objective during pretraining.However, there is usually a huge gap in the distribution and quantity of word tokens between visual datasets and language corpora.For example, in Table 1, the Book Corpus and Wikipedia, two conventional language corpora, contain billions of words with millions of unique tokens, while MS COCO, a common visual-grounded dataset, contains only 6 million words and 44 thousand unique tokens.Therefore, during visual-grounded learning, only the tokens from the visual datasets are updated while the majority of the tokens are not equipped with visual information.However, during pretraining, those tokens with and without information from the images will be mixed up in the same context of the sentence, confusing the contextual learning process.
Moreover, previous attempts compressed the entire image into one vector as a global representation and then matched it to the paired caption.However, as shown by the samples picked from the Visual Genome [19] dataset in Figure 1, many of the captions only describe local regions in the corresponding image.Thus, using a global representation vector can distract the encoder from capturing local information, making it difficult for the model to align between modalities.As a solution to this issue, we use the Vision Transformer [13] model as the visual encoder to store local information in patch embeddings.Additionally, aligning information from different modalities is a crucial phase in vision-language representation learning because it is how two sources of information are combined.There are existing researches that use Optimal Transport to solve this alignment problem.Uniter [7] and ViLT [17] used the OT-based distance as a pretraining objective, while Graph Optimal Transport [5] considered two OT distance: Wasserstein distance (WD) and Gromov-Wasserstein distance (GWD) for cross-domain alignment in Visual Question Answering.Nevertheless, the classical optimal transportation problem seeks a transportation map that satisfies marginal constraints, requiring masses from all sources to be moved to all destinations.In some cases, we want only a fraction of masses to be carried, making this requirement restrictive.For instance, as stated above, the caption only describes a part of the image.To get a more flexible alignment, we propose to adapt the Partial Optimal Transport variant to align between the modalities.
Our contributions can be summarized as: • We propose GroundedBERT -a grounded language representation that extends the BERT representation with visual information.The visual-grounded representation is first learned from the text-image pairs and then concatenated with the original BERT representation to form a unified visual-textual representation.• We use patch embeddings from Vision Transformer to maintain local information of the image instead of a single global representation.We also adapt Partial Optimal Transport to align between the two modalities.• We conduct extensive experiments on various language downstream tasks on the GLUE and SQuAD datasets.Empirical result shows that we significantly outperforms the baselines on these tasks.
In recent years, many vision-and-language pretrained models have been proposed to build joint cross-modal representations and focus on vision-and-language tasks such as visual question answering and natural language for visual reasoning [7,23,47].While [24,59] used only one cross-modal Transformer for learning, [29,48] proposed to use two single-modal Transformers and one cross-modal Transformer.Pretraining tasks such as masked language model and masked visual-feature classification were used in those studies to learn the vision-and-language representation.
Advanced machine learning algorithms such as the Contrastive Learning framework have been applied to the natural language processing and computer vision [31,33,34,36,37,45].Optimal Transport has also been extensively used in many natural language processing tasks and also the integration of vision and language fields, for example, Cross-Lingual Abstractive Summarization [32], machine translation [6], Vision and language pretraining [7,17], Visual Question Answering [5], etc.Nevertheless, the application of the variants of OT has been less attractive in vision-and-language research.
There are many works on grounded language learning [1, 15, 43] having been introduced in the past few years.On the other hand, there are few attempts to improve language representation with visual information.[21] introduced multimodal skip-gram models (MMSKIP-GRAM) taking visual information into account.[9] proposed IMAGINET which consists of GRU networks and tried to predict the visual representation and the next word in the sentence.[16] was similar to IMAGINET but they used a bi-directional LSTM for sentence encoder.Moreover, it aimed to predict both the visual feature and the other captions given one caption.[2] proposed an intermediate space called the grounded space and learns the visual and textual representation with cluster information and perceptual  information.[49] introduced the concept of vokenization and pretrained the language model with an additional voken-classification task.

METHODOLOGY
In this section, we introduce the details of our proposed Grounded-BERT.As shown in Figure 2, our model consists of two components: a language encoder and a visual grounding module.The complete framework is illustrated in Figure 3 where two objectives are introduced.

Language encoder
We use BERT [11] as the language encoder.Given an input sentence  = ( 1 , . . .,   ), we use the pretrained BERT model to contextually embed the discrete tokens   's into hidden-output vector   's: where is the hidden state of token  at layer  of the Transformer.

Visual grounding
Visual grounding module.The visual grounding module is a multi-layer perceptron to transform the contextual representation of each token in the sentence into the (visual) ground embedding.
We take the hidden states of  final Transformer layers ℎ −+1  , ℎ −+2  , . . ., ℎ   and concatenate them as the input for the visual grounding module. where is the concatenation of hidden states of token  from layer  −  + 1 to , VG stands for Visual Grounding.
Visual-Textual Embedding.The textual embedding is the final hidden state of the language encoder.The ground embedding are concatenated to this textual embedding to form a unified visualtextual embedding of the token in the sentence.
where   is the visual-textual embedding vector of the -th token, which we take as the final representation of the token using our GroundedBERT model, [ℎ   ,   ] is the concatenation of the final hidden state ℎ   and the ground embedding   .

Visual encoder
Patch embedding.Instead of using traditional convolution-based architectures for visual feature extraction [2,16,49], we use Vision Transformer (ViT) [13].Let  be the input image having size of (, , ℎ) which stands for the number of channels, width, and height of the image.Image  goes through the ViT to get a global feature vector ṽ and  patch embeddings ṽ1 , . . ., ṽ .ṽ , ṽ1 , . . ., ṽ = ViT () Image projection.We use a multi-layer perceptron to project the feature vector ṽ of each patch to the grounded space and represent visual context learned from visual features.1: where   is the global embedding of the input image,   's are the patch embeddings and prj stands for (image) projection.

Training
In this section, we introduce two different optimization objectives: Image-sentence matching for global matching and Optimal transport matching for alignment between local features.
Image-sentence Matching.The Image-sentence Matching task is inherited from the Image-text Matching task from many vision-andlanguage pretraining literatures mentioned in Section 2. Learning how to perform well on this task will encourage the model to better find the relationship between the textual information and the visual signal in a global sense.
From each modality, we take a vector as its global representation.For the vision side, we use the global feature vector   from ViT.For the language side, we use the visual-textual embedding of the CLS token.We concat these two vectors before feeding into a fully connected layer with sigmoid activation to make the binary prediction of whether the sentence describes the image.
Algorithm 2 Computing Partial Optimal Transport. 1: // all division operations are element-wise 6: = min  T1 , 1  7: T  = diag(  )T The negative pair is created by replacing the image with another randomly selected image from the training set.We apply the binary cross-entropy loss for optimization.
where  is the binary indicator,  = 1 if the image matches the sentence and 0 otherwise.
Optimal transport for vision-language alignment.To solve the alignment between language and vision, we use Optimal Transport (OT), specifically the Partial Optimal Transport (POT) variant.
For each image, we have  patch embeddings  = ( 1 , . . .,   ).For each sentence, we have  hidden representations of the words  = ( 1 , . . .,   }.We consider these two collections as the supports of two empirical distributions with uniform weights.We then use OT to estimate the distance between these two distributions.Specifically, we compute the cost matrix C where    = 1 − cos ∠(  ,   ), or the cosine distance between the corresponding patch and word embedding.We also let  = 1  / and  = 1  / be the two uniform weight vectors.
The distance between the two modalities can be defined using the OT-based distance as This formulation places constraints that all the mass from one distribution must be transported to the other distribution.We find, however, that this constraint is restrictive for the problem at hand, where the sentence describes only partially the corresponding image.Therefore, it is intuitively more apt to use the POT variant, described as follows.
where  is the total amount of mass to be transported.In our implementation,  is set as the total uniform weight vector of text.We use sinkhorn-based algorithms to calculate the transportation plan T and the OT-based distance.Algorithm 1 and Algorithm 2 are for OT and POT, respectively.The average distance D for every matching pair of sentence and image will be minimized, while nonmatching pair distance will be maximized.Formally, the alignment loss will be: where  is the given dataset,  + and  − are the matching and nonmatching image respectively corresponding to the sentence .The procedure to pick the negative image is similar to in the Image-Sentence Matching task.

EXPERIMENTAL SETUP 4.1 Datasets
Training.We use MS COCO [25] and Visual Genome [19] image captioning datasets as the training data for image projection and Visual grounding module.

Evaluation tasks and metrics
All tasks are single sentence or sentence pair classification except STS-B, which is a regression task.MNLI has three classes, all other classification tasks are binary classification.The evaluation tasks are also various: question answering (QNLI, SQUAD), acceptability (CoLA), sentiment (SST-2), paraphrase (MRPC), inference (MNLI, RTE, QNLI).The metric of each task is shown in table 3.For MRPC, we report F1 score.For STS-B, we report Pearson correlation.For both SQuAD, we report exact matching and F1 score.

Implementation
We use BERT-base-uncased as the language model and vit base patch16 224 for the visual encoder.We load the BERT weight pretrained on Bookcorpus and Wikipedia from Pytorch framework Huggingface, and load the ResNeXt weight pretrained on ImageNet.The Language encoder and Patch embedding extraction are frozen, we just train the Image projection and Visual grounding module based on the contextual representation and image feature map.Both modules are multi-layer perceptron with 1 hidden layers and apply relu activation.We set the MLP final output dimension in set 64, 128 for evaluating how visual information impact on the textual-visual representation in Sec 6.1.Our model is trained with a learning rate   = 1 −4 in 12 epochs using AdamW [28] as optimizer, we set batch size of 512 on 1 V100 GPU and train for 3-4 days.

Compared to the baseline models
The fine-tune results on 9 different natural-language tasks are reported in Table 2.We compare our GroundedBERT with the BERTbase as the language encoder to the BERT-base and Vokenization baseline respectively.Our GroundedBERT outperforms the baselines on all down-stream tasks.Specifically, we achieve an improvement from 0.5 to 3.6 score on BERT-base.Compared to Vokenization, we also achieve higher on most tasks, except SQuADv1.1.This shows that our grounded language model representation can capture more useful information for language understanding without changing the original language model.

Compared to other vision-and-language pretrained models
To prove the effectiveness of our proposed grounded language learning approach, we compare it with the following state-of-theart vision-and-language pretrained models.
• LXMERT [48] consists of two single-modal and one cross-modal Transformer to connect vision and language semantics.• VisualBERT [23] consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention.• VL-BERT [47] uses Transformer model as the backbone, and extends it to input both visual and linguistic embedded features.• ViLBERT [29] extends the BERT architecture to a multi-modal two-stream model and process both visual and textual inputs.• Oscar [24] uses object tags detected in images as additional points to ease the learning of alignments between text and image.
We also fine-tune all models on 9 different natural-language tasks of GLUE and SQuAD datasets.To have a fair comparison, all models are initialized with the pretrained BERT weights, except LXMERT that is pretrained from scratch.As shown in Table 4, the finetuning results on our model consistently outperform other pretrained models in all tasks.The results show that finetuning the BERT model will make it forget the original knowledge learned from a huge language corpus.

ANALYSIS 6.1 The impact of visual grounding
To understand the impact of visual grounding on text representation, we train GroundedBERT without using visual information and the weights of the Visual grounding module are randomly initialized.The results in Table 5 show that the visual information has the significant contribution in the language grounding and is beneficial to the textual representation.We also study the impact of the contribution visual grounding with different visual embedding dimensions.Since the dimension of the hidden representation of language encoder, which is BERT, is fixed depend on its configuration, we can setup the dimension of additional visual information flexibly.

Different learning rates on GLUE
Following the setting in BERT [11] on GLUE tasks, we also conduct additional experiments with more runs on different learning rates similar to the BERT paper.The learning rates are also similarly set to be {2, 3, 4, 5} − 5 to have a fair comparison.The results in Table 6 show that our model consistently outperforms the baseline in all datasets for all learning rates.

Different training strategy
We conduct the experiments on GroundedBERT trained with different settings: Optimal transport and Classifier.Table 7 reports evaluations of our model on GLUE and SQUAD on 5 different approaches, i.e., only Classifier, only Classical OT (OT), only Partial OT, Classifier + OT and Classifier + POT.The results show that the combination of Classifier and Partial OT achieves the highest score in most tasks, while Partial OT perform better than Classical OT in both combination with Classifier or not.

CONCLUSION
In this paper, we propose GroundedBERT as a grounded language learning model that incorporates visual information into BERT representation.We introduce the visual grounding module to capture the visual information which is later joined with the text representation to create a unified visual-textual representation.Our model significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.

Figure 2 :
Figure 2: Implementation of our GroundedBERT.The model consists of two components, i.e.Language encoder and Visual grounding part.The new representation of language model combines of Textual embedding and Visual embedding.

Figure 3 :Algorithm 1
Figure 3: Implementation of our training framework.The framework consists of two parrallel pipeline for visual and text, finally, the whole model is train with two objectives: Image-sentence matching and Optimal transport matching for alignment.Algorithm 1 Computing Optimal Transport.

Table
: Statistics of some common datasets used in visual grounded language learning task.

Table 2 :
[49]stream task results of BERT, Vokenization[49]and our GroundedBERT, we conduct the experiments on BERTbase architecture.MRPC results are F1 score, STS-B results are Pearson correlation, SQuAD v1.1 and v2.0 results are exact matching and F1 score.The results, which outperform the other one are marked in bold, are all scale to range 0-100.The Δ  and Δ   columns show the difference between our model and the baseline, and the Vokenization method respectively.

Table 3 :
Task descriptions and statistics.

Table 4 :
Downstream task results of different vision and language pretrained model.

Table 5 :
Downstream task results and comparison of our GroundedBERT without training the Visual grounding module.The first two rows report the fine-tuned results of our model without training with the visual grounded datasets, while the last 4 rows show the results of our approaches on both OT and POT.

Table 6 :
Downstream task results of BERT and our GroundedBERT with different learning rates on GLUE.

Table 7 :
Downstream task results different approaches when training the Visual grounding module.