Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation Extraction

How can we better extract entities and relations from text? Using multimodal extraction with images and text obtains more signals for entities and relations, and aligns them through graphs or hierarchical fusion, aiding in extraction. Despite attempts at various fusions, previous works have overlooked many unlabeled image-caption pairs, such as NewsCLIPing. This paper proposes innovative pre-training objectives for entity-object and relation-image alignment, extracting objects from images and aligning them with entity and relation prompts for soft pseudo-labels. These labels are used as self-supervised signals for pre-training, enhancing the ability to extract entities and relations. Experiments on three datasets show an average 3.41% F1 improvement over prior SOTA. Additionally, our method is orthogonal to previous multimodal fusions, and using it on prior SOTA fusions further improves 5.47% F1.


INTRODUCTION
Entity and relation extraction aim at detecting potential entities and their inter-relations from unstructured text, building structured <Entity, Relation, Entity> triples for downstream tasks like QA Figure 1: The previous methods (above) utilized graphs or hierarchical multi-modal fusion to combine text and image embeddings.Our method (below) aligns soft entity pseudolabel prompts by extracting potential objects and soft relation pseudo-label prompts through images.The obtained selfsupervised pseudo-label signals are used to pre-train the multimodal fusion, enhancing its ability to extract entity and relation-related information from images.[33,34,49] and web search [37].Prior attempts solely relied on single modality, e.g., text, for extraction [16,[21][22][23]36].Recent studies found that image representations aid triple extraction, prompting multimodal extraction research [7,17,32,63].Fusion and alignment of image and text representations have become a research focus in the MM community.Zhang et al. [60] and Zheng et al. [62] attempted graph-based multimodal fusion for fine-grained object and entity information extraction.Chen et al. [7] tried a hierarchical multimodal fusion framework, removing irrelevant image objects, enhancing robustness and effectiveness of entity extraction, achieving SOTA results.Overall, all works have been confined to training multimodal models based solely on existing datasets.However, due to the scarcity of labeled multimodal data, these works' effectiveness has been significantly constrained.
Addressing this, large-scale unsupervised image-text pairs like NewsCLIPping [13] serve as an effective solution for pretraining multimodal models.Multimodal pre-training works like CLIP [44], Oscar [31], and AlPro [25] offered solutions by aligning text with images, but neglect modeling entity and relation information embedded within modalities.We propose a multi-modal pre-training method focusing on entity and relational information within text and image, using a vast amount of unlabeled image-caption pairs and novel alignment methods to obtain pseudo entity and relation labels for entity-object and relation-image pairs.These pairs help in aligning the semantic spaces of images and text and are beneficial for the extraction of multimodal entities and relations.As shown in Figure 1, the entity-object alignment task uses automated object recognition tools, such as YOLO [54], to extract potential objects from images.We then use automated entity detection tools, like spaCy (https://spacy.io/),to extract potential entities from captions and align these entities with potential objects using entity prompts, such as "This is an image of [Entity]", to obtain soft pseudo entity labels.For example, when the object is a girl, the ranked entity embedding similarities are: [Girl, 0.62; Face, 0.21; ...].We store the pseudo entity labels corresponding to the object.Similarly, the relation-image alignment task collects a relation database from Wikidata and calculates similarity between relation prompts, such as "The image shows the relation of [Relation]", and image embeddings.For instance, for the image in Figure 1, we can obtain pseudo relation labels: [Medal Awarded, 0.68; Awarded Received, 0.12; ...].
After obtaining these pseudo entity and relation labels, we can use them as self-supervised signals to guide the pretraining of multimodal fusion.The soft label mechanism smooths the impact of prediction errors, even if some pseudo labels are incorrect [42].We employ cross-entropy loss to enhance the multimodal fusion's ability to extract objects and their relations in images and guide the text representations of entities and their relations.Take Figure 1 as an example, previous methods incorrectly predicted the entity Oscar as the PERSON type, leading to an erroneous relation prediction of per/per/peer.In our approach, the multimodal fusion has already unleashed the power of alignment information between potential objects and entities from a massive amount of unlabeled imagecaption pairs, so it easily predicts Oscar as the MISC type, resulting in the correct relation prediction of per/misc/awarded.
Our experiments on three public multimodal NER and RE datasets showed a 3.41% F1 score improvement over previous SOTA methods.Our self-supervised pretraining is orthogonal to prior hierarchical and graph-based multimodal fusion techniques, using vast imagecaption data to pretrain SOTA fusion modules.Continued pretraining enhances F1 performance by 5.47% for NER and RE tasks.We've experimented with Visual ChatGPT [57] in multimodal NER and RE tasks due to the explosive popularity of ChatGPT.In summary, our contributions are threefold:  [60] proposed using region-based image features to represent objects, employing Transformer and visual encoders for fine-grained semantics.Numerous experiments showed that including objects from images can significantly improve entity extraction accuracy.However, most methods overlook error sensitivity.Sun et al. [50] proposed learning a textimage relation classifier for better multimodal fusion and reducing irrelevant image interference.Chen et al. [7] attempted a hierarchical multimodal fusion framework to remove irrelevant objects.Wang et al. [55] used a retrieval approach to gather relevant text from Wikipedia for image and text-based predictions.
Nonetheless, previous methods neglected the vast amount of easily accessible image-caption pair data, such as NewsCLIPing.In this paper, we innovatively propose entity-object and relationimage alignment tasks, extracting soft pseudo entity and relation labels from image-caption pair data as self-supervised signals to enhance the extraction capabilities of multimodal fusion models.

Self-Supervised Multimodal Learning
In multimodal learning, models process and integrate data from multiple modalities [5,6,45], with applications in visual and language learning [43], video understanding [46,47], and natural language understanding [29,30,35].However, expensive human annotations are often required for effective training.Self-supervised learning [2,46,52,64] has addressed this by using one modality as a supervisory signal for another, such as masking elements in images or text and using information from the other modality to reconstruct the masked content [1,3].

Linear Projection [CLS]
Soft Relation Pseudo-Label Figure 2: Overview of the multimodal pretraining.We utilize three contrastive losses: the contrastive image-text loss (L  ), the contrastive object-entity loss (L  ), and the contrastive image-relation loss (L  ).These losses aim to align the input text sequence with the corresponding image, align the entities and objects, and align the image with pre-defined relations respectively.We pre-trained the model on NewsCLIPping [13], which includes both matched and mismatched image-caption pairs.In the matched pairs, the caption accurately describes the image, resulting in similar representations in semantic space, and vice versa.Therefore, we additionally utilize an image-text match loss (L   ).
Self-Supervised Multimodal Learning (SSML) leverages multimodal data and self-supervised objectives to enhance multimodal models.Pairings between modalities can be used as input by SSML algorithms (e.g., when one modality supervises another [8,24]) or as output (e.g., learning from unpaired data and inducing pairing as a byproduct [15,56]).Classical SSML methods include coarsegrained and fine-grained alignments; coarse-grained alignment assumes alignment between images and captions in multimodal self-supervision [44]; fine-grained alignment refers to correspondence between caption words and image patches [25,31].However, separate embedding spaces for modalities decrease effectiveness in modeling cross-modal interactions.In this paper, we employ entity and relation prompts for both alignments and use soft pseudo entity and relation labels to train multimodal fusion, improving entity and relation extraction performance.

TASK FORMULATION
The multimodal named entity recognition (MNER) is defined as a sequence labeling task.Given an input text sequence  = { 1 ,  2 , • • • ,   } and an associated image  , where  is the length of the sequence, the objective of the MNER task is to identify entities in the input sequence using the B, I, O format and categorize the recognized entities into predefined types

MODEL ARCHITECTURE
We introduce two prompt-guided alignment mechanisms, one for entity-object alignment and the other for image-relation alignment, in addition to an image-caption consistency alignment module.These alignment modules are pre-trained on large-scale multimodal data and then fine-tuned on task-specific data for NER and RE.In the following sections, we describe each of these modules in detail.

Image-Caption Consistency Alignment
The goal of this module is to encode the input text and image in a way that their respective features are mapped to the same embedding space, allowing for improved fusion between the two modalities.Given a batch of matched and mismatched image-caption pairs obtained from NewsCLIPpings  = {(  ,   ,   )}  =1 , where B denotes the batch size,   = { ( ) } is the caption that describes the image   and   is a binary label indicating whether the pair is matched (1) or not (0), we utilize a context encoder to obtain contextualized representations of the input caption.The context encoder can be a Transformer [53] or a BERT [11], where we use the BERT as an example.We feed the text sequence   into BERT to obtain the encoded representation   =  (  ), where To encode the visual content   , we first partition it into  disjoint small patches   = { ( ) } and apply a projection head   (•) to map each patch into a lower-dimensional space to obtain a sequence of patch tokens   .Then, we use a pre-trained Vision Transformer (ViT) [12] trained on ImageNet [10] to encode these tokens and generate a sequence of visual embeddings: where . Here,   and   represent the hidden size of the projection head and the Vision Transformer, respectively.Similar to text encoding, we obtain the visual representation by using the output of the [CLS] token, denoted as  ( )  .
To obtain multimodal embeddings, we concatenate the text embeddings and visual embeddings previously obtained and then use a BERT model to fuse the embeddings from the two modalities: where ) is the output multimodal embeddings and  is the hidden shape of 768.
Firstly, we employ an image-text matching loss L   to aid in the alignment between images and captions of the model: where   (•) is a classifier (e.g. a linear layer followed by a softmax operation).However, as the features of different modalities exist in separate embedding spaces, the transformer-based cross-modal encoders and L   may not possess adequate alignment and fusion capabilities.To address this issue, we introduce a novel contrastive image-text loss function (L  ).Different from conventional contrastive loss functions that measure the similarity between embeddings using dot products, we adopt two projection heads    (•) and    (•) to project the embeddings of different modalities into a generalized low-dimensional space.Then, we measure the similarity of the image and caption by dot product between the projection of  ( The similarity between the text and visual representations will be higher for matched text-entity pairs since they express similar content and produce similar representations.Thus, given a batch of matched image-caption pairs   = {(  ,   ,   )}   =1 , with   <  being the size of matched image-caption pairs from a batch and   = 1, we minimize the contrastive image-text loss L  to strengthen the model's ability to align the text-image pairs: where  is the temperature coefficient.The contrastive text-toimage loss L   encourages the text embedding to be similar to the image embedding, while the image-to-text loss L  is the opposite, aiming to align the image to text embeddings.Finally, we define the contrastive image-text loss L  as the average of the text-to-image loss L   and the image-to-text loss L  , which encourages bidirectional alignment of the text and image embeddings, leading to improved multimodal representations.

Prompt Guided Entity-Object Alignment
After establishing the model's image and text alignment capabilities using L   and L  , a more refined object-entity alignment ability is necessary to support the MNER task.To this end, the module is designed to generate soft entity pseudo-labels for fine-grained image regions, providing supervision to the pre-training model and improving its object-entity alignment capabilities.Soft pseudolabels differ from hard pseudo-labels in that they are represented as a probability distribution over types.By generating soft entity pseudo-labels for fine-grained image regions, the model can utilize object information in the image to aid in entity recognition and also improve the model's understanding of the entities that appear in the pre-training dataset.As the encoders defined in section 4.1, the soft entity pseudo-label generator consists of a textual encoder and a visual encoder to encode the textual entity prompts and visual objects obtained from the original image, separately.
In order to obtain entities from the dataset to increase the ability of model entity recognition, we first segment the caption of the dataset using spaCy and extract the most frequent M nouns as the candidate entity set   .Next, we generate a series of prompts , • • • ,  ( )  } based on the M candidate entities using a fixed template, for example: "This is an image of {}", where  ∈   .We obtain the text representation of these prompts { (1) } by sending them to the entity text encoder.Subsequently, we use YOLO to crop the input image   to generate an object   of the image, we then partition   into K disjoint small patches and send them to the entity visual encoder to obtain its visual representation where

(𝑖 )
∈ R  and  is the temperature coefficient.The soft entity pseudo-label will guide the pre-training by providing supervision for the model's object-entity alignment capabilities.Specifically, let  ( )  be the subset of   which consists of image patches containing the object   .We use the average of the corresponding multimodal embedding in  Ultimately, we define the contrastive object-entity loss L  as the cross-entropy loss between   and   : where B is the batch size.

Prompt Guided Image-Relation Alignment
After obtaining the alignment ability of entity-object, which helps the model to better discover entities, we need to obtain the alignment ability of relation-image to enhance the model's ability to recognize the relation between entities so that improve the model's performance on MRE tasks that require understanding the relations between entities in text.This module also contains a textual encoder as well as a visual encoder.We propose generating soft relation pseudo-labels that improve the model's relation-image alignment using a series of prompts based on pre-defined relation tags   = {  }  =1 .The relational tags are obtained from Wikidata, with a focus on relations of type "data" which are the primary relations associated with entities.The use of these relation tags enhances the model's ability to generalize and accurately identify relations between entities.This is similar to our approach for generating soft entity pseudo-labels, which provided supervision for object-entity alignment capabilities.Specifically, we generate a series of prompts
As the image is randomly cropped to obtain the object, it is not guaranteed that the entity mentioned in the sentence corresponds to the object that is cropped from the image.Therefore, we use the entire image as the target for relational alignment.We define the soft relation pseudo-labels   and the probability distribution   obtained from the classifier    (•) as: where we utilize the embedding of the special token [CLS]   as the multimodal representation.
Finally, the contrastive image-relation loss L  is defined as: Overall, the model is optimized through the joint optimization of four losses, which include: (1) Image-text matching loss (L   ) and (2) Contrastive image-text loss (L  ), both of which improve the model's ability to align image and text.(3) Contrastive objectentity loss (L  ), which enhances the model's ability to align objects with corresponding entities.(4) Contrastive image-relation loss (L  ), which improves the model's ability to align entities with their corresponding relations.Our final loss function is: (11) where the scalar hyper-parameters    ,   ,   , and   are used to control the weight of the four losses.

Model Fine-tuning on MNER and MRE
After pre-training the model, we fine-tune it in MNER & MRE tasks.Subsequently, we pass    to a conditional random field (CRF) to enforce structural correlations between labels in sequence decoding.For a specific forecast sequence , the probability of the sequence is defined as follows: where Y is the pre-defined label set with BIO tagging schema, and  (•) represents the potential function.Finally, the output label sequence  * and the training loss are: . Given a batch of multimodal MRE data, denoted as , where  is the batch size.To process the sentence  = { 1 ,  2 , • • • ,   } along with its corresponding entities 1 and 2, we follow the approach proposed by Soares et al. [48] by introducing four special tokens to represent the start and end of 1 and 2.These tokens are denoted as and [2  ], and are injected into the original sentence  : We then utilize this modified token sequence as the input for the textual encoder.For the relation representation between entities 1 and 2, we concatenate contextualized entity representations corresponding to [1  ] and [2  ] tokens from the multimodal encoder, obtaining   ∈ R 2× , where  is the hidden dimension of 768.
After obtaining the fixed-length relation representation   , we pass it through the classifier    (•) to obtain the probability distribution  (| ) =   (  ) over the different classes.The pretrained model is fine-tuned using the cross-entropy loss function:

EXPERIMENTS AND ANALYSES 5.1 Experimental Setup
Dataset: In line with prior research [7,9], we carry out experiments on Twitter-2015 [61] and Twitter-2017 [38] datasets containing 4,000/1,000/3,257, 3,373/723/723 sentences in train/dev/test sets for Evaluation Metrics: For MNER, an entity is deemed accurately identified if both its span and entity type correspond with the gold standard answer.In the case of MRE, a correct extraction of the relation between two entities occurs when the predicted relation type aligns with the gold standard.We utilize the evaluation code provided by Chen et al. [7], which employs overall Precision, Recall, and F1-score for assessment purposes.
Hyperparameters: In our study, we employed   as the text encoder, setting the maximum token length to 80.To serve as the visual encoder, we utilized a visual transformer.Our optimization strategy involved the AdamW optimizer with a decay rate of 1e-3 and a learning rate of 1e-4.In addition, the scale hyperparameters were configured as follows:    = 1,   = 1,   = 1, and   = 1.We pre-trained our model on NewsCLIPping using 8 NVIDIA 3090 GPUs with a batch size of 128 for 85,000 iterations and then fine-tuned the model using a batch size of 16.

Baselines
Following previous approaches [7], we concentrate on three categories of models for this comparison: text-based models, earlier MNER and MRE models, and multimodal pre-training methods.
Multimodal pre-training methods: 1) CLIP [44], 2) BLIP [26], 3) Oscar [31], and 4) U-VisualBERT [28] all use text to align images or objects.In addition, our pre-training method is orthogonal to previous MNER and MRE models, so we also add our pre-training experiments to prior SOTA model to compare whether its performance will further increase.

Main Results
Table 1 shows the results of our multimodal pre-training approach and three baseline methods on three public datasets, with each test conducted thrice.From comparing these methods, we can conclude: (1) Multimodal-based and pre-training methods outperform   2 show that all pre-training methods can further enhance MMIB's F1 performance, with ours achieving a new record, increasing the initial performance by 6.93%, 5.24%, and 4.23% on the three datasets, respectively.

Analyses and Discussions
Ablation Study.We conducted ablation experiments on three modules in the multimodal pre-training tasks to demonstrate their effectiveness.Ours w/o Entity-Object Alignment and Ours w/o Relation-Image Alignment represent the removal of entity-object alignment and relation-image alignment tasks during pre-training, which results in a reduced degree of learning for entities or relations by the multimodal fusion module.From Table 1, entity-object alignment and relation-image alignment tasks contribute an average F1 improvement of 3.01% and 2.47%, respectively, for the multimodal fusion module.Since Entity-Object Alignment benefits both NER and RE tasks, it has a more significant impact.Ours w/o Soft Pseudo Labels uses hard pseudo-labels instead, reducing the fusion module's error-smoothing ability; incorrect predictions directly affect the module, causing an average F1 drop of 1.51%.
Investigating the Consistency of Entity-Object and Image-Relation Representations in Semantic Space.We displayed entity-object and image-relation representations in a semantic space to show if different modalities' information is well combined.We chose 50 test samples from Twitter-2015 and MRE datasets and used YOLO [54] to get objects.We applied t-SNE [14] to reduce the text and image embeddings to 2D after modality fusion and plotted them in Figure 3.Our method aligns entity-object and image-relation semantics during pre-training using prompts, significantly helping to align data from various modalities in semantic space and better utilizing multimodal data for entity and relation extraction.Although MMIB and BLIP used entity-object alignment training, the alignment of entity-object and image-relation in semantic space remains sparse due to the absence of prompt templates as a pre-training method.
The Impact of Prompt Design.We designed various prompts to investigate their influence on pre-trained models.3. We discovered that prompts in the active voice yield better results than those in the passive voice, though the difference is not significant, with the impact being within 1%.Of course, our exploration of prompt engineering is limited, which presents an interesting direction for future research.
the number of types and B, I, or O, indicates whether it is the beginning of an entity, inside an entity, or outside of any entity, respectively.The output of the MNER task is a sequence of labels  = { 1 ,  2 , • • • ,   }, where   ∈ Y = { − ,  − ,  } and  ∈ , representing the label of the -th input token.The multimodal relation extraction (MRE) task involves an input text sequence  = { 1 ,  2 , • • • ,   } with two pre-extracted named entities  1 and  2 , as well as visual content  , where  is the length of the text sequence.The objective of the MRE task is to classify the corresponding relation tag  ∈  = { 1 ,  2 , • • • ,    } between  1 and  2 .The set  contains the pre-defined relation types, and   represents the number of the relations.
denotes the encoded representation of the input sequence with hidden embedding size  = 768.Here, we use the representation of the special token [CLS]  ( )  as the textual representation.
obtaining the object visual representation and text representations of the prompts, we use the normalized softmax score between the visual representation and all textual representations to generate soft entity pseudo-labels and send it to a classifier    (•), which outputs probability distribution   =   ( ( )   ), (7) where  ( )   ∈   and  ( )  ∈ R  .

4. 4
.1 MNER.For a batch of multimodal MNER data X   = {  ,   ,   }  =1 defined in Sec. 3, where  represents the batch size, we input the textual and visual data into the pre-trained model and utilize the output    = { 1 ,  2 , • • • ,   } of the multimodal encoder to obtain a representation for each position in the sentence.

Figure 3 :
Figure 3: Consistency of Entity-Object and Image-Relation Representations in Semantic Space.Table3: The F1 Impact of Combining Prompt Designs.

Figure 4 :
Figure 4: The F1 Scores Improvement in Tail Relations and Entity Types.
For entity-objectaligned prompts, we used two types: (①) This is an image of [EN-TITY].(②) An image of [ENTITY] is shown here.For relationimage-aligned prompts, we employed three types: (A) This image shows the relation of [RELATION].(B) The relation of [RELATION] is shown in this image.(C) The relation between the objects in the image is [RELATION].The F1 impact of using different prompts on the results is shown in Table

Table 2 :
The F1 performance obtained by using multimodal pre-training on state-of-the-art multimodal method.

Table 3 :
The F1 Impact of Combining Prompt Designs.As all multimodal pre-training methods are orthogonal to multimodal fusion methods, we selected the best fusion method (MMIB) and continued using multimodal pre-training based on MMIB.Results in Table