Training Multimedia Event Extraction With Generated Images and Captions

Contemporary news reporting increasingly features multimedia content, motivating research on multimedia event extraction. However, the task lacks annotated multimodal training data and artificially generated training data suffer from distribution shift from real-world data. In this paper, we propose Cross-modality Augmented Multimedia Event Learning (CAMEL), which successfully utilizes artificially generated multimodal training data and achieves state-of-the-art performance. We start with two labeled unimodal datasets in text and image respectively, and generate the missing modality using off-the-shelf image generators like Stable Diffusion and image captioners like BLIP. After that, we train the network on the resultant multimodal datasets. In order to learn robust features that are effective across domains, we devise an iterative and gradual training strategy. Substantial experiments show that CAMEL surpasses state-of-the-art (SOTA) baselines on the M2E2 benchmark. On multimedia events in particular, we outperform the prior SOTA by 4.2% F1 on event mention identification and by 9.8% F1 on argument identification, which indicates that CAMEL learns synergistic representations from the two modalities. Our work demonstrates a recipe to unleash the power of synthetic training data in structured prediction.

According to the United Nations, more than 43,000 migrants, mostly from sub-Saharan Africa

INTRODUCTION
As a fundamental research topic in the domain of information extraction, event extraction aims to identify instances of events and their arguments from unstructured data [7,11,20,44,65].An event refers to a specific incident that involves a change in state, which are marked by triggers such as verbs.The arguments of an event include the time and place of the event occurrence and its participants, such as the initiator, the recipient, and the instrument.
As digital media quickly evolve, news reports today frequently present information with a combination of text and image, providing a more comprehensive view of events than text alone [9,55].This has spurred the emergence of the multimedia event extraction  (MEE) task [26], which aims to jointly extract both textual and visual events from multimedia news articles.Figure 1 shows an MEE instance.Interestingly, not all events in a multimedia news article are multimodal.For example, the event Transport-Movement is described by both the text and the image modalities, whereas the events Life-Die and Contact-Meet are contained respectively in text and image only.
A major challenge posed by the MEE problem is the lack of multimodal training data.The M 2 E 2 dataset provided by [26] is only the test set.The labeled training datasets ACE2005 [52] and imSitu [68] contain event annotations in a single modality only.Despite recent progress [49,74], transferring the knowledge learned from unimodal annotations to multimodal test data remains a difficult challenge.
After the explosive success of image generation networks such as DALL-E 2 [41] and Stable Diffusion [45], a natural thought is to perform cross-modality data augmentation in order to bridge the modality gaps of MEE.That is, conditioned on existing unimodal data, we can generate training data for the missing modality.After that, we use the resultant multimodal data to train a network.As the generative models capture world knowledge learned from observing correlative patterns among natural images and their textual descriptions, it is probable that such knowledge can be distilled and used to inform the task of event extraction.
However, a naive cross-modal data augmentation approach faces two obstacles.First, it is difficult to precisely control the generative models and produce data that are relevant to the event label and free of hallucination.In Figure 2 (d), the generated image depicts gum chewing but the event label is about talking on the phone.In Figure 2 (h), the generated caption describes the people as holding signs, whereas the label is demonstrating.In Figure 2 (g), the caption hallucinates a trailer that does not exist in the image.Second, in the case of image generation, existing models occasionally still generate images with significant deformation and unnatural artifacts.For example, the soldier in Figure 2 (b) is shown with three hands.For these reasons, the distribution of the generated data likely diverges from that of real-world data.In practice, we find that directly training on generated data results in performance degradation (Table 2).
To fully utilize the power of generative models to augment existing unimodal training data, we propose Cross-modality Augmented Multimedia Event Learning (CAMEL).After generating synthetic multimodal data, CAMEL applies an iterative and gradual training strategy that learns robust representations under noisy data and distribution shifts.We train the networks using text coupled with synthetic visual data and images coupled with synthetic textual data.The network is gradually frozen from the bottom up during training.Experimentally, we show that this training technique offers substantial benefits over naive data augmentation.In particular, on multimedia events, we outperform the previous best network, UniCL [29], by 4.1% F1 on event mention identification and 9.8% on argument role identification.In addition, the training strategy of CAMEL works robustly under different choices of image generation and captioning networks.
Our contributions can be summarized as follows: • For multimodal event extraction, CAMEL utilizes synthetic data to fill in the missing modality in the unimodal ACE2005 and imSitu training datasets.To our best knowledge, this is the first work that successfully demonstrates the use of bidirectional cross-modality data augmentation (text-to-image and image-to-text) for multimodal learning.This results in superior data efficiency -with the unlabeled real-world multimodal VOA dataset [26]

RELATED WORK 2.1 Event Extraction
Event extraction [27] is a well-studied problem in information extraction.Many early works [8,17,30,34,58,63,75] focus on textual data and aim to identify event structures containing trigger words and arguments from unstructured text.Traditionally, textual event extraction is formulated as sequence labeling [12,42,61,64].More recent studies also formulate the problem as question answering [13,59,75].Similarly, visual event extraction [5,39,46,47,60,68], also referred to as situation recognition or visual semantic role labelling, aims to identify visual events and their participants.The earlier CRF-based methods [67,68] jointly predict event type and the associated roles in one stage.[35] shows that identifying the action and the argument roles in two stages with separate networks to offer performance gains.More recent methods, such as GSRFormer [3] and SituFormer [60], adopt the two-stage approach.
Several studies investigate the use of multimodal data in unimodal event extraction.For example, [49] and [74] retrieve image relevant to the events, which can assist with disambiguation.[25] leverages image captions as distant supervision to interpret events in the associated images.Although they operate on multimodal data, these methods are aimed at events present in one modality.
Multimedia event extraction is proposed to extract events and arguments from multimedia documents [2,26].[26] tackles imagetext documents while [2] focuses on video.WASE [26] uses weakly supervised learning to encode structured representations from textual and visual data into a shared embedding space.[29] introduces contrastive learning to bridge textual and visual modalities.Compare to these research, our method is the first to directly learn from synthetic multimodal training data, which are generated from labeled unimodal data.

Cross-modality Generative Data Augmentation.
Recent advances in generative models have propelled data augmentation research to a new level.On textual tasks, one approach is to generate additional textual training data [10,16,36,37,57,62,69].Another, multimodal approach is to generate visual data to complement existing textual data [31,33,66,76], which improves performance on textual tasks.For example, [31] generates visual data for machine translation.[76] uses generated images to guide text generation tasks, such as text completion, story generation, and concept-totext generation.In addition, [33,66] integrate synthetic images into language models to enhance the solution of plain language understanding tasks under low-resource settings.Unlike previous studies that address unimodal problems by synthesizing multimodal data, our work use the generated data to tackle multi-modal tasks.Doing so places a stringent requirement on the quality of generated data, as we need to train encoders in both modalities with the generated data.This necessitates overcoming the domain shifts between generated and real data.To the best of our knowledge, this is the first work to utilize bidirectional cross-modality data generation models for multimodal tasks.

TASK DEFINITION
Let  = ⟨, ⟩ represent a multimedia document, which consists of a set of images  = { 1 ,  2 , . ..} and a set of sentences The multimedia event extraction task contains the following two components.
Event Mention Identification: Given a multimedia document , the first goal is to identify a set of event mentions from .An event mention  belongs to one of the predefined event types,   , and is grounded on a trigger word  or a trigger image  or both.A multimedia event contains both a trigger word  and trigger image , while a text-only or an image-only event only contains one type of trigger.
Argument Role Identification: The purpose of argument role identification is to extract, from the document , all participants and attributes (i.e., arguments) of a given event .For each event type, there is a predefined list of argument types.Each argument  is classified into one argument type   associated with the event type.The argument is grounded on a textual span  in a sentence or one or more object bounding boxes in the image.The algorithm for argument role identification must also identify the position of the textual span  and the bounding boxes.
If  is a multimedia event, it must be grounded on both a textual trigger and a visual trigger.The arguments of multimedia events could contain both textual spans and visual objects.For example, in Figure 1, the multimedia event Transport: Movement is grounded on both the trigger word "reached" and the trigger image on the left.It also has two textual arguments and one visual argument.

APPROACH
The proposed approach, CAMEL, is trained with multimedia data that are artificially generated from unimodal data (Section 4.1).The cross-modality generative data augmentation approach can be thought of as distilling event-related knowledge from large generative models to the event identification network.
We show an overview of CAMEL in Figure 3.In a dual-encoder architecture, CAMEL first extracts features from the two modalities separately using unimodal encoders.To perform feature fusion and allow the network to pick relevant features among possibly noisy input, we design a modality-shared adapter module that perform cross-attention between the modalities.Further, to cope with possible distribution shifts and learn robust and generalizable features, we employ an iterative and gradual training strategy (Section 4.4).After these steps, we feed the resultant representation to domainspecific classifiers to identify the event mentions and arguments.

Cross-modality Generative Data Augmentation
A major obstacle for multimedia event extraction is the lack of multimodal training data.In the commonly used setup, first proposed by [26], the training data contains event annotations on text (ACE2005 [52]) and event annotations on images (imSitu [68]).The unlabeled VOA [26] dataset is often used as auxiliary training data;it contains parallel image-text data but no event annotations.
To tackle this problem, we utilize large text-to-image and imageto-text generative models to perform cross-modality generative data augmentation.Specifically, to augment the labeled textual data, we generate images using a text-to-image model.In addition, to augment labeled image data, we use an image-to-text model to generate image captions.This procedure yields labeled parallel image-text data.For most of our experiments, we use Stable Diffusion v2.1 [45] for image generation and BLIP [24] for captioning.However, CAMEL can be applied to a range of generative models with little loss in performance, as demonstrated in Section 5.3.
Visual Data Augmentation.We perform visual data augmentation on the labelled textual dataset, ACE2005, which consists of textual news reports.In order to extract textual spans that are relevant to the event, we utilize the existing annotations of event arguments and trigger words.For each event, we find the shortest continuous textual span that include all arguments and the trigger word, and use that as the textual input to image generation networks.We show one example of the extracted text span in the purple-lined box in Figure 3.
The image generation process is stochastic.Thus, we generate several images for each textual event in ACE2005 to cover different possible visual appearances and spatial arrangements.The number of images is a hyperparameter, which we set to four.
Textual Data Augmentation.To augment the visual dataset im-Situ with the textual modality, we utilize the off-the-shelf imageto-text model to generate image captions.To generate diverse and detailed captions, we adopt nucleus sampling [15].At each time step, the technique iteratively adds the most probable word to the candidate list until the total probability of the candidates exceeds a pre-defined probability.After that, the probabilities of candidates are normalized and one word is sampled accordingly.We generate one caption for each image in imSitu.

Model Architecture
Feature Extractors.CAMEL utilizes two pretrained Transformer encoders to extract unimodal features separately.Using the hidden states of the last network layer, the text encoder obtains a -dimensional vector representation ℎ text  for each word   .Similarly, each patch of the image is encoded into a -dimensional vector ℎ img  .We denote the set of all text representations as   and the set of all visual representations as   .We also prepend CLS tokens to the input of the two encoders.The corresponding encodings ℎ text CLS and ℎ img CLS can be understood as representing information from the entire sentence or image.
Feature Fusion.We devise a cross-attention network module, commonly used in Transformer networks, to fuse textual and visual features.The network consists of multi-head cross attention, layer normalization, and some linear layers.The detailed architecture is shown in Figure 4.
We refer to this module as the Adapter network.For simplicity, we denote the input to the Adapter as the query vector , the key matrix , and the value matrix  .The overall network is denoted as the function  = Adapter(, ,  ). ( We make repeated use of the same Adapter module with in the identification of event mentions and arguments, but change the , , and  depending on the exact task.Most parameters are shared across tasks.However, parameters in the linear task-specific projection layer are specific to the four tasks (textual event mention, textual argument role, visual event mention, visual argument role).
The design of the Adapter module is motivated by the characteristics of multimedia documents, which usually do not explicit indicate the correspondence between images and the main text.When we try to identify a textual event and its arguments, we do not know which image is relevant to this event.The cross-attention mechanism allows the network to distinguish relevant images.Similarly, when extracting visual events, the network relies on the Adapter to select relevant portions of the text to facilitate its prediction.
Textual Event Extraction.The first sub-problem in textual event extraction is to identify the trigger word.This is word-level classification.The trigger word should be classified into the exact event type, whereas other words should be classified as non-triggers.
For the classification of the  th word, we first take its encoding from the textual encoder, ℎ text  .After that, we feed ℎ text Adapter network as the query vector.We use the CLS token encodings of all images in the entire multimedia document, denoted as  all-img , as  and  in cross attention.
After that, we concatenate ℎ text

𝑖
and  text  and feed them through a linear classifier.The loss is cross-entropy.
The second sub-problem is the identification of event arguments.Following the convention in the literature [26,29], we use the ground-truth list of entities for both training and inference.Each entity is a textual span that describes a person, an organization, a location and so on.We take the encoding of the first word in that entity as the entity feature ℎ text-ent , and feed it to the Adapter.

𝑖
, and the textual encoding of the trigger word, and feed them through a linear classifier, which classifies it into the argument classes.Though the types of valid arguments change depending on the event, here we do not exploit this fact for further performance improvement.
During training, if the ACE2005 sentence contains an event, we generate several positive images from the event text prompt (see Section 4.1).In addition, we also include some negative images generated for other events into  all-img .This trains the network to distinguish between relevant images and irrelevant images.However, if the ACE2005 sentence does not contain any event, we would not be able to extract the event prompt using the method in Section 4.1.Instead, we randomly sample generated images from other text and use their encodings as  all-img .Visual Event Extraction.Similar to the textual modality, visual event extraction has two sub-problems, the classification of images into event types or non-events, and identification of objects as event arguments.For image event classification, we take the encoding of the image CLS token, ℎ img CLS .Using the Adapter network again, we acquire an aggregated feature from the text modality, which we denote as  img , where the matrix  all-text contains the encoding vectors of the textual CLS token encodings of all sentences in the same batch.We feed the concatenation of ℎ img CLS and  img to a linear classifier.For the second sub-problem, event argument identification, we first extract all objects in an image using an off-the-shelf object detector.For each object bounding box, we identify the three patches that contain its top-left corner, its center, and its bottom-right corner respectively.After that, we take the average of the three patch encodings, which we denote as ℎ img-obj .The object feature extraction process is illustrated in Figure 5. Once again, we apply feature fusion using the adapter network to obtain  img-obj ,  img-obj = Adapter(ℎ img-obj ,  all-text ,  all-text ). ( Finally, we concatenate three feature vectors, ℎ img-obj ,  img-obj , and ℎ img into a single vector and feed it through a linear classifier.

Multimedia Event Extraction
For multimedia events, we need to resolve the coreference between text events and visual events.Given a multimedia document, we compute the similarity of each sentence-image pair.Following [26,29], we treat a textual event and a visual event as the same event if and only if they have the same event type and the similarity between the sentence and image is greater than a threshold.We calculate the cosine similarity of the sentence-image pair using the CLIP model [40].The multimedia event inherits all textual event arguments and visual event arguments as its own arguments.

Training Strategy
Robust representation learning is key to the success of crossmodality data augmentation and multimedia event extraction.As discussed in the introduction and shown in Figure 1, the automatically generated multimodal data often contain noise, such inconsistency with the event label, hallucination, unnatural image artifacts, and so on.The discrepancy between the generated data distribution and the real-world data distribution may cause generalization difficulties.In addition, the M 2 E 2 task itself poses a transfer learning problem because the training data, ACE2005 and imSitu, have different distributions from the test set.Hence, we need to learn robust feature representations that generalize well.
We propose an iterative and gradual training strategy, shown in the right column of Figure 3.We divide the training into the three stages.In the first stage, we first train on visual event mention, followed by textual event mention.The separation is a simple method to alleviate the well-known problem that different modalities learn at different speeds [56].In the first stage, all network parameters are trained except the feature extractor corresponding to the generated synthetic data.For example, when training on real text data and generated image data, the text encoder is trainable but the image encoder is frozen.The rationale is to prevent the gigantic feature extractors (with hundreds of millions of parameters) from overfitting the low-level feature distributions of the augmented training data, which are likely idiosyncratic (e.g., soldiers with three hands) and not generalizable.However, we postulate that the high-level features extracted by the encoders are not heavily affected by shifts in lower-level feature distributions, so we train all the parameters after the encoders.
In the second stage, we again train the network on visual event mention identification, followed by textual event mention identification.Both encoders are frozen and only the Adapter and classifiers are trained.The design rationale is to allow visual classifiers to adapt to changes in the textual encoder in the first stage, and vice versa.In the third stage, we freeze all network parameters but finetune the visual event mention classifier using balanced event data.This technique is to mitigate the negative effects of imbalanced class proportions in the visual event mentions [19].Finally, we separately finetune the visual encoder for visual event argument identification, and fintune the text encoder for textual event argument identification.This creates two models specialized for argument identification.

EXPERIMENTS
In this section, we extensively evaluate CAMEL by comparing against existing SOTA approaches, against ablated version of CAMEL, and against different choices for the image generators and image captioning networks.

Experimental Setting
Datasets and Evaluation.We evaluate on the M 2 E 2 benchmark, a large-scale multimedia event extraction dataset that with the 8 types of events and 15 types of arguments.It contains 245 multimedia documents with 6,167 sentences and 1,014 images.There are 1,297 textual events and 391 visual events, among of which 192 textual event mentions and 203 visual event mentions are aligned into 309 multimedia events.
Since M 2 E 2 does not provide training data, we follow the previous work [26,29] to use the ACE2005 [52] and imSitu [68] (with the grounding information from [39]) for training.ACE2005 is a text dataset annotated with 33 event types, which contains the 8 specific types in M 2 E 2 .The image dataset imSitu is annotated with 504 activity verbs and 1,788 semantic roles.To utilize this dataset for 8-class classification, we follow [26] and map the 98 activity verbs to the 8 event types of M 2 E 2 .Following the previous works on event extraction [26,30,63], we use precision (P), recall (R), and F1 score (F1) as the default evaluation metrics.
Baselines.Following [29], we compare CAMEL with eight baselines for multimodal or unimodal event extraction.
Multimodal event extraction techniques can extract both textual events and visual events.WASE [26] first trains on different modalities independently and uses weakly supervised learning to align the two modalities.Two variations exist: WASE att locates the visual arguments using an attention heat map, whereas WASE obj leverages a object detection model.Flat att and Flat obj [26] are the simplified versions of WASE att and WASE obj respectively; they remove the graph convolution networks and concatenate features of different modalities for classification.UniCL [29] is the state-ofthe-art on M 2 E 2 , which incorporates visual knowledge into textual event extraction but uses two separate modality-specific models for event extraction.
Unimodal event extraction methods only extract textual or visual events but not both.JMEE [30] is a state-of-the-art textual event extraction technique which utilizes an attention-based Graph Convolution Network.GAIL [73] is a reinforcement learning method for textual event extraction where rewards are estimated by a Generative Adversarial Network.VAD [74] augments textual documents with images retrieved from the Internet to improve textual event extraction.Clip-Event [25] utilizes the pretrained CLIP network to perform visual event extraction.WASE-T and WASE-V are the WASE model which trained on ACE2005 and imSitu only.The latter has two further varations WASE-V att and WASE-V obj [26].
Hyperparameters.During cross-modality data augmentation, for each event in ACE20005, we perform one-time generation of 4 images at 512×512 resolution with 100 denoising steps.In addition, we use nucleus (top-) sampling [15] for image captioning with a probability threshold  of 0.9.We generate one caption for each original image.
For fair comparisons with the SOTA baseline [29], we use the same 12-layer BERT Large as the text encoder, and the same 12-layer Transformer CLIP model [18] as the visual encoder with 16x16 patch size.To detect objects for visual argument roles, we leverage the pretrained YOLOv8 [50] as the object detector.
When training on visual event extraction, our batch size is set to 64 and learning rate to 10 −4 .For textual event extraction, the batch size is set to 10 and learning rate set to 10 −4 .We employ the AdamW optimizer [32] with 10 −2 weight decay coefficient and the cosine learning rate schedule.In the first round of training, we train on the visual modality for 10 epochs and on the textual modality for 5 epochs.In the remainder of training, only one epoch is used for any modality.The maximum text input length is 200.

Main Results
Table 1 presents the performance of our proposed method CAMEL and several state-of-the-art baselines.The results show CAMEL significantly improves the event extraction performance over baseline methods.On textual events, we surpass UniCL by 1.7% F1 for event mention and 0.4% F1 for argument role.On visual events, we surpass UniCL by 0.9% F1 for event mention and 9.2% F1 for argument role.We speculate that the relatively small improvements for textual argument roles is that some textual arguments are pronouns (e.g., she) or proper noun (e.g., Saudi Arabia), which are not straightforward to visualize by the image generators.
Interestingly, the biggest performance boost appears on multimedia event extraction.We outperform the prior SOTA by 4.2% F1 on event mention identification and by 9.8% F1 on argument identification.This suggests CAMEL effectively learns synergistic representations from the two modalities.

Ablation Study
In order to investigate the effects of different components in CAMEL, we create ablated systems by removing each of the components.First, we create two variations in the training strategy.In the combined training baseline, we merge the textual event task and the visual event task as one training set and train the model in one stage without freezing any model parameters.In the one-round training baseline, we separate the training of the textual event task and the visual event task.We freeze the visual encoder when training on real textual data and generated visual data, and freeze the textual encoder when training on real textual data and generated visual data.However, we only apply one stage of training and remove the two later stages.
Next, in the w/o augmentation baseline, we remove all generated multimodal training data and train the network on unimodal data alone.For example, in textual event mention identification, we train the textual encoder and the classifier; the Adapter is removed as well.Finally, the w/o Adapter ablation retains multimodal training data but removes the Adapter network.The cross-attention scores are computed as cosine similarity.For example, in text mention identification, we compute the cosine similarity between each word ℎ text  2. The most interesting finding is that the w/o augmentation, unimodal baseline outperforms the simplistic combined training strategy by large margins (up to 12.9% F1 on multimedia event mentions).This clearly demonstrates the

Choice of Generative Models
We test if CAMEL can work with other large pretrained generative models.By default, CAMEL leverages Stable Diffusion v2.1 [45] as the image generator and BLIP [24] as image captioning model.In this experiment, we test out three different image generators, including Stable Diffusion v1.5 and v2 (SDv1.5 and SDv2) and the Kandinsky model [1].For image captioning, we attempt BLIPv2 [23], GIT [53], OFA [54], and VIT-GPT2 [21].Table 3 shows the results.We observe that, while the default settings works well, it often does not achieve the best F1 scores compared to other combinations.In addition, many model combinations outperform UniCL, the previous SOTA model.This demonstrates the generality of the CAMEL technique.

CONCLUSIONS
In this paper, we study the problem of multimedia event extraction and investigate the use of image generative networks and image captioning networks to complement existing unimodal training data.The automatically generated multimodal data often contain noise, such as inconsistency with the event label, hallucination, unnatural image artifacts, creating challenges for training.We propose a network, CAMEL, and a specialized training strategy to cope with augmented multimodal training data.CAMEL surpasses he prior SOTA by 4.2% F1 on event mention identification and by 9.8% F1 on argument identification.An ablation study shows that the design of network structure, the shared adapter, and the iterative training strategy in our method significantly improve performance.We also test the generality of the benefits of our approach to other cross-modality generative models.
[Arg: Origin], have reached [Trigger: Transport-Movement] European shores [Arg: Destination] this year from Libya and a surge in migrant crossings is expected this spring and summer.Nearly 1,200 migrants [Arg: Victim] have died [Trigger: Life-Die] at sea trying to cross the Mediterranean [Arg: Place].They also want for the EU to help support economic development in communities along the border.

Figure 1 :
Figure 1: Multimedia Event Extraction: three events are extracted from a multimedia news article.The multimedia event (Green) Transport-Movement is triggered by the word 'reached' and the image on the left.The textual event Life-Die (Orange) is triggered by the word 'died' only, and the visual event Contact-Meet (Purple) is solely triggered by the image on the right.
we go to war in Iraq Conflict-Attack I'm chewing gum and talking on the phone Contact-PhoneWrite a 30-foot Cuban patrol boat with four heavily armed men landed on American shores Movement-Transport there are two men in the group holding signs there is a white truck towing a boat into a trailer there is a person giving some money at the cash register

Figure 2 :
Figure 2: Examples of cross-modality augmented data.The red boxes indicate noise in the generated data, including inconsistency with the event label, hallucination, and unnatural image artifacts.

Figure 3 :Figure 4 :
Figure 3: An overview of the CAMEL network architecture and its training strategy

Figure 5 :
Figure 5: Extracting features for objects in images.

𝑖
and the visual image encoding ℎ img CLS .The similarities scores are normalized and used to compute a convex combination of image features, denoted as  text  .The concatenation of ℎ text  and  text  is used for classification.The results are shown in Table

Table 1 :
Main results on event mention and argument role extraction for three types of events.

Table 2 :
Ablation results of CAMEL on the M 2 E 2 dataset.