Abstract
As a crucial part of natural language processing, event-centered commonsense inference task has attracted increasing attention. With a given observed event, the intention and reaction of the people involved in the event are required to be inferred with artificial intelligent algorithms. To solve this problem, sequence-to-sequence methods are widely studied, where the event is first encoded into a specific representation and then decoded to generate the results. However, all the existing methods learn the event representation only with the textual information, while the visual information is ignored, which is actually helpful for the commonsense reference. In this article, we first define a new task of multi-modal commonsense reference with both textual and visual information. A new event-centered multi-modal dataset is also provided. Then we propose a multi-source knowledge reasoning graph network to solve this task, where three kinds of relational knowledge are considered. Multi-modal correlations are learned to get the event’s multi-modal representation from a global perspective. Intra-event object relations are explored to capture the fine-grained event feature with an object graph. Inter-event semantic relations are also explored through the external knowledge to understand the semantic associations among events with an event graph. We conduct extensive experiments on the new dataset, and the results show the effectiveness of our method.
1 INTRODUCTION
Recently, there is increasing attention among event-centered commonsense inferences. Given a daily event, the commonsense inference task aims to reason about the intention and reaction of the people involved in the event. For example, if the event “\(X\) votes for \(Y\)” is observed, we can infer the most plausible fact about the event. In terms of the intention, \(X\) may want to give support to \(Y\). For the reaction on the event, \(X\) is likely to feel proud and \(Y\) is likely to feel grateful. It is natural for humans to get the event inference ability with the foundation of commonsense knowledge learned from their lives, with which people can easily understand a movie or a story spanning over several months. It is also necessary for the intelligent system to reason like humans. For example, an ideal dialogue system should give a logical response that matches the user’s thoughts and emotions based on his/her experiences. A story generation system must understand the cause and effect of the leading context in order to generate a reasonable story. However, it is still challenging for today’s AI system to get such inference ability. This may be because most existing AI systems learn on task-related datasets in a data-driven manner, which leads to models that are effective at finding task-specific correlations but have limited capability in simple and explainable commonsense reasoning.
To support the research on event-related commonsense inference, Rashkin et al. [36] proposed Event2mind dataset, which focuses on modeling stereotypical intentions and reactions of people. Given an event described in short free-form text, this dataset annotates the likely intention and emotional reactions of people who cause or are affected by the event. After that, Sap et al. [37] proposed the ATOMIC dataset which aims to conduct the if-then reasoning. This dataset expands the three tasks in Event2mind to nine tasks which describe the cause, effect, intentions, and participant characteristics of the event. Based on these datasets, much work has been done for commonsense inference. Rashkin et al. [36] treated it as a multi-task problem and proposed an encoder-decoder framework, where an encoder and three decoders are adopted at the same time to infer the intention and reaction of the involved people. Sap et al. [37] utilized a similar way to resolve the nine sub-tasks. In addition to the information in the dataset, external knowledge is also considered to help the inference. Bosselut et al. [5] took advantage of transfer learning to generate explicit knowledge based on the implicit knowledge from deep pre-trained language models. Du et al. [9] proposed a context-aware variational autoencoder to learn the background information to guide reasoning.
Although much progress has been made, there is still room for improvement in the commonsense inference task. One weakness of existing methods is that they focus on learning a better representation for the event through either choosing a proper textual encoding method or introducing more external knowledge, while the visual information of the event is not considered [49]. For humans, it is easy to imagine the visual information of the scenes related to the event. The scenes have important visual clues for the event-related commonsense reasoning, such as where does the event happen and whether there are other objects related to the event. In addition to a single textual sentence, the visual information provides complementary descriptions for the event, which will be beneficial to the inference. For example, as shown in Figure 1, “cup” and “bowl” can be detected from the images of “Person\(X\) sits at the kitchen table”. These objects can be easily connected to “eat” and “drink” with the commonsense knowledge, which is helpful to predict the intention “to eat” or “to drink a cup of coffee” of Person\(X\)’s.
Fig. 1. Illustration on the importance of visual information in commonsense inference.
Unlike existing methods, in this article, we propose a new task of multi-modal commonsense inference, which aims to reason out the intention of and reaction to the event based on the multi-modal input of textual description and the image. To solve the new task, we propose a multi-source knowledge reasoning graph network, which explores three kinds of relational knowledge to conduct the multi-modal commonsense inference. In the first branch of the trident framework, we extract the text feature and visual feature of the event and combine them to get the multi-modal representation, by which the multi-modal correlations between the visual event and text event are captured from a global perspective. In the second branch of the trident framework, we build an object graph based on the event-related objects and learn the intra-event object relations with graph neural networks. Objects are the vital units of events, reflecting various information of the event-related persons and scenes. Learning their relations can promote the exploration of fine-grained features, thus benefiting the event representation and understanding. In the third branch of the trident framework, we build an event graph with external knowledge to learn the inter-event semantic relations. The learning of event relations will bring more background knowledge to assist the inference progress.
The contributions of our work can be summarized as follows:
• | We define a new problem of multi-modal commonsense inference with the textual description and visual image as input of an event. We also provide a new event-centered multi-modal dataset to support the research on this new task. | ||||
• | To solve the new multi-modal commonsense inference task, we propose a multi-source knowledge reasoning graph network, where three kinds of relational knowledge are considered. Multi-modal correlations are learned to get the event’s multi-modal representation from a global perspective. Intra-event object relations are explored to capture the fine-grained event feature with an object graph. Inter-event semantic relations are also explored through external knowledge to understand the semantic associations among events with an event graph. | ||||
• | We conduct extensive experiments on the proposed dataset, and the results demonstrate the effectiveness of our method. | ||||
2 RELATED WORK
2.1 Event-Centered Commonsense Inference
Event-centered commonsense inference aims to understand the relations among events and conduct the event prediction based on commonsense [16, 17, 46, 48]. It is of vital importance for the development of artificial intelligence and can be used in many advanced intelligent systems. In terms of the relation type, the commonsense inference task can be divided into two directions, temporal-relation-based inference and causal-relation-based inference.
The temporal relation based inference tries to understand temporal relations between events and complete the related event prediction tasks, such as script event prediction which requires the model to choose a correct subsequent event among candidates [45]. To deal with the problem, Granroth et al. [13] propose a model that simultaneously learns the event representation and predicts the strength of association between events. Pichotta et al. [33] input the event sequence into LSTM to predict the probability of the next event. However, the above methods make a prediction based on either event pairs or event chains, which do not fully utilize the dense event connections. As an improvement, Li et al. [24] constructed a narrative-event evolutionary graph based on event chains to describe the event evolutionary principles and patterns. They also propose a scaled graph neural network to model the event interactions and thus learn better representation for the event.
In comparison, the casual relation based inference aims to learn the cause and effect of the event, thus makes some logical predictions based on it. Mostafazadeh et al. [29] propose the ROCStories corpus as well as a new task of “Story Cloze Test”, which requires the model to choose the correct ending for the story. This task needs the understanding of the story context and the generated ending should be logically consistent with the context. Based on the same corpus, Rashkin et al. [35] propose to reason about the cause and effect of the mental state changes of characters in the story, which can be explained as the psychology of story characters. Rashkin et al. [36] further propose to conduct the commonsense inference by learning the intents and reactions for a given event. They release a dataset called Event2mind, which has 25,000 event phrases covering a various range of everyday events and situations. The likely intents and reactions of the participants are predicted with an encoder-decoder framework. The Event2mind is then extended into 9 subtasks for the “if-then” reasoning in ATOMIC dataset [37], which contains 877 k event-textual descriptions of inferential knowledge, where the wants, intents, needs, effects, reactions and attributes of the participants are all annotated and required to predict.
Based on the Event2mind and ATOMIC, many models have been proposed to solve the inference task. Rashkin et al. [36] tried different encoders to get a proper representation for the input event, such as CNN and RNN. Considering the advantage of self-attention [42] in modeling the long-time dependency and successful use in many NLP taks, Bosselut et al. [5] proposed a commonsense transformer to generate diverse commonsense descriptions. However, the above methods ignore the relations between different commonsense sub-tasks(xIntent, xReaction, and oReaction, and the like), which makes the generated results logically inconsistent and unreasonable. To solve these issues, Yuan et al. [50] propose a Chain Transformer where a chain of decoders is used to reason on different sub-tasks following the logical chain. Although these methods have achieved significant progress, they conduct inference on the text event, ignoring its visual representation in images. Besides, the event is usually encoded from an integral perspective, which loses the fine-grained information. The difference between our method and existing methods lies in two aspects. First, we take advantage of both textual information and visual information to learn the multi-modal representation of the event while existing methods only utilize the textual information. Second, we learn both the intra-event object relations and inter-event semantic relations, which are not considered by other methods.
2.2 Commonsense Knowledge Graphs
Commonsense knowledge plays an important role in the inference or reasoning related task since it contains abundant relationships, whose exploration could benefit the understanding of the context and thus help the reasoning. Many representative knowledge bases have been used in the related tasks, such as OpenCyc, ConceptNet, ASER, and the like. OpenCyc [1] is a large knowledge base which consists of 239,000 concepts and 2,039,000 facts in LISP-style logic. ConceptNet [41] represents the commonsense knowledge with a graph where the nodes are concepts and edges are relations from a fixed type set. Similar knowledge graphs also include DBpedia, NELL, and so on. These knowledge graphs are mainly constructed on the concepts, categories, and properties of things or objects.
They support the research on many semantic understanding and reasoning tasks. For example, Lukovnikov et al. [25] and Huang et al. [18] use the fact in knowledge graphs to complete the question answering. Wang et al. [44] propose a knowledge graph network for the recommendation, which explicitly models the high-order connectivities in KG.
However, these knowledge graphs could be insufficient in tasks where not only the knowledge about things or objects, but also the knowledge about activities and events is required. As a result, event-centered knowledge graph is also widely studied in recent years. FrameNet [4] is the earliest knowledge graph defining 27,691 events and their relations. PropBank [31] and NomBank [27] further extend the size of event KG, with a number of 112,917 and 114,576 events, respectively. Up to now, the largest event KG ASER [51] has reached the scale of 194 million events. It is extracted from 11-billion-token unstructured textual data. It contains multiple activities, states, events, and relations, which can be used in many real-world applications, such as pronoun resolution.
Apart from the direct use of knowledge base, implementing pre-training on external knowledge corpus provides another way to obtain commonsense knowledge. Bosselut et al. [5] demonstrate that implicit knowledge from deep pre-trained language models can be transferred to generate explicit knowledge in the commonsense knowledge graph. They propose a Commonsense Transformer which is trained on existing knowledge tuples to generate new knowledge based on the learned representation. Du et al. [9] propose a context-aware variational autoencoder to effectively learn the background knowledge thus guide the If-then reasoning. Specifically, they train the model on an auxiliary dataset, which contains three narrative story corpora and rich event background knowledge. The model is trained on the task-specific dataset to adapt the background knowledge to the commonsense inference task. The difference between our method and existing methods mainly lies in the way we utilize the commonsense knowledge. Here we extract the event-related sub-graphs from KGs and use GNNs to learn their representation to help the inference, while in other methods commonsense knowledge is used to pre-train the networks to help the task.
2.3 Graph Neural Network
Recently, Graph Neural Network has shown great advantages in processing the structural data [3, 22, 47, 52]. According to the calculation form, GNN can be categorized into 4 groups: Recurrent GNNs(RecGNNs), Convolutional GNNs(ConvGNNs), Graph AutoEncoders(GAEs), and Spatial-Temporal GNNs(STGNNs). As the early work of GNNs, RecGNNs [11, 38] apply recurrent architectures to learn the node representation, where message passing is done constantly with nodes’ neighborhoods until the node representations are stable. Then inspired by the success of CNN, convolution operation is also introduced to graph data in both spectral [8, 15, 21] and spatial ways [2, 12]. The spectral approaches adapt the spectral graph theory to design a graph convolution, while the spatial approaches inherit the message passing idea in RecGNNs but have the difference in getting node representations by stacking multiple convolutional layers. Graph autoencoders (GAEs) [6, 43] are used to learn the graph embedding by reconstructing the structural information such as adjacency matrix of the graph. STGNNs [20, 23, 39] aim to model both the spatial and temporal dependency of data and learn the representation of the spatial-temporal graph, which have advantages in the related tasks, such as human action recognition.
There has been much work on utilizing graph neural network to conduct the reasoning task. For example, Chen et al. [7] introduced a Global Reasoning unit that implements the relation reasoning via graph convolution on a graph in the interaction space. The unit is easily plugged into CNNs and can be used for multiple tasks, such as image classification and semantic segmentation. Lv et al. [26] performed the commonsense question answering with graph-based reasoning, where evidence is extracted from a heterogeneous knowledge source and are used to build graphs. The graphs are then learned with GCNs to predict the final answer. Huang et al. [19] proposed a Graph-based Temporal Reasoning Module to learn the relations among multiple action segments thus complete the action segmentation. Each action segment is represented as a node and the relations are learned with two GCNs. From above work, we can conclude that using GNN to learn the relations among nodes is an effective way to obtain the interaction structure of the graph. Therefore, in this article, we also propose to utilize GNN to model the relations among events and thus complete the commonsense inference task.
3 METHOD
In this article, we propose a multi-source knowledge reasoning graph network to solve the multi-modal commonsense inference task. As shown in Figure 2, three branches are adopted to obtain different kinds of knowledge. In the first branch, text representation and visual representation are considered simultaneously to obtain the multi-modal representation for the input event. Then the intra-event object relations are learned with the object graph through GNN. Finally, the inter-event semantic relations are explored with the event graph built based on the external knowledge graph and learned with GNN.
Fig. 2. Overview of the proposed multi-source knowledge reasoning graph network.
3.1 Multimodal Representation
In this section, we introduce how to jointly utilize the text and images of the event to get its multi-modal representation. It is noted that the visual modality can provide complementary information to the text modality. Taking “PersonX goes to the restaurant” as an example, the text feature is able to get some semantic information of the words based on their embeddings. However, the text feature cannot capture the information of the environment in which the event happens, and the specific appearance of the “restaurant”. Compared with the text, images have the ability to describe the event more intuitively and comprehensively, thus are beneficial for the event representation.
More specifically, we first extract the text feature and visual feature, respectively, and then combine them to get the multi-modal representation. For the text feature, each word in the event sentence is represented with an embedding. The event \(e\) can be noted as \(\lt e_1,e_2,\ldots ,e_n\gt \, \in \mathbb {R}^{ n \times d}\), where \(e_1, \ldots , e_n\) are embeddings of words in the event and \(n\) is the word number. Then, considering the great performance of Transformer in many NLP tasks, such as machine translation, we utilize Transformer to encode the event sentence. The Transformer encoder is composed of a stack of blocks, and each block contains two layers: multi-head self-attention layer and feed-forward layer. The event is input into the multi-head self-attention layer to capture the dependencies among words, and the output is further fed into the feed-forward network and layer normalization: (1) \(\begin{equation} x^e = MultiHead(e), \end{equation}\) (2) \(\begin{equation} e^{\prime } = LayerNorm(x^e + FFN(x^e)), \end{equation}\) where \(e \in \mathbb {R}^{ n \times d}\) is the event matrix composed of word embeddings. \(MultiHead\) is the multi-head self-attention layer and \(FFN\) is the feed-forward network introduced in [42]. \(LayerNorm\) represents layer normalization. After the encoding, the event can be represented as \(e^{\prime } \in \mathbb {R}^{ n \times d_{model}}\).
For the visual modality, images are input into the pre-trained ResNet to get the visual features firstly. Then, to better combine the images, we use the attention mechanism to determine the importance of each image: (3) \(\begin{equation} \beta _i = \frac{exp(q \cdot v_i)}{\sum _{j}exp(q \cdot v_j)} \end{equation}\) where \(\beta _i\) is the attention score for the \(i\)th image, \(q\) is the query vector to be learned, and \(v_i\) is the visual feature for the \(i\)th image. Then, the visual representation \(v_e\) can be calculated as the combination of all image features: (4) \(\begin{equation} v_e = \sum _{i} \beta _i v_i. \end{equation}\) Finally, an average pooling is conducted between the visual representation and each word embedding in \(e\) to get the multi-modal representation:\(m_e \in \mathbb {R}^{ n \times d_{model}}\).
3.2 Intra-Event Object Relation Learning
It is noted that an event phrase may contain multiple objects. Compared with words in the event sentence, objects also reflect significant information and play important roles in event reasoning. Understanding the relations among objects is beneficial for learning more effective event representations. However, most current methods encode the event phrase with only word embeddings, ignoring the important relations among objects. The structure of the commonly used RNN encoder also has limited ability to learn the object relations. In view of this fact, we propose to extract objects in the event and explicitly learn their representations and relations.
Specifically, we not only extract objects from the event phrases, but also detect the related objects in the images. Compared with the objects directly mentioned in the event phrase, objects from images provide more background information such as “where does the event take place” and “how does this situation look”. We take advantage of Faster R-CNN to detect objects from each image in the event-related image set. All of the detected objects plus objects in the event phrase finally form a set \(O_e = \lbrace o_1,\ldots ,o_t\rbrace\), where \(t\) is the object number.
After extracting the event-related objects, the object relations are then learned to obtain the semantic interactions among them. However, direct extraction and learning of object relations is unreasonable since the event phrase is too short. To better understand the relations, we leverage an external knowledge corpus ConceptNet [41] which has abundant object and relation knowledge to conduct the relation learning. ConceptNet is a large-scale commonsense knowledge graph with 8 million nodes and 21 million edges. The nodes represent concepts and edges reflect the semantic relations. Figure 3(a) shows node and edge examples of ConceptNet.
Fig. 3. Node and edge examples of the external knowledge graph ConceptNet and ASER.
Specifically, given the object set \(O_e\) of an event phrase \(e\), we first find the corresponding node in ConceptNet for each object. Then, to obtain more object-related knowledge, we extend the object set by adding the one-hop connected nodes in ConceptNet. More formally, for each object \(o_i\) in \(O_e\), we extract all the relation triples containing \(o_i\) from ConceptNet, and add all objects connected by these triples to the object set. The extended object set is noted as \(O^{\prime }_e = \lbrace o_1,\ldots ,o_T\rbrace\), where \(T\) is the size of the extended object set. We also extract relations among these objects according to the edges in the ConceptNet and finally get an object graph \(\mathcal {G}_e^o=(\mathcal {V}_e^o, \mathcal {E}_e^o)\). The nodes in \(\mathcal {G}_e^o\) are objects either in event phrases and images or related to the event according to ConceptNet. Edges denote the semantic relations of objects. Each edge has a weight reflecting the frequency of this relation.
After building the object graph, we propose to learn the object relations with graph neural networks. The network contains \(l_o\) layers and each layer conducts a message passing process. Specifically, we initialize the node representation with word embedding, which is then updated by its neighborhoods with the following calculation: (5) \(\begin{equation} v^{\prime }_i = W^x v_i + W^r \sum _{j \in \mathcal {N}(i)} e_{ij} f(v_i, r_{ij}, v_j) \end{equation}\) where \(v_i\) is the embedding for the \(i\)th node, \(W_x\) and \(W_r\) are learnable matrices, \(\mathcal {N}(i)\) are indexes of the neighborhood nodes, \(e_{ij}\) is the edge weight and \(r_{ij}\) is the edge embedding between node \(o_i\) and \(o_j\), and \(f\) is a 2-layer MLP that takes the concatenation of three vectors as input and projects them into one vector. After that, each object node gets information from its neighborhoods, thus enriches its representation.
After the calculations of \(l_o\) layers, all the node embeddings are updated. Then, we can get the object graph representation based on them. Specifically, each node corresponds to an object related to the event and reflects different aspects of the event. However, the nodes have different contributions to the final graph representation. Therefore, we take advantage of the attention mechanism to determine the node importance and compute the graph representation: (6) \(\begin{equation} g_e^o = \sum _{i} \beta _i v^{\prime }_i, \end{equation}\) (7) \(\begin{equation} \beta _i = \frac{exp(q \cdot v^{\prime }_i)}{\sum _{j}exp(q \cdot v^{\prime }_j)}, \end{equation}\) where \(g_e^o \in \mathbb {R}^{d_{model}}\) is the representation for the object graph, \(\beta _i\) is the relevant importance given to each node when blending all nodes together, and \(q\) is a trainable vector used as query.
3.3 Inter-Event Semantic Relation Learning
It is known that the commonsense inference requires the understanding of intention and reaction of the event, which are actually some types of relations between events. Therefore, to better conduct the inference, it is necessary to learn the event relations. Event relations are always contained in the long text which describes a complicated situation or story. Much work has been done on extracting event relations from unstructured text and building event-centered knowledge graphs based on it. In this article, we propose to leverage an external event-centered knowledge graph to assist the event relation learning. As far as we know, ASER [51] is the largest knowledge graph whose primitive units of semantics are eventualities and edges are relations. It contains 15 relation types belonging to 5 categories, 194-million unique eventualities, and 64-million unique edges among them. Its quality and effectiveness have been demonstrated by intrinsic and extrinsic evaluations. Figure 3(b) shows node and edge examples of ASER.
To utilize the event knowledge graph, we first need to match events in dataset with nodes in ASER. However, most events in ASER start with “I”, “He”, or “She”, while the events in our dataset always start with “PersonX” or “PersonY”. Therefore, the same event may have different forms in these two corpora and a more sophisticated event matching scheme is needed. Our matching algorithm is illustrated below:
After the matching process, each event in our dataset can get one or more corresponding nodes in ASER, which can be noted as \(A_e = \lbrace a_1, \ldots , a_m\rbrace\). Then we expand the node set \(A_e\) with their one-hop neighborhoods nodes in ASER and get the new set \(A^{\prime }_e = \lbrace a_1, \ldots , a_M\rbrace\) with \(M\) nodes in total. The edges among these nodes are also extracted to get a new event graph \(\mathcal {G}_e^a=(\mathcal {V}_e^a, \mathcal {E}_e^a)\). Each node in \(\mathcal {G}_e^a\) represents an event, and the edge represents their semantic correlations.
Then, we adopt the graph convolutional network to learn event relations. The network has \(l_a\) layers and each layer propagates information among events through message passing. Specifically, we initialize the node embeddings and edge embeddings with the GRU mentioned above. The embedding for each type of edge is learnable during the training process and node embeddings are updated as follows: (8) \(\begin{equation} v^{\prime }_i = W^a v_i + W^e \sum _{j \in \mathcal {N}(i)}e_{ij} f(v_i, r_{ij}, v_j), \end{equation}\) where \(v_i\) is the representation of the node \(a_i\), \(W_a\) and \(W_e\) are trainable matrices, \(\mathcal {N}(i)\) are indexes of the neighborhood nodes, \(e_{ij}\) is the edge weight, and \(r_{ij}\) is the edge embedding between node \(a_i\) and \(a_j\), and \(f\) is a 2-layer MLP that takes the concatenation of three vectors as input and projects them into one vector. During this process, each event node could get information from both its neighborhoods and its connected edges, thus better learn the event semantic correlations. Finally, we extract the updated embeddings of all event nodes from the node set \(A_e\) and do an average pooling to get the final event graph representation \(g^a_e \in \mathbb {R}^{d_{model}}\).
3.4 Collaborative Inference
As illustrated in the above sections, we learn three kinds of relational knowledge and get three representations for each event. These representations can reflect different aspects of the event and are complementary to each other. Next, we will introduce how to efficiently combine them to conduct the commonsense inference.
First, early fusion methods can be used to get a final representation to input to the decoder, including average pooling, concatenation and dot multiplication. Attention mechanism is also applied to determine the coefficient when summing the three vectors up. Since the multi-modal representation \(m_e\) is in \(\mathbb {R}^{ n \times d_{model}}\) while the object graph \(g^o_e\) and event graph representation \(g^a_e\) are in \(\mathbb {R}^{d_{model}}\), the early fusion method is actually performed on \(g^o_e\), \(g^a_e\) and each vector (also in \(\mathbb {R}^{d_{model}}\)) of \(m_e\). The final event representation is noted as \(e_f\).
As illustrated in [36], the commonsense inference task can be formulated as either n-gram re-ranking or sequence generation task. Whereas, the sequential decoder-based generation model performs better than the n-gram re-ranking in most cases. Therefore, we also regard the inference process as a sequence generation task with Transformer used as the decoder. The adopted Transformer-base decoder is composed of a stack of blocks and each block consists of masked multi-head self-attention layers, multi-head context-attention layers and feed-forward layers as in [42].
At each decoding step \(t\), the masked multi-head self-attention is first applied to get the dependency among generated words. Then, the multi-head context-attention is used to capture the relations between encoder output and the generated words, whose output is then sent to the feed-forward layer and the next block. Finally, the output hidden state for the \(t\) step \(h_x^t\) is fed to a liner layer to generate a new word: (9) \(\begin{equation} P(s_x^t|s_x^{i\lt t}, e_f) = softmax(h_x^tW_f), \end{equation}\) where \(s_x^{i\lt t}\) are the generated words before the \(t\)th step and \(s_x^t\) is the generated word on the \(t\)th step, \(e_f\) is the fused event representation, and \(W_f\) is a learnable matrix in \(\mathbb {R}^{d_{model}}\). Then, to get diverse results, we adopt beam search [10] as other methods to generate \(k\) results as well as their confidence scores, which are used for evaluation later.
Besides early fusion, we also try late fusion methods which work on the generated results of the decoders for combination. Specifically, three representations instead of fused representation \(e_f\) are input to the decoders separately to train three different decoders. During test, each decoder can generate \(k\) results with their confidence scores using beam search [10]. Then, all the generated results of different decoders are sorted according to their scores and the top \(k\) results are selected as the final predicted results.
The overall framework of our method is optimized by minimizing the cross-entropy loss between the generated results and the ground-truth phrases. Multi-task learning is employed to minimize the losses for all three decoders at the same time.
4 EXPERIMENT
4.1 Dataset
We build a new event-centered multi-modal dataset for the multi-modal commonsense inference task. The dataset is built based on the event corpus Event2mind, which contains 25,000 events described in a short free-form text, covering a various range of daily events and situations.
To obtain the event-related images, we take the event phrase as keywords to search images from Bing, and choose the top 20 images as the image set of the event. Specifically, the events in Event2mind can be divided into three categories: Blank events (events containing non-instantiated arguments), Idiom events (events coming from the Wiktionary idiom list), and 2+People events (events containing multiple different person variables). For the first two types, we can directly input the whole phrase for searching. While for the events involved with person, since the Event2mind uses “PersonX” and “PersonY” to refer to the participants, which are difficult for the website to understand, we only extract the verb phrase and delete the participants for searching.
Considering that there may be noises in the returned images, we follow [30] to filter out the noise image by its noise score, which is calculated by summarizing the pairwise distances of all images retrieved for one event. The image features are extracted with the pre-trained ResNet [14] and the distance is computed by the Euclidean metric. Images with the noise score greater than a threshold \(\theta _n\) are removed, and the remained images form the image set of the event.
The proposed dataset contains 25,000 events in total. Each event has 20 corresponding images which reflect the visual content of the event. Given one event and its images, the multimodal commonsense inference task aims to generate three different kinds of results: PersonX’s intent, PersonX’s reaction, and others’ reaction. We divide the dataset into training/dev/test set with an 80/10/10% split as in [36].
4.2 Implement Details
Following [36], we perform the inference in an encoder-decoder way. The transformer is utilized as the backbone of the encoder and decoder in our framework. Besides the transformer, RNN is also utilized for a fair comparison with existing methods. For the word embedding, we use 300-dimensional skip-gram word embedding pre-trained on Google News [28]. [email protected] is used to evaluate the performance as in [36]. It calculates the percentage of times the gold falls within the top 10 decoded results. Besides, we also take advantage of perplexity [40] and BLEU score [32] to evaluate our model as in [9] and [50]. Perplexity measures the probability of the model to generate exact targets, and is suitable for the one-to-many problem. BLEU score could evaluate the accuracy of generations. Specifically, we generate \(k\) sequences for each inference sub-task (i.e., prediction of xIntent, xReact, or oReact), and compute the average scores of the three metrics for the \(k\) sequences. \(k\) is set to 10 following [36]. As for other parameters, the representation dimension \(d_{model}\) is set to 100. The numbers of encoder blocks and decoder blocks are all set to 1. \(\theta _s\) in Section 3.3 is set to 0.8 and \(\theta _n\) in Section 4.1 is set to 0.9. The graph neural network layer numbers \(l_o\) and \(l_a\) are both set to 2. All these parameters are determined by cross-validation.
4.3 Comparison with Existing Methods
We mainly compare our model with existing methods which report results on the Event2mind dataset in the conventional commonsense inference task.
• | Max-pool uses the max-pooling of word embeddings as the event representation and RNN as the decoder to generate sentence. | ||||
• | ConvNet uses a convolutional neural network to encode the event and uses RNN as the decoder to generate sentence. | ||||
• | Seq2Seq uses a GRU-based sequence-to-sequence model to generate the results. | ||||
• | CWVAE [9] uses a context-aware variational autoencoder to learn the background information of events. | ||||
• | Transformer [42] is the widely self-attention based encoder-decoder model. | ||||
• | GTP [34] is a Transformer based pre-trained language model. | ||||
• | Ours-RNN uses GRU as the backbone of the encoder and decoder. | ||||
• | Ours-Transformer uses the Transformer as the backbone of the encoder and decoder. | ||||
As shown in Table 1, our models outperform all other methods on the three metrics. When using the same RNN encoder and decoder, our RNN-based methods perform better than them, which is due to the reason that we can learn better representations for events with three kinds of relational knowledge considered while existing methods just encode the intra-event text information in a simple way. As for the pre-trained language model, GPT performs better than other previous methods. However, our model still outperforms GPT on all metrics, which may because the data used to pre-train GPT has a relatively large discrepancy with the data of the commonsense inference task, and the three kinds of relation knowledge which have been proved to be helpful for the task is not utilized in GPT. The CWVAE method gets relatively better results than the Seq2Seq model since it learns the background knowledge of the event, which can reflect the event relations to a certain extent. However, it loses the multi-modal information, thus performs a bit worse than our Transformer based methods. For our methods, the Transformer based model performs better than the RNN based model, which may be because the Transformer has more advantages in encoding the long-range dependencies.
| Method | [email protected] | BLEU | PPL | ||||||
|---|---|---|---|---|---|---|---|---|---|
| xIntent | xReact | oReact | xIntent | xReact | oReact | xIntent | xReact | oReact | |
| Max-pool [36] | 36.01 | 38.30 | 65.60 | 5.21 | 3.60 | 4.01 | 43.54 | 32.87 | 16.03 |
| ConvNet [36] | 37.33 | 41.02 | 66.00 | 6.02 | 3.97 | 4.21 | 39.47 | 30.01 | 15.97 |
| Seq2Seq [36] | 36.21 | 41.10 | 66.30 | 5.48 | 4.02 | 4.47 | 42.62 | 29.37 | 15.76 |
| CWVAE [9] | - | - | - | 7.36 | 5.52 | 5.33 | 31.32 | 24.07 | 11.37 |
| Transformer [42] | 40.26 | 42.20 | 67.83 | 7.53 | 4.44 | 5.85 | 30.26 | 25.68 | 11.54 |
| GPT [34] | 41.36 | 43.07 | 68.50 | 8.50 | 5.13 | 6.27 | 30.03 | 24.56 | 11.34 |
| Ours-RNN | 38.16 | 41.74 | 67.06 | 6.23 | 4.25 | 5.26 | 36.71 | 27.03 | 13.26 |
| Ours-Transformer | 42.33 | 43.50 | 68.69 | 8.79 | 5.69 | 6.33 | 29.37 | 24.02 | 11.24 |
Table 1. Comparison with Existing Methods in the Multi-modal Commonsense Inference Task
4.4 Ablation Study
We also do ablation studies to figure out the effectiveness of the three kinds of relational knowledge. The variant model Ours-visual only simply uses visual information to conduct the commonsense inference. The model Ours\(\backslash\)v, Ours\(\backslash\)m, Ours\(\backslash\)o, or Ours\(\backslash\)e omits visual information, multi-modal representation, object relation learning and event relation learning respectively. RNN based method and Transformer based method are both considered. Results are shown in Table 2.
| Method | [email protected] | BLEU | PPL | ||||||
|---|---|---|---|---|---|---|---|---|---|
| xIntent | xReact | oReact | xIntent | xReact | oReact | xIntent | xReact | oReact | |
| Ours-RNN-visual only | 34.32 | 38.68 | 64.10 | 4.78 | 3.54 | 4.01 | 44.31 | 33.54 | 17.62 |
| Ours-RNN\(\backslash\)v | 37.27 | 41.32 | 66.70 | 6.07 | 4.10 | 4.88 | 38.12 | 27.31 | 13.97 |
| Ours-RNN\(\backslash\)m | 36.96 | 41.16 | 66.41 | 5.89 | 4.07 | 4.48 | 39.72 | 27.65 | 14.58 |
| Ours-RNN\(\backslash\)o | 38.04 | 41.50 | 66.80 | 6.10 | 4.10 | 4.84 | 37.33 | 28.10 | 14.01 |
| Ours-RNN\(\backslash\)e | 37.77 | 41.50 | 66.62 | 6.03 | 4.13 | 4.53 | 38.44 | 27.08 | 14.32 |
| Ours-RNN | 38.16 | 41.74 | 67.06 | 6.23 | 4.25 | 5.26 | 36.71 | 27.03 | 13.26 |
| Ours-Transformer-visual only | 36.74 | 41.08 | 66.13 | 5.97 | 4.11 | 4.33 | 39.64 | 28.53 | 14.37 |
| Ours-Transformer\(\backslash\)v | 41.56 | 42.70 | 68.34 | 8.13 | 5.21 | 6.12 | 29.60 | 24.53 | 11.32 |
| Ours-Transformer\(\backslash\)m | 40.93 | 42.30 | 68.01 | 7.84 | 4.67 | 5.97 | 29.67 | 24.67 | 11.28 |
| Ours-Transformer\(\backslash\)o | 41.76 | 43.10 | 68.50 | 8.31 | 5.21 | 6.21 | 29.74 | 24.37 | 11.33 |
| Ours-Transformer\(\backslash\)e | 41.35 | 42.72 | 68.32 | 8.01 | 5.03 | 6.13 | 30.10 | 25.02 | 11.45 |
| Ours-Transformer | 42.33 | 43.50 | 68.69 | 8.79 | 5.69 | 6.33 | 29.37 | 24.02 | 11.24 |
Table 2. Ablation Study Results
It is noted that each module in our method plays an important role in the final results. The complete model performs better than each variant in both RNN-based and Transformer based methods, which demonstrates that the multi-modal representation, intra-event object relation and inter-event relation knowledge could provide necessary information for the event representation. As for the models only use visual information, Ours-RNN-visual only and Ours-Transformer-visual only both have inferior results. Because there is a large semantic gap between the visual data and the target label (i.e., human intentions or reactions) of the multimodal commonsense inference task. By comparing the results between Ours\(\backslash\)v and the full model, it can be seen that the performance of Ours-RNN\(\backslash\)v and Ours-Transformer\(\backslash\)v drops a lot on the three sub-tasks and metrics, which demonstrates the effectiveness of the visual information. Besides, the event graph learning also plays an important part in the representation, which may be because the event graph provides extra information of the event intention and reaction.
4.5 Comparison on Different Combination Methods
It is noted that three kinds of representations are learned for an event in our method. A proper combination method is needed to get better representations for the reasoning. As illustrated in the method section, we use several methods, i.e., concatenation, average pooling and dot multiplication, to combine the representation. The attention mechanism is also used to determine the importance of each representation, which is performed as the weight when summing up the representations. Late fusion method which combines generated results of the decoders is also utilized. Here, we explore the effect of different combination methods. Transformer is used as the backbone of the encoder and decoder. Results are shown in the Table 3.
| Method | [email protected] | BLEU | PPL | ||||||
| xIntent | xReact | oReact | xIntent | xReact | oReact | xIntent | xReact | oReact | |
| Transformer | 40.26 | 42.20 | 67.83 | 7.53 | 4.44 | 5.85 | 30.26 | 25.68 | 11.54 |
| Concatenation | 41.73 | 42.54 | 68.16 | 8.04 | 4.95 | 5.90 | 30.01 | 25.65 | 11.36 |
| Average pooling | 41.03 | 42.23 | 67.56 | 7.68 | 4.61 | 5.81 | 30.23 | 25.62 | 11.66 |
| Attention | 41.54 | 42.40 | 68.05 | 7.81 | 4.82 | 5.87 | 30.01 | 25.01 | 11.27 |
| Dot multiplication | 37.32 | 39.69 | 66.11 | 5.56 | 3.97 | 4.12 | 36.67 | 30.14 | 14.83 |
| Late fusion | 42.33 | 43.50 | 68.69 | 8.79 | 5.69 | 6.33 | 29.37 | 24.02 | 11.24 |
Table 3. Comparison Results of Different Combination Methods
It can be seen that the method with late fusion achieves the best result. This may be because different decoders can well preserve the information of different representations and generate more various answers. Apart from the late fusion, the concatenation gets the best result among the early fusion methods, since it retains information from multiple representations in a relatively complete manner. The average pooling and attention mechanism also get some improvement compared with the basic Transformer model which only uses the textual feature. Besides, the attention mechanism which determines the weight when summing up different representations performs better than the average pooling, since it can combine them in a more flexible way and consider their different contributions. The dot multiplication has no improvement for the performance over the basic Transformer. Moreover, it damages the final results, which demonstrates the importance of a proper combination method.
4.6 Parameter Analysis
Here we investigate the influence of the sub-task of the representation \(d_{model}\). We vary it from 50 to 300 and keep other settings fixed. The results are shown in Figure 4. We can see that the dimension of 100 gets the best result. With the increase of \(d_{model}\), the performance drops on the three metrics with different degrees. This may be because the large dimension brings more parameters, which are difficult to train.
Fig. 4. Parameter analysis.
4.7 Qualitative Analysis
Here we visualize some important results of our method. Specifically, we show the event-related images, object graph and event graph learned by our model in Figure 5. For the first event “PersonX sits at the kitchen table”, the object graph extracts the related objects of the event, such as “cup”, “bowl”, and “table”. These objects are either from the event sentence or from the images. Their relations are also extracted with the external knowledge graph ConceptNet, such as“cup-AtLocation-table” and “cup-RelatedTo-drinking”. With the constructed object graph, more event-related concepts are included, such as “eating” and “drinking”. These concepts are actually closely related to PersonX’s intent “to eat” and “to drink a cup of coffee”, thus are helpful for the inference. For the second example “PersonX apologizes to PersonY”, since this event is relatively abstract, the extracted objects are less related to the event. However, the event relations could contribute a lot to the prediction. In the event graph, the nodes with solid lines are related events retrieved from ASER, and nodes with dashed lines are events extended from ASER. It can be seen that the extended event “I do not want to hit her” is closely related to the xIntent “regretful” and xReact “forgiven”, thus is also helpful for the prediction.
Fig. 5. Visulization of the object graph and event graph.
5 CONCLUSION
In this article, we propose a new multi-modal commonsense inference task where textual description and visual image are both utilized. An event-centered multi-modal dataset is constructed to support the new task. To solve the new task, we also propose a multi-source knowledge reasoning graph network where three kinds of relational knowledge are considered. Multi-modal correlation knowledge is firstly captured to get the multi-modal representation for the event. Then the intra-event object relations are explored to get the fine-grained event information, where external knowledge graph ConceptNet is introduced to build the object graph. Finally, inter-event semantic relations are learned for a better understanding of semantic associations among events, where the external event knowledge graph ASER is introduced to build the event graph. We conducted extensive experiments on the newly collected dataset and the results demonstrate the effectiveness of our method. In the future, we will consider more effective methods to utilize the visual information to better conduct the multi-modal commonsense inference task.
- [1] 1995. CYC: A large-scale investment in knowledge infrastructure. Commun. ACM 38, 11 (1995), 33–38.Google Scholar
Digital Library
- [2] . 2016. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems. 1993–2001.Google Scholar
- [3] . 2019. G3raphGround: Graph-based language grounding. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV).Google Scholar
Cross Ref
- [4] . 1998. The Berkeley Framenet project. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1. Association for Computational Linguistics, Montreal, Quebec, Canada, 86–90. Google Scholar
Digital Library
- [5] . 2019. Comet: Commonsense transformers for automatic knowledge graph construction. arXiv preprint arXiv:1906.05317 (2019).Google Scholar
- [6] . 2016. Deep neural networks for learning graph representations. In 30th AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [7] . 2019. Graph-based global reasoning networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 433–442.Google Scholar
Cross Ref
- [8] . 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.Google Scholar
- [9] . 2019. Modeling event background for if-then commonsense reasoning using context-aware variational autoencoder. arXiv preprint arXiv:1909.08824 (2019).Google Scholar
- [10] . 2017. Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806 (2017).Google Scholar
- [11] . 2010. Graph echo state networks. In 2010 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.Google Scholar
Cross Ref
- [12] . 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In AAAI Conference on Artificial Intelligence.Google Scholar
Digital Library
- [13] . 2016. What happens next? Event prediction using a compositional neural network model. In AAAI Conference on Artificial Intelligence, Vol. 30.Google Scholar
Cross Ref
- [14] . 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [15] . 2015. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163 (2015).Google Scholar
- [16] . 2020. Joint commonsense and relation reasoning for image and video captioning. In National Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [17] . 2020. Visual-textual hybrid sequence matching for joint reasoning. IEEE Transactions on Cybernetics PP, 99 (2020), 1–14.Google Scholar
- [18] . 2019. Knowledge graph embedding based question answering. In 12th ACM International Conference on Web Search and Data Mining. 105–113.Google Scholar
Digital Library
- [19] . 2020. Improving action segmentation via graph-based temporal reasoning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14024–14034.Google Scholar
Cross Ref
- [20] . 2016. Structural-RNN: Deep learning on spatio-temporal graphs. In IEEE Conference on Computer Vision and Pattern Recognition. 5308–5317.Google Scholar
Cross Ref
- [21] . 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
- [22] . 2021. Adaptive hierarchical graph reasoning with semantic coherence for video-and-language inference. In IEEE/CVF International Conference on Computer Vision. 1867–1877.Google Scholar
Cross Ref
- [23] . 2017. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).Google Scholar
- [24] . 2018. Constructing narrative event evolutionary graph for script event prediction. arXiv preprint arXiv:1805.05081 (2018).Google Scholar
- [25] . 2017. Neural network-based question answering over knowledge graphs on word and character level. In 26th International Conference on World Wide Web. 1211–1220.Google Scholar
Digital Library
- [26] . 2020. Graph-based reasoning over heterogeneous external knowledge for commonsense question answering. In AAAI Conference on Artificial Intelligence, vol. 34. 8449–8456.Google Scholar
Cross Ref
- [27] . 2004. The NomBank project: An interim report. In Workshop Frontiers in Corpus Annotation at HLT-NAACL 2004. 24–31.Google Scholar
- [28] . 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- [29] . 2016. A corpus and cloze evaluation for deeper understanding of commonsense stories. In 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 839–849.Google Scholar
Cross Ref
- [30] . 2017. Multi-modal knowledge representation learning via Webly-supervised relationships mining. In 25th ACM International Conference on Multimedia. 411–419.Google Scholar
Digital Library
- [31] . 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics 31, 1 (2005), 71–106.Google Scholar
Digital Library
- [32] . 2002. BLEU: A method for automatic evaluation of machine translation. In 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google Scholar
- [33] . 2016. Learning statistical scripts with LSTM recurrent neural networks. In AAAI Conference on Artificial Intelligence, Vol. 30.Google Scholar
Cross Ref
- [34] . 2018. Improving language understanding by generative pre-training. (2018).Google Scholar
- [35] . 2018. Modeling naive psychology of characters in simple commonsense stories. arXiv preprint arXiv:1805.06533 (2018).Google Scholar
- [36] . 2018. Event2mind: Commonsense inference on events, intents, and reactions. arXiv preprint arXiv:1805.06939 (2018).Google Scholar
- [37] . 2019. Atomic: An atlas of machine commonsense for if-then reasoning. In AAAI Conference on Artificial Intelligence, vol. 33. 3027–3035.Google Scholar
Digital Library
- [38] . 2008. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2008), 61–80.Google Scholar
Digital Library
- [39] . 2018. Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing. Springer, 362–373.Google Scholar
Cross Ref
- [40] . 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI Conference on Artificial Intelligence, Vol. 31.Google Scholar
Cross Ref
- [41] . 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI Conference on Artificial Intelligence, Vol. 31.Google Scholar
Cross Ref
- [42] . 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).Google Scholar
- [43] . 2016. Structural deep network embedding. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1225–1234.Google Scholar
Digital Library
- [44] . 2019. KGAT: Knowledge graph attention network for recommendation. In 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 950–958.Google Scholar
Digital Library
- [45] . 2017. Integrating order information and event relation for script event prediction. In 2017 Conference on Empirical Methods in Natural Language Processing. 57–67.Google Scholar
Cross Ref
- [46] . 2020. Multi-level knowledge injecting for visual commonsense reasoning. IEEE Transactions on Circuits and Systems for Video Technology PP, 99 (2020), 1–1.Google Scholar
- [47] . 2020. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems (2020).Google Scholar
Cross Ref
- [48] . 2022. Improving visual grounding with visual-linguistic verification and iterative reasoning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9499–9508.Google Scholar
Cross Ref
- [49] . 2018. Exploring visual relationship for image captioning. In European Conference on Computer Vision (ECCV). 684–699.Google Scholar
Digital Library
- [50] . 2020. Logic enhanced commonsense inference with chain transformer. In 29th ACM International Conference on Information & Knowledge Management. 1763–1772.Google Scholar
Digital Library
- [51] . 2020. ASER: A large-scale eventuality knowledge graph. In The Web Conference 2020. 201–211.Google Scholar
Digital Library
- [52] . 2020. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering (2020).Google Scholar
Index Terms
(auto-classified)Multi-Source Knowledge Reasoning Graph Network for Multi-Modal Commonsense Inference
Recommendations
Hyper-node Relational Graph Attention Network for Multi-modal Knowledge Graph Completion
Knowledge graphs often suffer from incompleteness, and knowledge graph completion (KGC) aims at inferring the missing triplets through knowledge graph embedding from known factual triplets. However, most existing knowledge graph embedding methods only use ...
Multi-modal Graph Contrastive Learning for Micro-video Recommendation
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information RetrievalRecently micro-videos have become more popular in social media platforms such as TikTok and Instagram. Engagements in these platforms are facilitated by multi-modal recommendation systems. Indeed, such multimedia content can involve diverse modalities, ...
Enhancing emotion inference in conversations with commonsense knowledge
AbstractExisting studies on emotion analysis in conversations have mainly focused on recognizing the emotion of a given utterance. This paper investigates the task of emotion inference in conversations, which explores how the utterances affect ...











Comments