Abstract
Humans can easily infer the motivations behind human actions from only visual data by comprehensively analyzing the complex context information and utilizing abundant life experiences. Inspired by humans’ reasoning ability, existing motivation prediction methods have improved image-based deep classification models using the commonsense knowledge learned by pre-trained language models. However, the knowledge learned from public text corpora is probably incompatible with the task-specific data of the motivation prediction, which may impact the model performance. To address this problem, this paper proposes a dual scene graph convolutional network (dual-SGCN) to comprehensively explore the complex visual information and semantic context prior from the image data for motivation prediction. The proposed dual-SGCN has a visual branch and a semantic branch. For the visual branch, we build a visual graph based on scene graph where object nodes and relation edges are represented by visual features. For the semantic branch, we build a semantic graph where nodes and edges are directly represented by the word embeddings of the object and relation labels. In each branch, node-oriented and edge-oriented message passing is adopted to propagate interaction information between different nodes and edges. Besides, a multi-modal interactive attention mechanism is adopted to cooperatively attend and fuse the visual and semantic information. The proposed dual-SGCN is learned in an end-to-end form by a multi-task co-training scheme. In the inference stage, Total Direct Effect is adopted to alleviate the bias caused by the semantic context prior. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance.
1 INTRODUCTION
Thanks to the success of deep learning [14, 27, 41, 45, 60], image-based action recognition approaches [43, 63] have achieved great progress, which focus on making the computer understand actions like humans. In addition to action recognition, predicting human motivation for performing an action is also very important due to its potential applications in security monitoring and health management. However, it is challenging to predict motivations by learning deep models on visual data because just recognizing the primary actions or objects in the image is not enough and the model requires to be able to answer the question of why instead of what. For example, the persons (marked with red boxes) shown in Figures 1(a) and 1(b) have the same action riding bike, but it is difficult to predict their motivations by only the visual appearance of the person and the interacted objects.
Fig. 1. Examples of motivations behind human actions. The red box indicates the target person in the image.
Humans can effortlessly infer different motivations behind similar actions from only visual data because we can comprehensively analyze the context information and utilize abundant experiences and commonsense knowledge in daily life. For example, the person in Figure 1(a) is leading a little dog and looking at a girl. The scene reflected in the image is an open space in front of the house. Then we can infer that the man in Figure 1(a) wants to have fun. In Figure 1(b), the street and pedestrians reflect that the person is riding on a busy street. In addition, the uniform indicates that the person is a police and he is probably answering an emergency call. To simulate human’s lifetime experience or knowledge in the process of motivation prediction, Vondrick et al. [54] propose to create a factor graph to transfer knowledge from unlabeled text into visual classifiers. The factor graph model [54] is implemented based on structured SVM [18, 40], which can flexibly incorporate the concurrence potentials of different concepts estimated from language models pre-trained on large-scale public text corpus.
Although the knowledge-enhanced motivation prediction model [54] has obtained much better performance than deep classification models (e.g., VGG-16 [45]), it has a non-negligible limitation that the commonsense knowledge learned by the pre-trained language models is probably incompatible with the task-specific data of the motivation prediction. To address this problem, in this paper, we focus on comprehensively exploring the relations among objects contained in the image and learning context prior from the task-specific training data to help the motivation predicting. It is not trivial to achieve this goal. On the one hand, we need to identify important concepts (i.e., objects, scenes) and their relations in the image to capture the complex visual information. On the other hand, we need to capture the semantic context prior which can help to learn the task-specific semantic commonsense and filter out unnecessary outputs. For example, as shown in Figure 1(c), based on the concepts in the scene, e.g., bowl, glass bottle and cake, it is not difficult to infer that the motivation of the person is to eat. It is worth noting that we must be careful to use this kind of context prior since it may mislead the motivation prediction model, especially for some long-tail labels. For example, as shown in Figure 1(d), paying more attention to the concepts of baseball uniform, baseball glove, and field has a negative impact on the prediction of the correct motivation to talk to the mascot.
Scene graph [20] is a kind of data structure which represents images as directed acyclic graphs, where nodes are objects and edges represent relationships between objects. Scene graph has been successfully applied in image retrieval [20], image captioning [50, 65], visual question answering [9], and image generation [24, 47, 64, 70]. As a highly structured representation of the image, scene graph has advantages in capturing the complex semantic dependencies between different objects detected from the image and reflecting the global scene information, which are useful for motivation prediction. However, until now, the application of scene graph to motivation prediction is still under-explored.
Based on above considerations, this paper proposes a dual scene graph convolutional network (dual-SGCN) to predict the action motivation by comprehensively exploring the visual and semantic context information among different objects detected from the image. We firstly extract objects and their relations with high prediction confidences from images to construct scene graphs by a pre-trained scene graph generator. Then, the dual-SGCN captures the complex visual information and semantic context prior for motivation prediction based on the generated scene graphs. The two branches (a visual branch and a semantic branch) of the dual-SGCN use input graphs with the same structure, where nodes and edges correspond to objects and relations in the scene graph, respectively. The only difference between the input graphs of two branches lies in the node and edge representations. More specifically, for the visual branch, we build a visual graph where object nodes are represented by the visual features of the corresponding image regions and the relation of two objects is represented by the visual feature of the union of the corresponding boxes. For the semantic branch, we build a semantic graph where nodes and edges are directly represented by the word embeddings of the object and relation labels. In each branch, we adopt node-oriented and edge-oriented message passing to propagate interaction information between different nodes and edges. In order to cooperatively capture the complex visual information and semantic context prior, we design a multi-modal interactive attention mechanism to attend features of different object nodes and relation edges learned from both branches to build a global representation. The proposed dual-SGCN is learned with a multi-task co-training scheme, where we use three independent classifiers to predict the scene, action and motivation labels based on the global representation learned from the visual graph and the semantic graph. In the inference stage, to avoid the dual-SGCN relying too heavily on the semantic context prior, especially for some long tail labels, we use Total Direct Effect [48] to compute the unbiased results by eliminating the negative effects of undesirable semantic knowledge.
The contributions of this paper are summarized as follows.
• | To the best of our knowledge, we are the first to predict action motivations based on scene graphs. We propose a dual scene graph convolutional network (dual-SGCN) to comprehensively explore the complex visual information and semantic context prior to conducting the motivation prediction. | ||||
• | We propose a multi-modal interactive attention mechanism to cooperatively attend the important objects or relations to capture the global structure information from scene graphs. | ||||
• | We propose a simple unbiased motivation inference scheme to alleviate the bias introduced by the semantic context prior. | ||||
• | Extensive experiments results on a public dataset demonstrate that the proposed dual-SGCN performs better than existing methods. | ||||
The rest of this paper is organized as follows. In Section 2, we review the related work on motivation prediction, scene graph generation, graph neural network, and co-attention mechanism. Section 3 describes our method including overall framework, visual relation modeling, semantic relation modeling, multi-modal interaction attention, multi-task learning, and unbiased inference. Experimental results are reported in Section 4. Finally, we conclude the paper with future work in Section 5.
2 RELATED WORK
2.1 Motivation Prediction
Humans can easily infer the underlying motivation of an action due to the outstanding cognitive skill based on life experiences [51, 57]. However, it is difficult to make the computer understand the human motivation behind an action. Vondrick et al. [54] are the first to study the motivation prediction based on vision data. Unfortunately, the inferior performance of the CNN model demonstrates that just relying on the visual information in an image may not be sufficient to solve the task. Considering that humans predict motivations behind actions based on daily experiences, Vondrick et al. [54] propose to learn a vision-based model that can access the external knowledge learned from web corpora. More specifically, a factor graph model is proposed to consider the semantic relevance among different concepts (i.e., scene, action, motivation) when predicting the motivation. The semantic relevance is learned by pre-trained natural language models and is incorporated into the vision-based model through structured-SVM [18, 40]. Although leveraging external textual knowledge improves the performance of the vision-based model, the knowledge learned from the web corpora may be incompatible with the task-specific data of the motivation prediction task. Different from the existing work, in our approach, we comprehensively exploit the complex visual information and semantic context prior from the training data to learn the commonsense knowledge that is more related to the motivation prediction.
The task of human intention prediction is also closely related to our work. Joo et al. [21] raise the problem of visual persuasion, which focuses on predicting persuasive intents of a politician in an image. To infer the intents, multiple types of syntactical attributes are developed to capture facial displays, gestures, and image backgrounds. Huang et al. [16] extend [21] by exploring a wider range of features, including body poses, the setting of the image, and improved deep features. Xi et al. [59] develop an automated methodology for learning about how political ideology is conveyed through social media images. A recent work Intentonomy [17] studies the human intent prediction for large-scale user generated social media images based on visual information. The main difference between our method and these existing methods lies in that we focus on predicting the motivation of the action performer in the image while the existing work considers the intent of image posters or photographers. A number of works focus on predicting human intention based on videos. For example, Wu et al. [58] propose an on-wrist motion triggered sensing system for anticipating daily intentions from videos. Sekhon et al. [44] propose a spatial context attentive network that jointly predicts trajectories for multiple pedestrians in the video for a future time window. Although these studies have achieved significant progress, the video data are much more difficult to collect compared with images, which restricts the scale of the data and the variety of the intention categories. Some work focuses on inferring intentions from textual [5, 68, 69] or acoustic data [10] which can be utilized in human-computer interaction. However, these methods cannot be directly applied to visual data.
2.2 Scene Graph Generation
Scene graph is originally proposed by Johnson et al. [20], which represents the image content as a directed acyclic graph, where nodes are objects and edges are relations between objects. The key challenge of scene graph generation (SGG) is how to detect objects and recognize the relationships among objects. Xu et al. [61] propose an SGG model based on iterative message passing. Li et al. [31] leverage mutual connections across different semantic levels of image understanding to help generate scene graphs. Zellers et al. [67] analyze the role of motifs (i.e., regularly appearing substructures in scene graphs) in the Visual Genome dataset and propose to model these intra-graph interactions by LSTMs. Newell et al. [37] propose an end-to-end learning framework based on convolutional networks to predict scene graphs from pixels, which combines object detection and relation recognition into one stage. More recently, Tang et al. [48] propose a counterfactual reasoning method that significantly improves the SGG performance. In this work, we use the SGG framework proposed by Tang et al. [48] to generate scene graphs from images. The scene graph can provide additional useful relation information when dealing with complex scenes, thus it has been successfully applied in image retrieval [20], image captioning [50, 65], visual question answering [9], and image generation [24, 47, 64, 70]. However, as far as we know, there has been no study on applying the scene graph to the motivation prediction task. Moreover, we utilize a dual-SGCN to learn both the complex visual information and the semantic context prior based on the scene graph, while existing methods ignore the multi-modal information.
2.3 Graph Convolutional Network
Graph Convolutional Network (GCN) is one kind of Graph Neural Networks (GNNs), which is proposed to learn better representations on graphs via message propagation and aggregation. There are two main categories of GCNs: the spectral-based approaches and the spatial-based approaches.
Spectral based approaches work with a spectral representation of the graphs (usually a Laplace matrix). The primal spectral-based GCN aggregates local information from neighbors and encodes them into vectors with the stacked graph convolutional layers [26]. Inspired by attention mechanism, GAT [53] is proposed to model the different relationships between the central node and its neighbors by updating with the attended features of neighbors. In addition, some work focuses on the K-hop neighbors to improve the convolution receptive field and get more node information [1, 6]. DGCN [49] extends the K-hop methods to directed graphs.
Spatial approaches conduct convolutions directly on the graph, operating on spatially close neighbors. The major challenge of spatial approaches is defining the convolution operation with differently sized neighborhoods and maintaining the local in-variance. Diffusion Convolutional Nueral Network (DCNN) [15] is proposed to learn diffusion-based representations from graph structured data that can be used as an effective basis for node classification. Gated recurrent units and modern optimization techniques are employed in Gated Graph Sequence Neural Networks (GGSNN) [32] to modify and extend the previous work to output sequences. PSCN [38] extracts locally connected regions from graphs and uses the normalized neighborhood graphs as the receptive fields of the CNN. GraphSAGE [11] is an inductive learning approach that generates node representations for unseen nodes by aggregating a sampling of neighbor node group. In DGCNN [39], the representations of nodes in dynamic 3D point clouds are learned with local neighborhood information and then are used in classification and segmentation. MatchGNN [55] learns the similarity between heterogeneous graphs with a path-relevant neighbor search. EGNN [22] adapts edge-labeling graph for few-shot learning that enables the evolution of an explicit clustering by iteratively updating the edge labels with direct exploitation of both intra-cluster similarity and the inter-cluster dissimilarity. Our work is more related to the spatial-based GCN proposed in [19], which can update messages on directed graphs with high dimensional node and edge features. However, this existing work is only designed for image generation task and it cannot be used to capture the multi-modal information of the visual content and the semantic prior.
2.4 Self-Attention
Self-attention [52] is firstly proposed to find semantic or syntactic input-output alignments under an encoder-decoder framework, which is especially effective in handling long-term dependency. Due to its simplicity and effectiveness, it has been applied in a wide range of textual tasks, e.g., machine translation [2], sentence summarization [42], text generation [29], and question answering [28], as well as various visual tasks, e.g., image classification [36, 46] and image generation [8]. Self-attention has also been used to fuse multi-modal information in vision and language tasks, e.g., visual question answering [23, 35, 66] and image captioning [62]. For example, Lu et al. [35] propose the first co-attention learning framework that can alternately attend the visual features of the image and the text features of the question. Kim et al. [23] propose bilinear attention networks (BAN) that can learn cooperative attention distributions to utilize the given vision-language information. Dense co-attention model [66] delivers significantly better VQA performance by establishing the complete interaction between each question word and each image region. It jointly models the self-attention of questions and images, as well as the question-guided-attention of images using a modular composition of basic attention units. Wiles et al. [56] design a spatial co-attention module to learn correspondences between a pair of images. Feng et al. [7] embed a co-attention mechanism to realize the parallel update of multi-modal features. More recently, the Transformer-based foundation models, e.g., Oscar [30], Uniter [4], and ViLT [25], directly use the tokens from different modalities as a concatenated sequence to capture the cross-modal interactions by the self-attention. To the best of our knowledge, no existing work focuses on jointly modeling the visual and semantic information of both objects and relations.
3 METHOD
To predict the motivation behind the action of an input image, we propose dual-SGCN to exploit the complex visual content and the semantic context prior by leveraging scene graphs. Figure 2 shows an overview of our model. We firstly extract scene graphs from images with a pre-trained scene graph generation network \(f_{SGG}\), which inputs an image I and outputs a scene graph \(G=f_{SGG}(I)\). More details are in Section 3.1. Then, the dual-SGCN will capture the visual and semantic structure information through multiple layers of node-orientated and edge-orientated message passing. The details of modeling the visual and semantic relations are illustrated in Sections 3.2 and 3.3, respectively. Next, we use a multi-modal interactive attention module to find the important object and relation features to build a global representation (Section 3.4). We use multi-task learning to train the model in an end-to-end form (Section 3.5). Finally, we adopt unbiased motivation inference to alleviate the bias caused by the semantic context prior (Section 3.6).
Fig. 2. Overview of the proposed dual scene graph convolutional network (dual-SGCN). The scene graph is firstly generated from the input image with a pre-trained SGG model (Section 3.1). Then, visual and semantic graphs are created to conduct the node-oriented and edge-oriented message passing by dual GCN layers to capture the interaction relations between objects and relations (Sections 3.2 and 3.3). Next, the multi-modal interaction attention mechanism is adopted to comprehensively exploit the correlation between visual and semantic information and find the pivotal objects or relations in motivation prediction (Section 3.4). Finally, to alleviate the bias caused by the semantic context prior, the unbiased inference is adopted to predict the motivation labels (Section 3.6). The overall framework is learned in an end-to-end form by multi-task co-training scheme (Section 3.5).
3.1 Scene Graph Generation
In order to generate informative and comprehensive visually grounded scene graphs, we use the popularly used method proposed by Tang et al. [48]. In this section, we briefly introduce the SGG method to make this paper self-contained.
Scene graph can be formally defined as a directed graph \(G=(O,E)\), which is a structured representation of the semantic content of an image. \(O=\lbrace o_1, \ldots ,o_n\rbrace\) is the set of nodes where each node \(o_i\) corresponds to an object detected from the image. The edge set \(E=\lbrace r_{ij}\rbrace\) consists of relationships among objects. Each edge \(r_{ij}\) denotes the relationship label between node \(o_i\) and node \(o_j\). In order to conveniently connect the scene graph with image, the objects are always accompanied with bounding boxes \(B= {\lbrace b_1, \ldots ,b_n\rbrace }\) where \(b_{i}\in \mathbb {R}^{4}\) denotes the location coordinate and size of the corresponding object \(o_i\) in the image. Given an input image I, the SGG method aims to detect the bounding boxes of objects and recognize their categories and relationships. The inferring process of the SGG model [67] can be formally defined as: (1) \(\begin{equation} Pr(G\mid I) = Pr(B\mid I) Pr(O\mid B,I) Pr(E\mid B,O,I). \end{equation}\) Here, the probabilities \(Pr(B\mid I)\) of object bounding boxes are predicted by a detector, i.e., Faster R-CNN [41]. To predict object labels \(Pr(O \mid B,I)\), visual features of bounding boxes are sequentially input into a biLSTM to obtain the contextualized representation of each box. Then, the object labels can be predicted by a decoding LSTM. To predict the probabilities \(Pr(E\mid B,O,I)\) of relations, the contextualized representation of each pair of boxes is computed based on the visual feature and the semantic label embedding of each box by another bi-LSTM. Relationships between objects are finally predicted based on visual features of objects, visual features of union boxes, and contextualized representations of object pairs. The major advantage of the adopted scene graph generation method [48] is that it can improve the conventional relation prediction by subtracting the counterfactual prediction result with intervention.
3.2 Visual Relation Modeling
Now, we will introduce how to capture the structured visual content from the image based on the generated scene graph. Specifically, we construct a visual graph \(G^v=(F^o, F^r)\), which has the same structure as the scene graph G. The nodes in \(G^v\) are represented by the visual features of the corresponding objects: \(F^o=\lbrace f^o_1, \ldots ,f^o_n\rbrace\). \(f^o_i\in \mathbb {R}^{d_{in}}\) is the representation of the \(i^{th}\) object. More concretely, visual features of nodes are calculated with ROIAlign [13] according to the feature maps and bounding boxes produced by the object detector (i.e., Faster RCNN [41]). The edge features are denoted as \(F^r=\lbrace f^r_{ij}\rbrace\). \(f^r_{ij}\in \mathbb {R}^{d_{in}}\) is the visual feature of the object pair \((o_i,o_j)\), which is represented by the ROIAlign [13] feature of the union box. Next, we will use GCN layers to capture the complex visual information contained in the visual graph \(G^v\). Each GCN layer consists of node-oriented message passing and edge-oriented message passing which are designed to update the embeddings for nodes and edges, respectively.
Node-oriented Message Passing. The node-oriented message passing focuses on learning node embeddings by aggregating important information from neighborhood nodes and edges. Different neighbors play different roles and show different importance in generating node embedding. For this reason, we update the embedding for each node by both node neighbors and edge neighbors and distinguish the influence of them with different transformations. For the convenience of introducing the node-oriented message passing, we firstly give the explicit definition of the neighborhood set. For a node \(f^o_i\), we define its neighborhood set as \(\mathcal {E}_i^{out},\mathcal {E}_i^{in},\mathcal {N}_i^{out},\mathcal {N}_i^{in}\). Specifically, \(\mathcal {E}_i^{out}\) is a set of edges that start with \(f^o_i\), and \(\mathcal {N}_i^{out}\) is a set of nodes that are connected to \(f^o_i\) by the edges in \(\mathcal {E}_i^{out}\). \(\mathcal {E}_i^{in}\) is a set of edges that end with \(f^o_i\) and \(\mathcal {N}_i^{in}\) is a set of nodes that are connected to \(f^o_i\) by the edges in \(\mathcal {E}_i^{in}\). Take the input scene graph in Figure 3 as an example, the neighbor sets for node \(f^o_1\) are \(\mathcal {E}_1^{out}=\lbrace f^r_{1,3},f^r_{1,4}\rbrace\), \(\mathcal {E}_1^{in}=\lbrace f^r_{2,1},f^r_{5,1}\rbrace\), \(\mathcal {N}_1^{out}=\lbrace f^o_3,f^o_4\rbrace\), \(\mathcal {N}_1^{in}=\lbrace f^o_2,f^o_5\rbrace\).
Fig. 3. A toy example of the node-oriented and edge-oriented message passing in the visual branch of the dual-SGCN. The input scene graph consists of five nodes ( \(f^o_1\) , \(f^o_2\) , \(f^o_3\) , \(f^o_4\) , \(f^o_5\) ) and five edges ( \(f^r_{2,1}\) , \(f^r_{1,3}\) , \(f^r_{1,4}\) , \(f^r_{5,1}\) , \(f^r_{4,3}\) ). We take the node \(f^o_1\) and the edge \(f^r_{4,3}\) as examples. In node-oriented message passing, the node \(f^o_1\) is updated by in-degree and out-degree neighboring nodes and edges. In edge-oriented message passing, the edge \(f^r_{4,3}\) is updated by the in-degree and out-degree neighboring nodes. The node-oriented and edge-oriented message passing in the semantic branch is conducted in a similar way.
For a node \(f^o_i\), new messages \(f^r_{out},f^o_{out},f^r_{in},f^o_{in}\) propagated from neighbors \(\mathcal {E}_i^{out},\mathcal {N}_i^{out},\mathcal {E}_i^{in},\mathcal {N}_i^{in}\) are formulated as follows: (2) \(\begin{equation} f^r_{out} = \frac{1}{|\mathcal {E}_i^{out}|}\sum _{r_{ij} \in \mathcal {E}_i^{out}}(W^c_{out}f^o_i + W^r_{out}f^r_{ij}), \end{equation}\) (3) \(\begin{equation} f^o_{out} = \frac{1}{|\mathcal {N}_i^{out}|}\sum _{o_j\in \mathcal {N}_i^{out}}(W^c_{out}f^o_i + W^o_{out}f^o_j), \end{equation}\) (4) \(\begin{equation} f^r_{in} = \frac{1}{|\mathcal {E}_i^{in}|}\sum _{r_{ji}\in \mathcal {E}_i^{in}}(W^c_{in}f^o_i + W^r_{in}f^r_{ji}), \end{equation}\) (5) \(\begin{equation} f^o_{in} = \frac{1}{|\mathcal {N}_i^{in}|}\sum _{o_j\in \mathcal {N}_i^{in}}(W^c_{in}f^o_i + W^o_{in}f^o_j). \end{equation}\) where \(W^c_{out}\) and \(W^c_{in} \in \mathbb {R}^{d_{g} \times d_{in}}\) are learnable weight matrices, which linearly transform the feature of the central node into different spaces to receive information from neighbors that have in or out connections with the central node. \(W^r_{out}\), \(W^o_{out}\), \(W^r_{in}\), \(W^o_{in} \in \mathbb {R}^{d_{g} \times d_{in}}\) are learnable weight matrices that are used to compute the update information from diverse neighbors. Finally, new embedding of the node \(o_i\) can be computed by aggregating information from diverse neighbors: (6) \(\begin{equation} \tilde{f}^o_i = \sigma \left(\tilde{W}^c \left(f^r_{out} + f^o_{out} + f^r_{in} + f^o_{in}\right) \right). \end{equation}\) where \(\sigma\) denotes the activation function (i.e., LeakyReLU), and \(\tilde{W}^c \in \mathbb {R}^{d \times d_{g}}\) is a learnable weight matrix.
Edge-oriented Message Passing. In the edge-oriented message passing, the neighbors of an edge \(r_{i,j}\) are defined as its start node \(o_i\) and end node \(o_j\). Take the input scene graph in Figure 3 as an example, the out-degree neighbor of the edge \(f^r_{4,3}\) is node \(f^o_3\) and the in-degree neighbor is node \(f^o_4\). Similar to the node-oriented message passing, the new messages propagated from neighbors for edge \(r_{i,j}\) are computed as follows: (7) \(\begin{equation} f^o_{in} = W^r f^r_{ij} + W^o_{in}f^o_i, \end{equation}\) (8) \(\begin{equation} f^o_{out} = W^r f^r_{ij} + W^o_{out}f^o_j. \end{equation}\) where \(W^r,W^o_{in},W^o_{out} \in \mathbb {R}^{d_{g} \times d_{in}}\) are learnable weight matrices. Then, we aggregate messages from neighbors to compute the new embedding of the edge \(r_{ij}\) as follows: (9) \(\begin{equation} \tilde{f}^r_{ij} = \sigma (\tilde{W}^r(f^o_{in} + f^o_{out})). \end{equation}\) where \(\tilde{W}^r \in \mathbb {R}^{d \times d_{g}}\) is the learnable weight matrix of the linear transform and \(\sigma\) is an activation function (i.e., LeakyReLU).
3.3 Semantic Relation Modeling
The semantic relations of objects contained in scene graphs can provide complementary information to the visual structure information captured in Section 3.2. Therefore, we propose to construct a semantic graph to explicitly model the semantic relation in scene graphs. The semantic graph has the same structure as the visual scene graph but different representations of nodes and edges. The nodes and edges are represented by the pre-trained semantic embeddings [19] of the corresponding object labels and relation labels respectively. With this scheme, we can capture the essential context prior from the high-level semantic representations of the objects and relationships for predicting the action motivation.
More specifically, we construct the semantic graph \(G^s=(L^o,L^r)\) based on the structure of the scene graph \(G=(O,E)\). Each node \(l^o_i \in L^o\) denotes the label embedding of the corresponding object \(o_i\). Each edge \(l^r_{ij} \in L^r\) denotes the label embedding of the corresponding relationship \(r_{ij}\) between two objects \(o_i\) and \(o_j\). We leave more details of computing the semantic embeddings in the experiment. Next, we take GCN layers as in Section 3.2 to capture the semantic context information in the scene graph. Finally, we obtain the new embedding \(\tilde{l}_i^o\) of the node (object) \(o_i\) and the new embedding \(\tilde{l}_{ij}^r\) of the edge (relation) \(r_{ij}\). Note that the GCNs used in the visual relation modeling and the semantic relations have the same network architecture but use independent parameters.
3.4 Multi-Modal Interaction Attention
After the visual and semantic relation modeling in Sections 3.2 and 3.3, we obtain the updated multi-modal features (i.e., visual and semantic node features) of objects and relations. Considering that some pivotal objects or relations may play more important roles in motivation predicting, we introduce a Multi-Modal Interaction Attention (MMIA) to comprehensively explore the complex correlations between objects and relations to more effectively represent the visual and semantic information in the image. As shown in Figure 4(a), the core of the proposed MMIA is Cross-Modal Self-Attention (CMSA) module which is designed for calculating the attended features based on the interaction among multi-modal features. Besides, we design a Modality-Aware Multi-Head (MAMH) mechanism to combine multi-modal features learned from multiple representation sub-spaces. The layer normalization and feed forward layers are adopted as in conventional Self-Attention [52].
Fig. 4. Architecture of Multi-Modal Interaction Attention (MMIA). The MMIA has two important modules: Cross-Modal Self-Attention (CMSA) and Modality-Aware Multi-Head (MAMH). The former is designed for calculating the attended features based on the interaction among multi-modal features. The latter is designed to combine multi-modal features learned from multiple representation sub-spaces.
For the convenience of explanation, we use \({\bf X}=[\tilde{F}^o,\tilde{L}^o,\tilde{F}^r,\tilde{L}^r]\) to denote all the multi-modal features of objects and relations. Here, \(\tilde{F}^o=[\tilde{f}^o_1; \ldots ;\tilde{f}^o_n]\in \mathbb {R}^{n\times d}\) and \(\tilde{L}^o=[\tilde{l}^o_1; \ldots ;\tilde{l}^o_n]\in \mathbb {R}^{n\times d}\) are visual and semantic features of objects obtained in previous sections. \(\tilde{F}^r=[\tilde{f}^r_1; \ldots ;\tilde{f}^r_m]\in \mathbb {R}^{m\times d}\) and \(\tilde{L}^r=[\tilde{l}^r_1; \ldots ;\tilde{l}^r_m]\in \mathbb {R}^{m\times d}\) are visual and semantic features of relations. Then we can formally define CMSA as follows: (10) \(\begin{equation} CMSA({\bf X},i)=Attention({\bf X}W^Q,X_i W^K,X_i W^V). \end{equation}\) where \({\bf X}\in \mathbb {R}^{(2n+2m)\times d}\) is the concatenation of multi-modal features, and i represents the \(i^{th}\) modality and \(X_i\in \lbrace \tilde{F}^o,\tilde{L}^o,\tilde{F}^r,\tilde{L}^r\rbrace\). \(W^Q\in \mathbb {R}^{d\times d^{\prime }}\), \(W^K\in \mathbb {R}^{d\times d^{\prime }}\) and \(W^V\in \mathbb {R}^{d\times d^{\prime }}\) are learnable parameter matrices for the linear projections. \(d^{\prime }\) is the output dimension for attended features. Attention is the scaled dot-product attention [52] which is the base module in self-attention. It uses a query and a set of key-value pairs to compute the weights and obtain a set of weighted sum of the values. More specifically, it computes the dot products of each query with all keys, and applies a scaling factor and a Softmax function to compute the weights on the values to obtain the attended outputs: (11) \(\begin{equation} Attention(Q,K,V)=Softmax\left(\frac{QK^T}{\sqrt {d^{\prime }}}\right)V. \end{equation}\) The detailed calculation process of CMSA is shown in Figure 4(b).
Next, we use the MAMH mechanism to combine attended multi-modal features, which is conducive to jointly attending to multi-modal features from different representation sub-spaces. It computes the weighted sum of h independent attention functions with re-projected Q, K and V: (12) \(\begin{equation} \bar{\bf X}=MAMH({\bf X})=Concat[CMSA({\bf X},1), \ldots ,CMSA({\bf X},h)] W^H. \end{equation}\) where h is the number of modalities (h is 4 in this work), \(W^H\in \mathbb {R}^{hd^{\prime }\times d_{h}}\) is a learnable weight matrix, and \(\bar{X} \in \mathbb {R}^{(2n+2m)\times d_{h}}\).
As shown in Figure 4(a), after the multi-head operation, we perform the same point-wise feed-forward layer as in conventional self-attention. The feed-forward layer is a fully connected feed-forward network consisting of two linear transformations and a ReLU activation: (13) \(\begin{equation} {\bf X}^{\prime }=FFN(\bar{\bf X})=ReLU(\bar{\bf X}W_1+b_1)W_2+b_2. \end{equation}\) where \(W_1 \in \mathbb {R}^{d_{h} \times d_{h^{\prime }}}\) and \(W_2 \in \mathbb {R}^{d_{h^{\prime }} \times d_{out}}\) are learnable matrixes. After applying layer normalization on the \({\bf X}^{\prime }\), we obtain the output of the MMIA.
3.5 Multi-task Learning
Considering the co-occurrence relationship between concepts of scene, action and motivation, we rely on multi-task co-training to learn the model in an end-to-end form. By jointly predicting the scene, the action of the person and the motivation behind the action in an image, multi-task co-training mechanism can explore the prior knowledge of co-occurrence relationship between these three kinds of concepts. Specifically, we employ three classifiers (i.e., \(\phi _s\), \(\phi _a\), and \(\phi _m\)) based on the output of the MMIA to get the corresponding prediction results of scene, action and motivation, respectively. The independent classifiers \(\phi _s\), \(\phi _a\), and \(\phi _m\) are defined as a fully connected layer with Softmax activation. We employ cross-entropy loss to learn the parameters of the overall framework: (14) \(\begin{equation} \mathcal {L}=\lambda _s \mathcal {L}_{ce}\big (\phi _s(\delta ({\bf X}^{\prime })),y_s\big) + \lambda _a \mathcal {L}_{ce}\big (\phi _s(\delta ({\bf X}^{\prime })),y_a\big) + \lambda _m \mathcal {L}_{ce}\big (\phi _s(\delta ({\bf X}^{\prime })),y_m\big). \end{equation}\) where \(\lambda _s\), \(\lambda _a\) and \(\lambda _m\) are balance factors. \(\mathcal {L}_{ce}\) represents cross-entropy loss. \(y_s\), \(y_a\) and \(y_m\) are groundtruth labels of the scene, action and motivation, respectively. \(\delta\) denotes the average pooling operation which can aggregate the multi-modal features into a single vector with the dimension of \(d_{out}\).
3.6 Unbiased Inference
The above sections have introduced how to exploit both visual and semantic structure information to predict motivations by leveraging on scene graphs. As illustrated in the introduction, the semantic information can reflect context prior which is always conducive to the motivation prediction. However, the semantic information contained in the object labels or relation labels may also lead to bad long-tailed bias.
To remove the long-tailed bias, a straightforward way is to use conventional debiasing methods, e.g., re-sampling [3, 12] and re-weighting [33]. However, these conventional methods may also hinder learning semantic contextual priors which are good for the motivation prediction. In this work, inspired by the unbiased predicting approach in [48], we use Total Direct Effect [48] as the unbiased inference result of the motivation prediction. Specifically, for an input image I, we define the output logits from the motivation classifier \(\phi _m\) as \(Pr_m(I)\), where the visual features of visual scene graph and semantic features of the scene graph are both used in prediction. Then, we introduce the counterfactual alternate \(Pr_m(\bar{I})\), where the visual features of objects and relations in the visual scene graph are masked to zero vectors instead. We try to remove the bad long-tailed bias by computing the difference between the static likelihood and counterfactual prediction: (15) \(\begin{equation} \widetilde{Pr}_{m}(I)=Pr_m(I)-\eta _m Pr_m(\bar{I}). \end{equation}\) where \(\eta _m\) represents the weighting factor to control the relative importance of the counterfactual term. For the prediction of actions and scenes, we apply the same counterfactual debiasing mechanism to obtain the unbiased results.
4 EXPERIMENTS
In this section, we first describe the experiment setting and implementation details, and then present experiment results on predicting motivation, followed by detailed ablation studies and parameter analysis.
4.1 Dataset
We conduct the experiments on Motivation dataset [54] which consists of 10,191 images selected from Microsoft COCO [34]. Each image in this dataset is annotated with three phrases describing the scene in the image, the action being performed, and the believed motivation of the person, respectively. We follow the experiment setting in [54], where actions and scenes are clustered into 100 classes, and motivations are clustered into 256 classes. We split the dataset into training set and testing set with a ratio of 75%:25%.
4.2 Evaluation Metrics
To fairly evaluate the performance of the motivation prediction models, we should take into consideration that the dataset has the long-tail issue and the person in an image may have multiple prospective motivations. Therefore, we follow Vondrick et al. [54] and take the median rank as the evaluate metric. It evaluates the model on each image by finding the rank of the ground truth label in the list of motivation predictions sorted by the predicting scores, and the median rank represents the median of such ranks of all images in dataset.
In addition, we also employ another metric [email protected] (mean [email protected]) to obtain more comprehensive evaluation results. It is a variant of [email protected] which computes the number of times that the ground truth label is included in the predicted labels with top K confidence scores. [email protected] computes the [email protected] on the test images of each category, and takes the average of all categories. It is worth noting that we use the same evaluation metric for predicting scenes, actions, and motivations.
4.3 Implementation Details
In our approach, to generate scene graphs from images, we adopt the scene graph generation method proposed in [48], which uses the Faster R-CNN with a ResNeXt-101-FPN backbone as the object detector. In order to save the storage space and computing cost, we only retain the objects with top 20 confidence scores and the relations with top 30 confidence scores. To ensure that the selected objects and relations can compose a complete graph, we further remove the relations that are associated with unselected objects. For visual scene graph, we map high-dimensional visual features (i.e., 4096 for VGG [45], 2048 for ResNet [14] and ResNeXt [60]) to 128-dimensional representations (i.e., \(d_{in}=128\)) by a linear transformation. For the semantic scene graph, we compute the semantic representations of the object and relation labels based on the pre-trained 128-dimensional word embeddings [19]. For the two branches of the dual-SGCN, the number of GCN layers is set to 3, and the dimension of hidden layer \(d_{g}\) is set to 512. We initialize the network parameters according to the first three layers of the pre-trained scene graph-based GCN model [19]. The output dimension d of dual-SGCN is set to 128. In MMIA, the dimensions \(d^{\prime }\), \(d_h\), \(d_{h^{\prime }}\) and \(d_{out}\) are all set to 128. The balance weights \(\lambda _m\), \(\lambda _a\) and \(\lambda _s\) of the loss function are set as 0.5, 0.4 and 0.3, respectively. The weighting factors \(\gamma _m\), \(\gamma _a\), and \(\gamma _s\) in the unbiased inference are set to 0.1. To train our model, we use the optimizer of Adam, and set the batch size to 1. The learning rate is set to 1e-5. For the baseline models, e.g., CNN*-VGG16, we set the batch size to 32 and the learning rate to 1e-3 to obtain the best result.
4.4 Result Analysis
4.4.1 Comparison with Existing Methods.
We report the prediction results of our approach and baselines in Table 1. We compare our model with previous methods Random, CNN-VGG16, and V&T. The Random [54] denotes a random classifier which classifies the input sample into a class with the likelihood that is proportional to the sample size of the class. CNN-VGG16 represents the Visual-Only method proposed in [54], which trains a linear classifier based on the visual features extracted by pre-trained CNN networks of VGG16 [45]. For fair comparison, we improve the CNN-VGG16 model by fine-tuning it on the training data. Specifically, the first 15 layers of the CNN-VGG16 are initialized with the parameters pre-trained on ImageNet. The last layer of the CNN-VGG16 is replaced with three randomly initialized linear classifiers for motivation, action, and scene prediction, respectively. We fine-tune the whole model by multi-task co-training mechanism (Section 3.5). The improved model is denoted as CNN*-VGG16. Moreover, we replace the backbone of CNN*-VGG16 with ResNet101 [14] and ResNeXt101 [60] and obtain the baseline methods CNN*-R101 and CNN*-RX101. V&T [54] represents the Visual+Text method proposed in [54], which can infer the motivation by using pre-trained natural language models to mine knowledge stored in massive amounts of public text corpus. To apply the intent recognition method proposed in Intentonomy [17] to motivation prediction, we adopt the localization loss to localize important person and object regions in a weakly-supervised manner with the help of CAM [71] as in [17]. For a fair comparison, we use the visual backbone of ResNeXt101 as in our model.
| Model | Motivation | Action | Scene | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MR | [email protected] | [email protected] | [email protected] | MR | [email protected] | [email protected] | [email protected] | MR | [email protected] | [email protected] | [email protected] | |
| RM | 62 | 0.40 | 1.98 | 3.95 | 29 | 1.00 | 5.00 | 10.00 | 15 | 1.00 | 5.00 | 10.00 |
| CNN-VGG16 [54] | 39 | - | - | - | 17 | - | - | - | 4 | - | - | - |
| CNN*-VGG16 | 32 | 2.84 | 10.90 | 18.30 | 16 | 6.87 | 20.25 | 31.92 | 5 | 7.71 | 26.15 | 40.35 |
| CNN*-R101 | 33 | 3.18 | 11.06 | 17.68 | 17 | 5.79 | 19.92 | 30.43 | 6 | 7.93 | 25.24 | 37.68 |
| CNN*-RX101 | 33 | 3.14 | 11.05 | 18.28 | 18 | 5.91 | 20.19 | 30.34 | 6 | 8.25 | 27.19 | 40.64 |
| V& T [54] | 27 | - | - | - | 14 | - | - | - | 3 | - | - | - |
| Intentonomy [17] | 32 | 3.40 | 11.64 | 18.76 | 17 | 7.48 | 21.08 | 32.13 | 5 | 8.93 | 27.93 | 41.08 |
| dual-SGCN | 14 | 6.74 | 22.20 | 33.21 | 8 | 13.52 | 37.24 | 50.99 | 3 | 13.32 | 40.09 | 54.63 |
There are 256 motivations, 100 actions, and 100 scenes.
Table 1. Comparison of Median Rank (MR, Lower is Better) and Mean [email protected] ([email protected], Higher is Better) Results
There are 256 motivations, 100 actions, and 100 scenes.
Because the number of images in each category is different, the RM [54] model will tend to classify samples into the class with the largest sample size. As shown in Table 1, the classification results of the random model are better than a completely random guess. It is worth noting that the existing work V&T [54] simply uses the metric of median rank (MR) to evaluate the model performance on the three tasks. Since the authors do not provide the code and text corpus, it is difficult to produce the [email protected] results for the V&T. Therefore, we only show the MR results of the V&T model for fair comparison. Nevertheless, the significant improvement on MR still can demonstrate the effectiveness of our approach. For the CNN-VGG16 [54] model, we find that there is a difference between the MR result reported by Vondrick et al. [54] and the one reproduced by our implementation (i.e., CNN*-VGG16). Therefore, we show both the original MR result and the reproduced result for fair comparison. Compared with CNN-VGG16, CNN*-VGG16 improves the performance by 7 and 1 under the MR metric in motivation and action prediction tasks, respectively. It proves the effectiveness of multi-task co-training mechanism. In addition, the slight performance drop in scene prediction reflects that the motivation and action prediction models cannot well complement the scene prediction. The CNN*-VGG16 fails to outperform V&T [54], which demonstrates the limitation of leveraging only visual information in motivation prediction. The performance of CNN*-VGG16 is competitive with CNN*-R101 and CNN*-RX101, which demonstrates that simply using more sophisticated deep models to exploit the visual information is ineffective in motivation prediction. The Intentonomy [17] model improves the MR and [email protected] results by 1 and 0.48%, respectively, compared with CNN*-RX101 in motivation prediction, which shows the effectiveness of the localization loss proposed in [17]. Our approach outperforms the V&T [54] model by a margin of 13 under the MR metric in motivation prediction, and outperforms the Intentonomy model [17] by 18. In action prediction, our approach achieves the MR result of 8, which outperforms the V&T model with a margin of 6. In scene prediction, the MR result of our approach is the same as the result of the V&T. However, our method outperforms the Intentonomy model by 4.39% under the metric of [email protected] in scene prediction. The above results demonstrate the effectiveness of modeling the visual and semantic structured information in scene graphs.
4.4.2 Ablation Studies.
To further evaluate the effectiveness of our model, we conduct ablation studies as shown in Table 2. Here, we mainly consider the necessity of the major components in our model: the GCN module, the MMIA module, the unbiased inference module, and dual architecture (i.e., semantic scene graph and visual scene graph).
| Method | Motivation | Action | Scene | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MR | [email protected] | [email protected] | [email protected] | MR | [email protected] | [email protected] | [email protected] | MR | [email protected] | [email protected] | [email protected] | |
| No GCN | 16 | 5.47 | 19.98 | 30.59 | 9 | 13.04 | 34.74 | 48.48 | 3 | 12.50 | 38.08 | 54.32 |
| S-SG Only | 20 | 5.05 | 15.70 | 23.98 | 11 | 11.39 | 28.52 | 40.26 | 4 | 9.95 | 31.17 | 45.88 |
| V-SG Only | 16 | 5.56 | 18.86 | 28.81 | 9 | 11.51 | 34.88 | 47.76 | 3 | 11.6 | 34.75 | 50.71 |
| No MMIA-Avg | 16 | 7.04 | 21.47 | 31.15 | 9 | 12.93 | 36.49 | 50.00 | 3 | 13.29 | 39.09 | 53.86 |
| No MMIA-LF | 16 | 5.27 | 18.04 | 28.32 | 9 | 12.56 | 33.28 | 46.06 | 3 | 11.69 | 33.77 | 49.24 |
| Biased | 15 | 6.36 | 21.37 | 32.09 | 9 | 13.45 | 35.71 | 50.43 | 3 | 13.20 | 39.77 | 54.18 |
| Focal | 15 | 6.20 | 20.69 | 31.05 | 9 | 12.95 | 34.86 | 49.33 | 3 | 13.66 | 38.93 | 52.64 |
| Reweight | 15 | 6.36 | 20.89 | 31.79 | 9 | 12.53 | 35.38 | 49.72 | 3 | 13.17 | 36.95 | 53.34 |
| Resample | 16 | 6.48 | 21.83 | 32.58 | 9 | 13.68 | 36.55 | 50.24 | 3 | 13.07 | 40.08 | 53.99 |
| dual-SGCN | 14 | 6.74 | 22.20 | 33.21 | 8 | 13.52 | 37.24 | 50.99 | 3 | 13.32 | 40.09 | 54.63 |
Table 2. Ablation Results of Median Rank (MR, Lower is Better) and Mean [email protected] ([email protected], Higher is Better)
No GCN: To evaluate the effectiveness of the GCN layers, we remove the GCN operation in both the semantic and visual branches. The semantic features of objects and relations are represented as initial embedding vectors. Both visual and semantic features are firstly mapped to the same dimension through linear projections, and then are used in the MMIA module. This variant model does not explicitly reason about the interaction between different objects in a single modality, but blindly combines the features of different objects and the structure information of the scene. In motivation prediction, our model achieves an improvement of 2 under the MR metric, as well as an improvement of 1.27% on [email protected] compared with the model without GCN module. The results in Table 2 demonstrate the advantage of the adopted GCN layers.
V-SG Only and S-SG Only independently use the visual scene graph and the semantic scene graph, respectively. In these two variant models, we predict the classification results from the output features of one branch of the GCN layers without considering the interaction between different modalities. We remove the MMIA module and simply combine the output of the GCN layers with average pooling. The experimental results demonstrate that visual scene graphs always have better results compared with semantic scene graphs. Semantic context prior learned from the object and relation labels is useful to infer the motivation based on the global structure information of the scene, but simply using semantic prior is not enough to capture the complex details contained in the image and thus results in inferior performance. Specifically, the V-SG Only model has a better performance on MR with an improvement of 4, and on [email protected] with an improvement of 3.16% compared with the S-SG Only model in motivation prediction. However, the V-SG Only model still has a performance gap of 2 on MR compared with our approach.
No MMIA-Avg and No MMIA-LF: To evaluate whether the MMIA module can effectively capture the interaction between the multi-modal features learned from the visual graph and the semantic graph, we design two variant models. No MMIA-Avg denotes that we remove the MMIA module and directly predict the classification results by combining visual and semantic features learned in the dual GCN layers with average pooling. No MMIA-LF denotes that we simply replace the MMIA module with a late fusion scheme. This fusion mechanism is analogous to the soft voting, which sums the output logits of the V-SG Only and S-SG Only models. We observe that the No MMIA-AVG model performs worse than our model by 2.06% on [email protected] in motivation prediction. In addition, the No MMIA-LF performs worse than our model by 1.47% on [email protected], and even worse than the V-SG Only, which demonstrates that the simple late fusion scheme is not effective to capture the complementary information contained in the visual and semantic scene graphs.
Biased: This variant model directly uses the conventional likelihood (i.e., output logits) to predict the classification results, which performs worse than our model by 1.12% on [email protected] in motivation prediction. The results demonstrate that the adopted unbiased motivation inference mechanism (as illustrated in Section 3.6) is more effective because it can alleviate the bad long-tailed bias caused by the semantic context prior.
Focal [33], Reweight and Resample [3]: We also compare our model with three conventional debiasing methods. Focal loss is defined as \(L_{fl}=-(1-p_t)^\gamma \log (p_t)\) which modifies the cross-entropy loss to automatically penalize well-learned samples and pay more attention to the hard ones. We follow [33] to set the hyper-parameters as \(\gamma = 2.0\), \(\alpha = 0.25\). Weighted cross-entropy loss is another widely used debiasing method. For the three tasks, the inversed sample fractions are assigned to different categories as weights. For the Resample method, each sample in the training dataset is resampled by the average of inversed sample fractions, and rare categories are up-sampled. Our method achieves an improvement of 2.16% on [email protected] in motivation prediction compared with the Focal model. The Reweight model performs worse than ours by 1.42% on [email protected] in motivation prediction. The Resample model performs worse than ours by 2 on MR in motivation prediction. The results demonstrate that the adopted counterfactual debiasing method is more effective.
4.5 Parameter Analysis
4.5.1 GCN Layers.
Firstly, we analyze the impact of the number of GCN layers. In Figure 5, we show the performance of our model with different numbers of GCN layers in predicting the three kinds of labels. From the results, we can see that when the number of GCN layers \(L=3\), the model achieves the best performance. Because shallower GCNs cannot well capture the relationships among the objects within the scene graph while deeper GCNs have the over-smoothing problem.
Fig. 5. Performance of our model with different numbers of GCN layers.
4.5.2 Balance Weights.
Secondly, we consider the impact of the balance weights \(\lambda _m, \lambda _a, \lambda _s\). As shown in Figure 6, we visualize the influence of balance weights on model performance. When one weight changes, the other two weights are frozen to the best choices (\(\lambda _m=0.5, \lambda _a=0.4, \lambda _s=0.3\)). Median Rank allows for multiple predictions and reflects the average predictive effectiveness across all samples. Therefore, we mainly rely on the Median Rank results to set the balance weights. We can obverse that the performance of our model under the metric of [email protected] is relatively stable compared with the Median Rank.
Fig. 6. Impact of balance weights \((\lambda _m, \lambda _a, \lambda _s)\) in the loss function Equation (14). For the convenience of comparison, we average the Median Rank and average [email protected] results of scene, action and motivation labels.
4.5.3 Debiasing Factors.
Finally, we consider the weighting factors \(\gamma _m\), \(\gamma _a\), and \(\gamma _s\) used in the unbiased inference. In Figure 7, we show the performance of our model with different debiasing factors in three predicting tasks. Because \(\gamma _m\), \(\gamma _a\), and \(\gamma _s\) are independent of each other, we visualize the median rank and [email protected] influenced by the three factors in Figures 7(a), 7(b), and 7(c) for three tasks, respectively. We can observe that too large or too small value of the debiasing factor will reduce the performance. Because too small value cannot effectively improve the conventional likelihood prediction while too large value may lose some useful semantic prior.
Fig. 7. Impact of debiasing factors \((\gamma _m, \gamma _a, \gamma _s)\) under the metrics of Median Rank and [email protected].
4.6 Further Remarks
4.6.1 Aggregation Methods.
As illustrated in Section 3.5, we adopt a pooling operation \(\delta\) to aggregate the multi-modal features learned by the dual GCNs into a single vector. Here, we analyze the influence of different pooling methods. The results are shown in Table 3, which demonstrate that the average-pooling method performs better than the max-pooling method in aggregating multi-modal features learned from the scene graph.
| Method | Motivation | Action | Scene | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MR | [email protected] | [email protected] | [email protected] | MR | [email protected] | [email protected] | [email protected] | MR | [email protected] | [email protected] | [email protected] | |
| max | 16 | 6.45 | 21.51 | 30.48 | 9 | 12.33 | 35.53 | 47.35 | 3 | 12.57 | 38.48 | 53.80 |
| avg | 14 | 6.74 | 22.20 | 33.21 | 8 | 13.52 | 37.24 | 50.99 | 3 | 13.32 | 40.09 | 54.63 |
Table 3. Performance of our Model with Different Aggregation Methods
4.6.2 Multi-task Co-training.
We also analyze the influence of the multi-task co-training mechanism (MTCT) proposed in Section 3.5, where the model is required to simultaneously predict the scene, the action of the person, and the motivation behind the action. The model parameters can be collaboratively learned with the loss functions of three tasks. In Table 4, we show the prediction results of the baseline model (CNN*-ResNeXt) and our model (Ours) jointly trained for three tasks with multi-task co-training (MTCT) or independently trained for each task. The results indicate that the co-training method obtains better results for all the there tasks when compared with the independent training scheme.
| Model | MTCT | Motivation | Action | Scene | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MR | [email protected] | [email protected] | [email protected] | MR | [email protected] | [email protected] | [email protected] | MR | [email protected] | [email protected] | [email protected] | ||
| CNN*-ResNeXt | \(\checkmark\) | 33 | 3.14 | 11.05 | 18.28 | 18 | 5.91 | 20.19 | 30.34 | 6 | 8.25 | 27.19 | 40.64 |
| \(\times\) | 37 | 2.47 | 8.99 | 15.73 | 19 | 4.78 | 18.54 | 27.46 | 6 | 8.31 | 25.69 | 39.19 | |
| dual-SGCN | \(\checkmark\) | 14 | 6.74 | 22.20 | 33.21 | 8 | 13.52 | 37.24 | 50.99 | 3 | 13.32 | 40.09 | 54.63 |
| \(\times\) | 15 | 6.22 | 20.99 | 31.90 | 9 | 13.34 | 36.67 | 49.66 | 3 | 12.35 | 36.59 | 50.39 | |
Table 4. Impact of the Multi-task Co-training (MTCT) Mechanism
4.6.3 Qualitative Analysis.
To further analyze the effectiveness of our model, we show qualitative examples of our approach in Figure 8. We visualize the ground truth labels and results predicted by CNN*-VGG model and our approach. Our model tends to predict results that are more consistent with the scene of images, such as baseball field instead of soccer field in Figure 8(a) and sitting at a table instead of holding a plate in Figure 8(d). The results indicate that the scene graphs play an important role in motivation prediction. On the other hand, our model prefers more accurate results, such as street instead of city in Figure 8(e) and holding a controller instead of standing in Figure 8(g). It confirms the effectiveness of the unbiased inference module.
Fig. 8. Qualitative comparison results. The ground truth labels and predicted labels are shown below each image, which follow the order of scene, action, and motivation from top to bottom. Correct predictions are marked in blue; reasonable but inaccurate predictions are marked in orange; incorrect predictions are marked in red.
In order to demonstrate the ability of the MMIA module in comprehensively finding the pivotal objects or relations that play more important roles in motivation prediction, we visualize the learned attention maps calculated by MMIA in Figure 9. We show two example images and visualize the average of attention weights for objects and relationships learned from the semantic or visual scene graph. As shown in Figure 9(a), the MMIA tends to focus on mouth in both semantic and visual modalities. For the semantic modality, the objects hand1, pizza, and eye also obtain high attention weights. For the relationships, the pizza in hand2 and hand2 holding pizza obtain more attention in the semantic modality and the visual modality, respectively. In Figure 9(b), the object wave has more attention in the semantic modality while the object board obtains more attention in the visual modality. The relationship man riding wave, which has larger attention weight in both the semantic modality and the visual modality, is closely related to the motivation label. These results demonstrate the effectiveness of the proposed MMIA module.
Fig. 9. Visualization of the attention weights of objects and relationships learned from the semantic and visual graphs by our model.
5 CONCLUSION
Considering the advantages of scene graphs in capturing complex semantic dependencies among different objects and relations detected from the image, in this paper, we proposed a scene graph-based motivation prediction model, named dual scene graph convolutional network (dual-SGCN). It can comprehensively capture the complex visual information and semantic context prior from the image data, which is more effective than learning external commonsense knowledge from public text corpora by language models. In the dual-SGCN, we firstly extract scene graphs from images by a pre-trained scene graph generation model. Then, we build a visual graph and a semantic graph based on the generated scene graph to conduct message passing among different objects and relations by GCN layers. Next, we designed a multi-modal interactive attention to combine the updated visual and semantic features of the nodes and edges. Finally, the dual-SGCN is learned by multi-task co-training and is used to predict motivations with the help of an unbiased inference mechanism. Extensive experimental results demonstrate the effectiveness of the proposed method. In future work, we will consider to apply our model to video data.
- [1] . 2020. N-GCN: Multi-scale graph convolution for semi-supervised node classification. In Uncertainty in Artificial Intelligence. PMLR, 841–851. Google Scholar
- [2] . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- [3] . 2015. Influence of resampling on accuracy of imbalanced classification. In Eighth International Conference on Machine Vision (ICMV 2015), Vol. 9875. International Society for Optics and Photonics, 987521.Google Scholar
- [4] . 2020. UNITER: Universal image-text representation learning. In European Conference on Computer Vision. Springer, 104–120.Google Scholar
Digital Library
- [5] . 2012. A model of intentional communication: AIRBUS (asymmetric intention recognition with Bayesian updating of signals). Proceedings of SemDial 2012 (2012), 149–150.Google Scholar
- [6] . 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.
arxiv:1606.09375 .Google Scholar - [7] . 2021. Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15506–15515.
arxiv:2105.01839 http://arxiv.org/abs/2105.01839.Google ScholarCross Ref
- [8] . 2015. Draw: A recurrent neural network for image generation. In International Conference on Machine Learning. PMLR, 1462–1471.Google Scholar
- [9] . 2021. AGQA: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11287–11297.
arxiv:2103.16002 http://arxiv.org/abs/2103.16002.Google ScholarCross Ref
- [10] . 2017. Speech intention classification with multimodal deep learning. In Canadian Conference on Artificial Intelligence. Springer, 260–271.Google Scholar
Cross Ref
- [11] . 2017. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1025–1035.Google Scholar
Digital Library
- [12] . 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263–1284. Google Scholar
Digital Library
- [13] . 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.Google Scholar
Cross Ref
- [14] . 2016. Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2016-Decem. (2016), 770–778.
arxiv:1512.03385 .Google ScholarCross Ref
- [15] . 2015. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163 (2015).
arxiv:1506.05163 http://arxiv.org/abs/1506.05163.Google Scholar - [16] . 2016. Inferring visual persuasion via body language, setting, and deep features. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 778–784. Google Scholar
Cross Ref
- [17] . 2020. Intentonomy: A dataset and study towards human intent understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12986–12996.
arxiv:2011.05558 http://arxiv.org/abs/2011.05558.Google Scholar - [18] . 2009. Cutting-plane training of structural SVMs. Machine Learning 77, 1 (2009), 27–59. Google Scholar
Digital Library
- [19] . 2018. Image generation from scene graphs. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1219–1228.
arxiv:1804.01622 .Google ScholarCross Ref
- [20] . 2015. Image retrieval using scene graphs. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 07-12-June. 3668–3678. Google Scholar
Cross Ref
- [21] . 2014. Visual persuasion: Inferring communicative intents of images. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2014), 216–223. Google Scholar
Digital Library
- [22] . 2019. Edge-labeling graph neural network for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11–20.Google Scholar
Cross Ref
- [23] . 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems, Vol. 2018-Decem. 1564–1574.
arxiv:1805.07932 .Google Scholar - [24] . 2020. 3-D scene graph: A sparse and semantic representation of physical environments for intelligent agents. IEEE Transactions on Cybernetics 50, 12 (2020), 4921–4933. Google Scholar
Cross Ref
- [25] . 2021. ViLT: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334 (2021).Google Scholar
- [26] . 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
- [27] . 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90. Google Scholar
Digital Library
- [28] . 2016. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning. PMLR, 1378–1387.Google Scholar
Digital Library
- [29] . 2015. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 (2015).Google Scholar
- [30] . 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision. Springer, 121–137.Google Scholar
Digital Library
- [31] . 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob. 1270–1279.
arxiv:1707.09700 .Google ScholarCross Ref
- [32] . 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015).Google Scholar
- [33] . 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980–2988.Google Scholar
Cross Ref
- [34] . 2014. Microsoft COCO: Common objects in context. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 8693 LNCS. Springer, 740–755.
arxiv:1405.0312 .Google ScholarCross Ref
- [35] . 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems (NIPS). 289–297.Google Scholar
- [36] . 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems. 2204–2212.Google Scholar
Digital Library
- [37] . 2017. Pixels to graphs by associative embedding. In Advances in Neural Information Processing Systems, Vol. 2017-Decem. 2172–2181.
arxiv:1706.07365 .Google Scholar - [38] . 2016. Learning convolutional neural networks for graphs. In International Conference on Machine Learning. PMLR, 2014–2023.Google Scholar
Digital Library
- [39] . 2018. DGCNN: A convolutional neural network over large-scale labeled graphs. Neural Networks 108 (2018), 533–543. Google Scholar
Cross Ref
- [40] . 2013. Dual coordinate solvers for large-scale structural SVMs. arXiv preprint arXiv:1312.1743 (2013).Google Scholar
- [41] . 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 1137–1149.
arxiv:1506.01497 .Google ScholarDigital Library
- [42] . 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685 (2015).Google Scholar
- [43] . 2019. Still image action recognition by predicting spatial-temporal pixel evolution. In Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019. IEEE, 111–120. Google Scholar
Cross Ref
- [44] . 2021. SCAN: A spatial context attentive network for joint multi-agent intent prediction. arXiv preprint arXiv:2102.00109 (2021).
arxiv:2102.00109 . http://arxiv.org/abs/2102.00109Google Scholar - [45] . 2015. Very deep convolutional networks for large-scale image recognition. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2015).
arxiv:1409.1556 .Google Scholar - [46] . 2014. Deep networks with internal selective attention through feedback connections. arXiv preprint arXiv:1407.3068 (2014).Google Scholar
- [47] . 2020. Object-centric image generation from layouts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2647–2655.
arxiv:2003.07449 . http://arxiv.org/abs/2003.07449.Google Scholar - [48] . 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 3713–3722.
arxiv:2002.11949 .Google ScholarCross Ref
- [49] . 2020. Directed graph convolutional network. arXiv preprint arXiv:2004.13970 (2020).Google Scholar
- [50] . 2021. SG2Caps: Revisiting scene graphs for image captioning. arXiv preprint arXiv:2102.04990 (2021).
arxiv:2102.04990 http://arxiv.org/abs/2102.04990.Google Scholar - [51] . 1996. People as flexible interpreters: Evidence and issues from spontaneous trait inference. Advances in Experimental Social Psychology 28 (1996), 211–279. Google Scholar
Cross Ref
- [52] . 2017. Attention is all you need. Advances in Neural Information Processing Systems 2017-Decem. (2017), 5999–6009.
arxiv:1706.03762 .Google Scholar - [53] . 2018. Graph attention networks. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings (2018).
arxiv:1710.10903 .Google Scholar - [54] . 2016. Predicting motivations of actions by leveraging text. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Vol. 2016-Decem. 2997–3005.
arxiv:1406.5472 .Google ScholarCross Ref
- [55] . 2019. Heterogeneous graph matching networks: Application to unknown malware detection. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 5401–5408. Google Scholar
Cross Ref
- [56] . 2020. Co-attention for conditioned image matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15920–15929.
arxiv:2007.08480 . http://arxiv.org/abs/2007.08480.Google Scholar - [57] . 1983. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition 13, 1 (1983), 103–128. Google Scholar
Cross Ref
- [58] . 2017. Anticipating daily intention using on-wrist motion triggered sensing. In Proceedings of the IEEE International Conference on Computer Vision. Vol. 2017-Octob. 48–56.
arxiv:1710.07477 .Google ScholarCross Ref
- [59] . 2020. Understanding the political ideology of legislators from social media images. In Proceedings of the 14th International AAAI Conference on Web and Social Media, ICWSM 2020, Vol. 14. 726–737.
arxiv:1907.09594 .Google ScholarCross Ref
- [60] . 2017. Aggregated residual transformations for deep neural networks. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Vol. 2017-Jan. 5987–5995.
arxiv:1611.05431 .Google ScholarCross Ref
- [61] . 2017. Scene graph generation by iterative message passing. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (2017), 3097–3106.
arxiv:1701.02426 .Google ScholarCross Ref
- [62] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 2048–2057.Google Scholar
Digital Library
- [63] . 2020. Pose-guided human animation from a single image in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15039–15048.
arxiv:2012.03796 http://arxiv.org/abs/2012.03796.Google Scholar - [64] . 2020. Image-to-image retrieval by learning similarity between scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10718–10726.
arxiv:2012.14700 . http://arxiv.org/abs/2012.14700Google Scholar - [65] . 2020. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graph. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3208–3216.
arxiv:2006.16934 http://arxiv.org/abs/2006.16934.Google Scholar - [66] . 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2019-June. 6274–6283.
arxiv:1906.10770 .Google ScholarCross Ref
- [67] . 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 5831–5840.
arxiv:1711.06640 .Google ScholarCross Ref
- [68] . 2020. Deep open intent classification with adaptive decision boundary. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14374–14382.
arxiv:2012.10209 http://arxiv.org/abs/2012.10209.Google Scholar - [69] . 2020. Discovering new intents with deep aligned clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14365–14373.
arxiv:2012.08987 http://arxiv.org/abs/2012.08987.Google Scholar - [70] . 2020. Layout2image: Image generation from layout. In International Journal of Computer Vision, Vol. 128. 2418–2435. Google Scholar
Digital Library
- [71] . 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929.Google Scholar
Cross Ref
Index Terms
(auto-classified)Dual Scene Graph Convolutional Network for Motivation Prediction
Recommendations
Boosting Scene Graph Generation with Visual Relation Saliency
The scene graph is a symbolic data structure that comprehensively describes the objects and visual relations in a visual scene, while ignoring the inherent perceptual saliency of each visual relation (i.e., relation saliency). However, humans often ...
Lightweight Visual Question Answering using Scene Graphs
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge ManagementVisual question answering (VQA) is a challenging problem in machine perception, which requires a deep joint understanding of both visual and textual data. Recent research has advanced the automatic generation of high-quality scene graphs from images, ...
A Unified Multiple Graph Learning and Convolutional Network Model for Co-saliency Estimation
MM '19: Proceedings of the 27th ACM International Conference on MultimediaCo-saliency estimation which aims to identify the common salient object regions contained in an image set is an active problem in computer vision. The main challenge for co-saliency estimation problem is how to exploit the salient cues of both intra-...















Comments