IMF: Interactive Multimodal Fusion Model for Link Prediction

Link prediction aims to identify potential missing triples in knowledge graphs. To get better results, some recent studies have introduced multimodal information to link prediction. However, these methods utilize multimodal information separately and neglect the complicated interaction between different modalities. In this paper, we aim at better modeling the inter-modality information and thus introduce a novel Interactive Multimodal Fusion (IMF) model to integrate knowledge from different modalities. To this end, we propose a two-stage multimodal fusion framework to preserve modality-specific knowledge as well as take advantage of the complementarity between different modalities. Instead of directly projecting different modalities into a unified space, our multimodal fusion module limits the representations of different modalities independent while leverages bilinear pooling for fusion and incorporates contrastive learning as additional constraints. Furthermore, the decision fusion module delivers the learned weighted average over the predictions of all modalities to better incorporate the complementarity of different modalities. Our approach has been demonstrated to be effective through empirical evaluations on several real-world datasets. The implementation code is available online at https://github.com/HestiaSky/IMF-Pytorch.


INTRODUCTION
Knowledge Graph (KG) stores rich knowledge and is essential for many real-world applications, such as question answering [14,41,52], urban computing [46,49] and recommendation systems [6,35,36]. Typically, a KG consists of relational triples, which are represented as <head entity, relation, tail entity> [24]. Nevertheless, KGs are inevitably incomplete due to the complexity, diversity and mutability of knowledge. To fix this gap, the problem of link prediction is studied so as to predict potential missing triples [4].
Traditional link prediction models, including translation-based [4,38] and neural network methods [21,23], suffered from the structural bias problem among triples. Recently, some studies [26,28,39] addressed this problem by enriching the dataset and proposing new models to capture multimodal information for link prediction. However, the performances of such studies were limited as they projected all modalities into a unified space with the same relation to capture the commonality, which might fail to preserve specific information in each modality. As a result, they could not effectively model the complicated interactions between modalities to capture the complementarity.
To address the above issue, we incline to learn the knowledge comprehensively rather than separately, which is similar to how humans think. Take the scenario in Figure 1 as an example, such a model might also get the wrong prediction that LeBorn James playsFor Golden States Warriors based on the similarity with Stephen Curry of the common bornIn relation to Akron, Ohio in graph structure. Meanwhile, it is difficult for visual features to express finegrained semantics and the only conclusion is that LeBorn James is a basketball player. Also, it might also make the outdated prediction of Cleveland Cavaliers due to 'played' in the second sentence (more consistent with playsFor than 'joined' in the third sentence) in the textual description. Nevertheless, by integrating the knowledge, it is easy to get the correct answer Log Angeles Lakers with the interaction between complementary information of structural, visual and textual highlighted in Figure 1. Since the knowledge learned from different modalities is diverse and complex, it is very challenging to effectively integrate multimodal information.
In this paper, we propose a novel Interactive Multimodal Fusion Model (IMF) for multimodal link prediction over knowledge graphs. IMF can learn the knowledge separately in each modality and jointly model the complicated interactions between different modalities with a two-stage fusion which is similar to the natural recognition process of human beings introduced above. In the multimodal fusion stage, we employ a bilinear fusion mechanism to fully capture the complicated interactions between the multimodal features with contrastive learning. For the basic link prediction model, we utilize the relation information as the context to rank the triples as predictions in each modality. In the final decision fusion stage, we integrate predictions from different modalities and make use of the complementary information to make the final prediction. The contributions of this paper are summarized as follows: • We propose a novel two-stage fusion model, IMF, that is effective in integrating complementary information of different modalities for link prediction. • We design an effective multimodal fusion module to capture bilinear interactions with contrastive learning for jointly modeling the commonality and complementarity. • We demonstrate the effectiveness and generalization of IMF with extensive experiments on four widely used datasets for multimodal link prediction.

METHODOLOGY
Formally, a knowledge graph is defined as G = ⟨E, R, T ⟩, where E and R indicate sets of entities and relations, respectively. T = {(ℎ, , )|ℎ, ∈ E, ∈ R} represents relational triples of the KG. In multimodal KGs, each entity in KGs is represented by multiple features from different modalities. Here, we define the set of modalities K = { , , , } where , , , denote structural, visual, textual and multimodal modality, respectively. Due to the complexity of real-world knowledge, it is almost impossible to take all the triples into account. Therefore, given a well-formulated KG, the Link Prediction task aims at predicting missing links between entities. Specifically, link prediction models expect to learn a score function of relational triples to estimate the likelihood of a triple, which is always formulated as : E × R × E → R.

Overall Architecture
In order to fully exploit the complicated interaction between different modalities, we propose a two-stage fusion model instead of simply considering the multimodal information separately in a unified vector space. As shown in Figure 2, IMF consists of four key components: 1 The Modality-Specific Encoders are used for extracting structural, visual and textual features as the input of multimodal fusion stage. 2 The Multimodal Fusion Module, which is the first fusion stage, effectively models bilinear interactions between different modalities based on Tucker decomposition and contrastive learning. 3 The Contextual Relational Model calculates the similarity of contextual entity representations to formulate triple scores as modality-specific predictions for decision fusion stage. 4 The Decision Fusion Module, which is the second fusion stage, takes all the similarity scores from structural, visual, textual and multimodal models into account to make the final prediction.

Modality-Specific Encoders
In this subsection, we first introduce the pre-trained encoders used for different modalities. These encoders are not fine-tuned during training and we treat them as fixed feature extractors to obtain the modality-specific entity representations. Note that IMF is a general framework and it is straightforward to replace them with other up-to-date encoders or add ones for new modalities into IMF.

Structural Encoder.
From the most basic view, the structural information of KG, we employ a Graph Attention Network (GAT) 1 [33] with TransE loss. Specifically, our GAT encoder takes L1 distance of neighbor aggregated representations as energy function of triples, which is (ℎ, , ) = ||h + r − t||. In the training process, we minimize the following Hinge loss (1): where is margin hyper-parameter and T ′ denotes set of negative triples derived from T . T ′ is created by randomly replacing head or tail entities of triples in T , which is (2): 2.2.2 Visual Encoder. Visual features are greatly expressive while providing different views of knowledge from traditional KGs. To effectively extract visual features, we utilize VGG16 2 pre-trained on ImageNet 3 to get image embeddings of corresponding entities following [20]. Specifically, we take outputs of the last hidden layer before softmax operation as visual features, which are 4096dimensional vectors.

Textual
Encoder. Entity descriptions contain much richer but more complex knowledge than pure KGs. To fully extract the complex knowledge, we employ BERT [11] as the textual encoder, which is very expressive to get description embeddings of corresponding entities. The textual features are 768-dimensional vectors, i.e., pooled outputs of pre-trained BERT-Base model 4 .

Multimodal Fusion
The multimodal fusion stage aims to effectively get multimodal representations, which fully capture the complex interactions between different modalities. Many existing multimodal fusion methods have achieved promising results in many tasks like VQA (Visual Question Answering). However, most of them aim at finding the commonality to get more precise representations by modality projecting [9,12] or cross-modal attention [25]. These types of methods will suffer from the loss of unique information in different modalities and can not achieve sufficient interaction between modalities.
To this end, we propose to employ the bilinear models, which have a strong ability to realize full parameters interaction as the cornerstone to perform the fusion of multimodal information. Specifically, we extend the Tucker decomposition, which decomposes the tensor into a core tensor transformed by a matrix along with each mode to 4-mode factors as expressed in Equation (3): denotes transformation matrix and P ∈ R × × × denotes a smaller core tensor. In such a situation, entity embeddings are first projected into a low-dimensional space and then fused with the core tensor P . Following [3], we further reduce the computation complexity by decomposing the core tensor P to merge representations of all modalities into a unified space with element-wise product. The detailed calculation process is expressed as Equation (4): whereẽ = ReLU(e M ) ∈ R denotes latent representations and e ∈ R is the original embedding representations and M ∈ R × is decomposed transformation matrix for each modality ∈ { , , }. However, the multimodal bilinear fusion has no bound limitation while the gradient produced by the final prediction result can only implicitly guide parameter learning. To alleviate this problem, we add constraints to limit the correlation between different modality representations of the same entity to be stronger. Therefore, we further leverage contrastive learning [7,16,42]  entities and modalities as an additional learning objective for regularization. In the settings of contrastive learning, we take the pairs of representations of the same entity of different modalities as positive samples and the pairs of representations of different entities as negative samples. As shown in Figure 3, we aim at limiting the distance of negative samples to be larger than positive samples to enhance multimodal fusion, which is:

between different
where (·, ·) denotes the distance measure and (·) denotes the embedding function. The superscript +, − represent the positive and negative samples, respectively. Specifically, we randomly sample entities from the entity set as a minibatch and define contrastive learning loss upon it. The positive pairs are naturally obtained with the same entities while the negative pairs are constructed by negative sharing [8] of all other entities. We take the latent representationsẽ = ReLU(e M ) ∈ R and leverage cosine similarity ( , ) = −u T v/||u||v|| as distance measure. Then we have the following contrastive loss function for each entity : where M = {( , ), ( , ), ( , )} is set of modality pairs.

Contextual Relational Model
After obtaining representations of each modality and multimodal, we then design a contextual relational model, which takes relations in triples as contextual information for scoring, to get the predictions. Note that this relational model can be easily replaced by any scoring function like TransE. Due to the variety and complexity of relations in KGs, we argue that improving the degree of parameter interaction [32] is crucial for better modeling the relational triples. The degree of parameter interaction means the calculation ratio of each parameter to all other parameters. For example, dot product could achieve 1/ degree while cross product could achieve ( − 1)/ degree. Based on this assumption, we propose to use bilinear outer product between entity and relation embeddings to incorporate contextual information into entity representations. Instead of taking relations as input as in previous studies, our contextual relational model utilizes relations to provide context in the transformation matrix of entity embeddings. Then, entity embeddings are projected using the contextual transformation matrix to get contextual embeddings, which are used for calculating similarity with all candidate entities. The learning objective is to minimize the binary cross-entropy loss. For each modality ∈ K, the computation details are shown as Equation (7) to Equation (9): where e andê are original and contextual entity embeddings respectively; W = W r denotes contextual transformation matrix which is obtained by matrix multiplication of weight matrix W and relation vectors r while b is a bias vector; is sigmoid function and y = [ 1, , 2, , ..., , ] is final prediction of modality .

Decision Fusion
Existing multimodal approaches mainly focus on projecting different modality representations into a unified space and predicting with commonality between modalities, which will fail to preserve the modality-specific knowledge. We alleviate this problem in the decision fusion stage by joint learning and combining predictions of different modalities to further leverage the complementarity. Under the multimodal settings, we assign different contextual relational models for each modality and utilize their own results for training in different views. Recall the contrastive learning loss in Equation (6), the overall training objective is to minimize the joint loss shown in Equation (10): where L denotes binary cross entropy loss for modality as Equation (9) and is a learned weight parameter.
To better illustrate the whole training process of IMF, we describe it via the pseudo-code of the optimization algorithm. As shown in Algorithm 1, we first obtain the pre-trained encoders of structural, visual and textual and utilize them for entity embeddings (line 3-5). Since the pre-trained models are much larger and more complex than IMF, they are not fine-tuned and their outputs are directly used as inputs of IMF. The multimodal embeddings are obtained by multimodal fusion while contrastive learning is applied to further enhance the fusion stage (line 9-11). During training, each modality delivers its own prediction and loss via the modality-specific scorers (line 12), and then the joint prediction and loss are computed based on all modalities including multimodal ones (line 14). for Entity in batch do 9: Obtain the structural, visual, textual embeddings e , e , e of entity 10: Compute the multimodal fused embeddings e of entity with Equation (4) 11: Compute the contrastive learning loss L with Equation (6) 12: Compute the loss L , L , L , L with modality-specific scorers via Equation (7) -Equation (9) 13: Compute the joint loss L with the above losses L , L , L , L , L via Equation (10) 14: Update model parameters of M by minimizing L 15: end for 16: end while 17: return M For inference, we propose to jointly consider the predictions of each modality as well as multimodal ones. Specifically, the overall predictions are shown in Equation (11): where denotes weight for modality as same as Equation (10) while the values in y are in [0, 1].

EXPERIMENTAL SETUP 3.1 Datasets
In this paper, we use four public datasets to evaluate our model. All the datasets consist of three modalities: structural triples, entity images and entity descriptions. DB15K, FB15K and YAGO15K datasets are obtained from MMKG 5 [20], which is a collection of multimodal knowledge graph. Specifically, we utilize the relational triples as structural features, entity images as visual features and we extract the entity descriptions from Wikidata [34] as textual features. FB15K-237 6 [31] is a subset of FB15K, the visual and textual features in FB15K can be directly reused. Each dataset is split with 70%, 10% and 20% for training, validation and test. The detailed statistics are shown in Table 1.
In the process of evaluation, we consider four metrics of valid entities to measure the model performance following previous studies: (1) mean rank (MR); (2) mean reciprocal rank (MRR); (3) hits ratio (Hits@1 and Hits@10). 5

Baselines
To demonstrate the effectiveness of our model, we choose two types of methods for comparison, which are monomodal methods and multimodal methods. For monomodal models, we take the baselines including: •  [27] to capture the complex interactions between entities and relations for prediction. • RotatE [29] introduces rotation operations between entities to represent relations in the complex space to infer symmetry, antisymmetry, inversion and composition relation patterns. • QuatE [43] extends rotation of the knowledge graph embeddings in the complex space into the quaternion space to obtain more degree of freedom. • KBAT [21] leverages Graph Attention Network (GAT) [33] as encoder to aggregate neighbors and employs ConvKB as decoder to compute triple scores. • TuckER [1] applies Tucker decomposition to capture the highlevel interactions between entity and relation embeddings. • HAKE [45] projects entities into polar coordinate system to model hierachical structures for incorporating semantics.
For multimodal models, we take the baselines including: • IKRL [39] utilizes the TransE energy function as scoring function on each pair of modalities for joint prediction. • MKGC [28] extends IKRL with combination of different modalities to explicitly deliver alignment between modalities. • MKBE [26] employs DistMult [40] as scoring function and designs Generative Adversarial Network (GAN) [13] to predict missing modalities.
For the ablation study, we design three variants of IMF: IMF (w/o MF) utilizes only structural information; IMF (w/o DF) simply takes multimodal representations for training and inference without decision fusion; IMF (w/o CL) removes the contrastive learning loss.

Implementation Details
The experiments are implemented on the server with an Intel Xeon E5-2640 CPU, a 188GB RAM and four NVIDIA GeForce RTX 2080Ti GPUs using PyTorch 1. 6 Table 2: Evaluation results on multimodal DB15K, FB15K and YAGO15K datasets from MMKG. "*" indicates the statistically significant improvements (i.e., two-sided t-test with < 0.05) over the best baseline.
with Xavier initialization and are optimized using Adam [15] optimizer. The evaluation is conducted under the RONDOM settings [30], where the correct triples are placed randomly in test set and the negative sampling are correctly employed without test leakage. For DB15K, FB15K and YAGO15K, we obtain the results by running all the baselines with their released codes. For FB15K-237, we directly obtain the results of TransE, ConvE, ConvKB, CapsE, RotatE, KBAT and TuckER from the re-evaluation work [30] and run the models of QuatE, HAKE, IKRL, MKGC and MKBE with their released codes.

Overall Performance
As shown in Table 2 and Table 3, we can observe that: • IMF significantly outperforms all the baselines. The performance gain is at most 42% for MRR on DB15K while is also more than 20% for all the evaluation metrics on average. • State-of-the-art monomodal methods employ a variety of complex models to improve the expressiveness and capture latent interactions. However, the results illustrate that the performance is highly limited by the structural bias of the nature of knowledge graph itself. Although these methods have already achieved promising results, IMF can easily outperform them by a significant margin with a much simpler model structure, which amply demonstrates the effectiveness. • In comparison with multimodal methods that treat the features of different modalities separately, our IMF jointly learning from different modalities with the two-stage fusion, which is beneficial in modeling the commonality and complementarity simultaneously.
Overall, our proposed IMF can model more comprehensive interactions between different modalities with both commonality and complementarity thanks to the effective fusion of multimodal information and thus achieve significant improvement of link prediction on KGs.  Table 3: Evaluation results on FB15K-237. "*" indicates the statistically significant improvements (i.e., two-sided t-test with < 0.05) over the best baseline. Table 4 shows the evaluation results of leveraging different modality information on FB15K-237, where denotes structural information; denotes visual information of images and denotes textual information of descriptions. We can see that by introducing visual or textual information, the performance is significantly improved. The significant performance gain brought by multimodal fusion module not only demonstrates the effectiveness of our approach, but also indicates the potential of integrating multimodal information in KG.

Ablation Study
To verify the effectiveness of decision fusion, we choose a case of <LeBron James, playsFor > and visualize the prediction scores of each modality as Figure 4 shows. Due to biases in each modality, the   prediction of monomodal is inevitable error-prone. The results in Table 2 and Table 3 also demonstrate the effectiveness of applying decision fusion to ensemble the specific latent features of each modality. Besides, the performance comparison between IMF (w/o CL) and IMF in Table 2 and Table 3 illustrates the necessity of contrastive learning for more robust results, especially in the scenario with fewer training samples and relation types.
From the results shown above, we can see that each component in our propose IMF has a significant contribution to the overall performance and it is beneficial to capture the commonality and complementarity between different modalities.

Generalization
In order to evaluate the generalization of our proposed approach, we simply replace the scoring function (contextual relational model) with existing methods such as TransE, ConvE and TuckER. The results in Figure 5 illustrate that our proposed framework of twostage fusion is general enough to be applied to any link prediction model for further improvement. Figure 6 shows the performance influence of embedding size for IMF. From the picture, we can see that the embedding size plays an important role in the model performance. Meanwhile, it is worthy of note that a larger embedding size not always results in better performance due to the overfitting problem, especially in the datasets with fewer relation types like YAGO15K. Considering the performance and the efficiency, the best choices of embedding size for these three datasets are 256, 256 and 128, respectively.

Case Study
In order to illustrate the effectiveness of our IMF model in a more intuitive way, we apply t-SNE to reduce dimension and visualize the contextual entity representations of basketball players in five different basketball teams. We can see in Figure 7 that the representations of basketball players are messed up with monomodal information due to the biases. However, with the help of interactive multimodal fusion, IMF can effectively capture complicated interactions between different modalities.

RELATED WORK 5.1 Knowledge Embedding Methods
Knowledge embedding methods have been widely used in graph representation learning tasks and have achieved great success on knowledge base completion (a.k.a link prediction). Translationbased methods aim at finding the transformation relationships from source to target. TransE [4], the most representative translationbased model, projects entities and relations into a unified vector space and minimizes the energy function of triples. Following this route, many translation-based methods have emerged. TransH [38] formulates the translating process on relation-specific hyperplanes. TransR [19] projects entities and relations into separate spaces.  Recently, some neural network methods have shown promising results in this task. ConvE [10] and ConvKB [23] utilize Convolutional Neural Network (CNN) to increase parameter interaction between entities and relations. KBAT [21] employ Graph Neural Networks (GNN) as the encoder to aggregate multi-hop neighborhood information.
However, all these methods above utilize only structural information, which is not sufficient for more complicated situations in real world. By incorporating multimodal information in the training process, our approach is able to improve the representations with external knowledge.

Multimodal Methods
Leveraging multimodal information has yielded extraordinary results in many NLP tasks [3]. DeViSE [12] and Imagined [9] propose to integrate multimodal information with modality projecting which learns a mapping from one modality to another. FiLM [25] extends cross-modal attention mechanism to extract textual-attentive features in visual models. MuRel [5] utilizes pair-wise bilinear interaction between modalities and regions to fully capture the complementarity. IKRL [39] is the first attempt at multimodal knowledge representation learning, which utilizes image data of the entities as extra information based on TransE. MKGC [28] combines textual and visual features extracted by domain-specific models as additional multimodal information compared to IKRL. MKBE [26] creates multimodal knowledge graphs by adding images, descriptions and attributes, and employs DistMult [40] as scoring function.
Although these approaches did incorporate multimodal information to improve performance, they cannot take full advantage of it as they fail to effectively model interactions between modalities.

CONCLUSION
In this paper, we study the problem of link prediction over multimodal knowledge graphs. Specifically, we aim at improving the interaction between different modalities. To reach this goal, we propose the IMF with a two-stage framework to enable effective fusion of multimodal information by (i) utilizing bilinear fusion to fully capture the complementarity between different modalities and contrastive learning to enhance the correlation between different modalities of the same entity to be stronger; and (ii) employing an ensembled loss function to jointly consider the predictions of multimodal representations. Experimental results on several benchmarking datasets demonstrate the effectiveness of our proposed model. Besides, we also conduct in-depth exploration to illustrate the generalization of our proposed method and the potential opportunity to apply it in real applications.
However, there are still some limitations of IMF, which are left to future works. For example, IMF requires the integrity of all the modalities and an additional component to predict the missing modalities may be useful to tackle this limitation. Besides, designing appropriate components to support more different kinds of modalities or propose a more lightweight fusion model to replace the bilinear model for better efficiency is also feasible.