Abstract
Image-text retrieval aims to take the text (image) query to retrieve the semantically relevant images (texts), which is fundamental and critical in the search system, online shopping, and social network. Existing works have shown the effectiveness of visual-semantic embedding and unimodal knowledge exploiting (e.g., textual knowledge) in connecting the image and text. However, they neglect the implicit multimodal knowledge relations between these two modalities when the image contains information that is not directly described in the text, hindering the ability to connect the image and text with the implicit semantic relations. For instance, an image shows a person next to the “tap” but the pairing text description may only include the word “wash,” missing the washing tool “tap.” The implicit semantic relation between image object “tap” and text word “wash” can help to connect the above image and text. To sufficiently utilize the implicit multimodal knowledge relations, we propose a Multimodal Knowledge enhanced Visual-Semantic Embedding (MKVSE) approach building a multimodal knowledge graph to explicitly represent the implicit multimodal knowledge relations and injecting it to visual-semantic embedding for image-text retrieval task. The contributions in this article can be summarized as follows: (1) Multimodal Knowledge Graph (MKG) is proposed to explicitly represent the implicit multimodal knowledge relations between the image and text as intra-modal semantic relations and inter-modal co-occurrence relations. Intra-modal semantic relations provide synonymy information that is implicit in the unimodal data such as the text corpus. And inter-modal co-occurrence relations characterize the co-occurrence correlations (such as temporal, causal, and logical) that are implicit in image-text pairs. These two relations help establishing reliable image-text connections in the higher-level semantic space. (2) Multimodal Graph Convolution Networks (MGCN) is proposed to reason on the MKG in two steps to sufficiently utilize the implicit multimodal knowledge relations. In the first step, MGCN focuses on the intra-modal relations to distinguish other entities in the semantic space. In the second step, MGCN focuses on the inter-modal relations to connect multimodal entities based on co-occurrence correlations. The two-step reasoning manner can sufficiently utilize the implicit semantic relations between two modal entities to enhance the embeddings of the image and text. Extensive experiments are conducted on two widely used datasets, namely, Flickr30k and MSCOCO, to demonstrate the superiority of the proposed MKVSE approach in achieving state-of-the-art performances. The codes are available at https://github.com/PKU-ICST-MIPL/MKVSE-TOMM2023.
- [1] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [2] . 2020. Contextual embeddings: When are they worth it? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2650–2663.
DOI: DOI: Google ScholarCross Ref
- [3] . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- [4] . 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc.Google Scholar
- [5] . 2021. Multimodal named entity recognition with image attributes and image knowledge. In Proceedings of the International Conference on Database Systems for Advanced Applications. Springer, 186–201.Google Scholar
Digital Library
- [6] . 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12655–12663.Google Scholar
Cross Ref
- [7] . 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15789–15798.Google Scholar
Cross Ref
- [8] . 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. Springer, 104–120.Google Scholar
Digital Library
- [9] . 2022. Cross-modal graph matching network for image-text retrieval. ACM Trans. Multim. Comput., Commun. Applic. 18, 4 (2022), 1–23.Google Scholar
Digital Library
- [10] . 2021. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence. 1218–1226.Google Scholar
Cross Ref
- [11] . 2022. MuKEA: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5089–5098.Google Scholar
Cross Ref
- [12] . 2015. Recurrent neural networks for emotion recognition in video. In Proceedings of the ACM International Conference on Multimodal Interaction. 467–474.Google Scholar
- [13] . 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4601–4611.Google Scholar
Cross Ref
- [14] . 2017. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).Google Scholar
- [15] . 2013. Devise: A deep visual-semantic embedding model. Adv. Neural Inf. Process. Syst. 26 (2013).Google Scholar
- [16] . 2019. Context-aware embeddings for automatic art analysis. In Proceedings of the International Conference on Multimedia Retrieval. 25–33.Google Scholar
Digital Library
- [17] . 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580–587.Google Scholar
Digital Library
- [18] . 2019. Joint syntax representation learning and visual cue translation for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8918–8927.Google Scholar
Cross Ref
- [19] . 2016. Attention-based multimodal neural machine translation. In Proceedings of the 1st Conference on Machine Translation. 639–645.Google Scholar
Cross Ref
- [20] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google Scholar
Cross Ref
- [21] . 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=SJU4ayYgl.Google Scholar
- [22] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [23] . 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201–216.Google Scholar
Digital Library
- [24] . 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4654–4662.Google Scholar
Cross Ref
- [25] . 2022. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transact. Pattern Anal. Mach. Intell. (2022).Google Scholar
- [26] . 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [27] . 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision. Springer, 121–137.Google Scholar
Digital Library
- [28] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.Google Scholar
Cross Ref
- [29] . 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, 3–11.
DOI: DOI: Google ScholarDigital Library
- [30] . 2022. Gradual: Graph-based dual-modal representation for image-text matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3459–3468.Google Scholar
Cross Ref
- [31] . 2020. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In Proceedings of the AAAI Conference on Artificial Intelligence. 164–172.Google Scholar
Cross Ref
- [32] . 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39–41.Google Scholar
Digital Library
- [33] . 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 299–307.Google Scholar
Cross Ref
- [34] . 2017. Multi-modal knowledge representation learning via webly-supervised relationships mining. In Proceedings of the 25th ACM International Conference on Multimedia. 411–419.Google Scholar
Digital Library
- [35] . 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.Google Scholar
Cross Ref
- [36] . 2020. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia. 1047–1055.Google Scholar
Digital Library
- [37] . 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104–1113.Google Scholar
Digital Library
- [38] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015), 91–99.Google Scholar
Digital Library
- [39] . 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211–252.Google Scholar
Digital Library
- [40] . 1997. Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45, 11 (1997), 2673–2681.Google Scholar
Digital Library
- [41] . 2019. Knowledge aware semantic concept expansion for image-text matching. In Proceedings of the International Joint Conferences on Artificial Intelligence.Google Scholar
Cross Ref
- [42] . 2017. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newslett. 19, 1 (2017), 22–36.Google Scholar
Digital Library
- [43] . 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1979–1988.Google Scholar
Cross Ref
- [44] . 2020. Multi-modal knowledge graphs for recommender systems. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 1405–1414.Google Scholar
Digital Library
- [45] . 2017. Attention is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [46] . 2020. Consensus-aware visual-semantic embedding for image-text matching. In Proceedings of the European Conference on Computer Vision. Springer, 18–34.Google Scholar
Digital Library
- [47] . 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2018), 394–407.Google Scholar
Digital Library
- [48] . 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1508–1517.Google Scholar
Cross Ref
- [49] . 2020. Fake news detection via knowledge-driven multimodal graph convolutional networks. In Proceedings of the International Conference on Multimedia Retrieval. 540–547.Google Scholar
Digital Library
- [50] . 2019. Position focused attention network for image-text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 3792–3798.Google Scholar
Digital Library
- [51] . 2021. Wasserstein coupled graph learning for cross-modal retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). IEEE, 1793–1802.Google Scholar
Cross Ref
- [52] . 2020. Adaptive cross-modal embeddings for image-text alignment. In Proceedings of the AAAI Conference on Artificial Intelligence. 12313–12320.Google Scholar
Cross Ref
- [53] . 2022. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15671–15680.Google Scholar
Cross Ref
- [54] . 2021. UniMF: A unified framework to incorporate multimodal knowledge bases into end-to-end task-oriented dialogue systems. In Proceedings of the International Joint Conferences on Artificial Intelligence. 3978–3984.Google Scholar
Cross Ref
- [55] . 2020. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3025–3035.Google Scholar
Cross Ref
- [56] . 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Computat. Ling. 2 (2014), 67–78.Google Scholar
Cross Ref
- [57] . 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831–5840.Google Scholar
Cross Ref
- [58] . 2020. Poet: Product-oriented video captioner for e-commerce. In Proceedings of the 28th ACM International Conference on Multimedia. 1292–1301.Google Scholar
Digital Library
- [59] . 2019. Deep learning-based recommender system: A survey and new perspectives. ACM Comput. Surv. 52, 1 (2019), 1–38.Google Scholar
Digital Library
Index Terms
MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval
Recommendations
Cross-modal Graph Matching Network for Image-text Retrieval
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods can be classified into ...
Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval
ICMR '22: Proceedings of the 2022 International Conference on Multimedia RetrievalCross-modal image-text retrieval is a fundamental task in information retrieval. The key to this task is to address both heterogeneity and cross-modal semantic correlation between data of different modalities. Fine-grained matching methods can nicely ...
Visual Relations Augmented Cross-modal Retrieval
ICMR '20: Proceedings of the 2020 International Conference on Multimedia RetrievalRetrieving relevant samples across multiple-modalities is a primary topic that receives consistently research interests in multimedia communities, and has benefited various real-world multimedia applications (e.g., text-based image searching). Current ...






Comments