skip to main content
research-article

MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval

Published:16 March 2023Publication History
Skip Abstract Section

Abstract

Image-text retrieval aims to take the text (image) query to retrieve the semantically relevant images (texts), which is fundamental and critical in the search system, online shopping, and social network. Existing works have shown the effectiveness of visual-semantic embedding and unimodal knowledge exploiting (e.g., textual knowledge) in connecting the image and text. However, they neglect the implicit multimodal knowledge relations between these two modalities when the image contains information that is not directly described in the text, hindering the ability to connect the image and text with the implicit semantic relations. For instance, an image shows a person next to the “tap” but the pairing text description may only include the word “wash,” missing the washing tool “tap.” The implicit semantic relation between image object “tap” and text word “wash” can help to connect the above image and text. To sufficiently utilize the implicit multimodal knowledge relations, we propose a Multimodal Knowledge enhanced Visual-Semantic Embedding (MKVSE) approach building a multimodal knowledge graph to explicitly represent the implicit multimodal knowledge relations and injecting it to visual-semantic embedding for image-text retrieval task. The contributions in this article can be summarized as follows: (1) Multimodal Knowledge Graph (MKG) is proposed to explicitly represent the implicit multimodal knowledge relations between the image and text as intra-modal semantic relations and inter-modal co-occurrence relations. Intra-modal semantic relations provide synonymy information that is implicit in the unimodal data such as the text corpus. And inter-modal co-occurrence relations characterize the co-occurrence correlations (such as temporal, causal, and logical) that are implicit in image-text pairs. These two relations help establishing reliable image-text connections in the higher-level semantic space. (2) Multimodal Graph Convolution Networks (MGCN) is proposed to reason on the MKG in two steps to sufficiently utilize the implicit multimodal knowledge relations. In the first step, MGCN focuses on the intra-modal relations to distinguish other entities in the semantic space. In the second step, MGCN focuses on the inter-modal relations to connect multimodal entities based on co-occurrence correlations. The two-step reasoning manner can sufficiently utilize the implicit semantic relations between two modal entities to enhance the embeddings of the image and text. Extensive experiments are conducted on two widely used datasets, namely, Flickr30k and MSCOCO, to demonstrate the superiority of the proposed MKVSE approach in achieving state-of-the-art performances. The codes are available at https://github.com/PKU-ICST-MIPL/MKVSE-TOMM2023.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Arora Simran, May Avner, Zhang Jian, and Ré Christopher. 2020. Contextual embeddings: When are they worth it? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 26502663. DOI:DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  4. [4] Bird Steven, Klein Ewan, and Loper Edward. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc.Google ScholarGoogle Scholar
  5. [5] Chen Dawei, Li Zhixu, Gu Binbin, and Chen Zhigang. 2021. Multimodal named entity recognition with image attributes and image knowledge. In Proceedings of the International Conference on Database Systems for Advanced Applications. Springer, 186201.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Chen Hui, Ding Guiguang, Liu Xudong, Lin Zijia, Liu Ji, and Han Jungong. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1265512663.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chen Jiacheng, Hu Hexiang, Wu Hao, Jiang Yuning, and Wang Changhu. 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1578915798.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chen Yen-Chun, Li Linjie, Yu Licheng, Kholy Ahmed El, Ahmed Faisal, Gan Zhe, Cheng Yu, and Liu Jingjing. 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. Springer, 104120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Cheng Yuhao, Zhu Xiaoguang, Qian Jiuchao, Wen Fei, and Liu Peilin. 2022. Cross-modal graph matching network for image-text retrieval. ACM Trans. Multim. Comput., Commun. Applic. 18, 4 (2022), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Diao Haiwen, Zhang Ying, Ma Lin, and Lu Huchuan. 2021. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence. 12181226.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Ding Yang, Yu Jing, Liu Bang, Hu Yue, Cui Mingxin, and Wu Qi. 2022. MuKEA: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 50895098.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Kahou Samira Ebrahimi, Michalski Vincent, Konda Kishore, Memisevic Roland, and Pal Christopher. 2015. Recurrent neural networks for emotion recognition in video. In Proceedings of the ACM International Conference on Multimodal Interaction. 467474.Google ScholarGoogle Scholar
  13. [13] Eisenschtat Aviv and Wolf Lior. 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46014611.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Faghri Fartash, Fleet David J., Kiros Jamie Ryan, and Fidler Sanja. 2017. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).Google ScholarGoogle Scholar
  15. [15] Frome Andrea, Corrado Greg S., Shlens Jon, Bengio Samy, Dean Jeff, Ranzato Marc’Aurelio, and Mikolov Tomas. 2013. Devise: A deep visual-semantic embedding model. Adv. Neural Inf. Process. Syst. 26 (2013).Google ScholarGoogle Scholar
  16. [16] Garcia Noa, Renoust Benjamin, and Nakashima Yuta. 2019. Context-aware embeddings for automatic art analysis. In Proceedings of the International Conference on Multimedia Retrieval. 2533.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Girshick Ross, Donahue Jeff, Darrell Trevor, and Malik Jitendra. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580587.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Hou Jingyi, Wu Xinxiao, Zhao Wentian, Luo Jiebo, and Jia Yunde. 2019. Joint syntax representation learning and visual cue translation for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 89188927.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Huang Po-Yao, Liu Frederick, Shiang Sz-Rung, Oh Jean, and Dyer Chris. 2016. Attention-based multimodal neural machine translation. In Proceedings of the 1st Conference on Machine Translation. 639645.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Kipf Thomas N. and Welling Max. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=SJU4ayYgl.Google ScholarGoogle Scholar
  22. [22] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Lee Kuang-Huei, Chen Xi, Hua Gang, Hu Houdong, and He Xiaodong. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, and Fu Yun. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 46544662.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, and Fu Yun. 2022. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transact. Pattern Anal. Mach. Intell. (2022).Google ScholarGoogle Scholar
  26. [26] Li Qimai, Han Zhichao, and Wu Xiao-Ming. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Li Xiujun, Yin Xi, Li Chunyuan, Zhang Pengchuan, Hu Xiaowei, Zhang Lei, Wang Lijuan, Hu Houdong, Dong Li, Wei Furu, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision. Springer, 121137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740755.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Liu Chunxiao, Mao Zhendong, Liu An-An, Zhang Tianzhu, Wang Bin, and Zhang Yongdong. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, 311. DOI:DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Long Siqu, Han Soyeon Caren, Wan Xiaojun, and Poon Josiah. 2022. Gradual: Graph-based dual-modal representation for image-text matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 34593468.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Mai Sijie, Hu Haifeng, and Xing Songlong. 2020. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In Proceedings of the AAAI Conference on Artificial Intelligence. 164172.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Miller George A.. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 3941.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Nam Hyeonseob, Ha Jung-Woo, and Kim Jeonghee. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 299307.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Nian Fudong, Bao Bing-Kun, Li Teng, and Xu Changsheng. 2017. Multi-modal knowledge representation learning via webly-supervised relationships mining. In Proceedings of the 25th ACM International Conference on Multimedia. 411419.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Qu Leigang, Liu Meng, Cao Da, Nie Liqiang, and Tian Qi. 2020. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia. 10471055.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Qu Leigang, Liu Meng, Wu Jianlong, Gao Zan, and Nie Liqiang. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 11041113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015), 9199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, Karpathy Andrej, Khosla Aditya, Bernstein Michael, et al. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Schuster Mike and Paliwal Kuldip K.. 1997. Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45, 11 (1997), 26732681.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Shi Botian, Ji Lei, Lu Pan, Niu Zhendong, and Duan Nan. 2019. Knowledge aware semantic concept expansion for image-text matching. In Proceedings of the International Joint Conferences on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Shu Kai, Sliva Amy, Wang Suhang, Tang Jiliang, and Liu Huan. 2017. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newslett. 19, 1 (2017), 2236.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Song Yale and Soleymani Mohammad. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19791988.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Sun Rui, Cao Xuezhi, Zhao Yan, Wan Junchen, Zhou Kun, Zhang Fuzheng, Wang Zhongyuan, and Zheng Kai. 2020. Multi-modal knowledge graphs for recommender systems. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 14051414.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  46. [46] Wang Haoran, Zhang Ying, Ji Zhong, Pang Yanwei, and Ma Lin. 2020. Consensus-aware visual-semantic embedding for image-text matching. In Proceedings of the European Conference on Computer Vision. Springer, 1834.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Wang Liwei, Li Yin, Huang Jing, and Lazebnik Svetlana. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2018), 394407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Wang Sijin, Wang Ruiping, Yao Ziwei, Shan Shiguang, and Chen Xilin. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 15081517.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wang Youze, Qian Shengsheng, Hu Jun, Fang Quan, and Xu Changsheng. 2020. Fake news detection via knowledge-driven multimodal graph convolutional networks. In Proceedings of the International Conference on Multimedia Retrieval. 540547.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Wang Yaxiong, Yang Hao, Qian Xueming, Ma Lin, Lu Jing, Li Biao, and Fan Xin. 2019. Position focused attention network for image-text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 37923798.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Wang Yun, Zhang Tong, Zhang Xueya, Cui Zhen, Huang Yuge, Shen Pengcheng, Li Shaoxin, and Yang Jian. 2021. Wasserstein coupled graph learning for cross-modal retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). IEEE, 17931802.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Wehrmann Jonatas, Kolling Camila, and Barros Rodrigo C.. 2020. Adaptive cross-modal embeddings for image-text alignment. In Proceedings of the AAAI Conference on Artificial Intelligence. 1231312320.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Yang Jinyu, Duan Jiali, Tran Son, Xu Yi, Chanda Sampath, Chen Liqun, Zeng Belinda, Chilimbi Trishul, and Huang Junzhou. 2022. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1567115680.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Yang Shiquan, Zhang Rui, Erfani Sarah M., and Lau Jey Han. 2021. UniMF: A unified framework to incorporate multimodal knowledge bases into end-to-end task-oriented dialogue systems. In Proceedings of the International Joint Conferences on Artificial Intelligence. 39783984.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Yin Yongjing, Meng Fandong, Su Jinsong, Zhou Chulun, Yang Zhengyuan, Zhou Jie, and Luo Jiebo. 2020. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 30253035.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Computat. Ling. 2 (2014), 6778.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Zellers Rowan, Yatskar Mark, Thomson Sam, and Choi Yejin. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 58315840.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Zhang Shengyu, Tan Ziqi, Yu Jin, Zhao Zhou, Kuang Kun, Liu Jie, Zhou Jingren, Yang Hongxia, and Wu Fei. 2020. Poet: Product-oriented video captioner for e-commerce. In Proceedings of the 28th ACM International Conference on Multimedia. 12921301.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Zhang Shuai, Yao Lina, Sun Aixin, and Tay Yi. 2019. Deep learning-based recommender system: A survey and new perspectives. ACM Comput. Surv. 52, 1 (2019), 138.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 5
        September 2023
        262 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3585398
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 March 2023
        • Online AM: 19 January 2023
        • Accepted: 3 January 2023
        • Revised: 13 December 2022
        • Received: 5 August 2022
        Published in tomm Volume 19, Issue 5

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)646
        • Downloads (Last 6 weeks)66

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!