skip to main content
research-article

Cross-modal Graph Matching Network for Image-text Retrieval

Published:04 March 2022Publication History
Skip Abstract Section

Abstract

Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods can be classified into independent representation matching methods and cross-interaction matching methods. The independent representation matching methods generate the embeddings of images and sentences independently and thus are convenient for retrieval with hand-crafted matching measures (e.g., cosine or Euclidean distance). As to the cross-interaction matching methods, they achieve improvement by introducing the interaction-based networks for inter-relation reasoning, yet suffer the low retrieval efficiency. This article aims to develop a method that takes the advantages of cross-modal inter-relation reasoning of cross-interaction methods while being as efficient as the independent methods. To this end, we propose a graph-based Cross-modal Graph Matching Network (CGMN), which explores both intra- and inter-relations without introducing network interaction. In CGMN, graphs are used for both visual and textual representation to achieve intra-relation reasoning across regions and words, respectively. Furthermore, we propose a novel graph node matching loss to learn fine-grained cross-modal correspondence and to achieve inter-relation reasoning. Experiments on benchmark datasets MS-COCO, Flickr8K, and Flickr30K show that CGMN outperforms state-of-the-art methods in image retrieval. Moreover, CGMM is much more efficient than state-of-the-art methods using interactive matching. The code is available at https://github.com/cyh-sj/CGMN.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Anderson Peter, Wu Qi, Teney Damien, Bruce Jake, Johnson Mark, Sunderhauf Niko, Reid Ian, Gould Stephen, and Hengel Anton Van Den. 2017. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  3. [3] Bai Yunsheng, Ding Hao, Bian Song, Chen Ting, Sun Yizhou, and Wang Wei. 2019. SimGNN: A neural network approach to fast graph similarity computation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. 384392.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Chen Hui, Ding Guiguang, Liu Xudong, Lin Zijia, Liu Ji, and Han Jungong. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Liqun, Gan Zhe, Cheng Yu, Li Linjie, Carin Lawrence, and Liu Jingjing. 2020. Graph optimal transport for cross-domain alignment. In Proceedings of the International Conference on Machine Learning. PMLR, 15421553.Google ScholarGoogle Scholar
  6. [6] Chen Tianlang and Luo Jiebo. 2020. Expressing objects just like words: Recurrent visual embedding for image-text matching. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. 1058310590.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chen Tianlang and Luo Jiebo. 2020. Expressing objects just like words: Recurrent visual embedding for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence. 1058310590.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chung Junyoung, Gulcehre Caglar, Cho KyungHyun, and Bengio Yoshua. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In arXiv preprint arXiv:1412.3555.Google ScholarGoogle Scholar
  9. [9] Diao Haiwen, Zhang Ying, Ma Lin, and Lu Huchuan. 2021. Similarity reasoning and filtration for image-text matching. arXiv preprint arXiv:2101.01368.Google ScholarGoogle Scholar
  10. [10] Eisenschtat Aviv and Wolf Lior. 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Faghri Fartash, Fleet David J., Kiros Jamie Ryan, and Fidler Sanja. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference.Google ScholarGoogle Scholar
  12. [12] Fey Matthias, Lenssen Jan E., Morris Christopher, Masci Jonathan, and Kriege Nils M.. 2020. Deep graph matching consensus. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  13. [13] Frome Andrea, Corrado Greg, Shlens Jon, Bengio Samy, Dean Jeffrey, Ranzato Marcaurelio, and Mikolov Tomas. 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 21212129.Google ScholarGoogle Scholar
  14. [14] Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, and Parikh Devi. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hu Zhibin, Luo Yongsheng, Lin Jiong, Yan Yan, and Chen Jian. 2019. Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Huang Yan, Wu Qi, Song Chunfeng, and Wang Liang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Ji Zhong, Wang Haoran, Han Jungong, and Pang Yanwei. 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Karpathy Andrej, Joulin Armand, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679.Google ScholarGoogle Scholar
  20. [20] Kipf Thomas N. and Welling Max. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  21. [21] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., Bernstein Michael, and Fei-Fei Li. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Retrieved from https://arxiv.org/abs/1602.07332.Google ScholarGoogle Scholar
  22. [22] Lee Kuang-Huei, Chen Xi, Hua Gang, Hu Houdong, and He Xiaodong. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Lee Kuang-Huei, Palangi Hamid, Chen Xi, Hu Houdong, and Gao Jianfeng. 2019. Learning visual relation priors for image-text matching and image captioning with neural scene graph generators. arXiv preprint arXiv:1909.09953.Google ScholarGoogle Scholar
  24. [24] Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, and Fu Yun. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Li Xiujun, Yin Xi, Li Chunyuan, Zhang Pengchuan, Hu Xiaowei, Zhang Lei, Wang Lijuan, Hu Houdong, Dong Li, Gao Furu Wei, Yejin Choi, and Jianfeng. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision. Springer, 121137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Li Yujia, Gu Chenjie, Dullien Thomas, Vinyals Oriol, and Kohli Pushmeet. 2019. Graph matching networks for learning the similarity of graph structured objects. In Proceedings of the International Conference on Machine Learning. 38353845.Google ScholarGoogle Scholar
  27. [27] Lin Tsung-Yi, Maire Michael, Belongie Serge, Ramanan Deva, Dollár Piotr, and Zitnick. C. Lawrence2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Liu Chunxiao, Mao Zhendong, Zhang Tianzhu, Xie Hongtao, Wang Bin, and Zhang Yongdong. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Liu Fangyu, Lebret Rémi, Orel Didier, Sordet Philippe, and Aberer Karl. 2020. Upgrading the newsroom: An automated image selection system for news articles. ACM Trans. Multim. Comput., Commun. Applic 16, 3 (2020), 128.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Liu Xihui, Wang Zihao, Shao Jing, Wang Xiaogang, and Li Hongsheng. 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Liu Xiaoxiao and Xu Qingyang. 2021. Adaptive attention-based high-level semantic introduction for image caption. ACM Trans. Multim. Comput. Commun. Appl. 16, 4 (2021), 128:1–128:22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1323.Google ScholarGoogle Scholar
  33. [33] Ma Lin, Lu Zhengdong, Shang Lifeng, and Li Hang. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision. 26232631.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Manning Christopher D., Surdeanu Mihai, Bauer John, Finkel Jenny Rose, Bethard Steven, and McClosky David. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 5560.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Mao Junhua, Xu Wei, Yang Yi, Wang Jiang, Huang Zhiheng, and Yuille. Alan2015. Deep captioning with multimodal recurrent neural networks (M-RNN). In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  36. [36] Niu Zhenxing, Zhou Mo, Wang Le, Gao Xinbo, and Hua Gang. 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the IEEE International Conference on Computer Vision. 18811889.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] P. Kingma Diederik and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  38. [38] Peng L., Yang Y., Wang Z., Huang Z., and Shen H. T.. 2020. MRA-Net: Improving VQA via multi-modal relation attention network. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1 (2020), 11. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Plummer Bryan A., Kordas Paige, Kiapour M. Hadi, Zheng Shuai, Piramuthu Robinson, and Lazebnik Svetlana. 2018. Conditional image-text embedding networks. In Proceedings of the European Conference on Computer Vision (ECCV). 249264.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  41. [41] Socher Richard, Karpathy Andrej, Le Quoc V., Manning Christopher D., and Ng Andrew Y.. 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Computat. Ling. 2 (2014), 207218.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.Google ScholarGoogle Scholar
  43. [43] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Lio Pietro, and Bengio Yoshua. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903.Google ScholarGoogle Scholar
  44. [44] Wang Liwei, Li Yin, and Lazebnik Svetlana. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 50055013.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Wang Sijin, Wang Ruiping, Yao Ziwei, Shan Shiguang, and Chen Xilin. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 15081517.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Wang Xin, Huang Qiuyuan, Celikyilmaz Asli, Gao Jianfeng, Shen Dinghan, Wang Yuanfang, Wang William Yang, and Zhang Lei. 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Wang Yaxiong, Yang Hao, Qian Xueming, Ma Lin, Lu Jing, Li Biao, and Fan Xin. 2019. Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748.Google ScholarGoogle Scholar
  48. [48] Wang Zihao, Liu Xihui, Li Hongsheng, Sheng Lu, Yan Junjie, Wang Xiaogang, and Shao Jing. 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 57645773.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wehrmann Jonatas, Kolling Camila, and Barros Rodrigo C.. 2020. Adaptive cross-modal embeddings for image-text alignment. In Proceedings of the AAAI Conference on Artificial Intelligence. 1231312320.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wei Xi, Zhang Tianzhu, Li Yan, Zhang Yongdong, and Wu Feng. 2020. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Wen K., Gu X., and Cheng Q.. 2020. Learning dual semantic relations with graph attention for image-text matching. IEEE Trans. Circ. Syst. Vid. Technol. 31, 7 (2020), 11. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Wu Yiling, Wang Shuhui, Song Guoli, and Huang Qingming. 2019. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 20882096.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Xu Kun, Wang Liwei, Yu Mo, Feng Yansong, Song Yan, Wang Zhiguo, and Yu Dong. 2019. Cross-lingual knowledge graph alignment via graph matching neural network. In Proceedings of the 57th Conference of the Association for Computational Linguistics. 31563161.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Yang Sibei, Li Guanbin, and Yu Yizhou. 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Trans. Assoc. Comput. Linguistics. 2 (2014), 67–78.Google ScholarGoogle Scholar
  56. [56] Yu Weijie, Xu Chen, Xu Jun, Pang Liang, Gao Xiaopeng, Wang Xiaozhao, and Wen Ji-Rong. 2020. Wasserstein distance regularized sequence representation for text matching in asymmetrical domains. arXiv preprint arXiv:2010.07717.Google ScholarGoogle Scholar
  57. [57] Yuan Jin, Zhang Lei, Guo Songrui, Xiao Yi, and Li Zhiyong. 2020. Image captioning with a joint attention mechanism by visual concept samples. ACM Trans. Multim. Comput. Commun. Appl. 16, 3 (2020), 83:1–83:22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Zanfir Andrei and Sminchisescu Cristian. 2018. Deep learning of graph matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 26842693.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Zhang Qi, Lei Zhen, Zhang Zhaoxiang, and Li Stan Z.. 2020. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Zheng Zhedong, Zheng Liang, Garrett Michael, Yang Yi, Xu Mingliang, and Shen Yi-Dong. 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multim. Comput. Commun. Appl. 16, 2 (2020), 51:1–51:23.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cross-modal Graph Matching Network for Image-text Retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 4
      November 2022
      497 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3514185
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 March 2022
      • Accepted: 1 November 2021
      • Revised: 1 September 2021
      • Received: 1 May 2021
      Published in tomm Volume 18, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!