Abstract
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods can be classified into independent representation matching methods and cross-interaction matching methods. The independent representation matching methods generate the embeddings of images and sentences independently and thus are convenient for retrieval with hand-crafted matching measures (e.g., cosine or Euclidean distance). As to the cross-interaction matching methods, they achieve improvement by introducing the interaction-based networks for inter-relation reasoning, yet suffer the low retrieval efficiency. This article aims to develop a method that takes the advantages of cross-modal inter-relation reasoning of cross-interaction methods while being as efficient as the independent methods. To this end, we propose a graph-based Cross-modal Graph Matching Network (CGMN), which explores both intra- and inter-relations without introducing network interaction. In CGMN, graphs are used for both visual and textual representation to achieve intra-relation reasoning across regions and words, respectively. Furthermore, we propose a novel graph node matching loss to learn fine-grained cross-modal correspondence and to achieve inter-relation reasoning. Experiments on benchmark datasets MS-COCO, Flickr8K, and Flickr30K show that CGMN outperforms state-of-the-art methods in image retrieval. Moreover, CGMM is much more efficient than state-of-the-art methods using interactive matching. The code is available at https://github.com/cyh-sj/CGMN.
- [1] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [2] . 2017. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- [3] . 2019. SimGNN: A neural network approach to fast graph similarity computation. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. 384–392.Google Scholar
Digital Library
- [4] . 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [5] . 2020. Graph optimal transport for cross-domain alignment. In Proceedings of the International Conference on Machine Learning. PMLR, 1542–1553.Google Scholar
- [6] . 2020. Expressing objects just like words: Recurrent visual embedding for image-text matching. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. 10583–10590.Google Scholar
Cross Ref
- [7] . 2020. Expressing objects just like words: Recurrent visual embedding for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence. 10583–10590.Google Scholar
Cross Ref
- [8] . 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In arXiv preprint arXiv:1412.3555.Google Scholar
- [9] . 2021. Similarity reasoning and filtration for image-text matching. arXiv preprint arXiv:2101.01368.Google Scholar
- [10] . 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [11] . 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference.Google Scholar
- [12] . 2020. Deep graph matching consensus. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [13] . 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 2121–2129.Google Scholar
- [14] . 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [15] . 2019. Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence.Google Scholar
Digital Library
- [16] . 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [17] . 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the International Conference on Computer Vision.Google Scholar
Cross Ref
- [18] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3128–3137.Google Scholar
Cross Ref
- [19] . 2014. Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679.Google Scholar
- [20] . 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [21] . 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Retrieved from https://arxiv.org/abs/1602.07332.Google Scholar
- [22] . 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision.Google Scholar
Cross Ref
- [23] . 2019. Learning visual relation priors for image-text matching and image captioning with neural scene graph generators. arXiv preprint arXiv:1909.09953.Google Scholar
- [24] . 2019. Visual semantic reasoning for image-text matching. In Proceedings of the International Conference on Computer Vision.Google Scholar
Cross Ref
- [25] . 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision. Springer, 121–137.Google Scholar
Digital Library
- [26] . 2019. Graph matching networks for learning the similarity of graph structured objects. In Proceedings of the International Conference on Machine Learning. 3835–3845.Google Scholar
- [27] 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision.Google Scholar
Cross Ref
- [28] . 2020. Graph structured network for image-text matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [29] . 2020. Upgrading the newsroom: An automated image selection system for news articles. ACM Trans. Multim. Comput., Commun. Applic 16, 3 (2020), 1–28.Google Scholar
Digital Library
- [30] . 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [31] . 2021. Adaptive attention-based high-level semantic introduction for image caption. ACM Trans. Multim. Comput. Commun. Appl. 16, 4 (2021), 128:1–128:22.Google Scholar
Digital Library
- [32] . 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 13–23.Google Scholar
- [33] . 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision. 2623–2631.Google Scholar
Digital Library
- [34] . 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 55–60.Google Scholar
Cross Ref
- [35] 2015. Deep captioning with multimodal recurrent neural networks (M-RNN). In Proceedings of the International Conference on Learning Representations.Google Scholar
- [36] . 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the IEEE International Conference on Computer Vision. 1881–1889.Google Scholar
Cross Ref
- [37] . 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [38] . 2020. MRA-Net: Improving VQA via multi-modal relation attention network. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1 (2020), 1–1.
DOI: Google ScholarDigital Library
- [39] . 2018. Conditional image-text embedding networks. In Proceedings of the European Conference on Computer Vision (ECCV). 249–264.Google Scholar
Cross Ref
- [40] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems.Google Scholar
- [41] . 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Computat. Ling. 2 (2014), 207–218.Google Scholar
Cross Ref
- [42] . 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.Google Scholar
- [43] . 2017. Graph attention networks. arXiv preprint arXiv:1710.10903.Google Scholar
- [44] . 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5005–5013.Google Scholar
Cross Ref
- [45] . 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 1508–1517.Google Scholar
Cross Ref
- [46] . 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [47] . 2019. Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748.Google Scholar
- [48] . 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5764–5773.Google Scholar
Cross Ref
- [49] . 2020. Adaptive cross-modal embeddings for image-text alignment. In Proceedings of the AAAI Conference on Artificial Intelligence. 12313–12320.Google Scholar
Cross Ref
- [50] . 2020. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [51] . 2020. Learning dual semantic relations with graph attention for image-text matching. IEEE Trans. Circ. Syst. Vid. Technol. 31, 7 (2020), 1–1.
DOI: Google ScholarCross Ref
- [52] . 2019. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 2088–2096.Google Scholar
Digital Library
- [53] . 2019. Cross-lingual knowledge graph alignment via graph matching neural network. In Proceedings of the 57th Conference of the Association for Computational Linguistics. 3156–3161.Google Scholar
Cross Ref
- [54] . 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [55] . 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Trans. Assoc. Comput. Linguistics. 2 (2014), 67–78.Google Scholar
- [56] . 2020. Wasserstein distance regularized sequence representation for text matching in asymmetrical domains. arXiv preprint arXiv:2010.07717.Google Scholar
- [57] . 2020. Image captioning with a joint attention mechanism by visual concept samples. ACM Trans. Multim. Comput. Commun. Appl. 16, 3 (2020), 83:1–83:22.Google Scholar
Digital Library
- [58] . 2018. Deep learning of graph matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2684–2693.Google Scholar
Cross Ref
- [59] . 2020. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- [60] . 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multim. Comput. Commun. Appl. 16, 2 (2020), 51:1–51:23.Google Scholar
Digital Library
Index Terms
Cross-modal Graph Matching Network for Image-text Retrieval
Recommendations
Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval
ICMR '22: Proceedings of the 2022 International Conference on Multimedia RetrievalCross-modal image-text retrieval is a fundamental task in information retrieval. The key to this task is to address both heterogeneity and cross-modal semantic correlation between data of different modalities. Fine-grained matching methods can nicely ...
Cross-modal information balance-aware reasoning network for image-text retrieval
AbstractAs a fundamental multimodal task, image-text retrieval bridges the gap between vision and language. Current mainstream methods exploit attention mechanisms to discover potential alignments between visual regions and textual words while ...
Cross-modal alignment with graph reasoning for image-text retrieval
AbstractImage-text retrieval task has received a lot of attention in the modern research field of artificial intelligence. It still remains challenging since image and text are heterogeneous cross-modal data. The key issue of image-text retrieval is how ...






Comments