Abstract
With the rapid development of information technology, image and text data have increased dramatically. Image and text matching techniques enable computers to understand information from both visual and text modalities and match them based on semantic content. Existing methods focus on visual and textual object co-occurrence statistics and learning coarse-level associations. However, the lack of intramodal semantic inference leads to the failure of fine-level association between modalities. Scene graphs can capture the interactions between visual and textual objects and model intramodal semantic associations, which are crucial for the understanding of scenes contained in images and text. In this article, we propose a novel scene graph semantic inference network (SGSIN) for image and text matching that effectively learns fine-level semantic information in vision and text to facilitate bridging cross-modal discrepancies. Specifically, we design two matching modules and construct scene graphs within each matching module for aggregating neighborhood information to refine the semantic representation of each object and achieve fine-level alignment of visual and textual modalities. We perform extended experiments in Flickr30K and MSCOCO and achieve state-of-the-art results, which validate the advantages of our proposed approach.
- [1] . 2021. Image-text matching: Methods and challenges. Invent. Syst. Contr. 204 (2021), 213–222.Google Scholar
Cross Ref
- [2] . 2016. SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. Springer, 382–398.Google Scholar
Cross Ref
- [3] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [4] . 2021. A multimodal deep framework for derogatory social media post identification of a recognized person. ACM Trans. Asian Low-resour. Lang. Inf. Process. 21, 1 (2021), 1–19.Google Scholar
Digital Library
- [5] . 2022. Is an image worth five sentences? A new look into semantics for image-text matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1391–1400.Google Scholar
Cross Ref
- [6] . 2021. Global relation-aware attention network for image-text retrieval. In Proceedings of the International Conference on Multimedia Retrieval. 19–28.Google Scholar
Digital Library
- [7] . 2022. Structure-aware positional transformer for visible-infrared person re-identification. IEEE Trans. Image Process. 31 (2022), 2352–2364.Google Scholar
- [8] . 2018. Iterative visual reasoning beyond convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7239–7248.Google Scholar
Cross Ref
- [9] . 2019. A long video caption generation algorithm for big video data retrieval. Fut. Gen. Comput. Syst. 93 (2019), 583–595.Google Scholar
Digital Library
- [10] . 2021. A new concept of electronic text based on semantic coding system for machine translation. ACM Trans. Asian Low-resour. Lang. Inf. 21, 1 (2021), 1–16.Google Scholar
- [11] . 2020. Deep adversarial canonical correlation analysis. In Proceedings of the SIAM International Conference on Data Mining. SIAM, 352–360.Google Scholar
Cross Ref
- [12] . 2018. VSE++: Improved visual semantic embeddings. In Proceedings of the British Machine Vision Conference.935–943.Google Scholar
- [13] . 2013. Devise: A deep visual-semantic embedding model. Adv. Neural Inf. Process. Syst. 26 (2013).Google Scholar
- [14] . 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181–7189.Google Scholar
Cross Ref
- [15] . 2021. Cross-graph attention enhanced multi-modal correlation learning for fine-grained image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1865–1869.Google Scholar
Digital Library
- [16] . 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163–6171.Google Scholar
Cross Ref
- [17] . 2018. Image and sentence matching via semantic concepts and order learning. IEEE Trans. Pattern Anal. Mach. Intell. 42, 3 (2018), 636–650.Google Scholar
Cross Ref
- [18] . 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5754–5763.Google Scholar
Cross Ref
- [19] . 2019. Deep pairwise ranking with multi-label information for cross-modal retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’19). IEEE, 1810–1815.Google Scholar
Cross Ref
- [20] . 2018. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1219–1228.Google Scholar
Cross Ref
- [21] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google Scholar
Cross Ref
- [22] . 2014. Deep fragment embeddings for bidirectional image sentence mapping. Adv. Neural Inf. Process. Syst. 27 (2014).Google Scholar
- [23] . 2016. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR’17).Google Scholar
- [24] . 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).Google Scholar
- [25] . 2015. Associating neural word embeddings with deep image representations using Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4437–4446.Google Scholar
Cross Ref
- [26] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [27] . 2021. Learning and integrating multi-level matching features for image-text retrieval. IEEE Sig. Process. Lett. 29 (2021), 374–378.Google Scholar
- [28] . 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201–216.Google Scholar
Digital Library
- [29] . 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4654–4662.Google Scholar
Cross Ref
- [30] . 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 1890–1899.Google Scholar
Cross Ref
- [31] . 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision. 1261–1270.Google Scholar
Cross Ref
- [32] . 2020. Visual-semantic matching by exploring high-order attention and distraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12786–12795.Google Scholar
Cross Ref
- [33] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.Google Scholar
Cross Ref
- [34] . 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 3–11.Google Scholar
Digital Library
- [35] . 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10921–10930.Google Scholar
Cross Ref
- [36] . 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4107–4116.Google Scholar
Cross Ref
- [37] . 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision. 2623–2631.Google Scholar
Digital Library
- [38] . 2021. A Hindi image caption generation framework using deep learning. ACM Trans. Asian Low-resour. Lang. Inf. 20, 2 (2021), 1–19.Google Scholar
Digital Library
- [39] . 2018. Webly supervised joint embedding for cross-modal image-text retrieval. In Proceedings of the 26th ACM International Conference on Multimedia. 1856–1864.Google Scholar
Digital Library
- [40] . 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 299–307.Google Scholar
Cross Ref
- [41] . 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the IEEE International Conference on Computer Vision. 1881–1889.Google Scholar
Cross Ref
- [42] . 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In Proceedings of the International Joint Conferences on Artificial Intelligence. 3846–3853.Google Scholar
- [43] . 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multim. Comput., Commun. Applic. 15, 1 (2019), 1–24.Google Scholar
Digital Library
- [44] . 2015. Flickr30K entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641–2649.Google Scholar
Digital Library
- [45] . 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104–1113.Google Scholar
Digital Library
- [46] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015).Google Scholar
- [47] . 2017. A simple neural network module for relational reasoning. Adv. Neural Inf. Process. Syst. 30 (2017).Google Scholar
- [48] . 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the 4th Workshop on Vision and Language. 70–80.Google Scholar
Cross Ref
- [49] . 2017. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google Scholar
Cross Ref
- [50] . 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia. 154–162.Google Scholar
Digital Library
- [51] . 2019. State-regularized recurrent neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 6596–6606.Google Scholar
- [52] . 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2018), 394–407.Google Scholar
Digital Library
- [53] . 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5005–5013.Google Scholar
Cross Ref
- [54] . 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1960–1968.Google Scholar
Cross Ref
- [55] . 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1508–1517.Google Scholar
Cross Ref
- [56] . 2019. Position focused attention network for image-text matching. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 3792–3798.Google Scholar
Digital Library
- [57] . 2018. Scene graph parsing as dependency parsing. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies. 397–407.Google Scholar
Cross Ref
- [58] . 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5764–5773.Google Scholar
Cross Ref
- [59] . 2011. WSABIE: Scaling up to large vocabulary image annotation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence.Google Scholar
- [60] . 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6609–6618.Google Scholar
Cross Ref
- [61] . 2021. Multiple attention encoded cascade R-CNN for scene text detection. J. Visual Commun. Image Repres. 80 (2021), 103261.Google Scholar
Digital Library
- [62] . 2021. Multi-scale relation reasoning for multi-modal visual question answering. Sig. Process.: Image Commun. 96 (2021), 116319.Google Scholar
Cross Ref
- [63] . 2019. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 2088–2096.Google Scholar
Digital Library
- [64] . 2017. A robust symmetry-based method for scene/video text detection through neural network. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17). IEEE, 1249–1254.Google Scholar
Cross Ref
- [65] . 2021. Event graph neural network for opinion target classification of microblog comments. ACM Trans. Asian Low-resour. Lang. Inf. 21, 1 (2021), 1–13.Google Scholar
- [66] . 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410–5419.Google Scholar
Cross Ref
- [67] . 2015. Deep correlation for matching images and text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3441–3450.Google Scholar
Cross Ref
- [68] . 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10685–10694.Google Scholar
Cross Ref
- [69] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.Google Scholar
Digital Library
- [70] . 2018. End-to-end convolutional semantic embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5735–5744.Google Scholar
Cross Ref
- [71] . 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Computat. Ling. 2 (2014), 67–78.Google Scholar
Cross Ref
- [72] . 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831–5840.Google Scholar
Cross Ref
- [73] . 2021. Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching. In Proceedings of the 29th ACM International Conference on Multimedia. 2205–2213.Google Scholar
Digital Library
- [74] . 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 686–701.Google Scholar
Digital Library
- [75] . 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10394–10403.Google Scholar
Cross Ref
- [76] . 2020. Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multim. Comput., Commun., Applic. 16, 2 (2020), 1–23.Google Scholar
Digital Library
- [77] . 2018. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV’18). 803–818.Google Scholar
Digital Library
Index Terms
Scene Graph Semantic Inference for Image and Text Matching
Recommendations
Cross-modal Graph Matching Network for Image-text Retrieval
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods can be classified into ...
Aligning vision-language for graph inference in visual dialog
Highlights- Visual dialog needs to construct semantic dependencies between visual and textual contents.
AbstractAs a cross-media intelligence task, visual dialog calls for answering a sequence of questions based on an image, using the dialog history as context. To acquire correct answers, the exploration of the semantic dependencies among ...
Scene graph generation by multi-level semantic tasks
AbstractUnderstanding scene image includes detecting and recognizing objects, estimating the interaction relationships of the detected objects, and describing image regions with sentences. However, since the complexity and variety of scene image, existing ...






Comments