Abstract
Image–text retrieval is a vital task in computer vision and has received growing attention, since it connects cross-modality data. It comes with the critical challenges of learning unified representations and eliminating the large gap between visual and textual domains. Over the past few decades, although many works have made significant progress in image–text retrieval, they are still confronted with the challenge of incomplete text descriptions of images, i.e., how to fully learn the correlations between relevant region–word pairs with semantic diversity. In this article, we propose a novel semantic completion and filtration (SCAF) method to alleviate the above issue. Specifically, the text semantic completion module is presented to generate a complete semantic description of an image using multi-view text descriptions, guiding the model to explore the correlations of relevant region–word pairs fully. Meanwhile, the adaptive structural semantic matching module is presented to filter irrelevant region–word pairs by considering the relevance score of each region–word pair, which facilitates the model to focus on learning the relevance of matching pairs. Extensive experiments show that our SCAF outperforms the existing methods on Flickr30K and MSCOCO datasets, which demonstrates the superiority of our proposed method.
- [1] . 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.Google Scholar
Cross Ref
- [2] . 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. CoRR cs.CL/0304006 (2003).Google Scholar
- [3] . 2019. Good news, everyone! Context driven entity-aware captioning for news images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12466–12475.Google Scholar
Cross Ref
- [4] . 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12652–12660.Google Scholar
Cross Ref
- [5] . 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014).Google Scholar
- [6] . 2022. LS-GAN: Iterative language-based image manipulation via long and short term consistency reasoning. In Proceedings of the 30th ACM International Conference on Multimedia (MM’22), , , , , , , , and (Eds.). ACM, 4496–4504. Google Scholar
Digital Library
- [7] . 2008. The stanford typed dependencies representation. In Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation. 1–8.Google Scholar
Digital Library
- [8] . 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference. 12.Google Scholar
- [9] . 2019. Bridging by word: Image grounded vocabulary construction for visual captioning. In Proceedings of the Conference of the Association for Computational Linguistics. 6514–6524.Google Scholar
Cross Ref
- [10] . 2010. Multi-Sentence compression: Finding shortest paths in word graphs. In Proceedings of the International Conference on Computational Linguistics, Proceedings of the Conference. 322–330.Google Scholar
- [11] . 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of the Annual Conference on Neural Information Processing Systems. 2121–2129.Google Scholar
- [12] . 2018. Look, Imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181–7189.Google Scholar
Cross Ref
- [13] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [14] . 2019. Bi-Directional spatial-semantic attention networks for image-text matching. IEEE Trans. Image Process. 28, 4 (2019), 2008–2020.Google Scholar
Digital Library
- [15] . 2017. Instance-Aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7254–7262.Google Scholar
Cross Ref
- [16] . 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163–6171.Google Scholar
Cross Ref
- [17] . 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664–676.Google Scholar
Digital Library
- [18] . 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the Annual Conference on Neural Information Processing Systems. 1889–1897.Google Scholar
- [19] . 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/1411.2539 (2014).Google Scholar
- [20] . 2019. Analyzing sentence fusion in abstractive summarization. CoRR abs/1910.00203 (2019).Google Scholar
- [21] . 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision, Vol. 11208. 212–228.Google Scholar
Digital Library
- [22] . 2022. Long short-term relation transformer with global gating for video captioning. IEEE Trans. Image Process. 31 (2022), 2726–2738. Google Scholar
Digital Library
- [23] . 2020. Visual-Semantic matching by exploring high-order attention and distraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12783–12792.Google Scholar
Cross Ref
- [24] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Vol. 8693. 740–755.Google Scholar
Cross Ref
- [25] . 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the ACM International Conference on Multimedia. 3–11.Google Scholar
Digital Library
- [26] . 2020. Graph structured network for image-text matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10918–10927.Google Scholar
Cross Ref
- [27] . 2021. Improving cross-modal image-text retrieval with teacher-student learning. IEEE Trans. Circuits Syst. Video Technol. 31, 8 (2021), 3242–3253.Google Scholar
Cross Ref
- [28] . 2022. Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding.
arXiv:2207.08386 . Retrieved from https://arxiv.org/abs/2207.08386.Google Scholar - [29] . 2017. Simple to complex cross-modal learning to rank. Comput. Vis. Image Underst. 163 (2017), 67–77.Google Scholar
Digital Library
- [30] . 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision. 2623–2631.Google Scholar
Digital Library
- [31] . 2019. PaddlePaddle: An open-source deep learning platform from industrial practice. Front. Data Comput. 1, 1 (2019), 105–115.Google Scholar
- [32] . 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2156–2164.Google Scholar
Cross Ref
- [33] . 2021. Deep multiscale fusion hashing for cross-modal retrieval. IEEE Trans. Circ. Syst. Video Technol. 31, 1 (2021), 401–410.Google Scholar
Digital Library
- [34] . 2017. Hierarchical multimodal LSTM for dense visual-semantic embedding. In Proceedings of the IEEE International Conference on Computer Vision. 1899–1907.Google Scholar
Cross Ref
- [35] . 2019. PaddlePaddle: An Easy-to-use, Easy-to-learn Deep Learning Platform. Retrieved from http://www.paddlepaddle.org/.Google Scholar
- [36] . 2017. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. Int. J. Comput. Vis. 123, 1 (2017), 74–93.Google Scholar
Digital Library
- [37] . 2019. MirrorGAN: Learning text-to-image generation by redescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1505–1514.Google Scholar
Cross Ref
- [38] . 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (2017), 1137–1149.Google Scholar
Digital Library
- [39] . 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE International Conference on Computer Vision. 5813–5823.Google Scholar
Cross Ref
- [40] . 2016. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4613–4621.Google Scholar
Cross Ref
- [41] . 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11572–11581.Google Scholar
Cross Ref
- [42] . 2021. Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’21). Computer Vision Foundation/IEEE, 7026–7035. Google Scholar
Cross Ref
- [43] . 2019. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2019), 394–407.Google Scholar
Digital Library
- [44] . 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 1497–1506.Google Scholar
Cross Ref
- [45] . 2019. Position focused attention network for image-text matching. In Proceedings of the ACM International Joint Conference on Artificial Intelligence. 3792–3798.Google Scholar
Cross Ref
- [46] . 2019. CAMP: Cross-Modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 5763–5772.Google Scholar
Cross Ref
- [47] . 2021. CMPD: Using cross memory network with pair discrimination for image-text retrieval. IEEE Trans. Circuits Syst. Video Technol. 31, 6 (2021), 2427–2437.Google Scholar
Cross Ref
- [48] . 2021. Noise augmented double-stream graph convolutional networks for image captioning. IEEE Trans. Circuits Syst. Video Technol. 31, 8 (2021), 3118–3127.Google Scholar
Cross Ref
- [49] . 2021. Augmented adversarial training for cross-modal retrieval. IEEE Trans. Multimedia 23 (2021), 559–571.Google Scholar
Cross Ref
- [50] . 2019. Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22, 2 (2019), 657–672.Google Scholar
Digital Library
- [51] . 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831–5840.Google Scholar
Cross Ref
- [52] . 2021. Improving domain-adaptive person re-identification by dual-alignment learning with camera-aware image generation. IEEE Trans. Circ. Syst. Video Technol. 31, 11 (2021), 4334–4346.Google Scholar
Digital Library
- [53] . 2020. Context-Aware attention network for image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3533–3542.Google Scholar
Cross Ref
- [54] . 2019. R2GAN: Cross-Modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11477–11486.Google Scholar
Cross Ref
Index Terms
Semantic Completion and Filtration for Image–Text Retrieval
Recommendations
Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval
ICMR '22: Proceedings of the 2022 International Conference on Multimedia RetrievalCross-modal image-text retrieval is a fundamental task in information retrieval. The key to this task is to address both heterogeneity and cross-modal semantic correlation between data of different modalities. Fine-grained matching methods can nicely ...
MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval
Image-text retrieval aims to take the text (image) query to retrieve the semantically relevant images (texts), which is fundamental and critical in the search system, online shopping, and social network. Existing works have shown the effectiveness of ...
Cross-Modal Image-Text Retrieval with Semantic Consistency
MM '19: Proceedings of the 27th ACM International Conference on MultimediaCross-modal image-text retrieval has been a long-standing challenge in the multimedia community. Existing methods explore various complicated embedding spaces to assess the semantic similarity between a given image-text pair, but consider no/little ...






Comments