Abstract
Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task.
Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the [email protected] metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN.
- [1] . 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. 382–398.Google Scholar
Cross Ref
- [2] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [3] . 2018. Picture it in your mind: Generating high level visual representations from textual descriptions. Information Retrieval Journal 21, 2–3 (2018), 208–229.Google Scholar
Digital Library
- [4] . 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12655–12663.Google Scholar
Cross Ref
- [5] . 2019. Uniter: Learning universal image-text representations. arXiv:1909.11740.Google Scholar
- [6] . 2019. Show, control and tell: A framework for generating controllable and grounded captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8307–8316.Google Scholar
Cross Ref
- [7] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT’19). 4171–4186.Google Scholar
- [8] . 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4601–4611.Google Scholar
Cross Ref
- [9] . 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference (BMVC’18). 12.Google Scholar
- [10] . 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181–7189.Google Scholar
Cross Ref
- [11] . 2020. Associating images with sentences using recurrent canonical correlation analysis. Applied Sciences 10, 16 (2020), 5516.Google Scholar
Cross Ref
- [12] . 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 804–813.Google Scholar
Cross Ref
- [13] . 2018. Bi-directional spatial-semantic attention networks for image-text matching. IEEE Transactions on Image Processing 28, 4 (2018), 2008–2020.Google Scholar
Digital Library
- [14] . 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634–4643.Google Scholar
Cross Ref
- [15] . 2019. ACMM: Aligned cross-modal memory for few-shot image and sentence matching. In Proceedings of the IEEE International Conference on Computer Vision. 5774–5783.Google Scholar
Cross Ref
- [16] . 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2310–2318.Google Scholar
Cross Ref
- [17] . 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163–6171.Google Scholar
Cross Ref
- [18] . 2018. Image and sentence matching via semantic concepts and order learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 3 (2018), 636–650.Google Scholar
- [19] . 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv:2004.00849.Google Scholar
- [20] . 2020. Multi-modal memory enhancement attention network for image-text matching. IEEE Access 8 (2020), 38438–38447.Google Scholar
Cross Ref
- [21] . 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE International Conference on Computer Vision. 5754–5763.Google Scholar
Cross Ref
- [22] . 2020. SMAN: Stacked multimodal attention network for cross-modal image-text retrieval. IEEE Transactions on Cybernetics. Online ahead of print, May 4, 2020.Google Scholar
Cross Ref
- [23] . 2017. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision. 2989–2998.Google Scholar
Cross Ref
- [24] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google Scholar
Cross Ref
- [25] . 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7482–7491.Google Scholar
- [26] . 2015. Associating neural word embeddings with deep image representations using Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4437–4446.Google Scholar
Cross Ref
- [27] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [28] . 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201–216.Google Scholar
Cross Ref
- [29] . 2019. Learning visual relation priors for image-text matching and image captioning with neural scene graph generators. arXiv:1909.09953.Google Scholar
- [30] . 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE International Conference on Computer Vision. 4654–4662.Google Scholar
Cross Ref
- [31] . 2019. Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia 21, 8 (2019), 2117–2130.Google Scholar
Cross Ref
- [32] . 2018. Factorizable net: An efficient subgraph-based framework for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 335–351.Google Scholar
Cross Ref
- [33] . 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics. 74–81.Google Scholar
- [34] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.Google Scholar
Cross Ref
- [35] . 2016. Leveraging visual question answering for image-caption ranking. In Proceedings of the European Conference on Computer Vision. 261–277.Google Scholar
Cross Ref
- [36] . 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 3–11.Google Scholar
Digital Library
- [37] . 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10921–10930.Google Scholar
Cross Ref
- [38] . 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4107–4116.Google Scholar
Cross Ref
- [39] . 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13–23.Google Scholar
- [40] . 2020. Efficient document re-ranking for transformers by precomputing term representations. arXiv:2004.14255.Google Scholar
- [41] . 2018. Learning relationship-aware visual features. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google Scholar
- [42] . 2019. Learning visual features for relational CBIR. International Journal of Multimedia Information Retrieval 9 (2019), 113–124.Google Scholar
- [43] . 2020. Transformer reasoning network for image-text matching and retrieval. In Proceedings of the International Conference on Pattern Recognition (ICPR’20).Google Scholar
- [44] . 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR’13).Google Scholar
- [45] . 2020. ImageBERT: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv:2001.07966.Google Scholar
- [46] . 2020. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia. 1047–1055.Google Scholar
Digital Library
- [47] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.Google Scholar
Digital Library
- [48] . 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008–7024.Google Scholar
Cross Ref
- [49] . 2017. A simple neural network module for relational reasoning. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS’17). 4967–4976.Google Scholar
- [50] . 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE International Conference on Computer Vision. 5814–5824.Google Scholar
Cross Ref
- [51] . 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [52] . 2017. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google Scholar
Cross Ref
- [53] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [54] . 2016. Order-embeddings of images and language. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16).Google Scholar
- [55] . 2018. Joint global and co-attentive representation learning for image-sentence retrieval. In Proceedings of the 26th ACM International Conference on Multimedia. 1398–1406.Google Scholar
Digital Library
- [56] . 2019. Position focused attention network for image-text matching. arXiv:1907.09748.Google Scholar
- [57] . 2020. Adversarial attentive multi-modal embedding learning for image-text matching. IEEE Access 8 (2020), 96237–96248.Google Scholar
- [58] . 2020. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10941–10950.Google Scholar
Cross Ref
- [59] . 2019. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 2088–2096.Google Scholar
Digital Library
- [60] . 2020. Cross-modal attention with semantic consistence for image-text matching. IEEE Transactions on Neural Networks and Learning Systems 31, 12 (2020), 5412–5425.Google Scholar
Cross Ref
- [61] . 2018. Graph R-CNN for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 670–685.Google Scholar
Cross Ref
- [62] . 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685–10694.Google Scholar
Cross Ref
- [63] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.Google Scholar
Digital Library
- [64] . 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.Google Scholar
Cross Ref
- [65] . 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 13041–13049.Google Scholar
Cross Ref
Index Terms
Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders
Recommendations
Adversarial Cross-Modal Retrieval
MM '17: Proceedings of the 25th ACM international conference on MultimediaCross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of cross-modal retrieval research is to learn a common subspace where the items of different modalities can be directly ...
HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
The purpose of cross-modal retrieval is to find the relationship between different modal samples and to retrieve other modal samples with similar semantics by using a certain modal sample. As the data of different modalities presents heterogeneous low-...
Learning DALTS for cross‐modal retrieval
Cross‐modal retrieval has been recently proposed to find an appropriate subspace, where the similarity across different modalities such as image and text can be directly measured. In this study, different from most existing works, the authors propose a ...






Comments