skip to main content
research-article

Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

Published:12 November 2021Publication History
Skip Abstract Section

Abstract

Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences (i.e., image regions and words, respectively) to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task.

Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way toward the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the [email protected] metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN.

REFERENCES

  1. [1] Anderson Peter, Fernando Basura, Johnson Mark, and Gould Stephen. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. 382398.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Carrara Fabio, Esuli Andrea, Fagni Tiziano, Falchi Fabrizio, and Moreo Alejandro. 2018. Picture it in your mind: Generating high level visual representations from textual descriptions. Information Retrieval Journal 21, 2–3 (2018), 208229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Chen Hui, Ding Guiguang, Liu Xudong, Lin Zijia, Liu Ji, and Han Jungong. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1265512663.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Yen-Chun, Li Linjie, Yu Licheng, Kholy Ahmed El, Ahmed Faisal, Gan Zhe, Cheng Yu, and Liu Jingjing. 2019. Uniter: Learning universal image-text representations. arXiv:1909.11740.Google ScholarGoogle Scholar
  6. [6] Cornia Marcella, Baraldi Lorenzo, and Cucchiara Rita. 2019. Show, control and tell: A framework for generating controllable and grounded captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 83078316.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT’19). 41714186.Google ScholarGoogle Scholar
  8. [8] Eisenschtat Aviv and Wolf Lior. 2017. Linking image and text with 2-way nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46014611.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Faghri Fartash, Fleet David J., Kiros Jamie Ryan, and Fidler Sanja. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference (BMVC’18). 12.Google ScholarGoogle Scholar
  10. [10] Gu Jiuxiang, Cai Jianfei, Joty Shafiq R., Niu Li, and Wang Gang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 71817189.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Guo Yawen, Yuan Hui, and Zhang Kun. 2020. Associating images with sentences using recurrent canonical correlation analysis. Applied Sciences 10, 16 (2020), 5516.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Hu Ronghang, Andreas Jacob, Rohrbach Marcus, Darrell Trevor, and Saenko Kate. 2017. Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 804813.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Huang Feiran, Zhang Xiaoming, Zhao Zhonghua, and Li Zhoujun. 2018. Bi-directional spatial-semantic attention networks for image-text matching. IEEE Transactions on Image Processing 28, 4 (2018), 20082020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 46344643.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Huang Yan and Wang Liang. 2019. ACMM: Aligned cross-modal memory for few-shot image and sentence matching. In Proceedings of the IEEE International Conference on Computer Vision. 57745783.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Huang Yan, Wang Wei, and Wang Liang. 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 23102318.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Huang Yan, Wu Qi, Song Chunfeng, and Wang Liang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 61636171.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Huang Yan, Wu Qi, Wang Wei, and Wang Liang. 2018. Image and sentence matching via semantic concepts and order learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 3 (2018), 636–650.Google ScholarGoogle Scholar
  19. [19] Huang Zhicheng, Zeng Zhaoyang, Liu Bei, Fu Dongmei, and Fu Jianlong. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv:2004.00849.Google ScholarGoogle Scholar
  20. [20] Ji Zhong, Lin Zhigang, Wang Haoran, and He Yuqing. 2020. Multi-modal memory enhancement attention network for image-text matching. IEEE Access 8 (2020), 3843838447.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Ji Zhong, Wang Haoran, Han Jungong, and Pang Yanwei. 2019. Saliency-guided attention network for image-sentence matching. In Proceedings of the IEEE International Conference on Computer Vision. 57545763.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Ji Zhong, Wang Haoran, Han Jungong, and Pang Yanwei. 2020. SMAN: Stacked multimodal attention network for cross-modal image-text retrieval. IEEE Transactions on Cybernetics. Online ahead of print, May 4, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Johnson Justin, Hariharan Bharath, Maaten Laurens Van Der, Hoffman Judy, Fei-Fei Li, Zitnick C. Lawrence, and Girshick Ross. 2017. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision. 29892998.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Kendall Alex, Gal Yarin, and Cipolla Roberto. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 74827491.Google ScholarGoogle Scholar
  26. [26] Klein Benjamin, Lev Guy, Sadeh Gil, and Wolf Lior. 2015. Associating neural word embeddings with deep image representations using Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 44374446.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Lee Kuang-Huei, Chen Xi, Hua Gang, Hu Houdong, and He Xiaodong. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201216.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Lee Kuang-Huei, Palangi Hamid, Chen Xi, Hu Houdong, and Gao Jianfeng. 2019. Learning visual relation priors for image-text matching and image captioning with neural scene graph generators. arXiv:1909.09953.Google ScholarGoogle Scholar
  30. [30] Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, and Fu Yun. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE International Conference on Computer Vision. 46544662.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Li Xiangyang and Jiang Shuqiang. 2019. Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia 21, 8 (2019), 21172130.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Li Yikang, Ouyang Wanli, Zhou Bolei, Shi Jianping, Zhang Chao, and Wang Xiaogang. 2018. Factorizable net: An efficient subgraph-based framework for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 335351.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Lin Chin-Yew. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics. 74–81.Google ScholarGoogle Scholar
  34. [34] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740755.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Lin Xiao and Parikh Devi. 2016. Leveraging visual question answering for image-caption ranking. In Proceedings of the European Conference on Computer Vision. 261277.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Liu Chunxiao, Mao Zhendong, Liu An-An, Zhang Tianzhu, Wang Bin, and Zhang Yongdong. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Liu Chunxiao, Mao Zhendong, Zhang Tianzhu, Xie Hongtao, Wang Bin, and Zhang Yongdong. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1092110930.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Liu Yu, Guo Yanming, Bakker Erwin M., and Lew Michael S.. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 41074116.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 1323.Google ScholarGoogle Scholar
  40. [40] MacAvaney Sean, Nardini Franco Maria, Perego Raffaele, Tonellotto Nicola, Goharian Nazli, and Frieder Ophir. 2020. Efficient document re-ranking for transformers by precomputing term representations. arXiv:2004.14255.Google ScholarGoogle Scholar
  41. [41] Messina Nicola, Amato Giuseppe, Carrara Fabio, Falchi Fabrizio, and Gennaro Claudio. 2018. Learning relationship-aware visual features. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle Scholar
  42. [42] Messina Nicola, Amato Giuseppe, Carrara Fabio, Falchi Fabrizio, and Gennaro Claudio. 2019. Learning visual features for relational CBIR. International Journal of Multimedia Information Retrieval 9 (2019), 113124.Google ScholarGoogle Scholar
  43. [43] Messina Nicola, Falchi Fabrizio, Esuli Andrea, and Amato Giuseppe. 2020. Transformer reasoning network for image-text matching and retrieval. In Proceedings of the International Conference on Pattern Recognition (ICPR’20).Google ScholarGoogle Scholar
  44. [44] Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR’13).Google ScholarGoogle Scholar
  45. [45] Qi Di, Su Lin, Song Jia, Cui Edward, Bharti Taroon, and Sacheti Arun. 2020. ImageBERT: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv:2001.07966.Google ScholarGoogle Scholar
  46. [46] Qu Leigang, Liu Meng, Cao Da, Nie Liqiang, and Tian Qi. 2020. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia. 10471055.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 9199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Rennie Steven J., Marcheret Etienne, Mroueh Youssef, Ross Jerret, and Goel Vaibhava. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 70087024.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Santoro Adam, Raposo David, Barrett David G., Malinowski Mateusz, Pascanu Razvan, Battaglia Peter, and Lillicrap Timothy. 2017. A simple neural network module for relational reasoning. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS’17). 49674976.Google ScholarGoogle Scholar
  50. [50] Sarafianos Nikolaos, Xu Xiang, and Kakadiaris Ioannis A.. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE International Conference on Computer Vision. 58145824.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Su Weijie, Zhu Xizhou, Cao Yue, Li Bin, Lu Lewei, Wei Furu, and Dai Jifeng. 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  52. [52] Teney Damien, Liu Lingqiao, and Hengel Anton van Den. 2017. Graph-structured representations for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  54. [54] Vendrov Ivan, Kiros Ryan, Fidler Sanja, and Urtasun Raquel. 2016. Order-embeddings of images and language. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16).Google ScholarGoogle Scholar
  55. [55] Wang Shuhui, Chen Yangyu, Zhuo Junbao, Huang Qingming, and Tian Qi. 2018. Joint global and co-attentive representation learning for image-sentence retrieval. In Proceedings of the 26th ACM International Conference on Multimedia. 13981406.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Wang Yaxiong, Yang Hao, Qian Xueming, Ma Lin, Lu Jing, Li Biao, and Fan Xin. 2019. Position focused attention network for image-text matching. arXiv:1907.09748.Google ScholarGoogle Scholar
  57. [57] Wei Kaimin and Zhou Zhibo. 2020. Adversarial attentive multi-modal embedding learning for image-text matching. IEEE Access 8 (2020), 96237–96248.Google ScholarGoogle Scholar
  58. [58] Wei Xi, Zhang Tianzhu, Li Yan, Zhang Yongdong, and Wu Feng. 2020. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1094110950.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Wu Yiling, Wang Shuhui, Song Guoli, and Huang Qingming. 2019. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 20882096.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Xu Xing, Wang Tan, Yang Yang, Zuo Lin, Shen Fumin, and Shen Heng Tao. 2020. Cross-modal attention with semantic consistence for image-text matching. IEEE Transactions on Neural Networks and Learning Systems 31, 12 (2020), 5412–5425.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Yang Jianwei, Lu Jiasen, Lee Stefan, Batra Dhruv, and Parikh Devi. 2018. Graph R-CNN for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 670685.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Yang Xu, Tang Kaihua, Zhang Hanwang, and Cai Jianfei. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1068510694.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 6778.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason J., and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 1304113049.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 17, Issue 4
          November 2021
          529 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3492437
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 November 2021
          • Revised: 1 February 2021
          • Accepted: 1 February 2021
          • Received: 1 July 2020
          Published in tomm Volume 17, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!