skip to main content
research-article

Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching

Authors Info & Claims
Published:12 November 2021Publication History
Skip Abstract Section

Abstract

Image-sentence matching is a challenging task in the field of language and vision, which aims at measuring the similarities between images and sentence descriptions. Most existing methods independently map the global features of images and sentences into a common space to calculate the image-sentence similarity. However, the image-sentence similarity obtained by these methods may be coarse as (1) an intermediate common space is introduced to implicitly match the heterogeneous features of images and sentences in a global level, and (2) only the inter-modality relations of images and sentences are captured while the intra-modality relations are ignored. To overcome the limitations, we propose a novel Cross-Modal Hybrid Feature Fusion (CMHF) framework for directly learning the image-sentence similarity by fusing multimodal features with inter- and intra-modality relations incorporated. It can robustly capture the high-level interactions between visual regions in images and words in sentences, where flexible attention mechanisms are utilized to generate effective attention flows within and across the modalities of images and sentences. A structured objective with ranking loss constraint is formed in CMHF to learn the image-sentence similarity based on the fused fine-grained features of different modalities bypassing the usage of intermediate common space. Extensive experiments and comprehensive analysis performed on two widely used datasets—Microsoft COCO and Flickr30K—show the effectiveness of the hybrid feature fusion framework in CMHF, in which the state-of-the-art matching performance is achieved by our proposed CMHF method.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Ben-Younes Hedi, Cadène Rémi, Cord Matthieu, and Thome Nicolas. 2017. MUTAN: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 26312639.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Ben-Younes Hedi, Cadène Rémi, Thome Nicolas, and Cord Matthieu. 2019. BLOCK: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19). 81028109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bottou Leon. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010. 177–186.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Buschman Timothy J. and Miller Earl K.. 2007. Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices. Science 315, 5820 (2007), 18601862.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chen Hui, Ding Guiguang, Liu Xudong, Lin Zijia, Liu Ji, and Han Jungong. 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 1265212660.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Choi Heeyoul, Cho Kyunghyun, and Bengio Yoshua. 2018. Fine-grained attention mechanism for neural machine translation. Neurocomputing 284 (2018), 171176.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chung Junyoung, Gulcehre Caglar, Cho KyungHyun, and Bengio Yoshua. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555.Google ScholarGoogle Scholar
  9. [9] Clinchant Stéphane, Ah-Pine Julien, and Csurka Gabriela. 2011. Semantic combination of textual and visual information in multimedia retrieval. In Proceedings of the IEEE International Conference on Computer Vision. ACM, New York, NY, 44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Faghri Fartash, Fleet David J., Kiros Jamie, and Fidler Sanja. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference. 12.Google ScholarGoogle Scholar
  11. [11] Fang Zhiwei, Liu Jing, Liu Xueliang, Tang Qu, Li Yong, and Lu Hanqing. 2019. BTDP: Toward sparse fusion with block term decomposition pooling for visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), Article 50, 21 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Frome Andrea, Corrado Greg S., Shlens Jon, Bengio Samy, Dean Jeff, Ranzato Marc’Aurelio, and Mikolov Tomas. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 21212129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Fukui Akira, Park Dong Huk, Yang Daylen, Rohrbach Anna, Darrell Trevor, and Rohrbach Marcus. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 457468.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Gao Peng, Jiang Zhengkai, You Haoxuan, Lu Pan, Hoi Steven C. H., Wang Xiaogang, and Li Hongsheng. 2019. Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 66396648.Google ScholarGoogle Scholar
  15. [15] Gao Yang, Beijbom Oscar, Zhang Ning, and Darrell Trevor. 2016. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 317326.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Guthrie David, Allison Ben, Liu Wei, Guthrie Louise, and Wilks Yorick. 2006. A closer look at skip-gram modelling. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 12221225.Google ScholarGoogle Scholar
  17. [17] Hou Ming, Tang Jiajia, Zhang Jianhai, Kong Wanzeng, and Zhao Qibin. 2019. Deep multimodal multilinear fusion with high-order polynomial pooling. In Advances in Neural Information Processing Systems. 1211312122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Huang Feiran, Zhang Xiaoming, Zhao Zhonghua, and Li Zhoujun. 2019. Bi-directional spatial-semantic attention networks for image-text matching. IEEE Transactions on Image Processing 28, 4 (2019), 20082020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 46344643.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Huang Yan, Wang Wei, and Wang Liang. 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 72547262.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Huang Yan, Wu Qi, Song Chunfeng, and Wang Liang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 61636171.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Huang Y., Wu Q., Wang W., and Wang L.. 2020. Image and sentence matching via semantic concepts and order learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 3 (2020), 636650.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Kim Jin-Hwa, On Kyoung-Woon, Lim Woosang, Kim Jeonghee, Ha Jung-Woo, and Zhang Byoung-Tak. 2016. Hadamard product for low-rank bilinear pooling. arXiv:1610.04325.Google ScholarGoogle Scholar
  24. [24] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.Google ScholarGoogle Scholar
  25. [25] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, et al. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 10971105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Lee Kuang-Huei, Chen Xi, Hua Gang, Hu Houdong, and He Xiaodong. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Li Shuang, Xiao Tong, Li Hongsheng, Yang Wei, and Wang Xiaogang. 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 18901899.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Lin Tsung-Yi, Maire Michael, Belongie Serge J., Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740755.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Liu Meng, Nie Liqiang, Wang Meng, and Chen Baoquan. 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of the ACM International Conference on Multimedia (MM’17). 970978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Liu Ruoyu, Zhao Yao, Wei Shikui, Zheng Liang, and Yang Yi. 2019. Modality-invariant image-text embedding for image-sentence matching. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), Article 27, 19 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Liu Xihui, Wang Zihao, Shao Jing, Wang Xiaogang, and Li Hongsheng. 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 19501959.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Liu Yu, Guo Yanming, Bakker Erwin M., and Lew Michael S.. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 41274136.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Liu Zhun, Shen Ying, Lakshminarasimhan Varun Bharadhwaj, Liang Paul Pu, Zadeh Amir, and Morency Louis-Philippe. 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv:1806.00064.Google ScholarGoogle Scholar
  35. [35] Peng Yuxin and Qi Jinwei. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), Article 22, 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Ren Shaoqing, He Kaiming, Girshick Ross B., and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28. 9199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Sarafianos Nikolaos, Xu Xiang, and Kakadiaris Ioannis A.. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE International Conference on Computer Vision. 58145824.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Song Yale and Soleymani Mohammad. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 19791988.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Maaten Laurens van der and Hinton Geoffrey. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 25792605.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Wang Haoran, Ji Zhong, Lin Zhigang, Pang Yanwei, and Li Xuelong. 2020. Stacked squeeze-and-excitation recurrent residual network for visual-semantic matching. Pattern Recognition 105 (2020), 107359.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Wang Haoran, Zhang Ying, Ji Zhong, Pang Yanwei, and Ma Lin. 2020. Consensus-aware visual-semantic embedding for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vol. 12369. 1834.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Wang Kaiye, Yin Qiyue, Wang Wei, Wu Shu, and Wang Liang. 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215.Google ScholarGoogle Scholar
  44. [44] Wang Liwei, Li Yin, and Lazebnik Svetlana. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 50055013.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Wang Liwei, Li Yin, and Lazebnik Svetlana. 2017. Learning two-branch neural networks for image-text matching tasks. arXiv:1704.03470.Google ScholarGoogle Scholar
  46. [46] Wang Shuhui, Chen Yangyu, Zhuo Junbao, Huang Qingming, and Tian Qi. 2018. Joint global and co-attentive representation learning for image-sentence retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’18). 13981406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Wang Shuo, Guo Dan, Xu Xin, Zhuo Li, and Wang Meng. 2019. Cross-modality retrieval by joint correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), Article 56, 16 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Wang Sijin, Wang Ruiping, Yao Ziwei, Shan Shiguang, and Chen Xilin. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’20). 14971506.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wang Tan, Xu Xing, Yang Yang, Hanjalic Alan, Shen Heng Tao, and Song Jingkuan. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the ACM International Conference on Multimedia (MM’19). 1220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Wang Xin, Wang Yuan-Fang, and Wang William Yang. 2018. Watch, listen, and describe: Globally and locally aligned cross-modal attentions for video captioning. arXiv:1804.05448.Google ScholarGoogle Scholar
  51. [51] Wang Zihao, Liu Xihui, Li Hongsheng, Sheng Lu, Yan Junjie, Wang Xiaogang, and Shao Jing. 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 57645773.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Wei Xi, Zhang Tianzhu, Li Yan, Zhang Yongdong, and Wu Feng. 2020. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 1093810947.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Wei Yinwei, Wang Xiang, Guan Weili, Nie Liqiang, Lin Zhouchen, and Chen Baoquan. 2020. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29 (2020), 114.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Wei Yunchao, Zhao Yao, Lu Canyi, Wei Shikui, Liu Luoqi, Zhu Zhenfeng, and Yan Shuicheng. 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Transactions on Cybernetics 47, 2 (2017), 449460.Google ScholarGoogle Scholar
  55. [55] Westphal Cedric and Pei Guanhong. 2009. Scalable routing via greedy embedding. In Proceedings of IEEE INFOCOM 2009. IEEE, Los Alamitos, CA, 28262830.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Wu Hao, Mao Jiayuan, Zhang Yufeng, Jiang Yuning, Li Lei, Sun Weiwei, and Ma Wei-Ying. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 66096618.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Xu Ran, Xiong Caiming, Chen Wei, and Corso Jason J.. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Xu Xing, Lin Kaiyi, Yang Yang, Hanjalic Alan, and Shen Tao Heng. 2020. Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (2020), 118.Google ScholarGoogle Scholar
  59. [59] Xu Xing, Lu Huimin, Song Jingkuan, Yang Yang, Shen Heng Tao, and Li Xuelong. 2020. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Transactions on Cybernetics 50, 6 (2020), 24002413.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Yang Zhenguo, Lin Zehang, Kang Peipei, Lv Jianming, Li Qing, and Liu Wenyin. 2020. Learning shared semantic space with correlation alignment for cross-modal event retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1 (2020), Article 9, 22 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Ye Zhaoda and Peng Yuxin. 2020. Sequential cross-modal hashing learning via multi-scale correlation mining. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 4 (2020), Article 105, 20 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Young Peter, Lai Alice, Hodosh Micah, and Hockenmaier Julia. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 6778.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Yuan Jin, Zhang Lei, Guo Songrui, Xiao Yi, and Li Zhiyong. 2020. Image captioning with a joint attention mechanism by visual concept samples. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), Article 83, 22 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Zadeh Amir, Chen Minghai, Poria Soujanya, Cambria Erik, and Morency Louis-Philippe. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250.Google ScholarGoogle Scholar
  65. [65] Zhang Dongxiang, Cao Rui, and Wu Sai. 2019. Information fusion in visual question answering: A survey. Information Fusion 52 (2019), 268280.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Zhang Shanshan, Yang Jian, and Schiele Bernt. 2018. Occluded pedestrian detection through guided attention in CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 69957003.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Zhang Ying and Lu Huchuan. 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision, Vol. 11205. 707–723.Google ScholarGoogle Scholar

Index Terms

  1. Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 17, Issue 4
        November 2021
        529 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3492437
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 November 2021
        • Accepted: 1 March 2021
        • Revised: 1 February 2021
        • Received: 1 November 2020
        Published in tomm Volume 17, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!