Abstract
Image-sentence matching is a challenging task in the field of language and vision, which aims at measuring the similarities between images and sentence descriptions. Most existing methods independently map the global features of images and sentences into a common space to calculate the image-sentence similarity. However, the image-sentence similarity obtained by these methods may be coarse as (1) an intermediate common space is introduced to implicitly match the heterogeneous features of images and sentences in a global level, and (2) only the inter-modality relations of images and sentences are captured while the intra-modality relations are ignored. To overcome the limitations, we propose a novel Cross-Modal Hybrid Feature Fusion (CMHF) framework for directly learning the image-sentence similarity by fusing multimodal features with inter- and intra-modality relations incorporated. It can robustly capture the high-level interactions between visual regions in images and words in sentences, where flexible attention mechanisms are utilized to generate effective attention flows within and across the modalities of images and sentences. A structured objective with ranking loss constraint is formed in CMHF to learn the image-sentence similarity based on the fused fine-grained features of different modalities bypassing the usage of intermediate common space. Extensive experiments and comprehensive analysis performed on two widely used datasets—Microsoft COCO and Flickr30K—show the effectiveness of the hybrid feature fusion framework in CMHF, in which the state-of-the-art matching performance is achieved by our proposed CMHF method.
- [1] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6077–6086.Google Scholar
Cross Ref
- [2] . 2017. MUTAN: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2631–2639.Google Scholar
Cross Ref
- [3] . 2019. BLOCK: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19). 8102–8109. Google Scholar
Digital Library
- [4] . 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010. 177–186.Google Scholar
Cross Ref
- [5] . 2007. Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices. Science 315, 5820 (2007), 1860–1862.Google Scholar
Cross Ref
- [6] . 2020. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 12652–12660.Google Scholar
Cross Ref
- [7] . 2018. Fine-grained attention mechanism for neural machine translation. Neurocomputing 284 (2018), 171–176.Google Scholar
Cross Ref
- [8] . 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555.Google Scholar
- [9] . 2011. Semantic combination of textual and visual information in multimedia retrieval. In Proceedings of the IEEE International Conference on Computer Vision.
ACM ,New York, NY , 44. Google ScholarDigital Library
- [10] . 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference. 12.Google Scholar
- [11] . 2019. BTDP: Toward sparse fusion with block term decomposition pooling for visual question answering. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), Article 50, 21 pages. Google Scholar
Digital Library
- [12] . 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121–2129. Google Scholar
Digital Library
- [13] . 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 457–468.Google Scholar
Cross Ref
- [14] . 2019. Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6639–6648.Google Scholar
- [15] . 2016. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 317–326.Google Scholar
Cross Ref
- [16] . 2006. A closer look at skip-gram modelling. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 1222–1225.Google Scholar
- [17] . 2019. Deep multimodal multilinear fusion with high-order polynomial pooling. In Advances in Neural Information Processing Systems. 12113–12122. Google Scholar
Digital Library
- [18] . 2019. Bi-directional spatial-semantic attention networks for image-text matching. IEEE Transactions on Image Processing 28, 4 (2019), 2008–2020.Google Scholar
Digital Library
- [19] . 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634–4643.Google Scholar
Cross Ref
- [20] . 2017. Instance-aware image and sentence matching with selective multimodal LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 7254–7262.Google Scholar
Cross Ref
- [21] . 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6163–6171.Google Scholar
Cross Ref
- [22] . 2020. Image and sentence matching via semantic concepts and order learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 3 (2020), 636–650.Google Scholar
Cross Ref
- [23] . 2016. Hadamard product for low-rank bilinear pooling. arXiv:1610.04325.Google Scholar
- [24] . 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.Google Scholar
- [25] . 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332. Google Scholar
Digital Library
- [26] . 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105. Google Scholar
Digital Library
- [27] . 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201–216.Google Scholar
Digital Library
- [28] . 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 1890–1899.Google Scholar
Cross Ref
- [29] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740–755.Google Scholar
Cross Ref
- [30] . 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of the ACM International Conference on Multimedia (MM’17). 970–978. Google Scholar
Digital Library
- [31] . 2019. Modality-invariant image-text embedding for image-sentence matching. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), Article 27, 19 pages. Google Scholar
Digital Library
- [32] . 2019. Improving referring expression grounding with cross-modal attention-guided erasing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 1950–1959.Google Scholar
Cross Ref
- [33] . 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision. 4127–4136.Google Scholar
Cross Ref
- [34] . 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv:1806.00064.Google Scholar
- [35] . 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), Article 22, 24 pages. Google Scholar
Digital Library
- [36] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28. 91–99. Google Scholar
Digital Library
- [37] . 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE International Conference on Computer Vision. 5814–5824.Google Scholar
Cross Ref
- [38] . 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 1979–1988.Google Scholar
Cross Ref
- [39] . 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.Google Scholar
Digital Library
- [40] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008. Google Scholar
Digital Library
- [41] . 2020. Stacked squeeze-and-excitation recurrent residual network for visual-semantic matching. Pattern Recognition 105 (2020), 107359.Google Scholar
Cross Ref
- [42] . 2020. Consensus-aware visual-semantic embedding for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vol. 12369. 18–34.Google Scholar
Digital Library
- [43] . 2016. A comprehensive survey on cross-modal retrieval. arXiv:1607.06215.Google Scholar
- [44] . 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5005–5013.Google Scholar
Cross Ref
- [45] . 2017. Learning two-branch neural networks for image-text matching tasks. arXiv:1704.03470.Google Scholar
- [46] . 2018. Joint global and co-attentive representation learning for image-sentence retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’18). 1398–1406. Google Scholar
Digital Library
- [47] . 2019. Cross-modality retrieval by joint correlation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2s (2019), Article 56, 16 pages. Google Scholar
Digital Library
- [48] . 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’20). 1497–1506.Google Scholar
Cross Ref
- [49] . 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In Proceedings of the ACM International Conference on Multimedia (MM’19). 12–20. Google Scholar
Digital Library
- [50] . 2018. Watch, listen, and describe: Globally and locally aligned cross-modal attentions for video captioning. arXiv:1804.05448.Google Scholar
- [51] . 2019. CAMP: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 5764–5773.Google Scholar
Cross Ref
- [52] . 2020. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 10938–10947.Google Scholar
Cross Ref
- [53] . 2020. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29 (2020), 1–14.Google Scholar
Cross Ref
- [54] . 2017. Cross-modal retrieval with CNN visual features: A new baseline. IEEE Transactions on Cybernetics 47, 2 (2017), 449–460.Google Scholar
- [55] . 2009. Scalable routing via greedy embedding. In Proceedings of IEEE INFOCOM 2009.
IEEE ,Los Alamitos, CA , 2826–2830.Google ScholarCross Ref
- [56] . 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6609–6618.Google Scholar
Cross Ref
- [57] . 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. Google Scholar
Digital Library
- [58] . 2020. Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (2020), 1–18.Google Scholar
- [59] . 2020. Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Transactions on Cybernetics 50, 6 (2020), 2400–2413.Google Scholar
Cross Ref
- [60] . 2020. Learning shared semantic space with correlation alignment for cross-modal event retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 1 (2020), Article 9, 22 pages. Google Scholar
Digital Library
- [61] . 2020. Sequential cross-modal hashing learning via multi-scale correlation mining. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 4 (2020), Article 105, 20 pages. Google Scholar
Digital Library
- [62] . 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.Google Scholar
Cross Ref
- [63] . 2020. Image captioning with a joint attention mechanism by visual concept samples. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), Article 83, 22 pages. Google Scholar
Digital Library
- [64] . 2017. Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250.Google Scholar
- [65] . 2019. Information fusion in visual question answering: A survey. Information Fusion 52 (2019), 268–280.Google Scholar
Digital Library
- [66] . 2018. Occluded pedestrian detection through guided attention in CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6995–7003.Google Scholar
Cross Ref
- [67] . 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision, Vol. 11205. 707–723.Google Scholar
Index Terms
Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching
Recommendations
A triple fusion model for cross-modal deep hashing retrieval
AbstractIn the field of resource retrieval, deep cross-modal retrieval is attracting increasing attention. It has a lower storage capacity and faster retrieval speed. However, most of the current methods put their attention on the semantic similarity ...
Colour image cross-modal retrieval method based on multi-modal visual data fusion
Because the traditional colour image cross-modal retrieval methods have the problems of low retrieval accuracy and recall, and long retrieval time, a colour image cross-modal retrieval method based on multi-modal visual data fusion is proposed. First, ...
HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval
The purpose of cross-modal retrieval is to find the relationship between different modal samples and to retrieve other modal samples with similar semantics by using a certain modal sample. As the data of different modalities presents heterogeneous low-...






Comments