Abstract
Image Captioning, which automatically describes an image with natural language, is regarded as a fundamental challenge in computer vision. In recent years, significant advance has been made in image captioning through improving attention mechanism. However, most existing methods construct attention mechanisms based on singular visual features, such as patch features or object features, which limits the accuracy of generated captions. In this article, we propose a Bidirectional Co-Attention Network (BCAN) that combines multiple visual features to provide information from different aspects. Different features are associated with predicting different words, and there are a priori relations between these multiple visual features. Based on this, we further propose a bottom-up and top-down bi-directional co-attention mechanism to extract discriminative attention information. Furthermore, most existing methods do not exploit an effective multimodal integration strategy, generally using addition or concatenation to combine features. To solve this problem, we adopt the Multivariate Residual Module (MRM) to integrate multimodal attention features. Meanwhile, we further propose a Vertical MRM to integrate features of the same category, and a Horizontal MRM to combine features of the different categories, which can balance the contribution of the bottom-up co-attention and the top-down co-attention. In contrast to the existing methods, the BCAN is able to obtain complementary information from multiple visual features via the bi-directional co-attention strategy, and integrate multimodal information via the improved multivariate residual strategy. We conduct a series of experiments on two benchmark datasets (MSCOCO and Flickr30k), and the results indicate that the proposed BCAN achieves the superior performance.
- [1] . 2016. SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision.
Springer , 382–398.Google ScholarCross Ref
- [2] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
Cross Ref
- [3] . 2005. METEOR: An automatic metric for mt evaluation with improved correalation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.Google Scholar
- [4] . 2017. MUTAN: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2612–2620.Google Scholar
Cross Ref
- [5] . 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659–5667.Google Scholar
Cross Ref
- [6] . 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10578–10587.Google Scholar
Cross Ref
- [7] . 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255.Google Scholar
Cross Ref
- [8] . 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision.
Springer , 15–29. Google ScholarDigital Library
- [9] . 2016. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2016), 2321–2334.Google Scholar
Cross Ref
- [10] . 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5630–5639.Google Scholar
Cross Ref
- [11] . 2019. Image captioning with visual-semantic double attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 1 (2019), 1–16. Google Scholar
Digital Library
- [12] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [13] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google Scholar
Digital Library
- [14] . 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634–4643.Google Scholar
Cross Ref
- [15] . 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 499–515.Google Scholar
Digital Library
- [16] . 2021. Fusion models for improved image captioning. In Pattern Recognition. ICPR International Workshops and Challenges. 381–395.Google Scholar
- [17] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google Scholar
Cross Ref
- [18] . 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems. 1889–1897. Google Scholar
Digital Library
- [19] . 2019. Reflective decoding network for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 8888–8897.Google Scholar
Cross Ref
- [20] . 2016. Multimodal residual learning for visual qa. In Advances in Neural Information Processing Systems. 361–369. Google Scholar
Digital Library
- [21] . 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016).Google Scholar
- [22] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73. Google Scholar
Digital Library
- [23] . 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891–2903. Google Scholar
Digital Library
- [24] . 2019. Entangled transformer for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 8928–8937.Google Scholar
Cross Ref
- [25] . 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out. 74–81.Google Scholar
- [26] . 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision.
Springer , 740–755.Google ScholarCross Ref
- [27] . 2017. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE International Conference on Computer Vision. 873–881.Google Scholar
Cross Ref
- [28] . 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375–383.Google Scholar
Cross Ref
- [29] . 2018. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7219–7228.Google Scholar
Cross Ref
- [30] . 2018. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. 7218–7225. Google Scholar
Digital Library
- [31] . 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014).Google Scholar
- [32] . 2020. X-Linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.Google Scholar
Cross Ref
- [33] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318. Google Scholar
Digital Library
- [34] . 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641–2649. Google Scholar
Digital Library
- [35] . 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).Google Scholar
- [36] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99. Google Scholar
Digital Library
- [37] . 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008–7024.Google Scholar
Cross Ref
- [38] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008. Google Scholar
Digital Library
- [39] . 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google Scholar
Cross Ref
- [40] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [41] . 2018. Multivariate attention network for image captioning. In Asian Conference on Computer Vision.
Springer , 587–602.Google Scholar - [42] . 2019. Hierarchical attention network for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8957–8964. Google Scholar
Digital Library
- [43] . 2017. Skeleton key: Image captioning by skeleton-attribute decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7272–7281.Google Scholar
Cross Ref
- [44] . 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–212.Google Scholar
Cross Ref
- [45] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048–2057. Google Scholar
Digital Library
- [46] . 2020. Constrained LSTM and residual attention for image captioning. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3 (2020), 18. Google Scholar
Digital Library
- [47] . 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 684–699.Google Scholar
Cross Ref
- [48] . 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 2621–2629.Google Scholar
Cross Ref
- [49] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google Scholar
Cross Ref
- [50] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google Scholar
Cross Ref
- [51] . 2017. Actor-critic sequence training for image captioning. arXiv preprint arXiv:1706.09601 (2017).Google Scholar
- [52] . 2020. Comprehensive image captioning via scene graph decomposition. In European Conference on Computer Vision.
Springer , 211–229.Google ScholarDigital Library
Index Terms
Bi-Directional Co-Attention Network for Image Captioning
Recommendations
Integrating Scene Semantic Knowledge into Image Captioning
Most existing image captioning methods use only the visual information of the image to guide the generation of captions, lack the guidance of effective scene semantic information, and the current visual attention mechanism cannot adjust the focus ...
Constrained LSTM and Residual Attention for Image Captioning
Visual structure and syntactic structure are essential in images and texts, respectively. Visual structure depicts both entities in an image and their interactions, whereas syntactic structure in texts can reflect the part-of-speech constraints between ...
Image Captioning with a Joint Attention Mechanism by Visual Concept Samples
The attention mechanism has been established as an effective method for generating caption words in image captioning; it explores one noticed subregion in an image to predict a related caption word. However, even though the attention mechanism could ...






Comments