skip to main content
research-article

Bi-Directional Co-Attention Network for Image Captioning

Authors Info & Claims
Published:12 November 2021Publication History
Skip Abstract Section

Abstract

Image Captioning, which automatically describes an image with natural language, is regarded as a fundamental challenge in computer vision. In recent years, significant advance has been made in image captioning through improving attention mechanism. However, most existing methods construct attention mechanisms based on singular visual features, such as patch features or object features, which limits the accuracy of generated captions. In this article, we propose a Bidirectional Co-Attention Network (BCAN) that combines multiple visual features to provide information from different aspects. Different features are associated with predicting different words, and there are a priori relations between these multiple visual features. Based on this, we further propose a bottom-up and top-down bi-directional co-attention mechanism to extract discriminative attention information. Furthermore, most existing methods do not exploit an effective multimodal integration strategy, generally using addition or concatenation to combine features. To solve this problem, we adopt the Multivariate Residual Module (MRM) to integrate multimodal attention features. Meanwhile, we further propose a Vertical MRM to integrate features of the same category, and a Horizontal MRM to combine features of the different categories, which can balance the contribution of the bottom-up co-attention and the top-down co-attention. In contrast to the existing methods, the BCAN is able to obtain complementary information from multiple visual features via the bi-directional co-attention strategy, and integrate multimodal information via the improved multivariate residual strategy. We conduct a series of experiments on two benchmark datasets (MSCOCO and Flickr30k), and the results indicate that the proposed BCAN achieves the superior performance.

REFERENCES

  1. [1] Anderson Peter, Fernando Basura, Johnson Mark, and Gould Stephen. 2016. SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. Springer, 382398.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Banerjee Satanjeev and Lavie Alon. 2005. METEOR: An automatic metric for mt evaluation with improved correalation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 6572.Google ScholarGoogle Scholar
  4. [4] Ben-Younes Hedi, Cadene Rémi, Cord Matthieu, and Thome Nicolas. 2017. MUTAN: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 26122620.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Long, Zhang Hanwang, Xiao Jun, Nie Liqiang, Shao Jian, Liu Wei, and Chua Tat-Seng. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 56595667.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cornia Marcella, Stefanini Matteo, Baraldi Lorenzo, and Cucchiara Rita. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1057810587.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248255.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Farhadi Ali, Hejrati Mohsen, Sadeghi Mohammad Amin, Young Peter, Rashtchian Cyrus, Hockenmaier Julia, and Forsyth David. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. Springer, 1529. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Fu Kun, Jin Junqi, Cui Runpeng, Sha Fei, and Zhang Changshui. 2016. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2016), 23212334.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Gan Zhe, Gan Chuang, He Xiaodong, Pu Yunchen, Tran Kenneth, Gao Jianfeng, Carin Lawrence, and Deng Li. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 56305639.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] He Chen and Hu Haifeng. 2019. Image captioning with visual-semantic double attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 1 (2019), 116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Huang Lun, Wang Wenmin, Chen Jie, and Wei Xiao-Yong. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 46344643.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Jiang Wenhao, Ma Lin, Jiang Yu-Gang, Liu Wei, and Zhang Tong. 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 499515.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Kalimuthu Marimuthu, Mogadala Aditya, Mosbach Marius, Dietrich” editor=“Del Bimbo Alberto Klakow,, Cucchiara Rita, Sclaroff Stan, Farinella Giovanni Maria, Mei Tao, Bertini Marco, Escalante Hugo Jair, and Vezzani Roberto. 2021. Fusion models for improved image captioning. In Pattern Recognition. ICPR International Workshops and Challenges. 381395.Google ScholarGoogle Scholar
  17. [17] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Karpathy Andrej, Joulin Armand, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems. 18891897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Ke Lei, Pei Wenjie, Li Ruiyu, Shen Xiaoyong, and Tai Yu-Wing. 2019. Reflective decoding network for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 88888897.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Kim Jin-Hwa, Lee Sang-Woo, Kwak Donghyun, Heo Min-Oh, Kim Jeonghee, Ha Jung-Woo, and Zhang Byoung-Tak. 2016. Multimodal residual learning for visual qa. In Advances in Neural Information Processing Systems. 361369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Kim Jin-Hwa, On Kyoung-Woon, Lim Woosang, Kim Jeonghee, Ha Jung-Woo, and Zhang Byoung-Tak. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016).Google ScholarGoogle Scholar
  22. [22] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., Bernstein Michael S., and Fei-Fei Li. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 3273. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Kulkarni Girish, Premraj Visruth, Ordonez Vicente, Dhar Sagnik, Li Siming, Choi Yejin, Berg Alexander C., and Berg Tamara L.. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 28912903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Li Guang, Zhu Linchao, Liu Ping, and Yang Yi. 2019. Entangled transformer for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 89288937.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Lin Chin-Yew. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out. 7481.Google ScholarGoogle Scholar
  26. [26] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740755.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Liu Siqi, Zhu Zhenhai, Ye Ning, Guadarrama Sergio, and Murphy Kevin. 2017. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE International Conference on Computer Vision. 873881.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Lu Jiasen, Xiong Caiming, Parikh Devi, and Socher Richard. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375383.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Lu Jiasen, Yang Jianwei, Batra Dhruv, and Parikh Devi. 2018. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 72197228.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Lu Pan, Li Hongsheng, Zhang Wei, Wang Jianyong, and Wang Xiaogang. 2018. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence. 72187225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Mao Junhua, Xu Wei, Yang Yi, Wang Jiang, Huang Zhiheng, and Yuille Alan. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014).Google ScholarGoogle Scholar
  32. [32] Pan Yingwei, Yao Ting, Li Yehao, and Mei Tao. 2020. X-Linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1097110980.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Plummer Bryan A., Wang Liwei, Cervantes Chris M., Caicedo Juan C., Hockenmaier Julia, and Lazebnik Svetlana. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 26412649. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Ranzato Marc’Aurelio, Chopra Sumit, Auli Michael, and Zaremba Wojciech. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).Google ScholarGoogle Scholar
  36. [36] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 9199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Rennie Steven J., Marcheret Etienne, Mroueh Youssef, Ross Jerret, and Goel Vaibhava. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 70087024.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45664575.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Wang Weixuan, Chen Zhihong, and Hu Haifeng. 2018. Multivariate attention network for image captioning. In Asian Conference on Computer Vision. Springer, 587602.Google ScholarGoogle Scholar
  42. [42] Wang Weixuan, Chen Zhihong, and Hu Haifeng. 2019. Hierarchical attention network for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 89578964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Wang Yufei, Lin Zhe, Shen Xiaohui, Cohen Scott, and Cottrell Garrison W.. 2017. Skeleton key: Image captioning by skeleton-attribute decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 72727281.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Wu Qi, Shen Chunhua, Liu Lingqiao, Dick Anthony, and Hengel Anton Van Den. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203212.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 20482057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Yang Liang, Hu Haifeng, Xing Songlong, and Lu Xinlong. 2020. Constrained LSTM and residual attention for image captioning. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3 (2020), 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 684699.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Yao Ting, Pan Yingwei, Li Yehao, and Mei Tao. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 26212629.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46514659.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46514659.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Zhang Li, Sung Flood, Liu Feng, Xiang Tao, Gong Shaogang, Yang Yongxin, and Hospedales Timothy M.. 2017. Actor-critic sequence training for image captioning. arXiv preprint arXiv:1706.09601 (2017).Google ScholarGoogle Scholar
  52. [52] Zhong Yiwu, Wang Liwei, Chen Jianshu, Yu Dong, and Li Yin. 2020. Comprehensive image captioning via scene graph decomposition. In European Conference on Computer Vision. Springer, 211229.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Bi-Directional Co-Attention Network for Image Captioning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 17, Issue 4
        November 2021
        529 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3492437
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 November 2021
        • Accepted: 1 April 2021
        • Revised: 1 March 2021
        • Received: 1 July 2020
        Published in tomm Volume 17, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!