skip to main content
research-article

Image Captioning with a Joint Attention Mechanism by Visual Concept Samples

Published:05 July 2020Publication History
Skip Abstract Section

Abstract

The attention mechanism has been established as an effective method for generating caption words in image captioning; it explores one noticed subregion in an image to predict a related caption word. However, even though the attention mechanism could offer accurate subregions to train a model, the learned captioner may predict wrong, especially for visual concept words, which are the most important parts to understand an image. To tackle the preceding problem, in this article we propose Visual Concept Enhanced Captioner, which employs a joint attention mechanism with visual concept samples to strengthen prediction abilities for visual concepts in image captioning. Different from traditional attention approaches that adopt one LSTM to explore one noticed subregion each time, Visual Concept Enhanced Captioner introduces multiple virtual LSTMs in parallel to simultaneously receive multiple subregions from visual concept samples. Then, the model could update parameters by jointly exploring these subregions according to a composite loss function. Technically, this joint learning is helpful in finding the common characters of a visual concept, and thus it enhances the prediction accuracy for visual concepts. Moreover, by integrating diverse visual concept samples from different domains, our model can be extended to bridge visual bias in cross-domain learning for image captioning, which saves the cost for labeling captions. Extensive experiments have been conducted on two image datasets (MSCOCO and Flickr30K), and superior results are reported when comparing to state-of-the-art approaches. It is impressive that our approach could significantly increase BLUE-1 and F1 scores, which demonstrates an accuracy improvement for visual concepts in image captioning.

References

  1. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.Google ScholarGoogle ScholarCross RefCross Ref
  2. Mirza Muhammad Ali Baig, Mian Ihtisham Shah, Muhammad Abdullah Wajahat, Nauman Zafar, and Omar Arif. 2018. Image caption generator with novel object injection. In Proceedings of the International Conference on Digital Image Computing: Techniques and Applications. 1--8.Google ScholarGoogle Scholar
  3. Yi Bin, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen, and Xuelong Li. 2019. Describing video with attention-based bidirectional LSTM. IEEE Transactions on Cybernetics 49, 7 (2019), 2631--2641.Google ScholarGoogle ScholarCross RefCross Ref
  4. Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, and Jungong Han. 2018. Show, observe and tell: Attribute-driven attention model for image captioning. In Proceedings of the International Joint Conference on Artificial Intelligence. 606--612.Google ScholarGoogle ScholarCross RefCross Ref
  5. Hui Chen, Guiguang Ding, Sicheng Zhao, and Jungong Han. 2018. Temporal-difference learning with sampling baseline for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 6706--6713.Google ScholarGoogle Scholar
  6. Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6298--6306.Google ScholarGoogle ScholarCross RefCross Ref
  7. Tseng-Hung Chen, Yuan-Hong Liao, Ching-Yao Chuang, Wan Ting Hsu, Jianlong Fu, and Min Sun. 2017. Show, adapt and tell: Adversarial training of cross-domain image captioner. In Proceedings of the IEEE International Conference on Computer Vision. 521--530.Google ScholarGoogle ScholarCross RefCross Ref
  8. Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Paying more attention to saliency: Image captioning with saliency and context attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2 (2018), Article 48, 21 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Michael J. Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Workshop on Statistical Machine Translation. 376--380.Google ScholarGoogle Scholar
  10. Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. 15--29.Google ScholarGoogle Scholar
  11. Carlos Flick. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out.Google ScholarGoogle Scholar
  12. Lianli Gao, Kaixuan Fan, Jingkuan Song, Xianglong Liu, Xing Xu, and Heng Tao Shen. 2019. Deliberate attention networks for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 8320--8327.Google ScholarGoogle ScholarCross RefCross Ref
  13. Alex Graves. 2012. Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, Vol. 385. Springer, Berlin, Germany.Google ScholarGoogle Scholar
  14. Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2018. Stack-captioning: Coarse-to-fine learning for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 6837--6844.Google ScholarGoogle Scholar
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  16. Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond J. Mooney, Kate Saenko, and Trevor Darrell. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  17. Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018. Learning to guide decoding for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 6959--6966.Google ScholarGoogle Scholar
  18. Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.Google ScholarGoogle ScholarCross RefCross Ref
  19. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  20. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. BabyTalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891--2903.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE International Conference on Computer Vision. 4653--4661.Google ScholarGoogle ScholarCross RefCross Ref
  23. Linghui Li, Sheng Tang, Yongdong Zhang, Lixi Deng, and Qi Tian. 2018. GLA: Global-local attention for image description. IEEE Transactions on Multimedia 20, 3 (2018), 726--737.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xiangyang Li and Shuqiang Jiang. 2019. Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia 21, 8 (2019), 2117--2130.Google ScholarGoogle ScholarCross RefCross Ref
  25. Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2019. Pointing novel objects in image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12497--12506.Google ScholarGoogle ScholarCross RefCross Ref
  26. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755.Google ScholarGoogle Scholar
  27. Anan Liu, Ning Xu, Hanwang Zhang, Weizhi Nie, Yuting Su, and Yongdong Zhang. 2018. Multi-level policy and reward reinforcement learning for image captioning. In Proceedings of the International Joint Conference on Artificial Intelligence. 821--827.Google ScholarGoogle ScholarCross RefCross Ref
  28. Chenxi Liu, Junhua Mao, Fei Sha, and Alan L. Yuille. 2017. Attention correctness in neural image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 4176--4182.Google ScholarGoogle Scholar
  29. Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-aware visual policy network for sequence-level image captioning. In Proceedings of the ACM International Conference on Multimedia. 1416--1424.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision. 353--369.Google ScholarGoogle ScholarCross RefCross Ref
  31. Di Lu, Spencer Whitehead, Lifu Huang, Heng Ji, and Shih-Fu Chang. 2018. Entity-aware image caption generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 4013--4023.Google ScholarGoogle ScholarCross RefCross Ref
  32. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3242--3250.Google ScholarGoogle ScholarCross RefCross Ref
  33. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7219--7228.Google ScholarGoogle ScholarCross RefCross Ref
  34. Yuzhao Mao, Chang Zhou, Xiaojie Wang, and Ruifan Li. 2018. Show and tell more: Topic-oriented multi-sentence image captioning. In Proceedings of the International Joint Conference on Artificial Intelligence. 4258--4264.Google ScholarGoogle ScholarCross RefCross Ref
  35. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 311--318.Google ScholarGoogle Scholar
  36. Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. 2017. Areas of attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 1251--1259.Google ScholarGoogle ScholarCross RefCross Ref
  37. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Annual Conference on Neural Information Processing Systems. 91--99.Google ScholarGoogle Scholar
  38. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jingkuan Song, Yuyu Guo, Lianli Gao, Xuelong Li, Alan Hanjalic, and Heng Tao Shen. 2019. From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE Transactions on Neural Networks and Learning Systems 30, 10 (2019), 3047--3058.Google ScholarGoogle ScholarCross RefCross Ref
  40. Jingkuan Song, Xiangpeng Li, Lianli Gao, and Heng Tao Shen. 2018. Hierarchical LSTMs with adaptive attention for visual captioning. arXiv:1812.11004.Google ScholarGoogle Scholar
  41. Lingyun Song, Jun Liu, Buyue Qian, and Yihe Chen. 2019. Connecting language to images: A progressive attention-guided network for simultaneous image captioning and language grounding. In Proceedings of the AAAI Conference on Artificial Intelligence. 8885--8892.Google ScholarGoogle ScholarCross RefCross Ref
  42. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems. 3104--3112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.Google ScholarGoogle ScholarCross RefCross Ref
  44. Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2017. Captioning images with diverse objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1170--1178.Google ScholarGoogle ScholarCross RefCross Ref
  45. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  46. Meng Wang, Hao Li, Dacheng Tao, Ke Lu, and Xindong Wu. 2012. Multimodal graph-based reranking for web image search. IEEE Transactions on Image Processing 21, 11 (2012), 4649--4661.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Weixuan Wang, Zhihong Chen, and Haifeng Hu. 2019. Hierarchical attention network for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 8957--8964.Google ScholarGoogle ScholarCross RefCross Ref
  48. Yufei Wang, Zhe Lin, Xiaohui Shen, Scott Cohen, and Garrison W. Cottrell. 2017. Skeleton key: Image captioning by skeleton-attribute decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7378--7387.Google ScholarGoogle Scholar
  49. Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony R. Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212.Google ScholarGoogle Scholar
  50. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Min Yang, Wei Zhao, Wei Xu, Yabing Feng, Zhou Zhao, Xiaojun Chen, and Kai Lei. 2019. Multitask learning for cross-domain image captioning. IEEE Transactions on Multimedia 21, 4 (2019), 1047--1061.Google ScholarGoogle ScholarCross RefCross Ref
  52. Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685--10694.Google ScholarGoogle ScholarCross RefCross Ref
  53. Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, Heng Tao Shen, and Yanli Ji. 2018. Video captioning by adversarial LSTM. IEEE Transactions on Image Processing 27, 11 (2018), 5600--5611.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017. Incorporating copying mechanism in image captioning for learning novel objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5263--5271.Google ScholarGoogle ScholarCross RefCross Ref
  55. Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. 711--727.Google ScholarGoogle ScholarCross RefCross Ref
  56. Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy parsing for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 2621--2629.Google ScholarGoogle ScholarCross RefCross Ref
  57. Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4904--4912.Google ScholarGoogle ScholarCross RefCross Ref
  58. Senmao Ye, Junwei Han, and Nian Liu. 2018. Attentive linear transformation for image captioning. IEEE Transactions Image Processing 27, 11 (2018), 5514--5524.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659.Google ScholarGoogle ScholarCross RefCross Ref
  60. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67--78.Google ScholarGoogle ScholarCross RefCross Ref
  61. Zongjian Zhang, Qiang Wu, Yang Wang, and Fang Chen. 2019. High-quality image captioning with fine-grained and semantic-guided visual attention. IEEE Transactions on Multimedia 21, 7 (2019), 1681--1693.Google ScholarGoogle ScholarCross RefCross Ref
  62. Wei Zhao, Benyou Wang, Jianbo Ye, Min Yang, Zhou Zhao, Ruotian Luo, and Yu Qiao. 2018. A multi-task learning approach for image captioning. In Proceedings of the International Joint Conference on Artificial Intelligence. 1205--1211.Google ScholarGoogle ScholarCross RefCross Ref
  63. Yuenen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, and Hanwang Zhang. 2020. More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  64. Zhihao Zhu, Zhan Xue, and Zejian Yuan. 2018. Think and tell: Preview network for image captioning. In Proceedings of the British Machine Vision Conference. 82.Google ScholarGoogle Scholar

Index Terms

  1. Image Captioning with a Joint Attention Mechanism by Visual Concept Samples

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 3
        August 2020
        364 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3409646
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 July 2020
        • Online AM: 7 May 2020
        • Revised: 1 April 2020
        • Accepted: 1 April 2020
        • Received: 1 November 2019
        Published in tomm Volume 16, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!