skip to main content
research-article

Adaptive Attention-based High-level Semantic Introduction for Image Caption

Authors Info & Claims
Published:17 December 2020Publication History
Skip Abstract Section

Abstract

There have been several attempts to integrate a spatial visual attention mechanism into an image caption model and introduce semantic concepts as the guidance of image caption generation. High-level semantic information consists of the abstractedness and generality indication of an image, which is beneficial to improve the model performance. However, the high-level information is always static representation without considering the salient elements. Therefore, a semantic attention mechanism is used for the high-level information instead of conventional of static representation in this article. The salient high-level semantic information can be considered as redundant semantic information for image caption generation. Additionally, the generation of visual words and non-visual words can be separated, and an adaptive attention mechanism is employed to realize the guidance information of image caption generation switching between new fusion information (fusion of image feature and high-level semantics) and a language model. Therefore, visual words can be generated according to the image features and high-level semantic information, and non-visual words can be predicted by the language model. The semantics attention, adaptive attention, and previous generated words are fused to construct a special attention module for the input and output of long short-term memory. An image caption can be generated as a concise sentence on the basis of accurately grasping the rich content of the image. The experimental results show that the performance of the proposed model is promising for the evaluation metrics, and the captions can achieve logical and rich descriptions.

References

  1. Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high-level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212.Google ScholarGoogle ScholarCross RefCross Ref
  2. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659.Google ScholarGoogle ScholarCross RefCross Ref
  3. Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van denHengel. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2018), 1367--1381.Google ScholarGoogle ScholarCross RefCross Ref
  4. Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van denHengel. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2018), 1367--1381.Google ScholarGoogle ScholarCross RefCross Ref
  5. Chen He and Haifeng Hu. 2019. Image captioning with visual-semantic double attention. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1 (2019), 1--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Anqi Wang, Haifeng Hu and Liang Yang. 2018. Image captioning with affective guiding and selective attention. ACM Trans. Multimedia Comput. Commun. Appl. 14, 3 (2018), 1--15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jie Wu, Haifeng Hu, and Yi Wu. 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Trans. Multimedia Comput. Commun. Appl. 14, 4 (2018), 1--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Md. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM Comput. Surveys 51, 6 (2019), 118--154.Google ScholarGoogle Scholar
  10. Xinli Chen and C. Lawrence Zitnick. 2015. Mind's eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2422--2431.Google ScholarGoogle Scholar
  11. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceeding of IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  12. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). In International Conference on Learning Representations. 1--15.Google ScholarGoogle Scholar
  13. Andrej Karpathy and Li Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664--676.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 677--691.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Junhua Mao, Xu Wei, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In IEEE International Conference on Computer Vision. 2533--2541.Google ScholarGoogle Scholar
  16. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision. 15--29.Google ScholarGoogle ScholarCross RefCross Ref
  17. Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generating simple image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1601--1608.Google ScholarGoogle Scholar
  18. Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piortr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.Google ScholarGoogle ScholarCross RefCross Ref
  19. Ning Wang, Jing-Chao Sun, Meng Joo Er, and Yan-Cheng Liu. 2016. Hybrid recursive least squares algorithm for online sequential identification using data chunks, Neurocomputing 2016, 174: 651--660.Google ScholarGoogle Scholar
  20. Siming Li, Girish Kulkarni, Tamara L. Berg, C. Alexander Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale N-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. 220--228.Google ScholarGoogle Scholar
  21. Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 359--368.Google ScholarGoogle Scholar
  22. Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1292--1302.Google ScholarGoogle Scholar
  23. Remi Lebret, Pedro O. Pinheiro, and Ronan Collobert. 2014. Simple image description generator via a linear phrase-based approach. ArXiv Preprint arXiv:1412.8419.Google ScholarGoogle Scholar
  24. Ryan Kiros, Ruslan Salakhutdinov, and Richard Zemel. 2014. Multimodal neural language models. In Proceedings of the 31st International Conference on International Conference on Machine Learning. 595--603.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. ArXiv Preprint arXiv:1410.1090.Google ScholarGoogle Scholar
  26. Johnson Justin, Karpathy Andrej, and Fei-Fei Li. 2016. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4565--4574.Google ScholarGoogle Scholar
  27. Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. 2017. Areas of attention for image captioning. In Proceedings of the International Conference on Computer Vision. 1251--1259.Google ScholarGoogle ScholarCross RefCross Ref
  28. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3242--3250.Google ScholarGoogle ScholarCross RefCross Ref
  29. Rezazadegan Tavakoli, Hamed R. Tavakoliy, Rakshith Shetty, Ali Borji, and Jorma Laaksonen. 2017. Paying attention to descriptions generated by image captioning models. In Proceedings of the IEEE International Conference on Computer Vision. 2506--2515.Google ScholarGoogle ScholarCross RefCross Ref
  30. Xiaodan Zhang, Shengfeng He, Xinhang Song, Rynson W. H. Lau, Jianbin Jiao, and Qixiang Ye. 2020. Image captioning via semantic element embedding. Neurocomputing 395, 212--221.Google ScholarGoogle ScholarCross RefCross Ref
  31. Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, and Changshui Zhang. 2015 Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 (2015), 2321--2334.Google ScholarGoogle Scholar
  32. Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision. 2407--2415.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894--4902.Google ScholarGoogle ScholarCross RefCross Ref
  34. Zongjian Zhang, Qiang Wu, Yang Wang, and Fang Chen. 2019. High-quality image captioning with fine-grained and semantic-guided visual attention. IEEE Trans. Multimedia 21, 7 (2019), 1681--1693.Google ScholarGoogle ScholarCross RefCross Ref
  35. Songtao Ding, Shiru Qu, Yuling Xi, and Shaohua Wan. 2020. Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 39, 520--530.Google ScholarGoogle ScholarCross RefCross Ref
  36. Jingkuan Song, Yuyu Guo, Lianli Gao, Xuelong Li, Alan Hanjalic, and Heng Tao Shen. 2019. From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE Trans. Neural Netw. Learn. Syst. 30, 10 (2019), 3047--3058.Google ScholarGoogle ScholarCross RefCross Ref
  37. Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017 Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimedia 19, 9 (2017), 2045--2055.Google ScholarGoogle ScholarCross RefCross Ref
  38. Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2020 Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans. Pattern Anal. Mach. Intell. 42, 5 (2020), 1112--1131.Google ScholarGoogle Scholar
  39. Meng Joo Er, Yong Zhang, Ning Wang, and Mahardhika Pratama. 2016 Attention pooling-based convolutional neural network for sentence modelling. Info. Sci. 373, 388--403.Google ScholarGoogle Scholar
  40. Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27. 2204--2212.Google ScholarGoogle Scholar
  41. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and tell: Lessons learned from the 2015 MSCOCO Image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 652--663.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Rawaa Dawoud Al-Dabbagh, Saad Mekhilef, and Mohd Sapiyan Baba. 2015 Parameters' fine tuning of differential evolution algorithm. Comput. Syst. Sci. Eng. 30, 2 (2015), 125--139.Google ScholarGoogle Scholar
  44. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  45. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll A R and C. Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. ArXiv Preprint, arXiv:1504.00325.Google ScholarGoogle Scholar
  46. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 311--318.Google ScholarGoogle Scholar
  47. Alon Lavie and Abhaya Agarwal. 2007. Meteor: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the 2nd Workshop on Statistical Machine Translation. 228--231.Google ScholarGoogle ScholarCross RefCross Ref
  48. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL2004. 74--81.Google ScholarGoogle Scholar
  49. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.Google ScholarGoogle ScholarCross RefCross Ref
  50. Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5630--5639.Google ScholarGoogle ScholarCross RefCross Ref
  51. Shiwei Wang, Long Lan, Xiang Zhang, Guohua Dong, and Zhigang Luo. 2019. Cascade semantic fusion for image captioning. IEEE Access 7, 66680--66688.Google ScholarGoogle ScholarCross RefCross Ref
  52. Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, and Margaret Mitchell. 2015. Language models for image captioning: The quirks and what works. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. 100--105.Google ScholarGoogle ScholarCross RefCross Ref
  53. Hui Chen, Guiguang Ding, Sicheng Zhao, and Jungong Han. 2018. Temporal-difference learning with sampling baseline for image captioning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 6706--6713.Google ScholarGoogle Scholar
  54. Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.Google ScholarGoogle ScholarCross RefCross Ref
  55. Jinsong Su, Jialong Tang, Ziyao Lu, Xianpei Han, and Haiying Zhang. 2019. A neural image captioning model with caption-to-images semantic constructor. Neurocomputing 367, 144--151.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adaptive Attention-based High-level Semantic Introduction for Image Caption

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 4
        November 2020
        372 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3444749
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 December 2020
        • Accepted: 1 July 2020
        • Revised: 1 April 2020
        • Received: 1 December 2019
        Published in tomm Volume 16, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!