Abstract
There have been several attempts to integrate a spatial visual attention mechanism into an image caption model and introduce semantic concepts as the guidance of image caption generation. High-level semantic information consists of the abstractedness and generality indication of an image, which is beneficial to improve the model performance. However, the high-level information is always static representation without considering the salient elements. Therefore, a semantic attention mechanism is used for the high-level information instead of conventional of static representation in this article. The salient high-level semantic information can be considered as redundant semantic information for image caption generation. Additionally, the generation of visual words and non-visual words can be separated, and an adaptive attention mechanism is employed to realize the guidance information of image caption generation switching between new fusion information (fusion of image feature and high-level semantics) and a language model. Therefore, visual words can be generated according to the image features and high-level semantic information, and non-visual words can be predicted by the language model. The semantics attention, adaptive attention, and previous generated words are fused to construct a special attention module for the input and output of long short-term memory. An image caption can be generated as a concise sentence on the basis of accurately grasping the rich content of the image. The experimental results show that the performance of the proposed model is promising for the evaluation metrics, and the captions can achieve logical and rich descriptions.
- Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high-level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212.Google Scholar
Cross Ref
- Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659.Google Scholar
Cross Ref
- Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van denHengel. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2018), 1367--1381.Google Scholar
Cross Ref
- Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton van denHengel. 2018. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2018), 1367--1381.Google Scholar
Cross Ref
- Chen He and Haifeng Hu. 2019. Image captioning with visual-semantic double attention. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1 (2019), 1--16.Google Scholar
Digital Library
- Anqi Wang, Haifeng Hu and Liang Yang. 2018. Image captioning with affective guiding and selective attention. ACM Trans. Multimedia Comput. Commun. Appl. 14, 3 (2018), 1--15.Google Scholar
Digital Library
- Jie Wu, Haifeng Hu, and Yi Wu. 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Trans. Multimedia Comput. Commun. Appl. 14, 4 (2018), 1--19.Google Scholar
Digital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.Google Scholar
Digital Library
- Md. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM Comput. Surveys 51, 6 (2019), 118--154.Google Scholar
- Xinli Chen and C. Lawrence Zitnick. 2015. Mind's eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2422--2431.Google Scholar
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceeding of IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google Scholar
Cross Ref
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). In International Conference on Learning Representations. 1--15.Google Scholar
- Andrej Karpathy and Li Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664--676.Google Scholar
Digital Library
- Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 677--691.Google Scholar
Digital Library
- Junhua Mao, Xu Wei, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In IEEE International Conference on Computer Vision. 2533--2541.Google Scholar
- Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision. 15--29.Google Scholar
Cross Ref
- Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generating simple image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1601--1608.Google Scholar
- Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piortr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.Google Scholar
Cross Ref
- Ning Wang, Jing-Chao Sun, Meng Joo Er, and Yan-Cheng Liu. 2016. Hybrid recursive least squares algorithm for online sequential identification using data chunks, Neurocomputing 2016, 174: 651--660.Google Scholar
- Siming Li, Girish Kulkarni, Tamara L. Berg, C. Alexander Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale N-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. 220--228.Google Scholar
- Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 359--368.Google Scholar
- Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1292--1302.Google Scholar
- Remi Lebret, Pedro O. Pinheiro, and Ronan Collobert. 2014. Simple image description generator via a linear phrase-based approach. ArXiv Preprint arXiv:1412.8419.Google Scholar
- Ryan Kiros, Ruslan Salakhutdinov, and Richard Zemel. 2014. Multimodal neural language models. In Proceedings of the 31st International Conference on International Conference on Machine Learning. 595--603.Google Scholar
Digital Library
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. ArXiv Preprint arXiv:1410.1090.Google Scholar
- Johnson Justin, Karpathy Andrej, and Fei-Fei Li. 2016. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4565--4574.Google Scholar
- Marco Pedersoli, Thomas Lucas, Cordelia Schmid, and Jakob Verbeek. 2017. Areas of attention for image captioning. In Proceedings of the International Conference on Computer Vision. 1251--1259.Google Scholar
Cross Ref
- Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3242--3250.Google Scholar
Cross Ref
- Rezazadegan Tavakoli, Hamed R. Tavakoliy, Rakshith Shetty, Ali Borji, and Jorma Laaksonen. 2017. Paying attention to descriptions generated by image captioning models. In Proceedings of the IEEE International Conference on Computer Vision. 2506--2515.Google Scholar
Cross Ref
- Xiaodan Zhang, Shengfeng He, Xinhang Song, Rynson W. H. Lau, Jianbin Jiao, and Qixiang Ye. 2020. Image captioning via semantic element embedding. Neurocomputing 395, 212--221.Google Scholar
Cross Ref
- Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, and Changshui Zhang. 2015 Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 (2015), 2321--2334.Google Scholar
- Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision. 2407--2415.Google Scholar
Digital Library
- Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894--4902.Google Scholar
Cross Ref
- Zongjian Zhang, Qiang Wu, Yang Wang, and Fang Chen. 2019. High-quality image captioning with fine-grained and semantic-guided visual attention. IEEE Trans. Multimedia 21, 7 (2019), 1681--1693.Google Scholar
Cross Ref
- Songtao Ding, Shiru Qu, Yuling Xi, and Shaohua Wan. 2020. Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 39, 520--530.Google Scholar
Cross Ref
- Jingkuan Song, Yuyu Guo, Lianli Gao, Xuelong Li, Alan Hanjalic, and Heng Tao Shen. 2019. From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE Trans. Neural Netw. Learn. Syst. 30, 10 (2019), 3047--3058.Google Scholar
Cross Ref
- Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017 Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimedia 19, 9 (2017), 2045--2055.Google Scholar
Cross Ref
- Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2020 Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans. Pattern Anal. Mach. Intell. 42, 5 (2020), 1112--1131.Google Scholar
- Meng Joo Er, Yong Zhang, Ning Wang, and Mahardhika Pratama. 2016 Attention pooling-based convolutional neural network for sentence modelling. Info. Sci. 373, 388--403.Google Scholar
- Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27. 2204--2212.Google Scholar
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and tell: Lessons learned from the 2015 MSCOCO Image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 652--663.Google Scholar
Digital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.Google Scholar
Digital Library
- Rawaa Dawoud Al-Dabbagh, Saad Mekhilef, and Mohd Sapiyan Baba. 2015 Parameters' fine tuning of differential evolution algorithm. Comput. Syst. Sci. Eng. 30, 2 (2015), 125--139.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
Cross Ref
- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Doll A R and C. Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. ArXiv Preprint, arXiv:1504.00325.Google Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 311--318.Google Scholar
- Alon Lavie and Abhaya Agarwal. 2007. Meteor: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the 2nd Workshop on Statistical Machine Translation. 228--231.Google Scholar
Cross Ref
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL2004. 74--81.Google Scholar
- Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.Google Scholar
Cross Ref
- Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5630--5639.Google Scholar
Cross Ref
- Shiwei Wang, Long Lan, Xiang Zhang, Guohua Dong, and Zhigang Luo. 2019. Cascade semantic fusion for image captioning. IEEE Access 7, 66680--66688.Google Scholar
Cross Ref
- Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, and Margaret Mitchell. 2015. Language models for image captioning: The quirks and what works. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. 100--105.Google Scholar
Cross Ref
- Hui Chen, Guiguang Ding, Sicheng Zhao, and Jungong Han. 2018. Temporal-difference learning with sampling baseline for image captioning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 6706--6713.Google Scholar
- Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.Google Scholar
Cross Ref
- Jinsong Su, Jialong Tang, Ziyao Lu, Xianpei Han, and Haiying Zhang. 2019. A neural image captioning model with caption-to-images semantic constructor. Neurocomputing 367, 144--151.Google Scholar
Digital Library
Index Terms
Adaptive Attention-based High-level Semantic Introduction for Image Caption
Recommendations
Regenerating Image Caption with High-Level Semantics
Intelligent Computing MethodologiesAbstractAutomatically describing an image with a sentence is a challenging task in the crossing area of computer vision and natural language processing. Most existing models generate image captions by an encoder-decoder process based on convolutional ...
Image caption generation with high-level image features
Highlights- Introduce the theory of attention in psychology to image captioning and use to filter image features.
AbstractRecently, caption generation has raised a huge interests in images and videos. However, it is challenging for the models to select proper subjects in a complex background and generate desired captions in high-level vision tasks. ...
Image caption based on Visual Attention Mechanism
IVSP '19: Proceedings of the 2019 International Conference on Image, Video and Signal ProcessingThe generic neural encoder-decoder framework for image captioning typically uses a convolution neural network to extract the image features and then uses a recurrent neural network to generate a sentence describing this image. The residual attention ...






Comments