skip to main content
research-article

Image Captioning via Semantic Guidance Attention and Consensus Selection Strategy

Authors Info & Claims
Published:10 October 2018Publication History
Skip Abstract Section

Abstract

Recently, a series of attempts have incorporated spatial attention mechanisms into the task of image captioning, which achieves a remarkable improvement in the quality of generative captions. However, the traditional spatial attention mechanism adopts latent and delayed semantic representations to decide which area should be paid more attention to, resulting in inaccurate semantic guidance and the introduction of redundant information. In order to optimize the spatial attention mechanism, we propose the Semantic Guidance Attention (SGA) mechanism in this article. Specifically, SGA utilizes semantic word representations to provide an intuitive semantic guidance that focuses accurately on semantic-related regions. Moreover, we reduce the difficulty of generating fluent sentences by updating the attention information in time. At the same time, the beam search algorithm is widely used to predict words during sequence generation. This algorithm generates a sentence according to the probabilities of words, so it is easy to push out a generic sentence and discard some distinctive captions. In order to overcome this limitation, we design the Consensus Selection (CS) strategy to choose the most descriptive and informative caption, which is selected by the semantic similarity of captions instead of the probabilities of words. The consensus caption is determined by selecting the one with the highest cumulative semantic similarity with respect to the reference captions. Our proposed model (SGA-CS) is validated on Flickr30k and MSCOCO, which shows that SGA-CS outperforms state-of-the-art approaches. To our best knowledge, SGA-CS is the first attempt to jointly produce semantic attention guidance and select descriptive captions for image captioning tasks, achieving one of the best performance ratings among any cross-entropy training methods.

References

  1. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Guided open vocabulary image captioning with constrained beam search. arXiv preprint arXiv:1612.00576 (2016).Google ScholarGoogle Scholar
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Vol. 29. Association for Computational Linguistics, Michigan, USA, 65--72.Google ScholarGoogle Scholar
  4. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS). Neural Information Processing Systems Foundation, Montréal, Canada, 1171--1179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hui Chen, Guiguang Ding, Sicheng Zhao, and Jungong Han. 2018. Temporal-difference learning with sampling baseline for image captioning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI’18). Association for the Advancement of Artificial Intelligence, New Orleans, Louisiana, USA, 6706--6713.Google ScholarGoogle ScholarCross RefCross Ref
  6. Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, and Tat-Seng Chua. 2016. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. arXiv preprint arXiv:1611.05594 (2016).Google ScholarGoogle Scholar
  7. Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. 2017. Recurrent attentional reinforcement learning for multi-label image recognition. arXiv preprint arXiv:1712.07465 (2017).Google ScholarGoogle Scholar
  8. Tianshui Chen, Wenxi Wu, Yuefang Gao, Le Dong, Xiaonan Luo, and Liang Lin. 2018. Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding. arXiv preprint arXiv:1808.04505 (2018).Google ScholarGoogle Scholar
  9. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google ScholarGoogle Scholar
  10. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google ScholarGoogle Scholar
  11. Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C. Lawrence Zitnick. 2015. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467 (2015).Google ScholarGoogle Scholar
  12. Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, and Changshui Zhang. 2017. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2321--2334.Google ScholarGoogle ScholarCross RefCross Ref
  13. Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017. Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia 19, 9 (2017), 2045--2055.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Las Vegas, USA, 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  15. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding long-short term memory for image caption generation. arXiv preprint arXiv:1509.04942 (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Boston, USA, 3128--3137.Google ScholarGoogle ScholarCross RefCross Ref
  18. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  19. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8. Association for Computational Linguistics, Barcelona, Spain, 1--10.Google ScholarGoogle Scholar
  20. Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, and Changyin Sun. 2017. Semantic regularisation for recurrent image annotation. arXiv preprint (2017).Google ScholarGoogle Scholar
  21. Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2016. Improved image captioning via policy gradient optimization of spider. arXiv preprint arXiv:1612.00370 (2016).Google ScholarGoogle Scholar
  22. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. arXiv preprint arXiv:1612.01887 (2016).Google ScholarGoogle Scholar
  23. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014).Google ScholarGoogle Scholar
  24. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, USA, 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ramakanth Pasunuru and Mohit Bansal. 2017. Multi-task video captioning with video and entailment generation. arXiv preprint arXiv:1704.07489 (2017).Google ScholarGoogle Scholar
  26. Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, Santiago, Chile, 2641--2649. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).Google ScholarGoogle Scholar
  28. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research 15, 1 (2014), 1929--1958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Qing Sun, Stefan Lee, and Dhruv Batra. 2017. Bidirectional beam search: Forward-backward inference in neural sequence models for fill-in-the-blank image captioning. arXiv preprint arXiv:1705.08759 (2017).Google ScholarGoogle Scholar
  30. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS’14). Neural Information Processing Systems Foundation, Montréal, Canada. 3104--3112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Amazon Mechanical Turk. 2012. Amazon mechanical turk. Retrieved August 17 (2012), 2012.Google ScholarGoogle Scholar
  32. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Boston, Massachusetts, USA, 4566--4575.Google ScholarGoogle ScholarCross RefCross Ref
  33. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Boston, Massachusetts, USA, 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  34. Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. 2017. Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Honolulu, Hawaii, USA, 464--472.Google ScholarGoogle ScholarCross RefCross Ref
  35. Jie Wu and Haifeng Hu. 2018. Cascade recurrent neural network for image caption generation. Electronics Letters 53, 25 (2018), 1642--1643.Google ScholarGoogle ScholarCross RefCross Ref
  36. Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS’17). Neural Information Processing Systems Foundation, Long Beach, USA, 1782--1792.Google ScholarGoogle Scholar
  37. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. International Machine Learning Society (IMLS), Lille, France, 2048--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2016. Boosting image captioning with attributes. OpenReview 2, 5 (2016), 8.Google ScholarGoogle Scholar
  39. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Las Vegas, USA, 4651--4659.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Image Captioning via Semantic Guidance Attention and Consensus Selection Strategy

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!