skip to main content
research-article

Constrained LSTM and Residual Attention for Image Captioning

Authors Info & Claims
Published:05 July 2020Publication History
Skip Abstract Section

Abstract

Visual structure and syntactic structure are essential in images and texts, respectively. Visual structure depicts both entities in an image and their interactions, whereas syntactic structure in texts can reflect the part-of-speech constraints between adjacent words. Most existing methods either use visual global representation to guide the language model or generate captions without considering the relationships of different entities or adjacent words. Thus, their language models lack relevance in both visual and syntactic structure. To solve this problem, we propose a model that aligns the language model to certain visual structure and also constrains it with a specific part-of-speech template. In addition, most methods exploit the latent relationship between words in a sentence and pre-extracted visual regions in an image yet ignore the effects of unextracted regions on predicted words. We develop a residual attention mechanism to simultaneously focus on the pre-extracted visual objects and unextracted regions in an image. Residual attention is capable of capturing precise regions of an image corresponding to the predicted words considering both the effects of visual objects and unextracted regions. The effectiveness of our entire framework and each proposed module are verified on two classical datasets: MSCOCO and Flickr30k. Our framework is on par with or even better than the state-of-the-art methods and achieves superior performance on COCO captioning Leaderboard.

References

  1. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google ScholarGoogle Scholar
  3. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Proceedings of the International Conference on Neural Information Processing Systems. 1171--1179.Google ScholarGoogle Scholar
  4. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 740--750.Google ScholarGoogle ScholarCross RefCross Ref
  5. Fuhai Chen, Rongrong Ji, Xiaoshuai Sun, Yongjian Wu, and Jinsong Su. 2018. GroupCap: Group-based image captioning with structured relevance and diversity constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle ScholarCross RefCross Ref
  6. Shi Chen and Qi Zhao. 2018. Boosted attention: Leveraging human attention for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle ScholarCross RefCross Ref
  7. Xinlei Chen, Hao Fang, Tsung Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325.Google ScholarGoogle Scholar
  8. Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844--3852.Google ScholarGoogle Scholar
  9. Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, et al. 2014. From captions to visual concepts and back. arXiv:1411.4952.Google ScholarGoogle Scholar
  10. Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, and Changshui Zhang. 2017. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2321--2334.Google ScholarGoogle ScholarCross RefCross Ref
  11. Kulkarni Girish, Premraj Visruth, Ordonez Vicente, Dhar Sagnik, Li Siming, Choi Yejin, Alexander C. Berg, and Tamara L. Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891--2903.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 2961--2969.Google ScholarGoogle Scholar
  13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  14. Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding long-short term memory for image caption generation. arXiv:1509.04942.Google ScholarGoogle Scholar
  15. Andrej Karpathy and Fei Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128--3137.Google ScholarGoogle ScholarCross RefCross Ref
  16. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.Google ScholarGoogle Scholar
  17. Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907.Google ScholarGoogle Scholar
  18. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia Li, and David A. Shamma. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the Meeting of the Association for Computational Linguistics: Long Papers.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Polina Kuznetsova, Vicente Ordonez, Tamara Berg, and Yejin Choi. 2014. TREETALK: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 351--362.Google ScholarGoogle ScholarCross RefCross Ref
  21. Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning.Google ScholarGoogle Scholar
  22. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’18). 21--37.Google ScholarGoogle Scholar
  23. Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle ScholarCross RefCross Ref
  24. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 6. 2.Google ScholarGoogle ScholarCross RefCross Ref
  25. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv:1412.6632.Google ScholarGoogle Scholar
  26. Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, and Iii Hal Daum. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  27. Yu Qin, Jiajun Du, Yonghua Zhang, and Hongtao Lu. 2019. Look back and predict forward in image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  28. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 779--788.Google ScholarGoogle ScholarCross RefCross Ref
  29. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.Google ScholarGoogle Scholar
  30. Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 1. 3.Google ScholarGoogle ScholarCross RefCross Ref
  31. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.Google ScholarGoogle Scholar
  32. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2818--2826.Google ScholarGoogle ScholarCross RefCross Ref
  34. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  35. Anqi Wang, Haifeng Hu, and Liang Yang. 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14 (July 2018), 1--15.Google ScholarGoogle Scholar
  36. Qingzhong Wang and Antoni B. Chan. 2019. Describing like humans: On diversity in image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle Scholar
  37. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 2048--2057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Liang Yang and Haifeng Hu. 2017. TVPRNN for image caption generation. Electronics Letters 53, 22 (2017), 1471--1473.Google ScholarGoogle ScholarCross RefCross Ref
  39. Liang Yang and Haifeng Hu. 2019. Adaptive syncretic attention for constrained image captioning. Neural Processing Letters 50 (2019), 549--564.Google ScholarGoogle ScholarCross RefCross Ref
  40. Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  41. Yezhou Yang, Ching Lik Teo, Hal Daum Iii, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 22--29.Google ScholarGoogle ScholarCross RefCross Ref
  43. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4651--4659.Google ScholarGoogle ScholarCross RefCross Ref
  44. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67--78.Google ScholarGoogle ScholarCross RefCross Ref
  45. Yue Zheng, Yali Li, and Shengjin Wang. 2019. Intention oriented image captions with guiding objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Constrained LSTM and Residual Attention for Image Captioning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 3
        August 2020
        364 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3409646
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 July 2020
        • Online AM: 7 May 2020
        • Accepted: 1 March 2020
        • Revised: 1 September 2019
        • Received: 1 June 2019
        Published in tomm Volume 16, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!