skip to main content
research-article

Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention

Published:25 April 2018Publication History
Skip Abstract Section

Abstract

Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Even though saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, research is still struggling to incorporate these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We show, through extensive quantitative and qualitative experiments on large-scale datasets, that our model achieves superior performance with respect to captioning baselines with and without saliency and to different state-of-the-art approaches combining saliency and captioning.

References

  1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  2. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Google ScholarGoogle Scholar
  3. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  4. Ali Borji. 2012. Boosting bottom-up and top-down visual features for saliency estimation. In IEEE International Conference on Computer Vision and Pattern Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Frédo Durand, Aude Oliva, and Antonio Torralba. 2017. MIT Saliency Benchmark. Retrieved from http://saliency.mit.edu/.Google ScholarGoogle Scholar
  6. Zoya Bylinskii, Adrià Recasens, Ali Borji, Aude Oliva, Antonio Torralba, and Frédo Durand. 2016. Where should saliency models look next? In European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  7. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  8. Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2016. A deep multi-level network for saliency prediction. In International Conference on Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  9. Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2017. Predicting human eye fixations via an LSTM-based saliency attentive model. arXiv preprint arXiv:1611.09571 (2017).Google ScholarGoogle Scholar
  10. Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303–338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics.Google ScholarGoogle Scholar
  13. Ke Gong, Xiaodan Liang, Xiaohui Shen, and Liang Lin. 2017. Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  14. Zenzi M. Griffin and Kathryn Bock. 2000. What the eyes say about speaking. Psychological Science 11, 4 (2000), 274–279.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jonathan Harel, Christof Koch, and Pietro Perona. 2006. Graph-based visual saliency. In Advances in Neural Information Processing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  17. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015. SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In IEEE International Conference on Computer Vision. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 11 (1998), 1254–1259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Saumya Jetley, Naila Murray, and Eleonora Vig. 2016. End-to-end saliency mapping via probability distribution prediction. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  21. Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. SALICON: Saliency in context. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  22. Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  23. Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. 2009. Learning to predict where humans look. In IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  24. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  25. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  26. Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierarchical approach for generating descriptive image paragraphs. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  27. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Srinivas S. S. Kruthiventi, Vennela Gudisa, Jaley H. Dholakiya, and R. Venkatesh Babu. 2016. Saliency unified: A deep architecture for simultaneous eye fixation prediction and salient object segmentation. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  29. Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891–2903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Matthias Kümmerer, Lucas Theis, and Matthias Bethge. 2015. DeepGaze I: Boosting saliency prediction with feature maps trained on ImageNet. In International Conference on Learning Representations Workshops.Google ScholarGoogle Scholar
  31. Matthias Kümmerer, Thomas S. A. Wallis, and Matthias Bethge. 2016. DeepGaze II: Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563 (2016).Google ScholarGoogle Scholar
  32. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL Workshop on Text Summarization Branches Out.Google ScholarGoogle Scholar
  33. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision.Google ScholarGoogle Scholar
  34. Nian Liu and Junwei Han. 2016. A deep spatial contextual long-term recurrent convolutional network for saliency detection. arXiv preprint arXiv:1610.01708 (2016).Google ScholarGoogle Scholar
  35. Nian Liu, Junwei Han, Dingwen Zhang, Shifeng Wen, and Tianming Liu. 2015. Predicting eye fixations using convolutional neural networks. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  36. Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The role of context for object detection and semantic segmentation in the wild. In IEEE International Conference on Computer Vision and Pattern Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems. 1143–1151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa Sayrol, and Xavier Giro-i Nieto. 2017. SalGAN: Visual saliency prediction with generative adversarial networks. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops.Google ScholarGoogle Scholar
  39. Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O’Connor, and Xavier Giró-i Nieto. 2016. Shallow and deep convolutional networks for saliency prediction. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  40. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In 40th Annual Meeting on Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Vasili Ramanishka, Abir Das, Jianming Zhang, and Kate Saenko. 2017. Top-down visual saliency guided by captions. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  42. Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s mechanical turk. In NAACL HLT Workshops. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ronald A. Rensink. 2000. The dynamic representation of scenes. Visual Cognition 7, 1–3 (2000), 17–42.Google ScholarGoogle ScholarCross RefCross Ref
  44. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. arXiv preprint arXiv:1703.10476 (2017).Google ScholarGoogle Scholar
  46. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  47. Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207–218.Google ScholarGoogle ScholarCross RefCross Ref
  48. Yusuke Sugano and Andreas Bulling. 2016. Seeing with humans: Gaze-assisted neural image captioning. arXiv preprint arXiv:1608.05203 (2016).Google ScholarGoogle Scholar
  49. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  51. Hamed R. Tavakoli, Ali Borji, Jorma Laaksonen, and Esa Rahtu. 2017. Exploiting inter-image similarity and ensemble of extreme learners for fixation prediction using deep features. Neurocomputing 244 (2017), 10–18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Hamed R. Tavakoli, Rakshith Shetty, Ali Borji, and Jorma Laaksonen. 2017. Paying attention to descriptions generated by image captioning models. In IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  53. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  54. Eleonora Vig, Michael Dorr, and David Cox. 2014. Large-scale optimization of hierarchical features for saliency prediction in natural images. In IEEE International Conference on Computer Vision and Pattern Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  56. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Conference on Empirical Methods in Natural Language Processing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. 2010. I2t: Image parsing to text description. Proceedings of the IEEE 98, 8 (2010), 1485–1508.Google ScholarGoogle ScholarCross RefCross Ref
  59. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In IEEE International Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  60. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.Google ScholarGoogle ScholarCross RefCross Ref
  61. Jianming Zhang and Stan Sclaroff. 2013. Saliency detection: A boolean map approach. In IEEE International Conference on Computer Vision. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 14, Issue 2
        May 2018
        208 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3210458
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 April 2018
        • Accepted: 1 December 2017
        • Received: 1 November 2017
        Published in tomm Volume 14, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!