Abstract
Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction models, which can predict human eye fixations. Even though saliency information could be useful to condition an image captioning architecture, by providing an indication of what is salient and what is not, research is still struggling to incorporate these two techniques. In this work, we propose an image captioning approach in which a generative recurrent neural network can focus on different parts of the input image during the generation of the caption, by exploiting the conditioning given by a saliency prediction model on which parts of the image are salient and which are contextual. We show, through extensive quantitative and qualitative experiments on large-scale datasets, that our model achieves superior performance with respect to captioning baselines with and without saliency and to different state-of-the-art approaches combining saliency and captioning.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.Google Scholar
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization.Google Scholar
- Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Ali Borji. 2012. Boosting bottom-up and top-down visual features for saliency estimation. In IEEE International Conference on Computer Vision and Pattern Recognition. Google Scholar
Digital Library
- Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Frédo Durand, Aude Oliva, and Antonio Torralba. 2017. MIT Saliency Benchmark. Retrieved from http://saliency.mit.edu/.Google Scholar
- Zoya Bylinskii, Adrià Recasens, Ali Borji, Aude Oliva, Antonio Torralba, and Frédo Durand. 2016. Where should saliency models look next? In European Conference on Computer Vision.Google Scholar
Cross Ref
- Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2016. A deep multi-level network for saliency prediction. In International Conference on Pattern Recognition.Google Scholar
Cross Ref
- Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2017. Predicting human eye fixations via an LSTM-based saliency attentive model. arXiv preprint arXiv:1611.09571 (2017).Google Scholar
- Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303–338. Google Scholar
Digital Library
- Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Google Scholar
Digital Library
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics.Google Scholar
- Ke Gong, Xiaodan Liang, Xiaohui Shen, and Liang Lin. 2017. Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Zenzi M. Griffin and Kathryn Bock. 2000. What the eyes say about speaking. Psychological Science 11, 4 (2000), 274–279.Google Scholar
Cross Ref
- Jonathan Harel, Christof Koch, and Pietro Perona. 2006. Graph-based visual saliency. In Advances in Neural Information Processing Systems. Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899. Google Scholar
Digital Library
- Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015. SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In IEEE International Conference on Computer Vision. Google Scholar
Digital Library
- Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 11 (1998), 1254–1259. Google Scholar
Digital Library
- Saumya Jetley, Naila Murray, and Eleonora Vig. 2016. End-to-end saliency mapping via probability distribution prediction. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. 2015. SALICON: Saliency in context. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. 2009. Learning to predict where humans look. In IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierarchical approach for generating descriptive image paragraphs. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Google Scholar
Digital Library
- Srinivas S. S. Kruthiventi, Vennela Gudisa, Jaley H. Dholakiya, and R. Venkatesh Babu. 2016. Saliency unified: A deep architecture for simultaneous eye fixation prediction and salient object segmentation. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
- Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891–2903. Google Scholar
Digital Library
- Matthias Kümmerer, Lucas Theis, and Matthias Bethge. 2015. DeepGaze I: Boosting saliency prediction with feature maps trained on ImageNet. In International Conference on Learning Representations Workshops.Google Scholar
- Matthias Kümmerer, Thomas S. A. Wallis, and Matthias Bethge. 2016. DeepGaze II: Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563 (2016).Google Scholar
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL Workshop on Text Summarization Branches Out.Google Scholar
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision.Google Scholar
- Nian Liu and Junwei Han. 2016. A deep spatial contextual long-term recurrent convolutional network for saliency detection. arXiv preprint arXiv:1610.01708 (2016).Google Scholar
- Nian Liu, Junwei Han, Dingwen Zhang, Shifeng Wen, and Tianming Liu. 2015. Predicting eye fixations using convolutional neural networks. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
- Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The role of context for object detection and semantic segmentation in the wild. In IEEE International Conference on Computer Vision and Pattern Recognition. Google Scholar
Digital Library
- Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems. 1143–1151. Google Scholar
Digital Library
- Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa Sayrol, and Xavier Giro-i Nieto. 2017. SalGAN: Visual saliency prediction with generative adversarial networks. In IEEE International Conference on Computer Vision and Pattern Recognition Workshops.Google Scholar
- Junting Pan, Kevin McGuinness, Elisa Sayrol, Noel O’Connor, and Xavier Giró-i Nieto. 2016. Shallow and deep convolutional networks for saliency prediction. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In 40th Annual Meeting on Association for Computational Linguistics. Google Scholar
Digital Library
- Vasili Ramanishka, Abir Das, Jianming Zhang, and Kate Saenko. 2017. Top-down visual saliency guided by captions. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s mechanical turk. In NAACL HLT Workshops. Google Scholar
Digital Library
- Ronald A. Rensink. 2000. The dynamic representation of scenes. Visual Cognition 7, 1–3 (2000), 17–42.Google Scholar
Cross Ref
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252. Google Scholar
Digital Library
- Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. arXiv preprint arXiv:1703.10476 (2017).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207–218.Google Scholar
Cross Ref
- Yusuke Sugano and Andreas Bulling. 2016. Seeing with humans: Gaze-assisted neural image captioning. arXiv preprint arXiv:1608.05203 (2016).Google Scholar
- Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning. Google Scholar
Digital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Hamed R. Tavakoli, Ali Borji, Jorma Laaksonen, and Esa Rahtu. 2017. Exploiting inter-image similarity and ensemble of extreme learners for fixation prediction using deep features. Neurocomputing 244 (2017), 10–18. Google Scholar
Digital Library
- Hamed R. Tavakoli, Rakshith Shetty, Ali Borji, and Jorma Laaksonen. 2017. Paying attention to descriptions generated by image captioning models. In IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Eleonora Vig, Michael Dorr, and David Cox. 2014. Large-scale optimization of hierarchical features for saliency prediction in natural images. In IEEE International Conference on Computer Vision and Pattern Recognition. Google Scholar
Digital Library
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. Google Scholar
Digital Library
- Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Conference on Empirical Methods in Natural Language Processing. Google Scholar
Digital Library
- Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. 2010. I2t: Image parsing to text description. Proceedings of the IEEE 98, 8 (2010), 1485–1508.Google Scholar
Cross Ref
- Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.Google Scholar
Cross Ref
- Jianming Zhang and Stan Sclaroff. 2013. Saliency detection: A boolean map approach. In IEEE International Conference on Computer Vision. Google Scholar
Digital Library
Index Terms
Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention
Recommendations
Mesh saliency
Research over the last decade has built a solid mathematical foundation for representation and analysis of 3D meshes in graphics and geometric modeling. Much of this work however does not explicitly incorporate models of low-level human visual ...
Mesh saliency
SIGGRAPH '05: ACM SIGGRAPH 2005 PapersResearch over the last decade has built a solid mathematical foundation for representation and analysis of 3D meshes in graphics and geometric modeling. Much of this work however does not explicitly incorporate models of low-level human visual ...
Computational versus Psychophysical Bottom-Up Image Saliency: A Comparative Evaluation Study
The predictions of 13 computational bottom-up saliency models and a newly introduced Multiscale Contrast Conspicuity (MCC) metric are compared with human visual conspicuity measurements. The agreement between human visual conspicuity estimates and model ...






Comments