skip to main content
research-article

Image Captioning with Affective Guiding and Selective Attention

Authors Info & Claims
Published:24 July 2018Publication History
Skip Abstract Section

Abstract

Image captioning is an increasingly important problem associated with artificial intelligence, computer vision, and natural language processing. Recent works revealed that it is possible for a machine to generate meaningful and accurate sentences for images. However, most existing methods ignore latent emotional information in an image. In this article, we propose a novel image captioning model with Affective Guiding and Selective Attention Mechanism named AG-SAM. In our method, we aim to bridge the affective gap between image captioning and the emotional response elicited by the image. First, we introduce affective components that capture higher-level concepts encoded in images into AG-SAM. Hence, our language model can be adapted to generate sentences that are more passionate and emotive. In addition, a selective gate acting on the attention mechanism controls the degree of how much visual information AG-SAM needs. Experimental results have shown that our model outperforms most existing methods, clearly reflecting an association between images and emotional components that is usually ignored in existing works.

References

  1. D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  2. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65--72.Google ScholarGoogle Scholar
  3. X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.Google ScholarGoogle Scholar
  4. J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 677--691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Paul Ekman, Wallace V. Friesen, and Phoebe Ellsworth. 1972. What emotion dimensions can observers judge from facial behavior? In Emotion in the Human Face. 67--75.Google ScholarGoogle Scholar
  6. Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1473--1482.Google ScholarGoogle ScholarCross RefCross Ref
  7. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every Picture Tells a Story: Generating Sentences from Images. Springer, Berlin. 15--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Carlos Flick. 2004. ROUGE: A package for automatic evaluation of summaries. In The Workshop on Text Summarization Branches Out. 10.Google ScholarGoogle Scholar
  9. Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, and Changshui Zhang. 2017. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 (2017), 2321--2334.Google ScholarGoogle ScholarCross RefCross Ref
  10. Daniel Grĺźhn and Susanne Scheibe. 2008. Age-related differences in valence and arousal ratings of pictures from the international affective picture system (IAPS): Do ratings become more extreme with age? Behav. Res. Methods 40, 2 (2008), 512--521.Google ScholarGoogle ScholarCross RefCross Ref
  11. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  12. Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2016. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision. 2407--2415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yu Gang Jiang, Baohan Xu, and Xiangyang Xue. 2014. Predicting emotions in user-generated videos. In Proceedings of the 28th AAAI Conference on Artificial Intelligence. 73--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dhiraj Joshi, Ritendra Datta, Elena Fedorovskaya, and Quang Tuan Luong. 2011. Aesthetics and emotions in images. IEEE Sign. Process. Mag. 28, 5 (2011), 94--115.Google ScholarGoogle ScholarCross RefCross Ref
  15. A Karpathy and L Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755.Google ScholarGoogle Scholar
  17. Jana Machajdik and Allan Hanbury. 2010. Affective image classification using features inspired by psychology and art theory. In Proceedings of the ACM International Conference on Multimedia. 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. 2014. Microsoft COCO: Common objects in context. arXiv preprint arXiv:1405.0312.Google ScholarGoogle Scholar
  19. Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Iii Hal Daumĺę. 2013. Midge: Generating image descriptions from computer vision detections. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics. 747--756. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Wenxuan Mou, Oya Celiktutan, and Hatice Gunes. 2015. Group-level arousal and valence recognition in static images: Face, body and context. In Proceedings of the IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. 1--6.Google ScholarGoogle Scholar
  21. Wenxuan Mou, Hatice Gunes, and Ioannis Patras. 2016. Automatic recognition of emotions and membership in group videos. In Proceedings of the Computer Vision and Pattern Recognition Workshops. 1478--1486.Google ScholarGoogle ScholarCross RefCross Ref
  22. A Ortony and T. J. Turner. 1990. What’s basic about basic emotions? Psychol. Rev. 97, 3 (1990), 315--331.Google ScholarGoogle ScholarCross RefCross Ref
  23. Kishore Papineni, Salim Roukos, Todd Ward, and Wei Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Meeting on the Association for Computational Linguistics. 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. C. Peng, T. Chen, A. Sadovnik, and A. Gallagher. 2015. A mixed bag of emotions: Model, predict, and transfer emotion distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 860--868.Google ScholarGoogle Scholar
  25. Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL Hlt 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 139--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  27. Xiaohui Wang, Jia Jia, Jiaming Yin, and Lianhong Cai. 2014. Interpretable aesthetic features for affective image classification. In Proceedings of the IEEE International Conference on Image Processing. 3230--3234.Google ScholarGoogle Scholar
  28. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (ICML). 2048--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Liang Yang and Haifeng Hu. 2017. TVPRNN for image caption generation. Electron. Lett. 53, 22 (2017), 1471--1473.Google ScholarGoogle ScholarCross RefCross Ref
  30. V. Yanulevskaya, J. C. Van Gemert, K. Roth, A. K. Herbold, N. Sebe, and J. M. Geusebroek. 2008. Emotional valence categorization using holistic image features. In Proceedings of the IEEE International Conference on Image Processing. 101--104.Google ScholarGoogle Scholar
  31. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4651--4659.Google ScholarGoogle ScholarCross RefCross Ref
  32. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014) 67--78.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Image Captioning with Affective Guiding and Selective Attention

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!