Abstract
Image captioning is an increasingly important problem associated with artificial intelligence, computer vision, and natural language processing. Recent works revealed that it is possible for a machine to generate meaningful and accurate sentences for images. However, most existing methods ignore latent emotional information in an image. In this article, we propose a novel image captioning model with Affective Guiding and Selective Attention Mechanism named AG-SAM. In our method, we aim to bridge the affective gap between image captioning and the emotional response elicited by the image. First, we introduce affective components that capture higher-level concepts encoded in images into AG-SAM. Hence, our language model can be adapted to generate sentences that are more passionate and emotive. In addition, a selective gate acting on the attention mechanism controls the degree of how much visual information AG-SAM needs. Experimental results have shown that our model outperforms most existing methods, clearly reflecting an association between images and emotional components that is usually ignored in existing works.
- D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65--72.Google Scholar
- X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.Google Scholar
- J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 677--691. Google Scholar
Digital Library
- Paul Ekman, Wallace V. Friesen, and Phoebe Ellsworth. 1972. What emotion dimensions can observers judge from facial behavior? In Emotion in the Human Face. 67--75.Google Scholar
- Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1473--1482.Google Scholar
Cross Ref
- Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every Picture Tells a Story: Generating Sentences from Images. Springer, Berlin. 15--29. Google Scholar
Digital Library
- Carlos Flick. 2004. ROUGE: A package for automatic evaluation of summaries. In The Workshop on Text Summarization Branches Out. 10.Google Scholar
- Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, and Changshui Zhang. 2017. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 (2017), 2321--2334.Google Scholar
Cross Ref
- Daniel Grĺźhn and Susanne Scheibe. 2008. Age-related differences in valence and arousal ratings of pictures from the international affective picture system (IAPS): Do ratings become more extreme with age? Behav. Res. Methods 40, 2 (2008), 512--521.Google Scholar
Cross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.Google Scholar
Cross Ref
- Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2016. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision. 2407--2415. Google Scholar
Digital Library
- Yu Gang Jiang, Baohan Xu, and Xiangyang Xue. 2014. Predicting emotions in user-generated videos. In Proceedings of the 28th AAAI Conference on Artificial Intelligence. 73--79. Google Scholar
Digital Library
- Dhiraj Joshi, Ritendra Datta, Elena Fedorovskaya, and Quang Tuan Luong. 2011. Aesthetics and emotions in images. IEEE Sign. Process. Mag. 28, 5 (2011), 94--115.Google Scholar
Cross Ref
- A Karpathy and L Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 664. Google Scholar
Digital Library
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755.Google Scholar
- Jana Machajdik and Allan Hanbury. 2010. Affective image classification using features inspired by psychology and art theory. In Proceedings of the ACM International Conference on Multimedia. 83--92. Google Scholar
Digital Library
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. 2014. Microsoft COCO: Common objects in context. arXiv preprint arXiv:1405.0312.Google Scholar
- Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Iii Hal Daumĺę. 2013. Midge: Generating image descriptions from computer vision detections. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics. 747--756. Google Scholar
Digital Library
- Wenxuan Mou, Oya Celiktutan, and Hatice Gunes. 2015. Group-level arousal and valence recognition in static images: Face, body and context. In Proceedings of the IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. 1--6.Google Scholar
- Wenxuan Mou, Hatice Gunes, and Ioannis Patras. 2016. Automatic recognition of emotions and membership in group videos. In Proceedings of the Computer Vision and Pattern Recognition Workshops. 1478--1486.Google Scholar
Cross Ref
- A Ortony and T. J. Turner. 1990. What’s basic about basic emotions? Psychol. Rev. 97, 3 (1990), 315--331.Google Scholar
Cross Ref
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Meeting on the Association for Computational Linguistics. 311--318. Google Scholar
Digital Library
- K. C. Peng, T. Chen, A. Sadovnik, and A. Gallagher. 2015. A mixed bag of emotions: Model, predict, and transfer emotion distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 860--868.Google Scholar
- Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL Hlt 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 139--147. Google Scholar
Digital Library
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google Scholar
Cross Ref
- Xiaohui Wang, Jia Jia, Jiaming Yin, and Lianhong Cai. 2014. Interpretable aesthetic features for affective image classification. In Proceedings of the IEEE International Conference on Image Processing. 3230--3234.Google Scholar
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (ICML). 2048--2057. Google Scholar
Digital Library
- Liang Yang and Haifeng Hu. 2017. TVPRNN for image caption generation. Electron. Lett. 53, 22 (2017), 1471--1473.Google Scholar
Cross Ref
- V. Yanulevskaya, J. C. Van Gemert, K. Roth, A. K. Herbold, N. Sebe, and J. M. Geusebroek. 2008. Emotional valence categorization using holistic image features. In Proceedings of the IEEE International Conference on Image Processing. 101--104.Google Scholar
- Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4651--4659.Google Scholar
Cross Ref
- Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014) 67--78.Google Scholar
Cross Ref
Index Terms
Image Captioning with Affective Guiding and Selective Attention
Recommendations
Constrained LSTM and Residual Attention for Image Captioning
Visual structure and syntactic structure are essential in images and texts, respectively. Visual structure depicts both entities in an image and their interactions, whereas syntactic structure in texts can reflect the part-of-speech constraints between ...
Image Captioning with a Joint Attention Mechanism by Visual Concept Samples
The attention mechanism has been established as an effective method for generating caption words in image captioning; it explores one noticed subregion in an image to predict a related caption word. However, even though the attention mechanism could ...
Image Captioning Based on Visual and Semantic Attention
MultiMedia ModelingAbstractMost of the existing image captioning methods only use the visual information of the image to guide the generation of the captions, lack the guidance of effective scene semantic information, and the current visual attention mechanism cannot adjust ...






Comments