Abstract
Visual emotion classification is predicting emotional reactions of people for the given visual content. Psychological studies show that human emotions are affected by various visual stimuli from low level to high level, including contrast, color, texture, scene, object, and association, among others. Traditional approaches regarded different levels of stimuli as independent components and ignored to effectively fuse different stimuli. This article proposes a hierarchical convolutional neural network (CNN)-recurrent neural network (RNN) approach to predict the emotion based on the fused stimuli by exploiting the dependency among different-level features. First, we introduce a dual CNN to extract different levels of visual stimulus, where two related loss functions are designed to learn the stimuli representation under a multi-task learning structure. Further, to model the dependency between the low- and high-level stimulus, a stacked bi-directional RNN is proposed to fuse the preceding learned features from the dual CNN. Comparison experiments on one large-scale and three small scale datasets show that the proposed approach brings significant improvement. Ablation experiments demonstrate the effectiveness of different modules from our model.
- Sergio Benini, Luca Canini, and Riccardo Leonardi. 2011. A connotative space for supporting movie affective recommendation. IEEE Transactions on Multimedia 13, 6 (2011), 1356--1370.Google Scholar
Digital Library
- Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih Fu Chang. 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM International Conference on Multimedia (MM’13). 223--232.Google Scholar
Digital Library
- Rich Caruana. 1998. Multitask learning. In Learning to Learn. Springer, 95--133.Google Scholar
Digital Library
- Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C. Courville, and Yoshua Bengio. 2015. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems (NIPS’15). 2980--2988.Google Scholar
- Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks. In Proceedings of the International Conference on Neural Information Processing Systems(NIPS’15).Google Scholar
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML’14). 647--655.Google Scholar
Digital Library
- Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. 2015. DRAW: A recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 1462--1471.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arXiv:1512.03385.Google Scholar
- Go Irie, Takashi Satou, Akira Kojima, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2010. Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. IEEE Transactions on Multimedia 12, 6 (2010), 523--535.Google Scholar
Digital Library
- Dhiraj Joshi, Ritendra Datta, Elena Fedorovskaya, Quang-Tuan Luong, James Z. Wang, Jia Li, and Jiebo Luo. 2011. Aesthetics and emotions in images. IEEE Signal Processing Magazine 28, 5 (2011), 94--115.Google Scholar
Cross Ref
- Hang-Bong Kang. 2003. Affective content detection using HMMs. In Proceedings of the 11th ACM International Conference on Multimedia (MM’03). ACM, New York, NY, 259--262.Google Scholar
Digital Library
- Peter J. Lang. 1979. A bio-informational theory of emotional imagery. Psychophysiology 16, 6 (1979), 495--512.Google Scholar
Cross Ref
- Peter J. Lang, Margaret M. Bradley, and Bruce N. Cuthbert. 2008. International Affective Picture System (IAPS): Affective Ratings of Pictures and Instruction Manual. Technical Report A-8. (2008). NIMH Center for the Study of Emotion and Attention.Google Scholar
- Liang Li, Shuqiang Jiang, and Qingming Huang. 2012. Learning hierarchical semantic description via mixed-norm regularization for image understanding. IEEE Transactions on Multimedia 14, 5 (2012), 1401--1413.Google Scholar
Digital Library
- Liang Li, Shuqiang Jiang, Zheng Jun Zha, Zhipeng Wu, and Qingming Huang. 2013. Partial-duplicate image retrieval via saliency-guided visual matching. IEEE Multimedia 20, 3 (2013), 13--23.Google Scholar
Digital Library
- Liang Li, Shuhui Wang, Shuqiang Jiang, and Qingming Huang. 2018. Attentive recurrent neural network weak-supervised multi-label image classification. In Proceedings of the 26th ACM International Conference on Multimedia(MM’18). 1092--1100.Google Scholar
Digital Library
- Liang Li, Chenggang Yan, Xing Chen, Chunjie Zhang, Jian Yin, Baochen Jiang, and Qingming Huang. 2016. Distributed image understanding with semantic dictionary and semantic expansion. Neurocomputing 174, Part A (2016), 384--392.Google Scholar
- Liang Li, Chenggang Clarence Yan, Ji Wen, Bo Wei Chen, Shuqiang Jiang, and Qingming Huang. 2015. LSH-based semantic dictionary learning for large scale image understanding. Journal of Visual Communication and Image Representation 31, (2015), 231--236.Google Scholar
Digital Library
- Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen-Change Loy, and Xiaoou Tang. 2015. Semantic image segmentation via deep parsing network. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1377--1385.Google Scholar
Digital Library
- Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015. Multi-task sequence to sequence learning. arXiv:1511.06114.Google Scholar
- Jana Machajdik and Allan Hanbury. 2010. Affective image classification using features inspired by psychology and art theory. In Proceedings of the 18th ACM International Conference on Multimedia (MM’10). ACM, New York, NY, 83--92.Google Scholar
Digital Library
- Joseph A. Mikels, Barbara L. Fredrickson, Gregory R. Larkin, Casey M. Lindberg, Sam J. Maglio, and Patricia A. Reuter-Lorenz. 2005. Emotional category data on images from the International Affective Picture System. Behavior Research Methods 37, 4 (2005), 626--630.Google Scholar
Cross Ref
- Xu Min, Jesse S. Jin, Suhuai Luo, and Lingyu Duan. 2008. Hierarchical movie affective content analysis based on arousal and valence features. In Proceedings of the International Conference on Multimedia (MM’08).Google Scholar
- Kuan Chuan Peng, Tsuhan Chen, Amir Sadovnik, and Andrew Gallagher. 2015. A mixed bag of emotions: Model, predict, and transfer emotion distributions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).Google Scholar
Cross Ref
- Tianrong Rao, Min Xu, Huiying Liu, Jinqiao Wang, and Ian Burnett. 2016. Multi-scale blocks based image emotion classification using multiple instance learning. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP’16). IEEE, Los Alamitos, CA, 634--638.Google Scholar
Cross Ref
- Tianrong Rao, Min Xu, and Dong Xu. 2016. Learning multi-level deep representations for image emotion classification. arXiv:1611.07145.Google Scholar
- Jitao Sang and Changsheng Xu. 2012. Right buddy makes the difference: An early exploration of social relation analysis in multimedia applications. In Proceedings of the ACM International Conference on Multimedia (MM’12).Google Scholar
Digital Library
- Jitao Sang, Changsheng Xu, and Liu Jing. 2012. User-aware image tag refinement via ternary semantic analysis. IEEE Transactions on Multimedia 14, 3 (2012), 883--895.Google Scholar
Digital Library
- Andreza Sartori, Dubravko Culibrk, Yan Yan, and Nicu Sebe. 2015. Who’s afraid of Itten: Using the art theory of color combination to analyze emotions in abstract paintings. In ACM MM. ACM, 311--320.Google Scholar
- Stefan Siersdorfer, Enrico Minack, Fan Deng, and Jonathon Hare. 2010. Analyzing and predicting sentiment of images on the social web. In Proceedings of the ACM International Conference on Multimedia (MM’10). ACM, New York, NY, 715--718.Google Scholar
Digital Library
- Martin Solli and Reiner Lenz. 2009. Color based bags-of-emotions. In Proceedings of the International Conference on Computer Analysis of Images and Patterns. 573--580.Google Scholar
Digital Library
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’14). 3104--3112.Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google Scholar
Cross Ref
- Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2 (2012), 26--30.Google Scholar
- Weining Wang and Qianhua He. 2008. A survey on emotional semantic image retrieval. In Proceedings of the 2008 15th International Conference on Image Processing (ICIP’08). 117--120.Google Scholar
Cross Ref
- Wei-Ning Wang, Ying-Lin Yu, and Sheng-Ming Jiang. 2006. Image retrieval by emotional semantics: A study of emotional space and feature extraction. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’06), Vol. 4. IEEE, Los Alamitos, CA, 3534--3539.Google Scholar
- Lu Xin, Poonam Suryanarayan, Reginald B. Adams, Li Jia, Michelle G. Newman, and James Z. Wang. 2012. On shape and the computability of emotions. In Proceedings of the ACM International Conference on Multimedia (MM’12).Google Scholar
- Chenggang Yan, Liang Li, Chunjie Zhang, Bingtao Liu, Yongdong Zhang, and Qionghai Dai. 2019. Cross-modality bridging and knowledge transferring for image understanding. IEEE Transactions on Multimedia 20, 10 (2019), 2675--2685.Google Scholar
Digital Library
- Yunbin Tu, Xishan Zhang, Bingtao Liu, and Chenggang Yan. 2017. Video description with spatial-temporal attention. ACM Multimedia (ACMMM'17). 1014--1022.Google Scholar
- Chenggang Yan, Hongtao Xie, Jianjun Chen, Zhengjun Zha, Xinhong Hao, Yongdong Zhang, and Qionghai Dai. 2018. A fast Uyghur text detector for complex background images. IEEE Transactions on Multimedia 20, 12 (2018), 3389--3398.Google Scholar
Digital Library
- Shijie Yang, Li Liang, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2017. A graph regularized deep neural network for unsupervised image representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google Scholar
Cross Ref
- V. Yanulevskaya, J. C. Van Gemert, K. Roth, A.-K. Herbold, N. Sebe, and J.-M. Geusebroek. 2008. Emotional valence categorization using holistic image features. In Proceedings of the International Conference on Image Processing (ICIP’08). IEEE, Los Alamitos, CA, 101--104.Google Scholar
Cross Ref
- Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. 2015. Understanding neural networks through deep visualization. In Proceedings of the 32nd International Conference on Machine Learning Deep Learning Workshop (ICML’15).Google Scholar
- Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2016. Building a large scale dataset for image emotion recognition: The fine print and the benchmark. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16).Google Scholar
- Joe Yue-Hei Ng, Fan Yang, and Larry S. Davis. 2015. Exploiting local features from deep networks for image retrieval. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’15). 53--61.Google Scholar
- Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). 818--833.Google Scholar
- Sicheng Zhao, Guiguang Ding, Yue Gao, and Jungong Han. 2017. Approximating discrete probability distribution of image emotions by multi-modal features fusion. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 4669--4675.Google Scholar
Cross Ref
- Sicheng Zhao, Yue Gao, Xiaolei Jiang, Hongxun Yao, Tat-Seng Chua, and Xiaoshuai Sun. 2014. Exploring principles-of-art features for image emotion recognition. In Proceedings of the ACM International Conference on Multimedia (MM’14). ACM, New York, NY, 47--56.Google Scholar
Digital Library
- Sicheng Zhao, Amir Gholaminejad, Guiguang Ding, Yue Gao, Jungong Han, and Kurt Keutzer. 2019. Personalized emotion recognition by personality-aware high-order learning of physiological signals. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 1--18.Google Scholar
Digital Library
- Sicheng Zhao, Hongxun Yao, Yue Gao, Guiguang Ding, and Tat-Seng Chua. 2018. Predicting personalized image emotion perceptions in social networks. IEEE Transactions on Affective Computing 9, 11 (2018), 526--540.Google Scholar
Digital Library
- Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2014. Object detectors emerge in deep scene CNNs. arXiv:1412.6856.Google Scholar
- Xinge Zhu, Liang Li, Weigang Zhang, Tianrong Rao Rao, Min Xu, Qingming Huang, and Dong Xu. 2017. Dependency exploitation: A unified CNN-RNN approach for visual emotion recognition. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17).Google Scholar
Cross Ref
Index Terms
A Hierarchical CNN-RNN Approach for Visual Emotion Classification
Recommendations
Video-based emotion recognition using CNN-RNN and C3D hybrid networks
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal InteractionIn this paper, we present a video-based emotion recognition system submitted to the EmotiW 2016 Challenge. The core module of this system is a hybrid network that combines recurrent neural network (RNN) and 3D convolutional networks (C3D) in a late-...
A biologically inspired model of emotion eliciting from visual stimuli
Several emotion eliciting models have been proposed in the literature, however most of them are still artificial models which ignore the biological basis. We propose an emotion (without awareness) eliciting model from visual stimuli, which is inspired ...
Neural Correlates of Positive and Negative Emotion Regulation
The ability to cope adaptively with emotional events by volitionally altering one's emotional reactions is important for psychological and physical health as well as social interaction. Cognitive regulation of emotional responses to aversive events ...






Comments