skip to main content
research-article

A Hierarchical CNN-RNN Approach for Visual Emotion Classification

Authors Info & Claims
Published:07 December 2019Publication History
Skip Abstract Section

Abstract

Visual emotion classification is predicting emotional reactions of people for the given visual content. Psychological studies show that human emotions are affected by various visual stimuli from low level to high level, including contrast, color, texture, scene, object, and association, among others. Traditional approaches regarded different levels of stimuli as independent components and ignored to effectively fuse different stimuli. This article proposes a hierarchical convolutional neural network (CNN)-recurrent neural network (RNN) approach to predict the emotion based on the fused stimuli by exploiting the dependency among different-level features. First, we introduce a dual CNN to extract different levels of visual stimulus, where two related loss functions are designed to learn the stimuli representation under a multi-task learning structure. Further, to model the dependency between the low- and high-level stimulus, a stacked bi-directional RNN is proposed to fuse the preceding learned features from the dual CNN. Comparison experiments on one large-scale and three small scale datasets show that the proposed approach brings significant improvement. Ablation experiments demonstrate the effectiveness of different modules from our model.

References

  1. Sergio Benini, Luca Canini, and Riccardo Leonardi. 2011. A connotative space for supporting movie affective recommendation. IEEE Transactions on Multimedia 13, 6 (2011), 1356--1370.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih Fu Chang. 2013. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In Proceedings of the 21st ACM International Conference on Multimedia (MM’13). 223--232.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Rich Caruana. 1998. Multitask learning. In Learning to Learn. Springer, 95--133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C. Courville, and Yoshua Bengio. 2015. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems (NIPS’15). 2980--2988.Google ScholarGoogle Scholar
  5. Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks. In Proceedings of the International Conference on Neural Information Processing Systems(NIPS’15).Google ScholarGoogle Scholar
  6. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML’14). 647--655.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. 2015. DRAW: A recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 1462--1471.Google ScholarGoogle Scholar
  8. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arXiv:1512.03385.Google ScholarGoogle Scholar
  9. Go Irie, Takashi Satou, Akira Kojima, Toshihiko Yamasaki, and Kiyoharu Aizawa. 2010. Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. IEEE Transactions on Multimedia 12, 6 (2010), 523--535.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dhiraj Joshi, Ritendra Datta, Elena Fedorovskaya, Quang-Tuan Luong, James Z. Wang, Jia Li, and Jiebo Luo. 2011. Aesthetics and emotions in images. IEEE Signal Processing Magazine 28, 5 (2011), 94--115.Google ScholarGoogle ScholarCross RefCross Ref
  11. Hang-Bong Kang. 2003. Affective content detection using HMMs. In Proceedings of the 11th ACM International Conference on Multimedia (MM’03). ACM, New York, NY, 259--262.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Peter J. Lang. 1979. A bio-informational theory of emotional imagery. Psychophysiology 16, 6 (1979), 495--512.Google ScholarGoogle ScholarCross RefCross Ref
  13. Peter J. Lang, Margaret M. Bradley, and Bruce N. Cuthbert. 2008. International Affective Picture System (IAPS): Affective Ratings of Pictures and Instruction Manual. Technical Report A-8. (2008). NIMH Center for the Study of Emotion and Attention.Google ScholarGoogle Scholar
  14. Liang Li, Shuqiang Jiang, and Qingming Huang. 2012. Learning hierarchical semantic description via mixed-norm regularization for image understanding. IEEE Transactions on Multimedia 14, 5 (2012), 1401--1413.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Liang Li, Shuqiang Jiang, Zheng Jun Zha, Zhipeng Wu, and Qingming Huang. 2013. Partial-duplicate image retrieval via saliency-guided visual matching. IEEE Multimedia 20, 3 (2013), 13--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Liang Li, Shuhui Wang, Shuqiang Jiang, and Qingming Huang. 2018. Attentive recurrent neural network weak-supervised multi-label image classification. In Proceedings of the 26th ACM International Conference on Multimedia(MM’18). 1092--1100.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Liang Li, Chenggang Yan, Xing Chen, Chunjie Zhang, Jian Yin, Baochen Jiang, and Qingming Huang. 2016. Distributed image understanding with semantic dictionary and semantic expansion. Neurocomputing 174, Part A (2016), 384--392.Google ScholarGoogle Scholar
  18. Liang Li, Chenggang Clarence Yan, Ji Wen, Bo Wei Chen, Shuqiang Jiang, and Qingming Huang. 2015. LSH-based semantic dictionary learning for large scale image understanding. Journal of Visual Communication and Image Representation 31, (2015), 231--236.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen-Change Loy, and Xiaoou Tang. 2015. Semantic image segmentation via deep parsing network. In Proceedings of the International Conference on Computer Vision (ICCV’15). 1377--1385.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2015. Multi-task sequence to sequence learning. arXiv:1511.06114.Google ScholarGoogle Scholar
  21. Jana Machajdik and Allan Hanbury. 2010. Affective image classification using features inspired by psychology and art theory. In Proceedings of the 18th ACM International Conference on Multimedia (MM’10). ACM, New York, NY, 83--92.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Joseph A. Mikels, Barbara L. Fredrickson, Gregory R. Larkin, Casey M. Lindberg, Sam J. Maglio, and Patricia A. Reuter-Lorenz. 2005. Emotional category data on images from the International Affective Picture System. Behavior Research Methods 37, 4 (2005), 626--630.Google ScholarGoogle ScholarCross RefCross Ref
  23. Xu Min, Jesse S. Jin, Suhuai Luo, and Lingyu Duan. 2008. Hierarchical movie affective content analysis based on arousal and valence features. In Proceedings of the International Conference on Multimedia (MM’08).Google ScholarGoogle Scholar
  24. Kuan Chuan Peng, Tsuhan Chen, Amir Sadovnik, and Andrew Gallagher. 2015. A mixed bag of emotions: Model, predict, and transfer emotion distributions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).Google ScholarGoogle ScholarCross RefCross Ref
  25. Tianrong Rao, Min Xu, Huiying Liu, Jinqiao Wang, and Ian Burnett. 2016. Multi-scale blocks based image emotion classification using multiple instance learning. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP’16). IEEE, Los Alamitos, CA, 634--638.Google ScholarGoogle ScholarCross RefCross Ref
  26. Tianrong Rao, Min Xu, and Dong Xu. 2016. Learning multi-level deep representations for image emotion classification. arXiv:1611.07145.Google ScholarGoogle Scholar
  27. Jitao Sang and Changsheng Xu. 2012. Right buddy makes the difference: An early exploration of social relation analysis in multimedia applications. In Proceedings of the ACM International Conference on Multimedia (MM’12).Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jitao Sang, Changsheng Xu, and Liu Jing. 2012. User-aware image tag refinement via ternary semantic analysis. IEEE Transactions on Multimedia 14, 3 (2012), 883--895.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Andreza Sartori, Dubravko Culibrk, Yan Yan, and Nicu Sebe. 2015. Who’s afraid of Itten: Using the art theory of color combination to analyze emotions in abstract paintings. In ACM MM. ACM, 311--320.Google ScholarGoogle Scholar
  30. Stefan Siersdorfer, Enrico Minack, Fan Deng, and Jonathon Hare. 2010. Analyzing and predicting sentiment of images on the social web. In Proceedings of the ACM International Conference on Multimedia (MM’10). ACM, New York, NY, 715--718.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Martin Solli and Reiner Lenz. 2009. Color based bags-of-emotions. In Proceedings of the International Conference on Computer Analysis of Images and Patterns. 573--580.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’14). 3104--3112.Google ScholarGoogle Scholar
  33. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  34. Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4, 2 (2012), 26--30.Google ScholarGoogle Scholar
  35. Weining Wang and Qianhua He. 2008. A survey on emotional semantic image retrieval. In Proceedings of the 2008 15th International Conference on Image Processing (ICIP’08). 117--120.Google ScholarGoogle ScholarCross RefCross Ref
  36. Wei-Ning Wang, Ying-Lin Yu, and Sheng-Ming Jiang. 2006. Image retrieval by emotional semantics: A study of emotional space and feature extraction. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’06), Vol. 4. IEEE, Los Alamitos, CA, 3534--3539.Google ScholarGoogle Scholar
  37. Lu Xin, Poonam Suryanarayan, Reginald B. Adams, Li Jia, Michelle G. Newman, and James Z. Wang. 2012. On shape and the computability of emotions. In Proceedings of the ACM International Conference on Multimedia (MM’12).Google ScholarGoogle Scholar
  38. Chenggang Yan, Liang Li, Chunjie Zhang, Bingtao Liu, Yongdong Zhang, and Qionghai Dai. 2019. Cross-modality bridging and knowledge transferring for image understanding. IEEE Transactions on Multimedia 20, 10 (2019), 2675--2685.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yunbin Tu, Xishan Zhang, Bingtao Liu, and Chenggang Yan. 2017. Video description with spatial-temporal attention. ACM Multimedia (ACMMM'17). 1014--1022.Google ScholarGoogle Scholar
  40. Chenggang Yan, Hongtao Xie, Jianjun Chen, Zhengjun Zha, Xinhong Hao, Yongdong Zhang, and Qionghai Dai. 2018. A fast Uyghur text detector for complex background images. IEEE Transactions on Multimedia 20, 12 (2018), 3389--3398.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Shijie Yang, Li Liang, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2017. A graph regularized deep neural network for unsupervised image representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  42. V. Yanulevskaya, J. C. Van Gemert, K. Roth, A.-K. Herbold, N. Sebe, and J.-M. Geusebroek. 2008. Emotional valence categorization using holistic image features. In Proceedings of the International Conference on Image Processing (ICIP’08). IEEE, Los Alamitos, CA, 101--104.Google ScholarGoogle ScholarCross RefCross Ref
  43. Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. 2015. Understanding neural networks through deep visualization. In Proceedings of the 32nd International Conference on Machine Learning Deep Learning Workshop (ICML’15).Google ScholarGoogle Scholar
  44. Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2016. Building a large scale dataset for image emotion recognition: The fine print and the benchmark. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16).Google ScholarGoogle Scholar
  45. Joe Yue-Hei Ng, Fan Yang, and Larry S. Davis. 2015. Exploiting local features from deep networks for image retrieval. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’15). 53--61.Google ScholarGoogle Scholar
  46. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). 818--833.Google ScholarGoogle Scholar
  47. Sicheng Zhao, Guiguang Ding, Yue Gao, and Jungong Han. 2017. Approximating discrete probability distribution of image emotions by multi-modal features fusion. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 4669--4675.Google ScholarGoogle ScholarCross RefCross Ref
  48. Sicheng Zhao, Yue Gao, Xiaolei Jiang, Hongxun Yao, Tat-Seng Chua, and Xiaoshuai Sun. 2014. Exploring principles-of-art features for image emotion recognition. In Proceedings of the ACM International Conference on Multimedia (MM’14). ACM, New York, NY, 47--56.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Sicheng Zhao, Amir Gholaminejad, Guiguang Ding, Yue Gao, Jungong Han, and Kurt Keutzer. 2019. Personalized emotion recognition by personality-aware high-order learning of physiological signals. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 1--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Sicheng Zhao, Hongxun Yao, Yue Gao, Guiguang Ding, and Tat-Seng Chua. 2018. Predicting personalized image emotion perceptions in social networks. IEEE Transactions on Affective Computing 9, 11 (2018), 526--540.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2014. Object detectors emerge in deep scene CNNs. arXiv:1412.6856.Google ScholarGoogle Scholar
  52. Xinge Zhu, Liang Li, Weigang Zhang, Tianrong Rao Rao, Min Xu, Qingming Huang, and Dong Xu. 2017. Dependency exploitation: A unified CNN-RNN approach for visual emotion recognition. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Hierarchical CNN-RNN Approach for Visual Emotion Classification

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 3s
          Special Issue on Face Analysis for Applications and Special Issue on Affective Computing for Large-Scale Heterogeneous Multimedia Data
          November 2019
          304 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3368027
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 December 2019
          • Revised: 1 August 2019
          • Accepted: 1 August 2019
          • Received: 1 February 2019
          Published in tomm Volume 15, Issue 3s

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!