Abstract
Body language is one of the most common ways of expressing human emotion. In this article, we make the first attempt to generate an action video with a specific emotion from a single person image. The goal of the emotion-based action generation task (EBAG) is to generate action videos expressing a specific type of emotion given a single reference image with a full human body. We divide the task into two parts and propose a two-stage framework to generate action videos with specified emotions. At the first stage, we propose an emotion-based pose sequence generation approach (EPOSE-GAN) for translating the emotion to a pose sequence. At the second stage, we generate the target video frames according to the three inputs including the source pose and the target pose as the motion information and the source image as the appearance reference by using conditional GAN model with an online training strategy. Our framework produces the pose sequence and transforms the action independently, which highlights the fundamental role that the high-level pose feature plays in generating action video with a specific emotion. The proposed method has been evaluated on the “Soul Dancer” dataset which is built for action emotion analysis and generation. The experimental results demonstrate that our framework can effectively solve the emotion-based action generation task. However, the gap in the details of the appearance between the generated action video and the real-world video still exists, which indicates that the emotion-based action generation task has great research potential.
- Kfir Aberman, Mingyi Shi, Jing Liao, Dani Lischinski, Baoquan Chen, and Daniel Cohen-Or. 2018. Deep video-based performance cloning. CoRR abs/1808.06847 (2018). arxiv:1808.06847 http://arxiv.org/abs/1808.06847Google Scholar
- Saima Aman and Stan Szpakowicz. 2007. Identifying expressions of emotion in text. In Proceedings of the International Conference on Text, Speech and Dialogue. Springer, 196--205.Google Scholar
Cross Ref
- Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning. 214--223.Google Scholar
- Carlos Busso, Zhigang Deng, Serdar Yildirim, Murtaza Bulut, Chul Min Lee, Abe Kazemzadeh, Sungbok Lee, Ulrich Neumann, and Shrikanth Narayanan. 2004. Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th International Conference on Multimodal Interfaces. ACM, 205--211.Google Scholar
Digital Library
- Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.Google Scholar
Cross Ref
- Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. 2016. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4733--4742.Google Scholar
Cross Ref
- Soumaya Chaffar and Diana Inkpen. 2011. Using a heterogeneous dataset for emotion analysis in text. In Proceedings of the Canadian Conference on Artificial Intelligence. Springer, 62--67.Google Scholar
Cross Ref
- Jingwen Chen, Jiawei Chen, Hongyang Chao, and Ming Yang. 2018. Image blind denoising with generative adversarial network based noise modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3155--3164.Google Scholar
Cross Ref
- Junkai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. 2018. Facial expression recognition in video with multiple feature fusion. IEEE Transactions on Affective Computing 9, 1 (2018), 38--50.Google Scholar
Digital Library
- Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8789--8797.Google Scholar
Cross Ref
- Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. 2015. Recurrent neural networks for emotion recognition in video. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (ICMI’15). ACM, New York, 467--474. DOI:https://doi.org/10.1145/2818346.2830596Google Scholar
Digital Library
- Paul Ekman. 1992. An argument for basic emotions. Cognition and Emotion 6, 3--4 (1992), 169--200. DOI:https://doi.org/10.1080/02699939208411068 arXiv:https://doi.org/10.1080/02699939208411068Google Scholar
Cross Ref
- Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: Regional Multi-person Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2334--2343.Google Scholar
Cross Ref
- Joseph Redmon Ali Farhadi. [n.d.]. YOLOv3: An Incremental Improvement. 5.Google Scholar
- Beat Fasel and Juergen Luettin. 2003. Automatic facial expression analysis: A survey. Pattern Recognition 36, 1 (2003), 259--275.Google Scholar
- Pedro F. Felzenszwalb and Daniel P. Huttenlocher. 2004. Pictorial structures for object recognition. International Journal of Computer Vision 61 (2004), 55--79.Google Scholar
Digital Library
- Ross Girshick. 2015. Fast r-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448.Google Scholar
Digital Library
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
Cross Ref
- Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. 2016. DeeperCut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
Cross Ref
- Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 5967--5976.Google Scholar
Cross Ref
- T. Kanade and J. F. Cohn and. 2000. Comprehensive database for facial expression analysis. In Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580). 46--53. DOI:https://doi.org/10.1109/AFGR.2000.840611Google Scholar
Cross Ref
- Efthymios Kouloumpis, Theresa Wilson, and Johanna Moore. 2011. Twitter sentiment analysis: The good the bad and the omg!. In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media.Google Scholar
- Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, and Zehan Wang. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 105--114.Google Scholar
Cross Ref
- Weiyuan Li and Hua Xu. 2014. Text-based emotion classification using emotion cause extraction. Expert Systems with Applications 41, 4 (2014), 1742--1749.Google Scholar
Digital Library
- Bing Liu and Lei Zhang. 2012. A survey of opinion mining and sentiment analysis. In Mining Text Data. Springer, 415--463.Google Scholar
- M. Liu, S. Shan, R. Wang, and X. Chen. 2014. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1749--1756. DOI:https://doi.org/10.1109/CVPR.2014.226Google Scholar
- P. Liu, S. Han, Z. Meng, and Y. Tong. 2014. Facial expression recognition via a boosted deep belief network. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 00. 1805--1812. DOI:https://doi.org/10.1109/CVPR.2014.233Google Scholar
- Xin Lu, Poonam Suryanarayan, Reginald B. Adams, Jr., Jia Li, Michelle G. Newman, and James Z. Wang. 2012. On shape and the computability of emotions. In Proceedings of the 20th ACM International Conference on Multimedia (MM’12). ACM, New York, 229--238. DOI:https://doi.org/10.1145/2393347.2393384Google Scholar
- Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. 2018. Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 99--108.Google Scholar
Cross Ref
- Navonil Majumder, Devamanyu Hazarika, A. Gelbukh, Erik Cambria, and Soujanya Poria. 2018. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowledge-Based Systems 161 (2018), 124--133.Google Scholar
Cross Ref
- Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2794--2802.Google Scholar
Cross Ref
- Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. Arxiv Preprint Arxiv:1511.05440 (2015).Google Scholar
- Joseph A. Mikels, Barbara L. Fredrickson, Gregory R. Larkin, Casey M. Lindberg, Sam J. Maglio, and Patricia A. Reuter-Lorenz. 2005. Emotional category data on images from the international affective picture system. Behavior Research Methods 37, 4 (1 Nov 2005), 626--630. DOI:https://doi.org/10.3758/BF03192732Google Scholar
- Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In Advances in Neural Information Processing Systems. 2277--2287.Google Scholar
- Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, 483--499.Google Scholar
Cross Ref
- Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, and Satinder Singh. 2015. Action-conditional video prediction using deep networks in Atari games. In Advances in Neural Information Processing Systems. 2863--2871.Google Scholar
- Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. 2016. DeepCut: Joint subset partition and labeling for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- Changqin Quan and Fuji Ren. 2010. A blog emotion corpus for emotional expression analysis in Chinese. Computer Speech 8 Language 24, 4 (2010), 726--749.Google Scholar
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.Google Scholar
- Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. In Advances in Neural Information Processing Systems. 2234--2242.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scales image recognition. In Arxiv:1409.1556 [cs]. http://arxiv.org/abs/1409.1556 arXiv: 1409.1556.Google Scholar
- Mohammad Soleymani, Sadjad Asghari-Esfeden, Yun Fu, and Maja Pantic. 2016. Analysis of EEG signals and facial expressions for continuous emotion detection. IEEE Transactions on Affective Computing 1 (2016), 17--28.Google Scholar
Digital Library
- X. Song, S. Jiang, and L. Herranz. 2017. Multi-scale multi-feature context modeling for scene recognition in the semantic manifold. IEEE Transactions on Image Processing 26, 6 (June 2017), 2721--2735. DOI:https://doi.org/10.1109/TIP.2017.2686017Google Scholar
Digital Library
- Carlo Strapparava and Rada Mihalcea. 2008. Learning to identify emotions in text. In Proceedings of the 2008 ACM Symposium on Applied Computing. ACM, 1556--1560.Google Scholar
Digital Library
- Haisheng Su, Xu Zhao, and Tianwei Lin. 2018. Cascaded pyramid mining network for weakly supervised temporal action localization. Arxiv:1810.11794 [cs] (Oct. 2018). http://arxiv.org/abs/1810.11794 arXiv: 1810.11794.Google Scholar
- Y.-I. Tian, Takeo Kanade, and Jeffrey F. Cohn. 2001. Recognizing action units for facial expression analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 2 (2001), 97--115.Google Scholar
Digital Library
- Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. 2017. Decomposing motion and content for natural video sequence prediction. Arxiv Preprint Arxiv:1706.08033 (2017).Google Scholar
- Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. 2017. Learning to generate long-term future via hierarchical prediction. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, International Convention Centre, Sydney, Australia, 3560--3569. http://proceedings.mlr.press/v70/villegas17a.html.Google Scholar
- C. Vondrick and A. Torralba. 2017. Generating the Future with Adversarial Transformers. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2992--3000. DOI:https://doi.org/10.1109/CVPR.2017.319Google Scholar
Cross Ref
- Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-video synthesis. In Advances in Neural Information Processing Systems (NeurIPS).Google Scholar
- Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8798--8807.Google Scholar
Cross Ref
- Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4724--4732.Google Scholar
Cross Ref
- Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV). 466--481.Google Scholar
Digital Library
- Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. 2018. Pose Flow: Efficient online pose tracking. In Proceedings of the British Machine Vision Conference.Google Scholar
- Yichao Yan, Jingwei Xu, Bingbing Ni, Wendong Zhang, and Xiaokang Yang. 2017. Skeleton-aided articulated motion generation. In Proceedings of the 25th ACM International Conference on Multimedia. ACM, 199--207.Google Scholar
Digital Library
- P. Yang, Q. Liu, and D. N. Metaxas. 2010. Exploring facial expressions with compositional features. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2638--2644. DOI:https://doi.org/10.1109/CVPR.2010.5539978Google Scholar
Cross Ref
- Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, and Jing-Hao Xue. 2018. Deep learning for single image super-resolution: A brief review. Arxiv:1808.03344 [cs] (Aug. 2018). http://arxiv.org/abs/1808.03344 arXiv: 1808.03344.Google Scholar
- Y. Yang, C. Deng, S. Gao, W. Liu, D. Tao, and X. Gao. 2017. Discriminative multi-instance multi-task learning for 3D action recognition. IEEE Transactions on Multimedia 19, 3 (March 2017), 519--529. DOI:https://doi.org/10.1109/TMM.2016.2626959Google Scholar
Digital Library
- Y. Yang, C. Deng, D. Tao, S. Zhang, W. Liu, and X. Gao. 2017. Latent Max-Margin Multitask Learning With Skelets for 3-D Action Recognition. IEEE Transactions on Cybernetics 47, 2 (2017), 439--448. DOI:https://doi.org/10.1109/TCYB.2016.2519448Google Scholar
- Yanhua Yang, Ruishan Liu, Cheng Deng, and Xinbo Gao. 2016. Multi-task human action recognition via exploring super-category. Signal Processing 124 (2016), 36--44. DOI:https://doi.org/10.1016/j.sigpro.2015.10.035Google Scholar
Digital Library
- Raymond A. Yeh, Chen Chen, Teck Yian Lim, Alexander G. Schwing, Mark Hasegawa-Johnson, and Minh N. Do. 2017. Semantic image in painting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5485--5493.Google Scholar
- Mengyao Zhai, Jiacheng Chen, Ruizhi Deng, Lei Chen, Ligeng Zhu, and Greg Mori. 2017. Learning to forecast videos of human activity with multi-granularity models and adaptive rendering. Arxiv:1712.01955 [cs] (Dec. 2017). http://arxiv.org/abs/1712.01955 arXiv: 1712.01955.Google Scholar
- Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 5907--5915.Google Scholar
- Shunxiang Zhang, Zhongliang Wei, Yin Wang, and Tao Liao. 2018. Sentiment analysis of chinese micro-blog text based on extended sentiment dictionary. Future Generation Computer Systems 81 (2018), 395--403.Google Scholar
Digital Library
- S. Zhao, Y. Gao, G. Ding, and T. Chua. 2018. Real-time multimedia social event detection in microblog. IEEE Transactions on Cybernetics 48, 11 (2018), 3218--3231. DOI:https://doi.org/10.1109/TCYB.2017.2762344Google Scholar
Cross Ref
- Sicheng Zhao, Chuang Lin, Pengfei Xu, Sendong Zhao, Yuchen Guo, R. V. V. Murali Krishna, Guiguang Ding, and Kurt Keutzer. 2019. CycleEmotionGAN: Emotional semantic consistency preserved CycleGAN for adapting image emotions. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- S. Zhao, H. Yao, Y. Gao, G. Ding, and T. Chua. 2018. Predicting personalized image emotion perceptions in social networks. IEEE Transactions on Affective Computing 9, 4 (2018), 526--540. DOI:https://doi.org/10.1109/TAFFC.2016.2628787Google Scholar
Digital Library
- S. Zhao, H. Yao, Y. Gao, R. Ji, and G. Ding. 2017. Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Transactions on Multimedia 19, 3 (March 2017), 632--645. DOI:https://doi.org/10.1109/TMM.2016.2617741Google Scholar
Digital Library
- Sicheng Zhao, Xin Zhao, Guiguang Ding, and Kurt Keutzer. 2018. EmotionGAN: Unsupervised domain adaptation for learning discrete probability distributions of image emotions. In Proceedings of the 26th ACM International Conference on Multimedia (MM’18) (Seoul, Republic of Korea). ACM, New York, 1319--1327. DOI:https://doi.org/10.1145/3240508.3240591Google Scholar
Digital Library
- Yan-Yan Zhao, Bing Qin, and Ting Liu. 2010. Sentiment analysis. Journal of Software 21, 8 (2010), 1834--1848.Google Scholar
Cross Ref
- Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223--2232.Google Scholar
Index Terms
Soul Dancer: Emotion-Based Human Action Generation
Recommendations
Mood contagion of robot body language in human robot interaction
The aim of our work is to design bodily mood expressions of humanoid robots for interactive settings that can be recognized by users and have (positive) effects on people who interact with the robots. To this end, we develop a parameterized behavior ...
Robot mood is contagious: effects of robot body language in the imitation game
AAMAS '14: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systemsMood contagion is an automatic mechanism that induces a congruent mood state by means of the observation of another person's emotional expression. In this paper, we address the question whether robot mood displayed during an imitation game can (a) be ...
On shape and the computability of emotions
MM '12: Proceedings of the 20th ACM international conference on MultimediaWe investigated how shape features in natural images influence emotions aroused in human beings. Shapes and their characteristics such as roundness, angularity, simplicity, and complexity have been postulated to affect the emotional responses of human ...






Comments