Abstract
Multi-modal video emotion reasoning (MERV) has recently attracted increasing attention due to its potential application in human-computer interaction. This task needs to not only recognize utterance-level emotions for conspicuous speakers, but also perceive the emotions of non-speakers in videos. Existing methods focus on modeling multi-modal multi-level contexts to capture emotion-relevant clues from the complex scenarios in videos. However, the context information is far from enough to infer the emotion labels of non-speakers due to the large gap between the scenario situation and emotions labels. Inspired by the observation that humans can find solutions to complex problems with the leverage of experience and knowledge, we propose SK-MER, a Scenario-relevant Knowledge-enhanced Multi-modal Emotion Reasoning framework for MERV task, which can leverage external knowledge to enhance the video scenario understanding and emotion reasoning. Specifically, we use scenario concepts extracted from videos to build knowledge subgraphs from external knowledge bases. The knowledge subgraphs are then utilized to obtain scenario-relevant knowledge representations through dynamic knowledge graph attention. Next, we incorporate the knowledge representations into context modeling to enhance emotion reasoning with external scenario-relevant knowledge. In addition, we propose a counterfactual knowledge representation learning approach to obtain more effective scenario-relevant knowledge representations. Extensive experimental results on MEmoR dataset show that the proposed SK-MER framework achieves new state-of-the-art results.
- [1] . 2007. DBpedia: A nucleus for a web of open data. In The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007. Springer, Busan, Korea, 722–735.
DOI: Google ScholarDigital Library
- [2] . 2021. HighlightMe: Detecting highlights from human-centric videos. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10–17, 2021. IEEE, 8137–8147.
DOI: Google ScholarCross Ref
- [3] . 2018. SenticNet 5: Discovering conceptual primitives for sentiment analysis by means of context embeddings. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18). AAAI Press, New Orleans, Louisiana, USA, 1795–1802.Google Scholar
Cross Ref
- [4] . 2018. VGGFace2: A dataset for recognising faces across pose and age. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018. IEEE Computer Society, Xi’an, China, 67–74.
DOI: Google ScholarDigital Library
- [5] . 2016. Emotion in context: Deep semantic feature fusion for video emotion recognition. In Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016. ACM, Amsterdam, The Netherlands, 127–131.
DOI: Google ScholarDigital Library
- [6] . 2021. Learning what and when to drop: Adaptive multimodal and contextual dynamics for emotion recognition in conversation. In Proceedings of the 29th ACM International Conference on Multimedia. ACM, Online, 1064–1073.
DOI: Google ScholarDigital Library
- [7] . 2020. Counterfactual samples synthesizing for robust visual question answering. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation / IEEE, Seattle, WA, USA, 10797–10806.
DOI: Google ScholarCross Ref
- [8] . 2019. Counterfactual critic multi-agent training for scene graph generation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. IEEE, Seoul, Korea (South), 4612–4622.
DOI: Google ScholarCross Ref
- [9] . 2016. Multi-modal conditional attention fusion for dimensional emotion prediction. In Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016. ACM, Amsterdam, The Netherlands, 571–575.
DOI: Google ScholarDigital Library
- [10] . 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020 (Proceedings of Machine Learning Research), Vol. 119. PMLR, Online, 1597–1607.Google Scholar
- [11] . 2017. Contrastive learning for image captioning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. Curran Associates, Inc., Long Beach, CA, USA, 898–907.Google Scholar
- [12] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, MN, USA, 4171–4186.
DOI: Google ScholarCross Ref
- [13] . 2016. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 38, 9 (2016), 1734–1747.
DOI: Google ScholarDigital Library
- [14] . 2013. Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In ACM Multimedia Conference, MM’13. ACM, Barcelona, Spain, 835–838.
DOI: Google ScholarDigital Library
- [15] . 2020. Knowledge-based video question answering with unsupervised scene descriptions. In Computer Vision - ECCV 2020-16th European Conference, Proceedings, Part XVIII (Lecture Notes in Computer Science), Vol. 12363. Springer, Glasgow, UK, 581–598.
DOI: Google ScholarDigital Library
- [16] . 2019. DialogueGCN: A graph convolutional neural network for emotion recognition in conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 154–164.
DOI: Google ScholarCross Ref
- [17] . 2013. Challenges in representation learning: A report on three machine learning contests. (2013). http://arxiv.org/abs/1307.0414.Google Scholar
- [18] . 2019. Mutual correlation attentive factors in dyadic fusion networks for speech emotion recognition. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019. ACM, Nice, France, 157–166.
DOI: Google ScholarDigital Library
- [19] . 2020. Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, (Ed.). ijcai.org, Yokohama, Japan, 767–773.
DOI: Google ScholarCross Ref
- [20] . 2019. LVIS: A dataset for large vocabulary instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. Computer Vision Foundation/IEEE, Long Beach, CA, USA, 5356–5364.
DOI: Google ScholarCross Ref
- [21] . 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006). IEEE Computer Society, New York, NY, USA, 1735–1742.
DOI: Google ScholarDigital Library
- [22] . 2015. Exploiting knowledge base to generate responses for natural language dialog listening agents. In Proceedings of the SIGDIAL 2015 Conference, The 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. The Association for Computer Linguistics, Prague, Czech Republic, 129–133.
DOI: Google ScholarCross Ref
- [23] . 2017. An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Volume 1: Long Papers. Association for Computational Linguistics, Vancouver, Canada, 221–231.
DOI: Google ScholarCross Ref
- [24] . 2017. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? arXiv preprint arXiv:1711.09577 (2017).Google Scholar
- [25] . 2018. ICON: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2594–2604.
DOI: Google ScholarCross Ref
- [26] . 2018. Conversational memory network for emotion recognition in dyadic dialogue videos. In Proceedings of the Conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2018. NIH Public Access, Association for Computational Linguistics, New Orleans, Louisiana, USA, 2122.
DOI: Google ScholarCross Ref
- [27] . 2020. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation/IEEE, Seattle, WA, USA, 9726–9735.
DOI: Google ScholarCross Ref
- [28] . 2020. spaCy: Industrial-strength natural language processing in Python. (2020).
DOI: Google ScholarCross Ref
- [29] . 2021. DialogueCRN: Contextual reasoning networks for emotion recognition in conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 7042–7052.
DOI: Google ScholarCross Ref
- [30] . 2021. MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5666–5675.
DOI: Google ScholarCross Ref
- [31] . 2020. Learning utterance-level representations with label smoothing for speech emotion recognition. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association. ISCA, Virtual, 4079–4083.
DOI: Google ScholarCross Ref
- [32] . 2021. Knowledge-driven egocentric multimodal activity recognition. ACM Trans. Multim. Comput. Commun. Appl. 16, 4 (2021), 133:1–133:133.
DOI: Google ScholarDigital Library
- [33] . 2020. Supervised contrastive learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020. Curran Associates, Inc., Online, 18661–18673.Google Scholar
- [34] . 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015. San Diego, CA, USA. http://arxiv.org/abs/1412.6980.Google Scholar
- [35] . 2021. Weakly supervised human-object interaction detection in video via contrastive spatiotemporal regions. CoRR abs/2110.03562 (2021).
arXiv:2110.03562 https://arxiv.org/abs/2110.03562.Google Scholar - [36] . 2021. CTNet: Conversational transformer network for emotion recognition. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 985–1000.
DOI: Google ScholarDigital Library
- [37] . 2018. Knowledge diffusion for neural dialogue generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Volume 1: Long Papers. Association for Computational Linguistics, Melbourne, Australia, 1489–1498.Google Scholar
Cross Ref
- [38] . 2018. Entity-aware image caption generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4013–4023.
DOI: Google ScholarCross Ref
- [39] . 2019. Emotion recognition using multimodal residual LSTM network. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019. ACM, Nice, France, 176–183.
DOI: Google ScholarDigital Library
- [40] . 2022. Multi-source knowledge reasoning graph network for multi-modal commonsense inference. ACM Trans. Multimedia Comput. Commun. Appl. (
Dec. 2022).DOI: Just Accepted. Google ScholarDigital Library
- [41] . 2019. DialogueRNN: An attentive RNN for emotion detection in conversations. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019. AAAI Press, Honolulu, Hawaii, USA, 6818–6825.
DOI: Google ScholarDigital Library
- [42] . 2019. OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. Computer Vision Foundation/IEEE, Long Beach, CA, USA, 3195–3204.
DOI: Google ScholarCross Ref
- [43] . 2020. EmotiCon: Context-aware multimodal emotion recognition using Frege’s principle. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation/IEEE, Seattle, WA, USA, 14222–14231.
DOI: Google ScholarCross Ref
- [44] . 2022. I-GCN: Incremental graph convolution network for conversation emotion detection. IEEE Trans. Multim. 24 (2022), 4471–4481.
DOI: Google ScholarCross Ref
- [45] . 2021. Counterfactual VQA: A cause-effect look at language bias. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021. Computer Vision Foundation/IEEE, Online, 12700–12710.Google Scholar
Cross Ref
- [46] . 2001. The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist 89, 4 (2001), 344–350.Google Scholar
Cross Ref
- [47] . 2017. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 873–883.
DOI: Google ScholarCross Ref
- [48] . 2021. Emotion knowledge driven video highlight detection. IEEE Trans. Multim. 23 (2021), 3999–4013.
DOI: Google ScholarCross Ref
- [49] . 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021 (Proceedings of Machine Learning Research), Vol. 139. PMLR, Online, 8748–8763.Google Scholar
- [50] . 2022. LR-GCN: Latent relation-aware graph convolutional network for conversational emotion recognition. IEEE Trans. Multim. 24 (2022), 4422–4432.
DOI: Google ScholarCross Ref
- [51] . 2015. FaceNet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. IEEE Computer Society, Boston, MA, USA, 815–823.
DOI: Google ScholarCross Ref
- [52] . 2013. The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association. ISCA, Lyon, France, 148–152.Google Scholar
Cross Ref
- [53] . 2020. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128, 2 (2020), 336–359.
DOI: Google ScholarDigital Library
- [54] . 2020. MEmoR: A dataset for multimodal emotion reasoning in videos. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, Seattle, WA, USA, 493–502.
DOI: Google ScholarDigital Library
- [55] . 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012).
arXiv:1212.0402 http://arxiv.org/abs/1212.0402.Google Scholar - [56] . 2017. ConceptNet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. AAAI Press, San Francisco, California, USA, 4444–4451.Google Scholar
Cross Ref
- [57] . 2018. Learning visual knowledge memory networks for visual question answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. Computer Vision Foundation/IEEE Computer Society, Salt Lake City, UT, USA, 7736–7745.
DOI: Google ScholarCross Ref
- [58] . 2007. YAGO: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007. ACM, Banff, Alberta, Canada, 697–706.
DOI: Google ScholarDigital Library
- [59] . 2018. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4231–4242.
DOI: Google ScholarCross Ref
- [60] . 2020. Contrastive multiview coding. In Computer Vision - ECCV 2020-16th European Conference, Proceedings, Part XI (Lecture Notes in Computer Science), Vol. 12356. Springer, Glasgow, UK, 776–794.
DOI: Google ScholarDigital Library
- [61] . 2018. Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018).
arXiv:1807.03748 http://arxiv.org/abs/1807.03748.Google Scholar - [62] . 2022. Dual scene graph convolutional network for motivation prediction. ACM Trans. Multimedia Comput. Commun. Appl. (
Dec. 2022).DOI: Just Accepted. Google ScholarDigital Library
- [63] . 2021. Cross-modal dynamic convolution for multi-modal emotion recognition. J. Vis. Commun. Image Represent. 78 (2021), 103178.
DOI: Google ScholarDigital Library
- [64] . 2019. Detectron2. (2019). https://github.com/facebookresearch/detectron2.Google Scholar
- [65] . 2020. Graph contrastive learning with augmentations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, , , , , and (Eds.). Virtual.Google Scholar
- [66] . 2018. Augmenting end-to-end dialogue systems with commonsense knowledge. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18). AAAI Press, New Orleans, LA, USA, 4970–4977.Google Scholar
Cross Ref
- [67] . 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017. Association for Computational Linguistics, Copenhagen, Denmark, 1103–1114.
DOI: Google ScholarCross Ref
- [68] . 2020. Knowledge aware emotion recognition in textual conversations via multi-task incremental transformer. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020. International Committee on Computational Linguistics, Barcelona, Spain (Online), 4429–4440.
DOI: Google ScholarCross Ref
- [69] . 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23, 10 (2016), 1499–1503.
DOI: Google ScholarCross Ref
- [70] . 2021. Multi-level counterfactual contrast for visual commonsense reasoning. In MM’21: ACM Multimedia Conference. ACM, Online (China), 1793–1802.
DOI: Google ScholarDigital Library
- [71] . 2020. Counterfactual contrastive learning for weakly-supervised vision-language grounding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020. Curran Associates, Inc., Online.Google Scholar
- [72] . 2021. Counterfactual graph learning for link prediction. CoRR abs/2106.02172 (2021).
arXiv:2106.02172 https://arxiv.org/abs/2106.02172.Google Scholar - [73] . 2021. Boosting contrastive learning with relation knowledge distillation. CoRR abs/2112.04174 (2021).
arXiv:2112.04174 https://arxiv.org/abs/2112.04174.Google Scholar - [74] . 2020. Webly supervised knowledge embedding model for visual reasoning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation/IEEE, Seattle, WA, USA, 12442–12451.
DOI: Google ScholarCross Ref
- [75] . 2019. An affect-rich neural conversational model with biased attention and weighted cross-entropy loss. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019. AAAI Press, Honolulu, Hawaii, USA, 7492–7500.
DOI: Google ScholarDigital Library
- [76] . 2019. Knowledge-enriched transformer for emotion detection in textual conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 165–176.
DOI: Google ScholarCross Ref
- [77] . 2018. Commonsense knowledge aware conversation generation with graph attention. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018. ijcai.org, Stockholm, Sweden, 4623–4629.
DOI: Google ScholarCross Ref
- [78] . 2019. Local aggregation for unsupervised learning of visual embeddings. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. IEEE, Seoul, Korea (South), 6001–6011.
DOI: Google ScholarCross Ref
- [79] . 2021. CrossCLR: Cross-modal contrastive learning for multi-modal video representations. CoRR abs/2109.14910 (2021).
arXiv:2109.14910 https://arxiv.org/abs/2109.14910.Google Scholar
Index Terms
Counterfactual Scenario-relevant Knowledge-enriched Multi-modal Emotion Reasoning
Recommendations
Multi-Source Knowledge Reasoning Graph Network for Multi-Modal Commonsense Inference
As a crucial part of natural language processing, event-centered commonsense inference task has attracted increasing attention. With a given observed event, the intention and reaction of the people involved in the event are required to be inferred with ...
Services for Context Aware Knowledge Enhancement and Its Application in the Chinese Enterprise Management Tank (CEMT)
SCC '13: Proceedings of the 2013 IEEE International Conference on Services ComputingIn the era of knowledge economy, knowledge resources have become the most valuable assets for enterprises. To better understand and reuse knowledge, it is necessary to relate it with the context in which the knowledge is generated and used. This is a ...
Evaluation and Discussion of Multi-modal Emotion Recognition
ICCEE '09: Proceedings of the 2009 Second International Conference on Computer and Electrical Engineering - Volume 01Recognition of emotions from multimodal cues is of basic interest for the design of many adaptive interfaces in human-machine and human-robot interaction. It provides a means to incorporate non-verbal feedback in the interactional course. Humans express ...






Comments