skip to main content
research-article

Counterfactual Scenario-relevant Knowledge-enriched Multi-modal Emotion Reasoning

Authors Info & Claims
Published:07 June 2023Publication History
Skip Abstract Section

Abstract

Multi-modal video emotion reasoning (MERV) has recently attracted increasing attention due to its potential application in human-computer interaction. This task needs to not only recognize utterance-level emotions for conspicuous speakers, but also perceive the emotions of non-speakers in videos. Existing methods focus on modeling multi-modal multi-level contexts to capture emotion-relevant clues from the complex scenarios in videos. However, the context information is far from enough to infer the emotion labels of non-speakers due to the large gap between the scenario situation and emotions labels. Inspired by the observation that humans can find solutions to complex problems with the leverage of experience and knowledge, we propose SK-MER, a Scenario-relevant Knowledge-enhanced Multi-modal Emotion Reasoning framework for MERV task, which can leverage external knowledge to enhance the video scenario understanding and emotion reasoning. Specifically, we use scenario concepts extracted from videos to build knowledge subgraphs from external knowledge bases. The knowledge subgraphs are then utilized to obtain scenario-relevant knowledge representations through dynamic knowledge graph attention. Next, we incorporate the knowledge representations into context modeling to enhance emotion reasoning with external scenario-relevant knowledge. In addition, we propose a counterfactual knowledge representation learning approach to obtain more effective scenario-relevant knowledge representations. Extensive experimental results on MEmoR dataset show that the proposed SK-MER framework achieves new state-of-the-art results.

REFERENCES

  1. [1] Auer Sören, Bizer Christian, Kobilarov Georgi, Lehmann Jens, Cyganiak Richard, and Ives Zachary G.. 2007. DBpedia: A nucleus for a web of open data. In The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007. Springer, Busan, Korea, 722735. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Bhattacharya Uttaran, Wu Gang, Petrangeli Stefano, Swaminathan Viswanathan, and Manocha Dinesh. 2021. HighlightMe: Detecting highlights from human-centric videos. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10–17, 2021. IEEE, 81378147. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Cambria Erik, Poria Soujanya, Hazarika Devamanyu, and Kwok Kenneth. 2018. SenticNet 5: Discovering conceptual primitives for sentiment analysis by means of context embeddings. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18). AAAI Press, New Orleans, Louisiana, USA, 17951802.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Cao Qiong, Shen Li, Xie Weidi, Parkhi Omkar M., and Zisserman Andrew. 2018. VGGFace2: A dataset for recognising faces across pose and age. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018. IEEE Computer Society, Xi’an, China, 6774. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Chen Chen, Wu Zuxuan, and Jiang Yu-Gang. 2016. Emotion in context: Deep semantic feature fusion for video emotion recognition. In Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016. ACM, Amsterdam, The Netherlands, 127131. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Chen Feiyu, Sun Zhengxiao, Ouyang Deqiang, Liu Xueliang, and Shao Jie. 2021. Learning what and when to drop: Adaptive multimodal and contextual dynamics for emotion recognition in conversation. In Proceedings of the 29th ACM International Conference on Multimedia. ACM, Online, 10641073. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Chen Long, Yan Xin, Xiao Jun, Zhang Hanwang, Pu Shiliang, and Zhuang Yueting. 2020. Counterfactual samples synthesizing for robust visual question answering. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation / IEEE, Seattle, WA, USA, 1079710806. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chen Long, Zhang Hanwang, Xiao Jun, He Xiangnan, Pu Shiliang, and Chang Shih-Fu. 2019. Counterfactual critic multi-agent training for scene graph generation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. IEEE, Seoul, Korea (South), 46124622. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Chen Shizhe and Jin Qin. 2016. Multi-modal conditional attention fusion for dimensional emotion prediction. In Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016. ACM, Amsterdam, The Netherlands, 571575. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Chen Ting, Kornblith Simon, Norouzi Mohammad, and Hinton Geoffrey E.. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020 (Proceedings of Machine Learning Research), Vol. 119. PMLR, Online, 15971607.Google ScholarGoogle Scholar
  11. [11] Dai Bo and Lin Dahua. 2017. Contrastive learning for image captioning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. Curran Associates, Inc., Long Beach, CA, USA, 898907.Google ScholarGoogle Scholar
  12. [12] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, MN, USA, 41714186. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Dosovitskiy Alexey, Fischer Philipp, Springenberg Jost Tobias, Riedmiller Martin A., and Brox Thomas. 2016. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 38, 9 (2016), 17341747. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Eyben Florian, Weninger Felix, Groß Florian, and Schuller Björn W.. 2013. Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In ACM Multimedia Conference, MM’13. ACM, Barcelona, Spain, 835838. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Garcia Noa and Nakashima Yuta. 2020. Knowledge-based video question answering with unsupervised scene descriptions. In Computer Vision - ECCV 2020-16th European Conference, Proceedings, Part XVIII (Lecture Notes in Computer Science), Vol. 12363. Springer, Glasgow, UK, 581598. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Ghosal Deepanway, Majumder Navonil, Poria Soujanya, Chhaya Niyati, and Gelbukh Alexander F.. 2019. DialogueGCN: A graph convolutional neural network for emotion recognition in conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 154164. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Goodfellow Ian, Erhan Dumitru, Carrier Pierre-Luc, Courville Aaron, Mirza Mehdi, Hamner Ben, Cukierski Will, Tang Yichuan, Thaler David, Lee Dong-Hyun, Zhou Yingbo, Ramaiah Chetan, Feng Fangxiang, Li Ruifan, Wang Xiaojie, Athanasakis Dimitris, Shawe-Taylor John, Milakov Maxim, Park John, Ionescu Radu, Popescu Marius, Grozea Cristian, Bergstra James, Xie Jingjing, Romaszko Lukasz, Xu Bing, Chuang Zhang, and Bengio Yoshua. 2013. Challenges in representation learning: A report on three machine learning contests. (2013). http://arxiv.org/abs/1307.0414.Google ScholarGoogle Scholar
  18. [18] Gu Yue, Lyu Xinyu, Sun Weijia, Li Weitian, Chen Shuhong, Li Xinyu, and Marsic Ivan. 2019. Mutual correlation attentive factors in dyadic fusion networks for speech emotion recognition. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019. ACM, Nice, France, 157166. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Guo Longteng, Liu Jing, Zhu Xinxin, He Xingjian, Jiang Jie, and Lu Hanqing. 2020. Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, Bessiere Christian (Ed.). ijcai.org, Yokohama, Japan, 767773. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Gupta Agrim, Dollár Piotr, and Girshick Ross B.. 2019. LVIS: A dataset for large vocabulary instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. Computer Vision Foundation/IEEE, Long Beach, CA, USA, 53565364. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Hadsell Raia, Chopra Sumit, and LeCun Yann. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006). IEEE Computer Society, New York, NY, USA, 17351742. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Han Sangdo, Bang Jeesoo, Ryu Seonghan, and Lee Gary Geunbae. 2015. Exploiting knowledge base to generate responses for natural language dialog listening agents. In Proceedings of the SIGDIAL 2015 Conference, The 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. The Association for Computer Linguistics, Prague, Czech Republic, 129133. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Hao Yanchao, Zhang Yuanzhe, Liu Kang, He Shizhu, Liu Zhanyi, Wu Hua, and Zhao Jun. 2017. An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Volume 1: Long Papers. Association for Computational Linguistics, Vancouver, Canada, 221231. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Hara Kensho, Kataoka Hirokatsu, and Satoh Yutaka. 2017. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? arXiv preprint arXiv:1711.09577 (2017).Google ScholarGoogle Scholar
  25. [25] Hazarika Devamanyu, Poria Soujanya, Mihalcea Rada, Cambria Erik, and Zimmermann Roger. 2018. ICON: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 25942604. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Hazarika Devamanyu, Poria Soujanya, Zadeh Amir, Cambria Erik, Morency Louis-Philippe, and Zimmermann Roger. 2018. Conversational memory network for emotion recognition in dyadic dialogue videos. In Proceedings of the Conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2018. NIH Public Access, Association for Computational Linguistics, New Orleans, Louisiana, USA, 2122. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, and Girshick Ross B.. 2020. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation/IEEE, Seattle, WA, USA, 97269735. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Honnibal Matthew, Montani Ines, Landeghem Sofie Van, and Boyd Adriane. 2020. spaCy: Industrial-strength natural language processing in Python. (2020). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Hu Dou, Wei Lingwei, and Huai Xiaoyong. 2021. DialogueCRN: Contextual reasoning networks for emotion recognition in conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 70427052. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Hu Jingwen, Liu Yuchen, Zhao Jinming, and Jin Qin. 2021. MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 56665675. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Huang Jian, Tao Jianhua, Liu Bin, and Lian Zheng. 2020. Learning utterance-level representations with label smoothing for speech emotion recognition. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association. ISCA, Virtual, 40794083. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Huang Yi, Yang Xiaoshan, Gao Junyu, Sang Jitao, and Xu Changsheng. 2021. Knowledge-driven egocentric multimodal activity recognition. ACM Trans. Multim. Comput. Commun. Appl. 16, 4 (2021), 133:1133:133. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Khosla Prannay, Teterwak Piotr, Wang Chen, Sarna Aaron, Tian Yonglong, Isola Phillip, Maschinot Aaron, Liu Ce, and Krishnan Dilip. 2020. Supervised contrastive learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020. Curran Associates, Inc., Online, 1866118673.Google ScholarGoogle Scholar
  34. [34] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015. San Diego, CA, USA. http://arxiv.org/abs/1412.6980.Google ScholarGoogle Scholar
  35. [35] Li Shuang, Du Yilun, Torralba Antonio, Sivic Josef, and Russell Bryan C.. 2021. Weakly supervised human-object interaction detection in video via contrastive spatiotemporal regions. CoRR abs/2110.03562 (2021). arXiv:2110.03562 https://arxiv.org/abs/2110.03562.Google ScholarGoogle Scholar
  36. [36] Lian Zheng, Liu Bin, and Tao Jianhua. 2021. CTNet: Conversational transformer network for emotion recognition. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 9851000. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Liu Shuman, Chen Hongshen, Ren Zhaochun, Feng Yang, Liu Qun, and Yin Dawei. 2018. Knowledge diffusion for neural dialogue generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Volume 1: Long Papers. Association for Computational Linguistics, Melbourne, Australia, 14891498.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Lu Di, Whitehead Spencer, Huang Lifu, Ji Heng, and Chang Shih-Fu. 2018. Entity-aware image caption generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 40134023. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Ma Jia-Xin, Tang Hao, Zheng Wei-Long, and Lu Bao-Liang. 2019. Emotion recognition using multimodal residual LSTM network. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019. ACM, Nice, France, 176183. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Ma Xuan, Yang Xiaoshan, and Xu Changsheng. 2022. Multi-source knowledge reasoning graph network for multi-modal commonsense inference. ACM Trans. Multimedia Comput. Commun. Appl. (Dec. 2022). DOI: Just Accepted.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Majumder Navonil, Poria Soujanya, Hazarika Devamanyu, Mihalcea Rada, Gelbukh Alexander F., and Cambria Erik. 2019. DialogueRNN: An attentive RNN for emotion detection in conversations. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019. AAAI Press, Honolulu, Hawaii, USA, 68186825. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Marino Kenneth, Rastegari Mohammad, Farhadi Ali, and Mottaghi Roozbeh. 2019. OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. Computer Vision Foundation/IEEE, Long Beach, CA, USA, 31953204. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Mittal Trisha, Guhan Pooja, Bhattacharya Uttaran, Chandra Rohan, Bera Aniket, and Manocha Dinesh. 2020. EmotiCon: Context-aware multimodal emotion recognition using Frege’s principle. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation/IEEE, Seattle, WA, USA, 1422214231. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Nie Weizhi, Chang Rihao, Ren Minjie, Sun Yueting, and Liu Anan. 2022. I-GCN: Incremental graph convolution network for conversation emotion detection. IEEE Trans. Multim. 24 (2022), 44714481. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Niu Yulei, Tang Kaihua, Zhang Hanwang, Lu Zhiwu, Hua Xian-Sheng, and Wen Ji-Rong. 2021. Counterfactual VQA: A cause-effect look at language bias. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021. Computer Vision Foundation/IEEE, Online, 1270012710.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Plutchik Robert. 2001. The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist 89, 4 (2001), 344350.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Poria Soujanya, Cambria Erik, Hazarika Devamanyu, Majumder Navonil, Zadeh Amir, and Morency Louis-Philippe. 2017. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 873883. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Qi Fan, Yang Xiaoshan, and Xu Changsheng. 2021. Emotion knowledge driven video highlight detection. IEEE Trans. Multim. 23 (2021), 39994013. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, Krueger Gretchen, and Sutskever Ilya. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021 (Proceedings of Machine Learning Research), Vol. 139. PMLR, Online, 87488763.Google ScholarGoogle Scholar
  50. [50] Ren Minjie, Huang Xiangdong, Li Wenhui, Song Dan, and Nie Weizhi. 2022. LR-GCN: Latent relation-aware graph convolutional network for conversational emotion recognition. IEEE Trans. Multim. 24 (2022), 44224432. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Schroff Florian, Kalenichenko Dmitry, and Philbin James. 2015. FaceNet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015. IEEE Computer Society, Boston, MA, USA, 815823. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Schuller Björn W., Steidl Stefan, Batliner Anton, Vinciarelli Alessandro, Scherer Klaus R., Ringeval Fabien, Chetouani Mohamed, Weninger Felix, Eyben Florian, Marchi Erik, Mortillaro Marcello, Salamin Hugues, Polychroniou Anna, Valente Fabio, and Kim Samuel. 2013. The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association. ISCA, Lyon, France, 148152.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Selvaraju Ramprasaath R., Cogswell Michael, Das Abhishek, Vedantam Ramakrishna, Parikh Devi, and Batra Dhruv. 2020. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 128, 2 (2020), 336359. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Shen Guangyao, Wang Xin, Duan Xuguang, Li Hongzhi, and Zhu Wenwu. 2020. MEmoR: A dataset for multimodal emotion reasoning in videos. In Proceedings of the 28th ACM International Conference on Multimedia. ACM, Seattle, WA, USA, 493502. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Soomro Khurram, Zamir Amir Roshan, and Shah Mubarak. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). arXiv:1212.0402 http://arxiv.org/abs/1212.0402.Google ScholarGoogle Scholar
  56. [56] Speer Robyn, Chin Joshua, and Havasi Catherine. 2017. ConceptNet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. AAAI Press, San Francisco, California, USA, 44444451.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Su Zhou, Zhu Chen, Dong Yinpeng, Cai Dongqi, Chen Yurong, and Li Jianguo. 2018. Learning visual knowledge memory networks for visual question answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. Computer Vision Foundation/IEEE Computer Society, Salt Lake City, UT, USA, 77367745. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Suchanek Fabian M., Kasneci Gjergji, and Weikum Gerhard. 2007. YAGO: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007. ACM, Banff, Alberta, Canada, 697706. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Sun Haitian, Dhingra Bhuwan, Zaheer Manzil, Mazaitis Kathryn, Salakhutdinov Ruslan, and Cohen William W.. 2018. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 42314242. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Tian Yonglong, Krishnan Dilip, and Isola Phillip. 2020. Contrastive multiview coding. In Computer Vision - ECCV 2020-16th European Conference, Proceedings, Part XI (Lecture Notes in Computer Science), Vol. 12356. Springer, Glasgow, UK, 776794. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] van den Oord Aäron, Li Yazhe, and Vinyals Oriol. 2018. Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018). arXiv:1807.03748 http://arxiv.org/abs/1807.03748.Google ScholarGoogle Scholar
  62. [62] Wanyan Yuyang, Yang Xiaoshan, Ma Xuan, and Xu Changsheng. 2022. Dual scene graph convolutional network for motivation prediction. ACM Trans. Multimedia Comput. Commun. Appl. (Dec. 2022). DOI: Just Accepted.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Wen Huanglu, You Shaodi, and Fu Ying. 2021. Cross-modal dynamic convolution for multi-modal emotion recognition. J. Vis. Commun. Image Represent. 78 (2021), 103178. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Wu Yuxin, Kirillov Alexander, Massa Francisco, Lo Wan-Yen, and Girshick Ross. 2019. Detectron2. (2019). https://github.com/facebookresearch/detectron2.Google ScholarGoogle Scholar
  65. [65] You Yuning, Chen Tianlong, Sui Yongduo, Chen Ting, Wang Zhangyang, and Shen Yang. 2020. Graph contrastive learning with augmentations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Larochelle Hugo, Ranzato Marc’Aurelio, Hadsell Raia, Balcan Maria-Florina, and Lin Hsuan-Tien (Eds.). Virtual.Google ScholarGoogle Scholar
  66. [66] Young Tom, Cambria Erik, Chaturvedi Iti, Zhou Hao, Biswas Subham, and Huang Minlie. 2018. Augmenting end-to-end dialogue systems with commonsense knowledge. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18). AAAI Press, New Orleans, LA, USA, 49704977.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Zadeh Amir, Chen Minghai, Poria Soujanya, Cambria Erik, and Morency Louis-Philippe. 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017. Association for Computational Linguistics, Copenhagen, Denmark, 11031114. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Zhang Duzhen, Chen Xiuyi, Xu Shuang, and Xu Bo. 2020. Knowledge aware emotion recognition in textual conversations via multi-task incremental transformer. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020. International Committee on Computational Linguistics, Barcelona, Spain (Online), 44294440. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Zhang Kaipeng, Zhang Zhanpeng, Li Zhifeng, and Qiao Yu. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23, 10 (2016), 14991503. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Zhang Xi, Zhang Feifei, and Xu Changsheng. 2021. Multi-level counterfactual contrast for visual commonsense reasoning. In MM’21: ACM Multimedia Conference. ACM, Online (China), 17931802. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. [71] Zhang Zhu, Zhao Zhou, Lin Zhijie, Zhu Jieming, and He Xiuqiang. 2020. Counterfactual contrastive learning for weakly-supervised vision-language grounding. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020. Curran Associates, Inc., Online.Google ScholarGoogle Scholar
  72. [72] Zhao Tong, Liu Gang, Wang Daheng, Yu Wenhao, and Jiang Meng. 2021. Counterfactual graph learning for link prediction. CoRR abs/2106.02172 (2021). arXiv:2106.02172 https://arxiv.org/abs/2106.02172.Google ScholarGoogle Scholar
  73. [73] Zheng Kai, Wang Yuanjiang, and Yuan Ye. 2021. Boosting contrastive learning with relation knowledge distillation. CoRR abs/2112.04174 (2021). arXiv:2112.04174 https://arxiv.org/abs/2112.04174.Google ScholarGoogle Scholar
  74. [74] Zheng Wenbo, Yan Lan, Gou Chao, and Wang Fei-Yue. 2020. Webly supervised knowledge embedding model for visual reasoning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. Computer Vision Foundation/IEEE, Seattle, WA, USA, 1244212451. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Zhong Peixiang, Wang Di, and Miao Chunyan. 2019. An affect-rich neural conversational model with biased attention and weighted cross-entropy loss. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019. AAAI Press, Honolulu, Hawaii, USA, 74927500. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Zhong Peixiang, Wang Di, and Miao Chunyan. 2019. Knowledge-enriched transformer for emotion detection in textual conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 165176. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Zhou Hao, Young Tom, Huang Minlie, Zhao Haizhou, Xu Jingfang, and Zhu Xiaoyan. 2018. Commonsense knowledge aware conversation generation with graph attention. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018. ijcai.org, Stockholm, Sweden, 46234629. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  78. [78] Zhuang Chengxu, Zhai Alex Lin, and Yamins Daniel. 2019. Local aggregation for unsupervised learning of visual embeddings. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. IEEE, Seoul, Korea (South), 60016011. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  79. [79] Zolfaghari Mohammadreza, Zhu Yi, Gehler Peter V., and Brox Thomas. 2021. CrossCLR: Cross-modal contrastive learning for multi-modal video representations. CoRR abs/2109.14910 (2021). arXiv:2109.14910 https://arxiv.org/abs/2109.14910.Google ScholarGoogle Scholar

Index Terms

  1. Counterfactual Scenario-relevant Knowledge-enriched Multi-modal Emotion Reasoning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 5s
        October 2023
        280 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3599694
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 June 2023
        • Online AM: 10 February 2023
        • Accepted: 22 January 2023
        • Received: 5 September 2022
        Published in tomm Volume 19, Issue 5s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!