A Bipartite Graph is All We Need for Enhancing Emotional Reasoning with Commonsense Knowledge

The context-aware emotional reasoning ability of AI systems, especially in conversations, is of vital importance in applications such as online opinion mining from social media and empathetic dialogue systems. Due to the implicit nature of conveying emotions in many scenarios, commonsense knowledge is widely utilized to enrich utterance semantics and enhance conversation modeling. However, most previous knowledge infusion methods perform empirical knowledge filtering and design highly customized architectures for knowledge interaction with the utterances, which can discard useful knowledge aspects and limit their generalizability to different knowledge sources. Based on these observations, we propose a Bipartite Heterogeneous Graph (BHG) method for enhancing emotional reasoning with commonsense knowledge. In BHG, the extracted context-aware utterance representations and knowledge representations are modeled as heterogeneous nodes. Two more knowledge aggregation node types are proposed to perform automatic knowledge filtering and interaction. BHG-based knowledge infusion can be directly generalized to multi-type and multi-grained knowledge sources. In addition, we propose a Multi-dimensional Heterogeneous Graph Transformer (MHGT) to perform graph reasoning, which can retain unchanged feature spaces and unequal dimensions for heterogeneous node types during inference to prevent unnecessary loss of information. Experiments show that BHG-based methods significantly outperform state-of-the-art knowledge infusion methods and show generalized knowledge infusion ability with higher efficiency. Further analysis proves that previous empirical knowledge filtering methods do not guarantee to provide the most useful knowledge information. Our code is available at: https://github.com/SteveKGYang/BHG.


INTRODUCTION
Understanding human emotions is at the core of affective computing.In natural language processing, recent years have witnessed growing research interests in machine's context-aware emotional reasoning ability, especially in conversations, due to its vital importance in scenarios such as empathetic dialogue systems [17] and online opinion mining from social media [4].This goal is mostly specified as recognizing the emotion [23] or the emotion cause [21] of certain utterances within a conversation.
There are two key challenges for enhancing conversational emotional reasoning.Firstly, the emotion of the target speaker is influenced by both his own mental state and other participants' behaviors.Current methods mainly build conversation models [27,28,38] based on Pre-trained Language Models (PLMs) [15,41] to tackle these dependencies.Secondly, emotions are often conveyed implicitly with metaphor, sarcasm, and underlying common sense.To mine the related information, a mainstream solution infuses commonsense knowledge to provide emotional clues and help model the inter-utterance relations, which mostly follows a three-step paradigm.We provide an illustration in Figure 1 and summarize this paradigm as follows: Firstly, knowledge extraction obtains commonsense knowledge items with the target conversation as queries.This process is closely dependent on the granularity of the queries and the characteristics of the knowledge source.As shown in the example, with the utterance-level query "Yeah, jogging with Sally!" and the generative knowledge source COMET [2], we can extract sentence-level social commonsense knowledge such as "PersonX is seen as active" and "As a result, others feel excited".Secondly, knowledge filtering selects the most relevant knowledge aspects from the extracted knowledge items based on task-specific priors.The most widely adopted methods include rule-based [6,44,47] and distantly supervised [13,37] filtering.Finally, knowledge interaction introduces the filtered knowledge to the conversation model via customized architectures to enhance its emotional reasoning ability.For example, a typical architecture [12,13] assigns knowledge aspects to the edges of the conversation graphs and utilizes the corresponding knowledge features as edge representations to model the inter-utterance dependencies.In this paper, we dive into the knowledge filtering and knowledge interaction steps and raise the following research questions (RQ): • RQ1: Most knowledge filtering methods select knowledge aspects empirically without further evaluations.What knowledge aspects are most effective in enhancing conversational emotional reasoning?• RQ2: Current knowledge interaction methods are highly coupled with both the knowledge sources and conversation models.Are these customized architectures necessary?In addressing these research questions, we propose a simple yet effective Bipartite Heterogeneous Graph (BHG) method for enhancing emotional reasoning with commonsense knowledge.Firstly, we extract context-aware utterance representations via a PLM-based conversation model and relevant commonsense knowledge from three multi-type and multi-grained knowledge sources.Considering the complementary nature [40] between utterance semantics and related common sense, we model the extracted utterance and knowledge representations as heterogeneous nodes.In addition, we introduce a forward and a backward knowledge aggregation node type to perform automatic knowledge filtering and knowledge interaction, because for each target utterance, effective knowledge aspects for modeling inter-utterance relations are usually different in the past context and future context along the dialogue flow, as shown in previous works [12,13,44].Then a bipartite graph is built on these four heterogeneous node types, where messages from the utterance and knowledge are passed to the knowledge aggregation nodes for semantic-aware knowledge filtering, and the aggregated knowledge messages interact with the utterance nodes to enrich their semantics and model inter-utterance dependencies.The BHG is decoupled from the conversation models and knowledge sources and enables a unified model architecture for multi-type and multi-grained knowledge infusion.Its simple bipartite structure also facilitates the graph reasoning process.
In a BHG, heterogeneous nodes usually possess different feature spaces with unequal dimensions as conversation models/knowledge sources change.During the graph reasoning process, existing heterogeneous graph neural networks [8,26,32] mostly project all types of nodes into a unified feature space to facilitate the interaction between neighbors.However, the projections disrupt the original feature spaces and can lead to unnecessary loss of information for high-dimensional node types.Based on the heterogeneous graph Transformer [8], we propose a Multi-dimensional Heterogeneous Graph Transformer (MHGT) for graph reasoning, which utilizes a multi-dimensional edge-dependent matrix to enable direct attention calculation and message passing between heterogeneous nodes with different feature dimensions.MHGT allows all node types to retain the original feature spaces and potentially useful information for the knowledge filtering and knowledge interaction processes during inference.
We evaluate the effectiveness of our proposed BHG and MHGT methods on five datasets across two conversational emotional reasoning tasks: Emotion Recognition in Conversations (ERC) and Casual Emotion Entailment (CEE).The experimental results show that the BHG-based methods outperform previous state-of-the-art knowledge infusion models, and show generalized knowledge infusion ability on multi-type and multi-grained knowledge sources with higher efficiency than previous customized methods.We also analyze the effectiveness of different knowledge aspects, and the results show that previous empirical knowledge filtering methods can introduce less useful knowledge aspects and discard knowledge aspects that benefit the emotional reasoning process.
In summary, this paper makes the following contributions: (1) we propose a bipartite heterogeneous graph-based method to enhance emotional reasoning with commonsense knowledge, which enables a unified framework for multi-type and multi-grained knowledge infusion; (2) we propose a multi-dimensional heterogeneous graph Transformer for graph reasoning, which allows unchanged feature spaces and unequal dimensions for heterogeneous node representations during inference; (3) the BHG-based methods significantly outperform state-of-the-art knowledge infusion methods and show generalized knowledge infusion ability with higher efficiency.

METHODOLOGY
We introduce the utterance and knowledge feature extraction process in Sec.2.1.Then the BHG construction process and the multidimensional HGT methods are introduced in Sec.2.2 and 2.3.Finally, the examined tasks and prediction process are described in Sec.2.4.

Feature Extraction
In this section, we extract the utterance features via a PLM-based conversation model.We also extract multi-type (generative and extractive knowledge) and multi-grained (utterance-level and phraselevel) commonsense knowledge to enhance the emotional reasoning process.These processes are illustrated in Fig. 2 (a).

Conversation Model.
We decouple conversation modeling from knowledge infusion by utilizing existing conversation models

k i,m h i h 1 h i+1 h i+2 h n h i-2 h i-1 h i-
where ĥ ∈ R   × ℎ denotes the token-level representations,   is the token number of û , and  ℎ denotes the dimension of hidden states.We further obtain the utterance-level context-aware representation ℎ  for each utterance   within the dialogue flow via mean-pooling: where   and   denote the start and end positions of   in û .
2.1.2Knowledge Extraction.In previous works, two commonsense knowledge graphs have been proven most effective in enhancing the emotional reasoning process: the social commonsense knowledge graph ATOMIC [25] and the taxonomic/lexical knowledge graph ConceptNet [29].Therefore, we separately obtain relevant knowledge from three knowledge sources expanded from ATOMIC and ConceptNet.With the development of automatic knowledge graph construction, generative knowledge sources receive increasing interest due to their flexibility and convenience in knowledge extraction.We extract the ATOMIC knowledge via two COMET models: the first one is COMET 2019 [2], which pre-trains a GPT [24]  The second knowledge source COMET 2020 [9] further extends the generative pre-training to an expanded ATOMIC knowledge graph and part of ConceptNet based on a larger BART [11] model.COMET 2020 follows a similar knowledge extraction process as COMET 2019 and each knowledge representation has dimension  2020  .
On the other hand, we obtain phrase-level ConceptNet knowledge in an extractive manner.Specifically, we tokenize each utterance   and concatenate the tokens into n-gram phrases.For each phrase, we extract all its immediate neighbors in the English sub-graph of ConceptNet.Each neighbor assertion is a ⟨source, relation, target, weight⟩ quadruple where source and target denote the query phrase and the neighbor concept, relation denotes the corresponding relation type, and weight denotes a confidence score assigned to the assertion.For example, we can extract the assertion: ⟨think, HasPrerequisite, use brains, 2.375⟩ for the input phrase think.There are 28 different relation types in the extracted quadruples.We further remove all assertions with confidence scores less than 2.0 for denoising.To facilitate the extraction of the knowledge representations, we manually design an interpretation for each relation type and convert all assertions into natural language by concatenating the concepts to the relation interpretation.For example, ⟨think, HasPrerequisite, use brains⟩ is converted to "think has the prerequisite of use brains".We present the interpretations of the five most common relations in Table 1.As most of these knowledge items have simple semantics, we use a RoBERTa-Base encoder to extract the features for the converted natural language knowledge and each knowledge representation has dimension    .Overall, for each utterance   and a knowledge source, we obtain a set of knowledge representations: { 1 ,  2 , ...,    }, where } depends on the knowledge source, and   denote the number of extracted knowledge items for   .

Bipartite Heterogeneous Graph
We propose a directed bipartite heterogeneous graph (BHG) to complement utterance representations with extracted commonsense knowledge representations.Formally, the BHG is denoted as G = (V, E, N, R), where each node  ∈ V and each edge  ∈ E. N, R denote the set of node and relation types, and each node or edge is projected to its type via a mapping function:  () : V → N ,  () : E → R, respectively.Specifically, the node and edge types are defined as follows: 2.2.1 Node Types.Firstly, each utterance within a conversation is modeled as a node in the BHG.Its representation obtained from the conversation model is used as the node feature.The utterance node type is denoted as ℎ.The extracted knowledge representations are modeled as another type of node: .In addition, we introduce extra knowledge aggregation node types to perform automatic knowledge filtering and knowledge interaction.As previous works have shown that effective knowledge aspects can be different for performing knowledge interactions with the past context and future context [12,13,44], we introduce a forward and a backward aggregation node type, which is separately denoted as  ∈ R   and  ∈ R   , where   and   are their representation dimensions.Overall, we define a set with four types of nodes: N = {ℎ, ,  , }.

Relation Types.
For each edge  = ⟨ 1 ,  2 ⟩ from the source node  1 to the target node  2 , its relation type is defined as  () : ⟨ ( 1 ),  ( 2 )⟩.Specifically, we introduce six types of relations to perform knowledge filtering and knowledge interaction.Firstly, the forward/backward knowledge information relations:    : ⟨,  ⟩ and   : ⟨, ⟩ are proposed to separately provide the extracted knowledge information to both forward and backward knowledge aggregation nodes.Secondly, the knowledge filtering process naturally requires considering the corresponding utterance information.Therefore, we utilize the utterance information relations:    : ⟨ℎ,  ⟩ and   : ⟨ℎ, ⟩ to introduce utterance semantics to the knowledge aggregation nodes.In addition, commonsense knowledge has been proven not only useful for enriching the semantics of each utterance but in modeling inter-utterance dependencies [35,44,45].Therefore, we further incorporate the filtered commonsense knowledge to the utterance nodes by designing a forward infusion relation:   : ⟨ , ℎ⟩ and a backward infusion relation:   : ⟨, ℎ⟩.Overall, the relation set contains six types of relations: R = {   ,   ,    ,   ,   ,   }.

BHG Construction.
For each utterance node ℎ  , we create a forward aggregation node   and a backward aggregation node   .Then we build a BHG for each conversation by considering the following criteria for each relation type: • ∀ ≠  and , ⟨  ,   ⟩ ∉ E and ⟨  ,   ⟩ ∉ E. Each knowledge aggregation node only receives and filters the extracted knowledge from its corresponding utterance.• ∀ ≠ , ⟨ℎ  ,   ⟩ ∉ E and ⟨ℎ  ,   ⟩ ∉ E. Each knowledge aggregation node only considers the information from its corresponding utterance when performing knowledge filtering.
where   is the pre-defined forward knowledge infusion window size.Each forward aggregated knowledge is only used to enrich the semantics of its own utterance and model the inter-utterance relations with future utterances.• ∀ −   ≤  ≤ , ⟨  , ℎ  ⟩ ∈ E, where   is the pre-defined backward knowledge infusion window size.Each backward aggregated knowledge is only used to enrich the semantics of its own utterance and model the inter-utterance relations with previous utterances.
Based on these criteria, the BHG building process is described in Algorithm 1, and an example is illustrated in Figure 2(b).

Multi-dimensional HGT
In heterogeneous graphs, different types of nodes/relations do not share feature spaces, which requires node and relation-dependent architectures for GNN-based graph modeling.The Heterogeneous Graph Transformer (HGT) [8] separately initializes and optimizes a set of Transformer [31] parameters for each homogeneous subgraph.In addition, HGT unifies the dimensions of all node types before aggregation via linear transformation due to the complicated structures of most large-scale heterogeneous graphs.In our case, the utterance and knowledge representations often have unequal dimensions:  ℎ ≠   with the substitution of conversation models and knowledge sources.Considering the simple structure of BHG, we believe unifying dimensions can lead to unnecessary loss of information for high-dimensional node types.Therefore, we propose a Multi-dimensional HGT (MHGT) to model the BHG, which allows the dimensions of heterogeneous nodes to remain unchanged during the attention calculation and message-passing processes.
Based on the structure of vanilla HGT, MHGT improves the heterogeneous dot-product attention process to enable interactions where denotes the -th head of a equal-dimensional linear projection from    as the key and    as the query, || denotes concatenation, and    denotes the softmax operation.Different from the vanilla HGT, MHGT proposes a multi-dimensional edge-dependent matrix to project nodes of  (  ) to the representation space of  (  ), which enables attention calculations between heterogeneous node features and allows situations where   (  ) ≠   (  ) .We perform the message-passing process for    with the target node    as follows: where   (  ) (  ) is another equal-dimensional linear projection from   as the value.Similarly, MHGT uses a multi-dimensional edge-dependent matrix to align the neighbor node   to the feature space of the target node   .Same as in vanilla dot-product attention, the message is aggregated using the calculated attention weights: where ⊕ denotes the element-wise sum operation.Finally, v+1  is used to update    in the following manner: where  denotes the  activation function and   (  ) denotes an equal-dimensional linear projection from  (  ).We stack  MHGT layers to allow interactions between non-adjacent nodes.For utterance   , we obtain the -th layer output from MHGT: ℎ   as its final knowledge-enhanced representation.

Prediction and Training
We examine the BHG-based knowledge infusion method on two complex emotional reasoning tasks: Emotion Recognition in Conversations (ERC) and Casual Emotion Entailment (CEE).

Emotion Recognition in Conversations.
ERC aims to identify each utterance   's emotion within a dialogue D from a pre-defined emotion category set , which is modeled as a text classification task on each utterance [23].Specifically, we utilize a feed-forward neural network to project the knowledge-enhanced utterance representations to the classification space: where   ∈ R  ℎ × | | and   ∈ R | | are learnable parameters.We optimize the standard cross-entropy loss to train the ERC model: where ŷ  and    denote the -th element of ŷ and the one-hot emotion label   of utterance   , and  denotes the batch size.

Casual Emotion
Entailment.CEE aims to identify the causes behind the non-neutral emotion of target utterances and locate their positions from the conversational history.Given a dialogue D and the emotion label  of each utterance, CEE is modeled as a binary classification task to predict whether the candidate utterance   contains the emotion cause for target utterance   , where 1 ≤  ≤ .During inference, we concatenate the knowledge-enhanced utterance representations of   and   to calculate the logits: where   ∈ R 2 ℎ ×1 and   ∈ R 1 are learnable parameters, respectively.A BCE loss is used to incorporate the supervision signals: where   ∈ {0, 1} is the binary label.We further introduce the emotion labels to enhance the learning of the CEE model using the same prediction and training paradigm as in Eqn. 9 and Eqn. 10, and jointly optimize the two tasks in a multi-task learning manner: where  denotes a hyper-parameter controlling the weight of the ERC loss.

EXPERIMENTS 3.1 Datasets
We test our method on four Emotion Recognition in Conversations (ERC) datasets and one Causal Emotion Entailment (CEE) dataset.For all datasets, we only utilize the text modality in our experiments.IEMOCAP [3]: A two-party multi-modal ERC dataset derived from the scenarios in the scripts of the two actors.The pre-defined emotion category set  consists of: neutral, sad, anger, happy, frustrated, excited.
MELD [20]: A multi-party multi-modal ERC dataset collected from the scripts of American TV show Friends.The pre-defined emotions are neutral, sad, anger, disgust, fear, happy, surprise.
DailyDialog [14]: A ERC dataset compiled from human-written daily conversations with only two parties involved and no speaker information.The pre-defined emotion labels are neutral, happy, surprise, sad, anger, disgust, fear.
EmoryNLP [42]: Another ERC dataset collected from TV show Friends.It is annotated with the following emotion categories: neutral, sad, mad, scared, powerful, peaceful, joyful.
RECCON [22]: A CEE dataset collected from the scripts of Daily-Dialog with both utterance-level emotion labels and binary emotion cause labels, with the same emotion category set as DailyDialog.

Baseline Models
We compare our method with strong baselines models and categorize them into four groups according to their characteristics: PLM-based methods.For the ERC task, we select the following two methods: RoBERTa-Large [15] used the PLM RoBERTa-Large to directly model the conversation.The utterance representations are used to fine-tune the weights.DialogXL [27] improved the XL-Net [41] with the enhanced memory and dialog-aware self-attention mechanism to capture long historical context and dependencies between multiple parties.For the CEE task, we select: RoBERTa-Base/Large [22] concatenated the conversation as input to the PLM RoBERTa.Then CEE was modeled as a binary classification problem for each utterance pair.
Graph-based Methods.For ERC, RGAT [10] improved the relation modeling of conversations and added relational position encodings as sequential information.DAG-ERC [28] built a directed acyclic graph on the dialogue and used a graph neural network to aggregate the information.For CEE, we select two methods: ECPE-2D [5] represented the emotion-cause pairs as 2D representations and utilized the Transformer to model them.RankCP [33] ranked the clause pairs and performed end-to-end extraction with inter-clause modeling.
Knowledge-based Methods.For ERC, five methods are selected: KI-Net [35] infused both commonsense and sentiment lexicon knowledge and proposed a self-matching module to enhance the knowledge interaction.COSMIC [6] used the RNN to model the dialogue history and extracted utterance-level commonsense knowledge to model the speakers' mental states.TODKAT [47] modeled topic information via PLMs and explicitly infused event-centered knowledge.SKAIG [12] extracted psychological commonsense knowledge and infused the knowledge to enhance the edge representations of the knowledge graph.CauAIN [45] used the emotion cause knowledge to guide the traceback process of context modeling.For CEE, KAG [36] utilized the entity-related commonsense knowledge to model the semantic dependencies between the candidates and emotions.AKM [30] combined knowledge via an adapted knowledge model in a multi-task learning manner.KEC [13] utilized the directed acyclic graph incorporating social commonsense knowledge to improve the causal reasoning ability.KBCIN [44] proposed the knowledge-bridged causal interaction network to capture context dependencies of conversations and make emotional cause reasoning.
Zero-shot Method with ChatGPT. .We also include the zero-shot evaluation results of the latest large language model ChatGPT1 on all datasets, provided by Yang et al. [37].

Implementation and Evaluation Settings
We conduct all experiments using a single Nvidia Tesla A100 GPU with 80GB of memory.We initialize the pre-trained weights of RoBERTa and use the tokenization tools provided by Huggingface [34].We leverage AdamW optimizer [16] to train the model.The batch size of experiments on all datasets is set to 16 except in DailyDialog, which is 24.We use a linear warm-up learning rate scheduling of warm-up ratio 20% and a peak learning rate 1e-5.We set a dropout rate of 0.1 and an L2-regularisation rate of 0.01 to avoid over-fitting.= 768,  = 3, and  is set to 0.8.For each knowledge source with dimension   , we set   =   =   and randomly initialize the forward/backward knowledge aggregation node representations.During knowledge infusion, we set   =   = 5 when both past and future contexts are provided, and   = 0,   = 10 when only the past context is available.
For ERC, we select the Weighted F1 score as the evaluation metric for IEMOCAP, MELD, and EmoryNLP.Since "neutral" occupies most of DailyDialog, we utilize the Micro F1 score excluding the "neutral" utterances to reflect the performances in non-neutral emotions, as in previous works [12,28,47].We also calculate Macro F1 scores on all classes for DailyDialog to evaluate the overall performances.For CEE, we report the F1 scores on positive and negative utterances and Macro F1 scores as the overall evaluation.All reported results are averages of five random runs.

Overall
Performance.The performance of our BHG-based methods and all baseline models on the five datasets are presented in Table 2. Firstly, the zero-shot prompting results show that Chat-GPT still bears a huge gap with advanced conversation models and knowledge-based methods in performing emotional reasoning, possibly because these tasks are very subjective even to humans, showing the necessity of exploring few-shot prompting and knowledge infusion to further calibrate ChatGPT's understanding of these subjective emotion concepts [37].These results also motivate continual research on supervised task-specific methods.Secondly, the knowledge-based methods significantly improve model performance on most datasets compared to the PLM-based and graph-based conversation models.These results empirically prove the effectiveness of commonsense knowledge infusion to emotional reasoning tasks.Thirdly, the BHG-based methods outperform all baseline models on all ERC and CEE datasets, including the new state-of-the-art performance of 71.2% on IEMOCAP, 62.37% (Micro-F1) on DailyDialog, and 79.73% (Macro-F1) on RECCON.These outstanding performances quantify the advantages of BHG-based knowledge infusion over previous methods.

Knowledge-based Methods Comparisons.
In the comparison of knowledge-based methods, the unified BHG architecture shows generalized knowledge infusion ability by outperforming all previous highly customized knowledge infusion methods on all three tested knowledge sources.For the phrase-level extractive knowledge source ConceptNet, BHG outperforms KI-Net by at least 2% on IEMOCAP, MELD, and DailyDialog.It also has an impressive 6.81% improvement on RECCON compared to KAG.These results show that the BHG structure can effectively interact with multi-grained heterogeneous nodes to infuse knowledge.For the utterance-level generative knowledge source  2019 , BHG possesses a similar advantage over other knowledge-based methods on most datasets, especially on IEMOCAP, DailyDialog, and RECCON.A possible reason is that some useful knowledge is mistakenly discarded by these previous methods as they all perform empirical knowledge filtering.On the other hand, we provide all knowledge aspects to construct the BHG and perform knowledge filtering automatically, which enables the aggregation nodes to retain all useful knowledge.The knowledge source  2020 is proven most effective in enhancing emotional reasoning as it outperforms all other knowledge sources on all five datasets.Compared to  2019 ,  2020 trains a larger language model on a bigger social commonsense knowledge graph.Therefore,  2020 is expected to generate more reliable knowledge for each utterance.These results show the importance of high-quality commonsense knowledge sources.
4.1.3BHG Variants.We further compare the BHG-based methods in experimental settings such as "w/o future context" and "w/o emotions" to test model performance in scenarios where future contexts and emotion labels are unavailable.For ERC, the BHG's performance drops with only past context to a limited extent on most datasets.These results show that future dialogue can provide useful clues for the emotional reasoning of the current utterance in most cases.For CEE, the BHG performances on all knowledge resources drop less than 0.5% without the emotional supervision signals, which is much less than the 1.7% decrease on the previous state-of-the-art model KBCIN.We believe the appropriately infused commonsense knowledge from BHG can make up for the missing emotional information introduced by the labels.Finally, we compare the performance of MHGT and the vanilla HGT in modeling the BHG.During experiments for HGT, we unify the heterogeneous node dimensions by linearly projecting the highdimension representations to low-dimensional spaces.According to the results, HGT significantly underperforms MHGT on all datasets.For example, we observe an over 3% drop in Weighted-F1 for both  2019 and   on IEMOCAP.MHGT retains the original semantic space for utterance representations, which preserves useful information to perform semantic-aware knowledge filtering and emotional reasoning, while HGT projects the representations to low-dimensional spaces and causes unnecessary loss of information.

Ablation Studies for Knowledge Infusion
To further investigate the BHG's efficiency in knowledge infusion and the contributions of its components, we perform ablation studies on the BHG and other knowledge-based baseline methods, and the results are presented in Table 3.To ensure fair comparisons, we select the BHG and four knowledge infusion methods that are all tested with the  2019 knowledge source on the four ERC datasets.

Knowledge Infusion
Efficiency.Firstly, we compare the performance drops between the models when the whole knowledge infusion architecture is removed, where more significant drops reflect the higher efficiency of the method in leveraging the same knowledge source.According to the results, "BHG w/o know."achieves top-2 performance drops on all datasets, and the drops exceed 2% on three out of four datasets, while other knowledge infusion methods normally make substantial contributions to only one or two datasets.These results show that our unified BHG method can utilize the same knowledge source to enhance emotional reasoning on differently distributed data more efficiently than previous customized knowledge infusion methods.

BHG Modules.
We further investigate the backward/forward knowledge infusion architectures in BHG by removing these modules and comparing the performance.According to the results, "BHG w/o backward aggr." and "BHG w/o forward aggr."both perform worse than BHG on all datasets.These results not only strengthen that infusing commonsense knowledge can enhance the modeling of inter-utterance relations, but further prove that a unified knowledge infusion module can be decoupled from the utterance modeling process and used for any conversation models in a plug-in manner.They also prove the necessity of splitting the knowledge filtering and interaction processes for previous and future contexts in the BHG architecture.In addition, "BHG w/o forward aggr."suffers from higher performance drops on all datasets than "BHG w/o backward aggr.", which shows that social commonsense knowledge is more useful in modeling inter-utterance relations for future contexts.These observations correspond with the widely recognized prior that the current utterance has more significant influences on future utterances along the dialogue flow [28].

Knowledge Filtering Analysis
To provide an intuitive view of the knowledge filtering process, we record the dot-product attention weights between the knowledge nodes and knowledge aggregation nodes in the last MHGT layer for  2019 on IEMOCAP test set and visualize their distributions in box plots.The results are presented in Figure 3.Following the BHG structure, we split the attention weights from forward and backward knowledge filtering, and higher weight distributions reflect more contributions from the corresponding knowledge aspect.6,12,45,47].Most of these knowledge aspects are also proven effective in the automatic knowledge filtering of BHG, while some aspects such as xReact and oReact are less attended in both forward and backward knowledge filtering.These results indicate that some widely used knowledge aspects may not provide much useful information as expected.On the other hand, some previously ignored knowledge aspects are assigned higher attention scores in automatic knowledge filtering.For example, xNeed, a less-used knowledge aspect in previous works, receives considerable attention in backward knowledge filtering.We expect the above analysis to guide the adjustment of future strategies for empirical methods.

RELATED WORK
We discuss relevant background on conversational emotion reasoning, including conversation models and knowledge-based methods.

Conversation Models
Emotion reasoning in a conversation naturally requires modeling interactions between dialogue participants, known as intra-and inter-speaker dependencies [27].Early works regarded dialogues as temporal flows and utilized RNNs to model the dialogue history and emotional dynamics of each dialogue participant [6,18].In other works, Transformer variants were also leveraged to model long-range dependencies in conversations [43,46].Recent works mostly relied on PLMs such as RoBERTa [15] and XLNet [41] to obtain utterance-level or conversation-level features [1,27].Some other works [7,10,28] modeled utterances as nodes and carefully designed graph structures on the dialogue to enable efficient message passing among utterances during the graph aggregation process.Other self-supervised architectures such as the Variational Autoencoder [19,39] were also used for modeling conversations or discourse information.

Commonsense Knowledge-based Methods
Due to the implicitness of emotional expression in many scenarios, commonsense knowledge has been widely utilized to enhance emotional reasoning.One line of works [35,43,46] extracted phraselevel concepts from the large-scale knowledge graph Concept-Net [29] to concatenate them with the token-level utterance representations, and Transformer-based models were used to perform utterance-knowledge interactions.More recent works leveraged the utterance-level knowledge from the generative knowledge source COMET [2,9].Some methods directly infused the knowledge into utterance representations [6,47], followed by neural network-based knowledge infusion modules.Other works utilized the knowledge as edge representations in the dialogue graphs to model inter-utterance relations between utterance nodes [12,13,44] or trace emotion casual clues [45], which obtained outstanding performance in both emotion detection [12] and emotion casual detection tasks [13,44].In addition, the pre-trained knowledge adapters were also used for knowledge infusion into PLM-based conversation models [38].

CONCLUSION AND FUTURE WORK
This paper proposes a bipartite heterogeneous graph for enhancing emotional reasoning with commonsense knowledge.We model the utterance representations and knowledge representations as heterogeneous nodes and design a BHG for commonsense knowledge infusion.In addition, we propose a multi-dimensional heterogeneous graph Transformer to perform graph reasoning to retain unchanged feature spaces for heterogeneous node types.Experiments show that BHG-based methods outperform state-of-the-art knowledge infusion methods on five datasets across two conversational emotional reasoning tasks.The BHG also shows generalized knowledge infusion ability with higher efficiency.Further analysis proves that previous empirical knowledge filtering methods do not guarantee to provide the most useful knowledge information.
In future work, we will test our BHG-based method on more conversation models, such as graph-based models, to further examine their generalizability.We will also explore simultaneous knowledge infusion from multiple knowledge sources in a unified BHG framework, which enables the model to reason on several knowledge types to enhance its performance.

Figure 1 :
Figure 1: Illustration of the three-step paradigm of knowledge infusion.An example based on the knowledge source COMET is provided for each step.

Figure 3 :
Figure 3: Box plots of the knowledge filtering attention weights for nine knowledge aspects on the IEMOCAP test set.We use  2019 as the knowledge source and MHGT as the BHG encoder.Orange lines denote the median numbers.

3 k f k b
An overview of the utterance/knowledge feature extraction and BHG construction processes.In (b), the graph construction process for the -th utterance is presented.  = 2 and   = 3 are examples of the forward and backward knowledge infusion window sizes.know.and uttr.denote "knowledge" and "utterance".to obtain context-aware utterance representations.We utilize a PLM-based conversation model.For a conversation D = [ 1 ,  2 , . . .,   , . . .,   ] with  utterances,   is the target utterance pre-pended with its speaker.The model concatenates both past and future contexts as the input û = ||  ∈ [1,  ]   , where || is the concatenation operation.Considering situations where only the dialogue history is available, we also test our methods on only past contexts: û = ||  ∈ [1,  ]   .

Table 1 :
model on ATOMIC.During knowledge extraction, each utterance   is constructed into the query: (  ||   ||[ ]).denotes a knowledge aspects term representing an if-then relation type for the speaker's actions/mental states.COMET 2019 provides nine knowledge aspects from ATOMIC, where their interpretations are listed in Table1.The constructed query is then input to the COMET model, and the finallayer hidden representation of the decoder:    ∈ R  2019 Interpretations of the selected knowledge aspects/relations."X" denotes the target utterance speaker.
is used as the utterance-level knowledge representation corresponding to   , where  2019  denotes the dimension of    .COMET Aspect Interpretation xIntent Why does X cause the event?xAttr How would X be described?xNeed What does X need to do before the event?xWant What would X likely want to do after the event?xEffect What effects does the event have on X? xReact How does X feel after the event?oWant What would others likely want to do after the event?oEffect What effects does the event have on others?oReact How do others feel after the event?

Table 2 :
Test results of our BHG-based methods and baseline models on the five datasets."Know.Source" lists the commonsense knowledge sources used by each knowledge-based method.In "w/o future context", only the dialogue history is introduced as context for each target utterance.The "w/o emotions" setting removes the utterance emotion information for CEE.Best values are highlighted in bold.

Table 3 :
Ablation studies for the BHG and other knowledgebased baseline models."know."denotes the whole knowledge infusion architectures."backward/forward aggr."denotes the backward/forward knowledge aggregation designs in BHG.We highlight top-2 performance drops in bold.
4.3.1 Overall Analysis.For forward knowledge filtering, the most attended knowledge aspect xWant has a median number over 0.3.xIntent and xEffect are also frequently utilized, with their median numbers over 0.2.These results show that forward knowledge interaction can benefit more from knowledge aspects reflecting the potential effect of the target utterance on the target speaker.On the other hand, backward knowledge filtering pays attention to the knowledge aspects reflecting the effect on both the target speaker and other participants.For example, xEffect and oWant are highly attended with median numbers over 0.3, and xNeed and oEffect also have median numbers of over 0.2.In comparisons of the box plots for forward and backward knowledge filtering, the BHG has very different patterns for selecting knowledge aspects.For example, xWant knowledge is considered useful in forward knowledge filtering but receives much less attention in backward knowledge filtering.These observations further show the necessity of designing forward and backward knowledge aggregation nodes in BHG.In addition, some knowledge aspects such as xAttr and oReact have low attention distributions in both forward and backward knowledge filtering, possibly because these knowledge aspects can provide less relevant information for emotional reasoning.