Think Rationally about What You See: Continuous Rationale Extraction for Relation Extraction

Relation extraction (RE) aims to extract potential relations according to the context of two entities, thus, deriving rational contexts from sentences plays an important role. Previous works either focus on how to leverage the entity information (e.g., entity types, entity verbalization) to inference relations, but ignore context-focused content, or use counterfactual thinking to remove the model's bias of potential relations in entities, but the relation reasoning process will still be hindered by irrelevant content. Therefore, how to preserve relevant content and remove noisy segments from sentences is a crucial task. In addition, retained content needs to be fluent enough to maintain semantic coherence and interpretability. In this work, we propose a novel rationale extraction framework named RE2, which leverages two continuity and sparsity factors to obtain relevant and coherent rationales from sentences. To solve the problem that the gold rationales are not labeled, RE2 applies an optimizable binary mask to each token in the sentence, and adjust the rationales that need to be selected according to the relation label. Experiments on four datasets show that RE2 surpasses baselines.


INTRODUCTION
Relation extraction (RE) is a crucial part of many information retrieval (IR) systems, which could extract relations between entities from sentences.These structured triplets such as (Ryan, Yaz, per/per/alumni) (Figure 1) from heterogeneous sources could benefit Ryan and Yaz realising they knew each other from school was so wholesome and they are similar to me and my friend.multiple downstream applications like question answering [19,20] and natural language understanding [14,21,22].To obtain structured triples, we need to exploit relevant and noise-free sentences from the entity context, so that the correct relation can be extracted between the two available entities.Entity Thinking methods such as Hu et al. [15,16,17] inject reserved special tokens <e> and </e> before and after the entity, and focus on the contextualized features of the entity through these special tokens.However, the semantic information of entities cannot be specified in special tokens.Therefore, Zhou and Chen [37] and Lu et al. [24] respectively introduce entity type and entity verbalization to better reveal the contextual semantic representation of entities and infer the relations between entities.Although Entity Thinking methods can better capture the context semantics of the entity, but cannot automatically remove noisy and irrelevant contents.Such noisy content tends to destroy correct relational inference.Taking Figure 1 as an example, the Entity Thinking method has no idea in judging the relevance of the content such as "they knew each other from school" and "me and my friend" for predicting the relation between entities.Therefore, the model may be misled by the word "friend", and mispredicts the relation as Per/per/peer.To remove the potential impact of the content on the relation extraction between entities, Counterfactual Thinking methods [25,33] remove the model's bias against different words and entities.However, these methods do not focus on explicitly removing noisy contextual content, thus, the model's prediction can still be misled as Per/per/siblings.
To remove irrelevant and noisy content in sentences, we first propose rational thinking methods which could extract relevant and noise-free rationales in RE task.Although the methods of rational thinking have been verified in the various downstream tasks of information retrieval such as question answering [34], we still face two crucial challenges to leverage the rational thinking methods for the RE task: (1) Gold rationales which are relevant to the relation label in the sentences are not available, therefore, we cannot train a rationale extractor in a supervised learning manner, (2) Extracted rationales are encouraged to be continuous, which can not only improve the interpretability of the rationales, but also express coherent semantics to predict the relation labels between entities.
In addressing the two main challenges, we present two new continuity and sparsity factors in this study to manage the coherence and quantity of chosen rationale tokens.Sparsity imposition aids in striking a balance between eliminating irrelevant material and preserving relevant content.Promoting continuity is advantageous for obtaining continuous rationales, leading to a more coherent semantic representation.Furthermore, we employ an adjustable binary mask for rationale selection and modify the rationale tokens necessary for the relation extraction task using relation labels.As a result, the unavailability of gold rationales can be addressed through end-to-end training.Our primary contributions include: (1) Introducing a novel end-to-end training system, RE 2 , which treats rationale extraction as an adjustable binary mask for the relation extraction task and retains relevant, noise-free rationales via continuity and sparsity factors.(2) Experiments on four commonly used datasets demonstrate that RE 2 significantly improves best-reported baselines in both full data and low-resource settings.

PROPOSED MODEL 2.1 Continuous Rationale Extractor
In the continuous rationale extractor module of the model, we mask the tokens that are irrelevant to the relation extraction task, and keep the continuous tokens to improve extraction performance.

Sentence Representation and Importance
Matrix.We can obtain semantic embedding of each token through its contextualized sentence representation.In practice, we adopt the BERT [8] to encode the token representations as:   ∈ R × , where  is the dimension of the embedding and  is the number of tokens in the sentence.To select the tokens most relevant to the entities in the sentence for relation extraction task, we first calculate the importance matrix:  =   ⊤   1 +   2 with the token embeddings of the two entities   1 and   2 extracted from   .We denote  = ( 1 , ...,   ) ⊤ and obtain the importance score   which represents the importance of the  ℎ token towards RE task.
2.1.2Factor Graph.We represent the token selection using a binary vector  = ( 1 ,  2 , ...,   ) ⊤ , where   ∈ {0, 1}.The value 0 or 1 is used to indicate whether the  ℎ token is selected.In this way, we could transform the structured prediction problem of rational sequence generation into the assignment of values to multiple variables.To maintain semantic coherence in token selection, it's important to consider the continuity in token choice.Additionally, to emphasize the importance and relevance of tokens to entities, we need to limit the number of tokens with sparsity.Therefore, we introduce the factor graph F and decompose these requirements into multiple local factors for optimal token selection.More specifically, we adopt the pairwise factor CONTINUITY and L-ary factor SPARSITY.In the following sections, we will formulate these two factors and provide their score functions.

CONTINUITY (CON):
To improve the continuity of tokens selected for RE task, we adopt the CONTINUITY (CON) factor, which could examine whether each pair of consecutive tokens are both selected.We adopt the factor CON (  ,   +1;  ,+1 ) to represent the constraint on the continuous selection of the  ℎ and ( + 1) ℎ tokens.As shown in Figure 2, if both tokens are selected, we encourage this continuous selection by adding the edge score  ,+1 ≥ 0 in the score function.Formally, the score function for CON factor can be denoted as: score CON (  ,  +1 ;  ,+1 ) =    +1  ,+1 . ( As illustrated in Section 2.1.1,we adopt the importance matrix  to measure the tokens that are relevant to the entities, which is a critical metrics to token selection.We add the scores of the selected  ℎ and ( + 1) ℎ tokens in the score function and finalize the score function as: We can impose continuity constraints on the token selection for the original sentence by leveraging the combination of the pairwise factors.Formally, the factor graph can be formulated as:

SPARSITY (SPA):
To control the sparsity in token selection for RE task, we adopt the L-ary factor SPARSITY (SPA) by imposing a limit  on the maximum number of selected tokens as a restriction.In practice,  can also be the proportion of all tokens in a sentence.The SPA factor is a hard constraint by definition.We can formulate the SPA factor with the following score function: Overall, to consider continuity and sparsity together, we obtain the factor graph F by instantiating with the  binary variables and combining the CON and SPA factors: To utilize the both continuity and sparsity constraints and find an optimal solution for token selection, the score functions of factor graph F need to sum the local sub-problems {CON(  ,  +1 ;  ,+1 ) : 1 ≤  < } and {SPA( 1 , ...,   ; )} as: The hard constraint of F is inherited from the SPA factor, which specify that the total number of selected tokens should not exceed .The soft constraint is inherited from the CON factors, encouraging to select consecutive tokens that are relevant to the RE task.To find solution that satisfies these constraints well, we approach the problem of solving the variables as a Maximum A Posteriori (MAP) inference problem that maximizes the score function: score (; ).We can represent this problem as maximization of the score function under the constraint that || 1 ≤ : In fact, maximizing the score function is essentially a complex structured problem involving sub-problems with interrelated global agreement constraints, making it difficult to find an accurate maximization algorithm [26].To solve this problem, we consider the Marginal Inference with Lagrange Multiplier.

Marginal
Inference with Lagrange Multiplier.We can solve the maximization problem with the Gibbs distribution and get an approximate solution.We construct a Gibbs distribution so that  (; ) ∝  (score(; )).In this way, we can sample from m ∼  (; ) and obtain an approximate optimal solution.However, obtaining unbiased samples is challenging.To address this issue, we use Perturb-and-MAP [27], an approximate sampling strategy.
Another problem is that the score function in Eq. 6 is a piecewise function, making the Gibbs distribution  (; ) ∝  (score(; )) discontinuous.As marginal inference in discontinuous Markov Random Fields is hard to solve, we reformulate the hard constraint: SPA in Eq. 6 with Lagrange multiplier, which express hard constraints in the form of continuous functions.Specifically, we use a Lagrange Multiplier  > 0, and add ( − || 1 ) to the objective function in Eq. 7. We finalize the Eq.7 as: where the Gibbs distribution should be reformulated as  (; ) ∝  (score(; ) + ( − || 1 )).Therefore, the reformulated Gibbs distribution becomes continuous, enabling us to calculate the optimal  that maximizes the score function, and obtain the rationales which are relevant to the relation extraction task.

Relation Classifier
Finally, the classifier makes relation predictions conditioned on the selected rationales and the entities : ŷ = pred( ⊙  ∥ ) to obtain the relation label distributions.⊙ and ∥ denote the element-wise product and concatenation, respectively.The relation classification loss could be calculated as: L = −  =1   log ŷ , where  is the number of training sentences in an epoch, and   is the groundtruth tag vector of the sentence   .The relation classification loss could jointly train the Continuous Rationale Extractor module and Relation Classifier module in an end-to-end manner.

EXPERIMENTS AND ANALYSES 3.1 Experimental Setup and Baselines
Setup: We evaluate the model on four widely-used RE datasets: SemEval [13], which contains 6,507/1,493/2,717 samples in train/dev/test sets and 19 relation types.TACRED [36] and TACRED-Revisit [1], which contain 68,124/22,631/15,509 samples and 42 relation types.Re-TACRED [32], which contains 58,465/19,584/13,418 samples and 42 relation types.Following prior effort [33], we adopt Micro F1 as the evaluation metric.Under the low-resource setting, we randomly sample 10%, 25%, and 50% of the training set as the small-scale training sets for evaluation, and evaluate our model on the test set.We use the BERT-Base default tokenizer with a max-length of 128 to preprocess data.We set K as 60% of all tokens in the sentence.For the classifier, we set the layer dimensions as 768-384-labels.We use BertAdam [18] with 3e-5 learning rate, warm up with 0.06 to optimize the loss and set the batch size as 16.Baselines: We first introduce SOTA models as base model on the RE task, and then adopt various baselines.We adopt SURE [24] as the base model.We compare RE 2 with the following baselines: Entity thinking baselines: (1) MTB [31], (2) Entity Mask [36], (3) Typed Entity Marker [37].Counterfactual thinking baselines adopt causal inference to remove bias in RE tasks: (4) CFIE [25], (5) CORSAIR [29] (6) CORE [33].Rationale thinking baselines could predict sparse binary masks over input tokens for RE tasks: (7) HardKuma [2], (8) IB objective [28], (9) UNIREX [3].Note that the entity thinking method is also used in the SURE, all baselines of entity thinking are used to replace the methods in SURE.

Results and Analysis
Overall Performance.Table 1 shows the mean and standard deviation results with 5 runs of training and testing on four datasets.We observe that using the entity information verbalization (SURE [24]) can achieve an average 0.6% improvement in F1 across all datasets compared to other entity thinking methods.Therefore, we adopt SURE as the base model.For counterfactual and rationale thinkings, we find that they both bring a 0.3% improvement in the F1 performance across all datasets.Our proposed method of rational thinking addresses two major challenges: (1) end-to-end training of both the rationale extraction and relation classifier, and (2) extraction of continuous rationales.As a result, RE 2 achieves a significant 0.9% improvement in F1 across all RE datasets, including low-resource RE settings.Compared with the previous SOTA: UNIREX, RE 2 has an additional 0.4% increase in F1 performance.An interesting finding is that for low-resource settings (e.g., only 10% of the training set), RE 2 can achieve more performance improvements than the full data setting: 1.1% vs. 0.9%, which shows that RE 2 is robust enough in the case of limited training data.The low-noise and relevant rationales obtained by using RE 2 can help the F1 performance of the base model more significantly.1 generally concludes that all modules positively impact performance.Specifically, the absence of continuity leads to discontinuous rationales, affecting the coherence of semantic representations and causing a 0.4% F1 performance drop.
Removing sparsity selects noisier rationales, resulting in a 0.5% F1 performance reduction.Interestingly, removing added entities has minimal effect on F1 performance (0.1%).We find that 89% of rationales contain two entities and 97% contain at least one entity, indicating that adding entities provides little additional information.Effect of Two Factors: As shown in Figure 3, we display F1 scores and token selection rates in relation to varying  values on SemEval.
As  rises, more tokens within sentences are chosen as rationales.
Nonetheless, the F1 score for RE 2 doesn't increase consistently with higher , due to the incorporation of unrelated rationales.
Optimal performance occurs at  = 60, meaning 60% of tokens are selected as rationales on average.Eliminating the Sparsity factor entirely causes the model's F1 score to decline from 87.2 to 86.6.Additionally, the Continuity constraint benefits the model, as RE 2 with Continuity constraints consistently produces improved outcomes.
Coherence Analysis of Rationales.RE 2 utilizes the continuity factor to control the generation of rationales that are more semantically coherent, which can express more fluent semantics.We analyze the coherence of the rationales through perplexity based  on GPT-3 [30].From Table 2, RE 2 could obtain the lowest average perplexity, approaching that of the original sentences.Human Evaluation.We conduct human evaluations of rationales with a 15-member annotation team, involving 5 members in data validation.Annotators predict relation labels using original sentences and extracted rationales, then rate information sufficiency (on a 1-5 scale) for both.Higher scores signify greater sufficiency.
To ensure consistency, we perform inter-annotator agreement and manual validation.Table 3 shows that annotators can provide more accurate relation labels even with lower information sufficiency in rationales than original sentences, suggesting that removing irrelevant details from sentences can decrease noise and enhance relational prediction accuracy.

CONCLUSION AND FUTURE WORK
In this paper, we propose a novel rationale extraction framework RE 2 , which adopt two factors, continuity and sparsity, to control the relevancy of rationales to the RE task and improve the coherence.We introduce the marginal inference with a Lagrange multiplier to solve the problem of maximizing the score function with two factors.Therefore, we could jointly train the rationale extraction and relation classification tasks in an end-to-end manner where gold annotations for rationales are not available.Experiments on four datasets show the effectiveness of RE 2 .In the future, we can extend the research on relation extraction to the construction of knowledge graphs [6,7,9,35], the matching of knowledge graphs [10][11][12]23], and the acceleration of information retrieval [4,5].
Per/per[E1] Ryan [/E1] and [E2] Yaz [/E2] realising they knew each other from school [E1] Ryan [/E1] and [E2] Yaz [/E2] realising they knew each other from school was so wholesome and they are similar to me and my friend.Counterfactual Thinking Entity Thinking Per/per/siblings Per/per/alumni Per/per/peer Rational Thinking [E1] Ryan [/E1] and [E2] Yaz [/E2] realising they knew each other from school was so wholesome and they are similar to me and my friend.

Figure 1 :
Figure 1: Different models "see" different content in sentences by thinking differently.Rational thinking predicts the correct relation label per/per/alumni between two entities Ryan and Yaz by seeing the relevant and correct content.

king 2 ×Figure 2 :
Figure 2: Architecture: The rationale extractor obtains the rationales from the input using a binary mask consists of continuity and sparsity factors.

Table 1 :
Average micro F1 results in four RE datasets."re."means that we will replace the entity information verbalization in SURE with the corresponding entity thinking baselines.We mark the (standard deviation) of the results.Effect of Two Factors (Continuity and Sparsity).is the hyper-parameter to control the sparsity of the token selection.Continuity is imposed to improve contiguity.Ablation Study.We perform an ablation study to demonstrate the effectiveness of our model's various modules on the test set.RE 2 without Continuity and RE 2 without Sparsity eliminate the continuity and sparsity elements in the factor graphs in rationale extraction, respectively.RE 2 without Adding Entities removes the entities added in the relation classifier module, using only the rationales for relation classification.Table

Table 2 :
Perplexity of the extracted rationales.Original means the original sentences.Lower perplexity is better.

Table 3 :
Human evaluation (Micro F1 / Information Sufficiency) of the original sentences and extracted rationales.