Inducing Causal Structure for Abstractive Text Summarization

The mainstream of data-driven abstractive summarization models tends to explore the correlations rather than the causal relationships. Among such correlations, there can be spurious ones which suffer from the language prior learned from the training corpus and therefore undermine the overall effectiveness of the learned model. To tackle this issue, we introduce a Structural Causal Model (SCM) to induce the underlying causal structure of the summarization data. We assume several latent causal factors and non-causal factors, representing the content and style of the document and summary. Theoretically, we prove that the latent factors in our SCM can be identified by fitting the observed training data under certain conditions. On the basis of this, we propose a Causality Inspired Sequence-to-Sequence model (CI-Seq2Seq) to learn the causal representations that can mimic the causal factors, guiding us to pursue causal information for summary generation. The key idea is to reformulate the Variational Auto-encoder (VAE) to fit the joint distribution of the document and summary variables from the training corpus. Experimental results on two widely used text summarization datasets demonstrate the advantages of our approach.


INTRODUCTION
Text summarization is an important task in natural language processing (NLP), which targets to produce a fluent and condensed summary for a document, while preserving the key information [36,45].Abstractive summarization is a mainstream approach to generate compact summaries from scratch [3,72].Advances in deep learning have fueled research in applying neural sequence-to-sequence (Seq2Seq) networks to automatically extract effective features and generate summaries in an end-to-end manner [40,47,57].
Despite the promising performance, current data-driven summarization models possess an inherent issue.These efforts often exploit all types of correlations to fit data well, overlooking the underlying data generating process (DGP) that reveals how observed data is generated [1].Such correlations are probably spurious due to the biased statistical dependencies caused by confounder inherited from the training corpus.For instance, if the term "lion" frequently co-occurs with "Africa" in training data, a model might erroneously generate a summary containing "Africa" even for the document describing the core information "lion pregnancy" with the side information "Africa".The occurrence of such stereotyping, arising from spurious correlations, impacts the effectiveness of text summarization techniques and hinders practical applications.
Recently, structural causal model (SCM) has attracted great interest from the research community to identify the underlying DGP of the observed data [2,39,49,50,58].Learned causal models aid stable prediction and generalization by capturing causal relationships.In this work, we aim to devise a SCM for describing the DGP in text summarization, with the goal of inducing causal structure of the data, especially the causal relationships between documents and summaries.We would like the latent space to be separated into content and style space.For the content space, we assume two kinds of latent factors, i.e., Core-Content (CC) factors and Side-Content (SC) factors, referring to the core content (main points) and side content (non-essential information) in the document, respectively.For the style space, we also assume two kinds of latent factors, i.e., Document-Style (DS) factors and Summary-Style (SS) factors, referring to the lengthy writing style of the document and the concise writing style of the summary, respectively.Among such latent factors, there can be confounder representing the statistical dependencies inherited from the training corpus.
Specifically, as shown in Figure 1, we assume that CC and SS factors are summary-causal factors whose relationship with the summary remains invariant across the corpus, while other factors are non-causal for the summary and only causally influence the document.Each document is generated from the summary-causal factor CC and the non-causal factors SC and DS.Besides, we incorporate core topics and side topics in the documents to guide the learning of CC and SC factors.Theoretically, we prove that certain conditions ensure the identifiability of causal factors, enabling the generation of summaries containing only causal information.
Based on the SCM, we propose a Causality Inspired Sequence-to-Sequence model (CI-Seq2Seq) for abstractive text summarization, which enforces the learned representations to mimic the latent factors.The key idea is to learn the causal generative mechanisms for the document and summary, by adapting Variational Autoencoder (VAE) [26] to supervised training.Specifically, we first partition each dataset into subsets through Latent Dirichlet Allocation (LDA) [4] and define confounder information as topical features of subsets.Then, we utilize LDA and Compression Rate (CR) [18] as the guidance to learn the content and style factors, respectively.During testing, we first infer CC and DS factors based on the learned document-causal generative mechanisms, and then use the summary-causal generative mechanisms for controlled summary generation based on the given CR between DS and SS factors.
To the best of our knowledge, it is the first work to combine causality and text summarization with a rigorous theoretical guarantee.Experimental results on two widely-used datasets, i.e., CNN/-Daily Mail [19] and XSUM [43], demonstrated that CI-Seq2Seq can achieve significant improvements over prevailing baselines in terms of prediction performance, generalizability and interpretability.The code is available at https://github.com/ict-bigdatalab/CI-Seq2Seq.
As for NLP, the causality-aware methods are mainly studied in text classification [53,62], table-to-text generation [8] and language model pre-training [6], for debiasing [53], controlling [21] or style transferring [42].For example, some works [6,8,21] applied causal intervention to eliminate the spurious correlations introduced by backdoor path.Some works [9,21,42,53,68] utilized counterfactual reasoning to measure the causal effect by excluding the direct effect from its total effect, or control textual attributes by assigning counterfactual values.Yet there have been few works that apply causal perspective to text summarization [68], particularly in terms of causal representation learning.Disentangled Representation Learning.Disentangled representation learning aims to map different aspects of data into distinct low-dimensional latent spaces.It has attracted considerable attention in machine learning [20] and NLP [11,42,73,81].Besides disentangling latent factors, we focus on characterizing the causal and non-causal factors for text summarization.

A CAUSAL VIEW ON SUMMARIZATION
Following the definition that two variables have a causal relationship, denoted as "cause → effect", if intervening the cause may alter the effect, but not vice versa [50,52], we first define the causal relationships in text summarization, and then formulate it using structural causal model (SCM) [51], followed by the identifiability analysis to ensure that the latent factors in our SCM can be correctly separated and learned under certain conditions.

Causal Relationships
We introduce step-by-step about how we characterize the causal relationships in text summarization.
(1) Assuming latent factors with causal relationships.It is likely that there exist correlations between a document  and its summary .According to the Reichenbach's common cause principle [52], correlations mean there exist common causes that causally influence  and .We assume latent factors  as common causes, carrying mixed information of  and .That is,  →  and  → .
(2) Clarifying the causes for document and summary.To separate the mixed information in , we decompose  in terms of content and style, i.e., Core-Content (CC) factor   and Side-Content (SC) factor   in content space, as well as Document-Style (DS) factor   and Summary-Style (SS) factor   in style space.For content, summary  should preserve the core content while omit the side content of document , i.e.,   → ,   →  and   → .For style, considering different style of  and , we assume   is the style factor for  and   for , i.e.,   →  and   → .
(3) Capturing correlations among latent factors.Latent factors may mix through spurious correlation from biased statistical dependencies of the training corpus [41,53].We use  to denote confounder resulting in the spurious correlation, and orient four edges from  to latent factors, i.e.,  →   ,  →   ,  →   ,  →   .(4) Adding guidance to separate latent factors.Practically, we use weakly-supervised signals to guide latent factor learning.For content factors, we introduce core topics   and side topics   for   and   respectively.That is,   →   and   →   .For style factors, we define the function relation between   and   to bridge them.Thus, we link   −   by an undirected edge.

Structural Causal Model
Based on the above analysis, we devise the SCM for text summarization (Figure 1).It describes the data generating process (DGP)latent factors generate the observations (document and summary) given the confounder.The nodes denote variables, and the edges denote relationships (directed: causal, undirected: non-causal).
We refer to  ( |  ,   ,   ) and  (|  ,   ) as the causal generative mechanisms for the document and summary, respectively.They are assumed to be invariant to the prior  (  ,   ,   ,   ) according to the Independent Causal Mechanisms (ICM) Principle [52,59], denoted by solid arrows in Figure 1, while the latent distributions given the confounder may vary across domains, denoted by dashed arrows.Besides, the topic distributions  (  |  ) and  (  |  ) denote the content guidance, and the function relation between   and   denotes the style guidance.We formally present a comprehensive functional form for the DGP as outlined below.
We define Θ ≜ { , } as the parameters to generate observed variables, where  is the invertible function mapping latent factors to observed variables, and  denotes the parameters to generate latent factor given confounder .The parent set is denoted as (•).
where   contains sufficient statistics T and coefficient ,    is the base measure, and    is the normalization function.For    ( |  ,   ,   ), we constrain it by Additive Noise Model (ANM) assumption [23], where the DGP for  can be expressed as:  =   (  ,   ,   ) + ,  ∼   ().
(3) We rewrite Equation 1 using Equation 3, i.e., (4) Similarly, we can obtain the results for ,   and   , i.e., In summary, using the DGP in our SCM, we can express the joint probability density functions as Equations 4-7.
Notice that latent variables cannot be directly obtained.Instead, we can only learn their representations by mimicking these distributions.This raises a crucial question: Can we learn representations for each latent factor without mixing information with others, while ensuring that the difference between the learned representations and the true representations remains within acceptable bounds of uncertainty?This refers to the identifiability of the latent variables.
How to ensure identifiability, i.e, how to solve the question, is presented in the subsequent section.

Identifiability Analysis
As discussed in Section 3.2, we aim to learn representations for latent factors while ensuring their identifiability.To achieve this, we begin by defining an equivalence relation denoted as ∼  .Definition 3.1 (∼  Equivalent).Suppose Θ and Θ are two set of parameters for the SCM as defined in Section 3.2.Θ and Θ are called ∼  equivalent if the following conditions are met: where  ∈ {, ,   ,   } ,  is a latent factor in (), expressed as  ∈ (), A is an invertible permutation matrix, and v is a vector.
The following Theorem 3.2 provides a sufficient condition which ensures our model to learn parameters Θ that satisfy ∼  equivalence with true parameters Θ. Theorem 3.2 (∼  Identifiability).Considering the SCM described in Section 3.2, if we have an adequate number of distinct  values, denoted as   , that satisfy the variety assumption, i.e., the matrix L ≜ [( 1 ) −( 0 ), ..., (   ) −( 0 )] has full column rank, where () represents the vector parameter for the probability density function of an exponential family distribution.Then the learned parameter Θ and the true parameter Θ exhibit ∼  equivalent.Discussion Theorem 3.2 ensures that the learned parameters are ∼  equivalent with the true parameters, that is: (1) The joint distributions given by learned parameters and true parameters match  (Equation 8).( 2) Latent factors can be separated, as only one appears in Equation 9 each time.(3) The difference between learned latent factors and true ones is limited to a permutation transformation with a linear shift applied to their sufficient statistics (Equation 9).Besides, Theorem 3.2 requires that the number   of different values for confounder  is sufficient.It can be satisfied by proper definition for confounder.Proofs are provided in Section 5.

CAUSALITY INSPIRED SEQ2SEQ MODEL
Under the theoretical guarantees on modeling latent factors separately (Theorem 3.2), we propose the Causality Inspired Sequenceto-Sequence (CI-Seq2Seq) model to learn representations that can mimic the latent factors by fitting the observed training data.
In the following, we first present our model architecture, a restructured Variational Auto-Encoder (VAE) [26], which learns latent representations from input and produces samples resembling the original data.Then, we detail learning strategy for causal generative mechanisms  ( |  ,   ,   ) and  (|  ,   ), followed by the controlled generation procedure using these learned mechanisms.

Model Architecture
As depicted in Figure 2, the proposed CI-Seq2Seq contains three main components: Confounder-aware Variational Encoder, Reconstruction Decoder, and Prediction Decoder.
from the learned distribution using a reparametrization trick [26], respectively.
• Computing h  with Style Guidance.Since the SS factors are only causally related to the summary  = { 1 ,  2 , ...,   } of length , it is not suitable to directly extract them from  like DS factors.Therefore, we introduce compression rate (CR) [18] between DS and SS factors as the style guidance.Specifically, CR help bridge DS and SS factors smoothly, which indicates the information ratio between the target summary and the source document.Following previous work [71], we define CR as the ratio of the text length between the summary and the document, i.e.,  = / ∈ (0, 1).Based on , we can obtain h  : where  denotes the parameters of the reconstruction decoder.First, we apply a fully connected (FC) layer to combine h  , h  and h  into the composed information h  .Then, we propose to replace the first token of decoder input with h  , since the first token matters much for the generation of following tokens.Besides, the first token is only allowed to attend to itself, which could alleviate the vanishing latent factor problem to some extent [66].
To further enhance the impact of h  , we add it to all the output hidden states {o  }  =1 from the last Transformer layer in the reconstruction decoder.The vocabulary selection probability   for generating  is computed as where W 3 ∈ R  ℎ ×  and b 3 ∈ R   are learnable.

Prediction
Decoder.This decoder only allows the injection of the CC representation h  along with the SS representation h  for generating the summary  according to   (|  ,   ), where  denotes the parameters of the prediction decoder.
First, similar to the reconstruction decoder, we obtain the composed representation h  for summary prediction, by combining h  and h  using a FC layer.Then, we replace the first token with h  in the prediction decoder.Simultaneously, we add h  to all the output hidden states {r  }  =1 from the last transformer layer in the prediction decoder.The final vocabulary selection probability   for generating  is calculated in the same way as   .

Learning Strategy
To learn  ( |  ,   ,   ) and  (|  ,   ) for invariant prediction, we reformulate the learning objective function of VAE in the supervised scenario to fit the training corpus.Specifically, we apply four learning objectives as follows.
• Reconstruction Loss is applied to train the reconstruction decoder to reconstruct the input document, i.e., (iii) Finally, we compute the Euclidean distance (i.e., L2 distance) of ⟨h  , h  ⟩ and ⟨h  , h  ⟩ as the content guidance loss, i.e., The total loss is a summation of the four losses: where   and   are used to control the strength of the regularization and the content guidance.

PROOF
Proof.For Theorem 3.2, we will demonstrate that we can learn a parameter Θ that is ∼  equivalent to the true parameter Θ, satisfying two conditions: Equation 8 and Equation 9 in Definition 3.1.The first condition means the correct fitting of the joint distribution of observed variables, which can be guaranteed by the universal approximation ability of neural networks.Therefore, our main task is to prove the validity of the second condition.
The proof is roughly divided into two steps: Denoising and Identifying.We will present the proof step by step.
(32) Furthermore, notice that different observed variables share common latent factors.To capture this characteristic, we specifically target pairs of observed variables and apply the aforementioned denoising method to these pairs.This idea is inspired by LaCIM [60].For the variable pairs (,   ), (,   ) and (,   ), we can obtain the similar denoised results: where Identifying.This step aims to establish the validity of Equation 9, which asserts the identifiability of each latent factors.Firstly, we present the process for separating these variables.Subsequently, we will transform the resulting equations to derive Equation 9.
Considering that we have sufficient number   of different values of .Taking the logarithm on the both sides of Equations 29-35, then we plug these different  (i.e.,  0 ,  1 , ...   ) into each equation.Subtracting the first equation (containing  0 ) from the second equation ( 1 ) to the last equation (   ), we obtain   different equations for each of Equations 29-35, indexing by  = 1, 2, . . .,   : In Equation 36 .Then we rewrite these equations in matrix form: We denote Equation 37as Eq(•).Notice that () = {  ,   ,   }, we will now outline the procedure for separating the latent factors in the parent set of .By evaluating the expression ( = ) + ( =   ) − ( = ), we can separate the latent factor   of observed variable : Using the same method we can separate   of  by evaluating ( = ) + ( =   ) − ( = ): Afterwards, the only remaining latent factor of ,   , is naturally separated.The above results show that for the observed variable , all of its latent factors can be separated while preserving their individuality, without mixed information.This conclusion also holds true for other observed variables, i.e., ,   , and   .The equations for each separated latent factors can be expressed as follows: where  ∈ {, ,   ,   , , , },  ∈ ().
Based on Equation 40, we will demonstrate the validity of Equation 9. Since number   is enough to ensure matrix  ,⊤ has full rank, we multiply it's inverse matrix on both sides of Equation 40: where Notice that Equation 41is already in the same form as Equation 9.
The remaining task is to prove that the matrix A  is an invertible permutation matrix, which can be achieved by directly applying Lemma 3, Theorem 2, and Theorem 3 from [25].□

EXPERIMENTAL SETTINGS 6.1 Datasets
We conduct experiments on two public text summarization datasets in English: (1) XSUM [43]

Evaluation Methodology
• Automatic Evaluation: We adopt Rouge scores [30] to automatically evaluate the quality of the summaries generated by our model and the baselines.Specifically, we use Rouge-1 (R1), Rouge-2 (R2) and Rouge-L (RL) to measure the the uni-gram, bi-gram and longest-common subsequence similarities, respectively.• Human Evaluation: We measure the Informativeness, Faithfulness, and Fluency referring to [13,22,27].Each summary is rated on a 5-point Likert scale (higher better), to measure whether the generated summary can satisfy: (i) Informativeness, covering core information (i.e., the most necessary pieces) of the source document and excluding side information that may mislead the understanding of the main idea of the document; (ii) Faithfulness, containing only information present in the document, without introducing any made-up facts (i.e., hallucination [67]); (iii) Fluency, being natural and grammatically correct.Specifically, we ask three college students to score 200 samples randomly picked from the test set of CNN/DM and XSUM (100 for each).

Baselines
We compare CI-Seq2Seq against several recently proposed baseline methods: (i) Unified VAE-PGN [10] leverages VAE to eliminate non-critical information at a sentence-level for abstractive summarization.(ii) VHTM [15] jointly accomplishes summarization with topic inference via variational encoder-decoder.(iii) T5 [54] is a pre-trained framework that converts all text-based language problems into a text-to-text format.(iv) BART [28] is a denoising autoencoder for pre-training Seq2Seq models.(v) GLM [12] is a General Language Model pre-trained with autoregressive blank infilling.(vi) PtLAAM [34] uses a length-aware attention mechanism to generate summaries with desired length.(vii) PEGASUS [74] is a pre-trained model tailored for abstractive summarization, with Gap Sentences Generation (GSG) as pre-training objective.

Implementation Details
The proposed CI-Seq2Seq can be adapted to other Seq2Seq PLMs.
Here, we choose BART-large and PEGASUS-large for initialization, denoted as CI-Seq2Seq  and CI-Seq2Seq  , where the hidden size  ℎ is 1024, and the vocabulary size   is 50265 or 96103 for CI-Seq2Seq  and CI-Seq2Seq  , respectively.BART is chosen for its outstanding performance as well as less computing cost than its peers [34], and PEGASUS is chosen for its state-of-the-art performance in summarization.The number of new parameters added to CI-Seq2Seq compared to backbones is about 256M.
For hyper parameters, we use grid search to automatically find the best setup based on the validation set.We select   as 16 from [8,32],   as 5 from [1,20], and ℎ as 0.25 from [0.02, 0.3].We choose   and   as 128 from {128, 256},   as 256 from {256, 512}, and   as 128 from {128, 256}.Note that the dimension of the SC representations   is set larger than that of the CC representations   , for it is very likely that the SC representations include more diverse information than the CC representations describing the core information.During training, we select   and   as 1 from [1 −3 , 1].The batch size is searched from {256, 512}, and the learning rate is searched from [1 −5 , 1 −4 ].During test, we select the best number of candidate points in the range of [5,20] and the best optimization steps in the range of [20,100].  ,   ,   are searched from [0.001, 0.5], batch size is searched from [1,4], and learning rate is searched from [0.001, 0.5].
Adam optimizer is utilized at both stages.We train our model on one NVIDIA Tesla V100 32GB GPUs for about 5k∼10k steps for each dataset, which takes approximately six days.All experimental results are reported on the test set.Note that for baseline methods, we reproduce and evaluate our backbone models (i.e., BART and PEGASUS) ourselves to provide a fair comparison, while we report scores of other baselines from the papers.For BART, the results reproduced ourselves are almost consistent with those of the original paper [28].For PEGASUS, the results on XSUM are almost consistent.However, our reproduced results and the reported results in the original paper [74] have a gap 2 .The result difference between this work and the original paper may come from our restriction on the maximum sequence length, which is set to 512 for the source documents and 64 for the summaries.

EXPERIMENTAL RESULTS
We aim to answer four research questions: (RQ1) Does CI-Seq2Seq enhance prediction performance on in-domain datasets?(RQ2) Does CI-Seq2Seq enhance generalization ability on out-of-domain datasets?(RQ3) Is CI-Seq2Seq Interpretable? (RQ4) How do latent factors and their constraints affect the performance of CI-Seq2Seq?For each question, we conduct experiments as follow.

In-domain Prediction Performance
To answer RQ1, we compare CI-Seq2Seq with various strong baselines on the test set of CNN/DM and XSUM, where models are trained on the training set of the same corpus.7.1.1Automatic Evaluation.We have the following observations from Table 1: (i) VAE-based neural summarization models (i.e., Unified VAE-PGN and VHTM) perform well by automatically learning text representations containing critical information of documents.(ii) The improvements of PLMs (i.e., T5, BART and GLM) over previous methods demonstrate the utility of pre-training on massive corpora for downstream summarization tasks.(iii) By incorporating length-aware attention mechanism, PtLAAM could further enhance the performance of BART.(iv) PEGASUS outperforms all baselines on XSUM, showing the power of its tailored pre-training objective for summarization.On CNN/DM, PEGASUS performs less well than models initialized with BART under the same maximum sequence length constraint.The reason may lie in the different matching degree between the pre-training objective and the downstream datasets.Specifically, BART's denoising objective is to reconstruct full text, while the GSG objective for PEGASUS is to reconstruct corrupted text.Consequently, BART, with its longer target text, can better handle the long summaries of CNN/DM than PEGASUS.
When we look at our CI-Seq2Seq model, we can find that: (i) Our CI-Seq2Seq implemented by both BART and PEGASUS can outperform all the baselines on two datasets.For example, CI-Seq2Seq  performs 12.98% better than PEGASUS on XSUM in terms of RL.This indicates the insufficiency of only modeling statistical dependence and the effectiveness of modeling the causal relationships between observed documents and summaries.(ii) Between them, CI-Seq2Seq  performs better on XSUM, while CI-Seq2Seq  performs better on CNN/DM.Under the same fine-tuning setting, the possible explanation aligns with that accounting for the performance difference between BART and PEGASUS.
7.1.2Human Evaluation.As shown in Table 2, we can observe that: (i) Informativeness: CI-Seq2Seq models implemented by two backbones perform better than baselines.It indicates that introducing causality helps to extract the core information into summaries, meanwhile effectively reducing the interference of side information.This is consistent with our purpose of distinguishing core content from side one in the document and leveraging the causal part for summary generation.(ii) Faithfulness: CI-Seq2Seq models also outperform baselines, indicating that our could alleviate the hallucination by pursuing only core information in the document, though the hallucination problem is not our focus and deserves further exploration.(iii) Fluency: CI-Seq2Seq models are comparable to baselines, indicating that our models can retain the ability to generate fluent text while removing non-essential information.

Out-of-domain Generalization Ability
To answer RQ2, we compare model performance on unseen corpus under the zero-shot setting.That is, given a model trained on XSUM, we evaluate its performance on the out-of-domain (OOD) test examples from CNN/DM and vice versa.Specifically, we sample 2000 examples from each test set for evaluation.
As shown in Table 3, we observe that: though all the models struggle with OOD test examples, CI-Seq2Seq outperform baselines.For example, when training on XSUM and testing on CNN/DM, CI-Seq2Seq  beats PEGASUS by 11.55% in terms of RL.These results demonstrate that capturing the invariant causal relationships can empower the summarization model with generalization ability.

Interpretability of Latent Factors
To answer RQ3, we analyze the roles of content factors and style factors through case study and visual analysis.Content Factors Analysis.To understand the influence of the CC and SC factors, we compare the top-3 attended words in the document when generating each token of the summary and document, based on the cross attention weights of Transformer.As shown in Table 4, summary generation guided by h  prefers the tokens (Attended  ) conveying the core information of the document, e.g., "shale" and "safely", while document reconstruction guided by h  and h  attends to inessential words (Attended  ), e.g., "involves" and "acking".Without h  , the generated summary only captures the core content "safe" and "strengthen regulations", omitting the side content "protect public health" which exhibits frequent co-occurrence with "safe" in the corpus.This case indicates that the learned representations h  and h  can mimic the CC and SC factors to capture the core and side content in the document, respectively.We also visually analyze the learned CC and SC representations.Specifically, we randomly sample 2000 test examples from XSUM and CNN/DM respectively, and then apply t-SNE [61] to visualize h  , h  and h  .As shown in Figure 3, we can observe that: (i) The distributions of both h  and h  are smoother than that of h  , indicating that by splitting the mixed information into distinct parts, each part will contain purer information.(ii) The distribution of h  exhibits higher uniformity, whereas h  demonstrates greater scattering.This observation indicates that the SC factors capture diverse side information for document generation and thus are dispersive.Table 4: An example (No.8) from the XSUM test data, to analyze the roles of content factors (CC and SC) and style factors (DS and SS).We mark the core content in blue and the side content in red.Document: ...The joint report from the Royal Society and Royal Academy of Engineering say the technique is safe if firms follow best practice and rules are enforced... "Our main conclusions are that the environmental risks of hydraulic fracturing for shale can be safely managed provided there is best practice observed and provided it's enforced through strong regulation, "...

Ground-truth summary:
A gas extraction method which triggered two earth tremors near Blackpool last year should not cause earthquakes or contaminate water but rules governing it will need tightening, experts say.
BART: Shale gas extraction can be carried out safely in the UK, but stronger regulations are needed to protect public health, a report says.CI-Seq2Seq: Shale gas extraction in the UK can be relatively safe, but the government should strengthen regulations, say scientists.

Impact of Latent Factors and Constraints
To answer RQ4, we perform ablations on XSUM to analyze injection ways of latent factors as well as the necessity of confounder information and the content/style guidance served as constraints.
Impact of Latent Factors.We removed the addition of ℎ  /ℎ  , and the replacement of the first token in the decoder, respectively.As shown in the middle of Table 5, we found that both addition and replacement operations contribute the prediction performance, and the replacement of the starting token matters more.
Impact of Constraints.For confounder, we set   = 1 to eliminate its information.For content, we remove the content guidance loss in Equation 20.For style, we sample   in the same way as   without additional bridge between them.As shown in the bottom of Table 5, removing constraints on either the confounder or content/style guidance hurts the prediction performance.This demonstrates the necessity of all constraints, which is consistent with our theory that they are essential for identifying latent factors.

CONCLUSION
In this paper, we presented a principled causal perspective for text summarization.Theoretically, we proved the identifiability of the causal and non-causal factors in SCM to ensure these latent factors to be separated.Inspired by the identifiability theory, we proposed CI-Seq2Seq to learn causal representations that could mimic the causal factors for summary generation.We hope the paradigm can illuminate a promising technical direction of causality in NLP.One limitation of our method is the slightly higher computational cost than the original Seq2Seq architecture, due to the introduction of additional parameters and the optimization procedure during inference.To address this, we plan to reduce the dimension of latent representations and explore other optimization tools.We also want to explore diverse ways to utilize confounder information and define causal factors, which can better showcase our model's strengths under the identifiability guarantees.Besides, we are interested in inducing the causal structure into extractive summarization, and exploring the controllability on more aspects.

Figure 1 :
Figure 1: The proposed SCM for text summarization.Solid and dashed circles denote observed and latent variables.The solid arrows pointing to  and  represent the invariant causal generative mechanisms  ( |  ,   ,   ) and  (|  ,   ), while the dashed arrows pointing from  represent the varied latent distributions given confounder.The blue arrows pointing to   and   represent the content guidance for   and   , while the yellow dashed line between   and   represents their relation as the style guidance.Details see Section 3.

Figure 2 :
Figure 2: The overall architecture of CI-Seq2Seq model.

CR=0. 1 : 4 :Figure 3 :
Figure 3: The t-SNE plot of h  , h  and h  learned by ours.Style Factors Analysis.To understand the influence of the DS and SS factors, we vary  between them to explicitly control the generation.Specifically, we vary  from 0.1 to 0.7 to change h  .As shown in Table4, the generated summary is concise when  = 0.1, only including the necessary information, e.g., "low" and "risks".When  = 0.7, the generated summary contains more specific description, e.g., "the controversial technique".The summary becomes more detailed as  increases.Note that the goal of controlled generation here is not precise length control, but to control the style of the summary by utilizing  as weaklysupervised signal.The results indicate that h  can mimic the SS factors to actively control the writing style of the summary.
The joint probability density of  and () can be written as:  Θ  (, ()|) = Θ  (,   ,   ,   |) =   ( |  ,   ,   ) •    (  ,   ,   |).(1) For    (  ,   ,   |), we assume it follows exponential family:    (  ,   ,   |) = T  ,  (  |) •  T  ,  (  |) •  T   ,   (  |)  ,  (   |) •  ,  (   |) •   ,   (   |) 4.1.1Confounder-aware Variational Encoder.This encoder targets to obtain representations h  , h  , h  and h  for the CC, SC, DS and SS factors from the input document  = { 1 ,  2 , ...,   } of length  .Based on Theorem 3.2, confounder  is essential in distinguishing latent factors.It can be defined as the intrinsic properties of training data, e.g., topic, style and domains.Here, we denote  as the topic extracted from documents.Then, the encoder maps  and  into h  , h  , h  and h  according to   (  ,   ,   |, ) and the relation between   and   , where  denotes the parameters of the confounder-aware variational encoder. , h  and h  .We mix h  with h  and encode them into the distribution of h [37]coding Confounder Information h  .To achieve different values of confounder , we denote  as the topical features.We first partition each summarization corpus into   subsets via LDA topic classification, where each document belongs to one subset.Specifically, each document obtains a topic distribution from LDA, and the topic id  with the highest probability is assigned to the document.Then, following the practice of word embedding[37],  is applied to look up a hidden vector h  ∈ R   from a trainable embedding matrix E  ∈ R   ×  , i.e., h  = E  ().• Encoding Source Information h  .CC, SC and DS factors are probably influenced by the full information of the document.Therefore, we propose to model the distribution of them conditioned on the global semantic representation of .Specifically, given an input document , we first add a special token "[CLS]" in front of it, and then leverage the final hidden state of this token as its global representation h  ∈ R  ℎ .It is a flexible aggregate and comprehensive understanding of the entire sequence.• Sampling h  , h  and h  to model the posterior distributions  (  ,   ,   |, ).Specifically, the true posterior  (  ,   ,   |, ) is approximated via the variational distribution   (  ,   ,   |, ).We constraint the prior distributions  (  ,   ,   ) as standard Gaussian distributions following [26, 29].Gaussian parameters mean  ,, and variance  2 ,, are projected from the concatenation of h  and h  : [  ;   ; ) 4.1.2Reconstruction Decoder.This decoder targets to utilize the representations h  , h  and h  of the CC, SC and DS factors to reconstruct the input document  according to   ( |  ,   ,   ), ) • Prediction Loss is applied to encourage the prediction decoder to generate the summary based on the summary-causal representations, i.e., L  = −E   (  ,  ,  |, ) [log   (|  ,   )].(14) • KL Loss is a regularizer based on the Kullback-Leibler (KL) divergence, applied to push the posterior   (  ,   ,   |, ) to be closed to the prior  (  ,   ,   ) which is constrained as standard Gaussian distributions, i.e., L  = D  [  (  ,   ,   |, )∥ (  ,   ,   )].(15) • Content Guidance Loss is further applied to guide the optimization of the CC and SC factors, which is calculated by three steps.(i) We first extract the core topics   and side topics   in  according to the LDA topic distribution  (  |) on   topics.Specifically, given a threshold ℎ, a topic    ( ∈ {1, 2, ...,   }) belongs to the core topics of document  if  (  =    |) > ℎ, otherwise side ones.To indicate the type of each topic, we introduce a   −dimension binary indicator g, where "1" represents the core topics and "0" represents the side ones.(ii) We then transform such topic information into hidden representations h  , h  ∈ R  ℎ based on another learnable embedding matrix E  ∈ R   × ℎ .Similar to E  , each row of E  represents a topic embedding.Specifically, to achieve the aggregated hidden representation h  which combines information of all core topics, we obtain the core topic distribution  (  |) based on the binary indicator g, i.e.,  (  |) = ( (  |) ⊙ g), (16) where ⊙ denotes element-wise multiplication and () denotes normalization operation.After that, we linearly combine topic embeddings from E  according to  (  |) as below: h  =  (  |)E  .
,  ,  log   ( |  ,   ,   ) +  ∥  ∥ 2 2 +  ∥  ∥ 2 2 +  ∥  ∥ 2 2 , (22) where   ,   and   control the learned   ,   and   in a reasonable scale.Specifically, we sample some candidate points from  (0,  ) and select the optimal one in terms of Equation 22 as the initial point for further optimization.Finally, we employ the optimized  *  and  *  to generate summaries with different styles by varying  1 .In this way, we can actively control the compression rate of the summary.That is, with different  *  values and the optimized  *  , we generate the summary  based on the learned   (| *  ,  *  ).

Table 1 :
In-domain performance comparisons between our CI-Seq2Seq and the baselines on XSUM and CNN/DM datasets.Best results are marked in boldface.* indicates statistically significant improvements over baselines (p-value < 0.05).

Table 5 :
Ablations of injection ways of latent factors (middle) as well as their constraints (bottom) on the subset of XSUM.