CitationSum: Citation-aware Graph Contrastive Learning for Scientific Paper Summarization

Citation graphs can be helpful in generating high-quality summaries of scientific papers, where references of a scientific paper and their correlations can provide additional knowledge for contextualising its background and main contributions. Despite the promising contributions of citation graphs, it is still challenging to incorporate them into summarization tasks. This is due to the difficulty of accurately identifying and leveraging relevant content in references for a source paper, as well as capturing their correlations of different intensities. Existing methods either ignore references or utilize only abstracts indiscriminately from them, failing to tackle the challenge mentioned above. To fill that gap, we propose a novel citation-aware scientific paper summarization framework based on citation graphs, able to accurately locate and incorporate the salient contents from references, as well as capture varying relevance between source papers and their references. Specifically, we first build a domain-specific dataset PubMedCite with about 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships between them. It is characterized by preserving the salient contents extracted from full texts of references, and the weighted correlation between the salient contents of references and the source paper. Based on it, we design a self-supervised citation-aware summarization framework (CitationSum) with graph contrastive learning, which boosts the summarization generation by efficiently fusing the salient information in references with source paper contents under the guidance of their correlations. Experimental results show that our model outperforms the state-of-the-art methods, due to efficiently leveraging the information of references and citation correlations.


INTRODUCTION
Digital scientific documents such as scientific papers are vital linked web resource.The rapid growth of scientific research [13] requires the development of methods to automatically summarise scientific papers.Unlike general language texts, scientific papers are characterized by domain specific structures, and domain-specific terms, which must be taken into account when generating succinct and conclusive summaries [1, 26,38].Furthermore, scientific documents are connected, and relevant to the papers they cite.The source scientific papers, their references, and their citation correlations, form the enormous citation graph.In a citation graph, references of scientific papers and their correlations, provide extra knowledge such as the context of its research background, methods, and findings [3,6,26,37].Therefore, to understand and summarize the gist of a scientific paper, it is essential to incorporate information of its references and relation structure from the citation graph besides the contents of itself.
Although the citation graph plays a promising role in improving the automatic summarization of scientific papers, little attention has been focused on incorporating them in existing pre-trained language models (PLMs) based summarization methods [16,19].The only exception is CGSum [3] which leverages references of source papers by constructing a citation graph to improve its summary generation.Yet it only incorporates abstracts of references, which can be uninformative or even meaningless to the source document.As shown in Table 1, the two abstracts of references have low semantic similarity with the gold summary of the source document (abstract), which shows that abstracts from references can be uninformative for improving the summary generation of the source document.
A better alternative is to incorporate the full contents of references instead of their abstracts.The challenge lies in bringing the full text of the references can introduce redundant information, the content of the references may be irrelevant to the source paper except for some key sentences as shown in Table 2.Moreover, different parts of references, such as introduction, related work, methods and experiments, have varying levels of semantic similarities with Table 1: An example of source document and abstracts of its references in the SSN dataset [3].We calculate the ROUGE-1 score [17] as the semantic similarity between the gold summary of the source paper (its abstract) and abstracts of its references.
Abstract of source paper: this paper summarizes the contents of a plenary talk at the pan african congress of mathematics held in rabat in july 2017.we provide a survey of recent results on spectral properties of schrödinger operators with singular interactions supported by manifolds of codimension one and of robin billiards with the focus on the geometrically induced discrete spectrum and its asymptotic expansions in term of the model parameters.
References 1: we determine accurate asymptotics for the low-lying eigenvalues of the robin laplacian when the robin parameter goes to $-infty$.the two first terms in the expansion have been obtained by k. pankrashkin in the $ 2d$-case and by k. pankrashkin and n. popoff in higher dimensions.the asymptotics display the influence of the scalar curvature and the splitting between every two consecutive eigenvalues.(ROUGE-1: 0.1579) References 2: we give a counterexample to the long standing conjecture that the ball maximises the first eigenvalue of the robin eigenvalue problem with negative parameter among domains of the same volume.furthermore , we show that the conjecture holds in two dimensions provided that the boundary parameter is small.this is the first known example within the class of isoperimetric spectral problems for the first eigenvalue of the laplacian where the ball is not an optimiser.(ROUGE-1: 0.1818) Table 2: An example of source document and contents of its reference in the SSN dataset [3].The most related contents in the reference is marked with the blue color.
Source paper: in this paper, the weak galerkin finite element method for second order elliptic problems employing polygonal or polyhedral meshes with arbitrary small edges or faces was analyzed.with the shape regular assumptions, optimal convergence order for $ h 1 $ and $ l_2 $ error estimates were obtained.also element based and edge based error estimates were proved.References: weak galerkin (wg) refers to a finite element technique for partial differential equations in which differential operators are approximated by their weak forms as distributions.a weak galerkin method was introduced and analyzed for second order elliptic equations based on weak gradients.in this paper , we shall develop a new weak galerkin method for second order elliptic equations formulated as a system of two first order linear equations, our model problem seeks a flux function inlineform0 and a scalar function inlineform1 defined in an open bounded polygonal or polyhedral domain inlineform2 satisfying displayform0, and the following dirichlet boundary condition displayform0, where inlineform3 is a symmetric, uniformly positive definite matrix on the domain inlineform4.a weak formulation for ( eqref3 ) -( eqref4 ) seeks inlineform5 and inlineform6, such that displayform0, here inlineform7 is the standard space of square integrable functions on inlineform8 , inlineform9 is the divergence of vector -valued......... the source paper.For example, Table 3 shows the mean semantic similarity between source documents and different parts of their references in the dataset SSN [3].We observe that sentences selected from the full contents of references have the highest semantic similarity with source documents, while abstracts of references have the lowest semantic similarity.Therefore, to efficiently leverage references and their correlation structure of the citation graph to improve the summarization of the source document, a remaining big challenge is to identify and locate key information from the full contents of references for the source paper, as well as capture their correlations of different intensities.This can be laborious even for experienced researchers.
To fill this gap, we propose a novel citation-aware scientific paper summarization framework based on citation graphs, able to accurately locate and incorporate the salient contents from references, as well as capture the varying relevance between source papers and their references.we also developed a new dataset for the biomedical domain: PubMedCite with about 192K biomedical We then propose a self-supervised citation-aware summarization framework (CitationSum) with graph contrastive learning, which integrates the salient information from references with the source paper contents under the guidance of their correlations to improve summary generation.We build a hierarchical heterogeneous graph containing nodes of the source paper, references, and their tokens, based on a weighted citation graph.We design the contrastive learning guided by the hierarchical graph, to align representations of source documents with key contents of their references from PLMs, at both document and token-level.This allows our method to incorporate the useful information from references with source papers according to their semantic correlation, and fuse information between source documents, references, and their tokens.We show that our graph contrastive learning can be deemed as an implicit reconstruction of the weighted citation graph and document contents, which is consistent with the phase of the human writing process for scientific papers.Our main contributions are as follows: (1) We propose a novel self-supervised summarization method CitationSum, that incorporates the graph contrastive learning to incorporate key contents of references and inherited semantic correlations of source documents and references, for scientific paper summarization.(2) We developed a domain-specific dataset PubMedCite containing 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships among them.
To the best of our knowledge, this is the first dataset in the biomedical domain for the task.
(3) Experimental results empirically demonstrate that our method can efficiently leverage references and capture semantic correlations between source papers and their references, leading to superior performance compared to previous advanced methods.

RELATED WORK 2.1 Scientific Paper Summarization
Online scientific document as a web-based text resource, its automatic summarization has attracted much attention [12,27].One direction to tackle this problem is citation-assisted summarization, which aims to highlight the main contributions of papers, based on citation sentences from papers that cite the source document.The earliest attempts at citation-based summarisation [1, 6, 26], used the sentence clustering and ranking methods such as Lexrank [7], to select citation sentences of papers, as the summary for the paper referenced by them.In addition to the cited text span from papers, Yasunaga et al. [38] further utilized the abstract of the target paper, and the graph neural networks (GNNs) [33] to encode all input texts.Zerva et al. [39] investigated the advanced pre-trained encoders BERT [11] to identify and select citation text spans.However, these methods cannot address if a newly published paper has not been cited by any paper.To address this gap, An et al. [3] recently proposed the citation graph-based summarization task which considers both the contents of target papers and their corresponding references in the citation graph.However, they only utilized abstracts of references in a shallow manner, which inspires us to fully leverage references and capture the correlations between source papers and their references, via graph contrastive learning.

Text Summarization with Contrastive Learning
Recently, contrastive learning has been introduced to improve text summarization.Liu and Liu [20] proposed to use the contrastive learning to optimize the quality of generated summaries according to the evaluation metric ROUGE.Cao and Wang [5] and Nan et al. [25] utilized the contrastive learning to improve the factuality of abstractive summarization.Liu et al. [18] proposed the topicaware contrastive loss to capture dialogue topic information for abstractive dialogue summarization.Wang et al. [32] designed the contrastive loss to align sentence representation across different languages, for multilingual summarization.Hu et al. [10] used the contrastive learning to improve the graph encoder, for the radiology findings summarization.Different from them, we focus on the citation graph-based summarization task, which has been rarely studied.Xie et al. [36] designed the graph contrastive learning to capture the better topic information for long document summarization.

CITATION-AWARE PUBMED DATASET
To support the evaluation and development of our method, we first introduce a new large-scale citation-aware PubMed dataset (Pub-MedCite) with 192,744 scientific paper nodes and 917,838 citation relationships in the biomedical domain, that are extracted from the PubMed Central Open Access Subset1 .The PubMedCite corpus is built on the PubMed Central Open Access Subset.To construct PubMedCite, we first downloaded the whole PubMed Open Access Subset (up to 17 Nov. 2021), then build the graph by adding papers into it through the breadth first traversal starting from a random document until the number of nodes reaches a limit.During the construction, we utilize pubmed parser by Achakulvisut et al. [2] to extract the PMC id, pubmed id, title, abstract, and full article of each document.For the inductive setting, we used the same graph building method to sample two different sub-graphs from the whole PubMedCite citation graph as the validation and test set.Then, we removed the inter-graph edges among the three sets to ensure their independence.Although it is not for commercial use, we are unable to release document contents of the dataset directly due to license limitations (such as CC BY-SA, and CC BY-ND licenses) of some papers.We will release the build citation graph among all documents, and provide the code script for users to access and process the document contents themselves, according to the paper id saved in the citation graph.
The statistics of the PubMedCite dataset and comparison with the only existing dataset SSN [3] are shown in Table 4.When compared with SSN: 1) The average length of gold summary in our dataset is longer than that of SSN, while the average length of full articles is relatively shorter than that of SSN. 2) Our dataset keeps full contents of references to make better use of their information, while SSN only keeps abstracts of them.3) Our dataset only includes papers in the biomedical domain, while SSN consists of papers from several different fields including mathematical, physics, and computer science, our dataset can help the evaluation of domainspecific tasks in the research community.Moreover, biomedical scientific papers are laden with terminology and have complex syntactic structures [21,31].This makes our dataset a challenging benchmark for automatic summarization methods.

METHODS
We first define the task of scientific paper summarization with the citation graph.Given a corpus , each document  in the corpus is represented by the sequence of  tokens: Its target summary is represented by a sequence of  tokens:  = { 1 ,  2 , • • • ,   }, where  ≪ .The citation graph  = { , } of the corpus preserves citation relationships among all documents, where  is the set of document nodes and  is the set of edges. can be represented by the adjacency matrix , where  , ′ = 1 means there exists a citation link between document  and  ′ ,  , ′ = 0 otherwise.For each source document , it aims to generate its target summary  based on the source paper and the sub-citation graph   with only the source document and all its neighbours.It is generally considered as a conditioned sequence-to-sequence [29] learning problem to model the generation process  (|,   ).
In this section, we will introduce our proposed citation-aware scientific paper summarization framework (CitationSum) based on the graph contrastive learning and the citation graph.CitationSum aims to fully leverage the useful information of references and the weighted citation correlation in the citation graph.Different from previous methods which used only the abstracts of references [3],  we first select the key information from the full contents of references for each source document and build its weighted citation graph to capture the varying semantic correlation between the source paper and its references.To further make the deep information fusing between the source paper and its references, we build the hierarchical heterogeneous graph based on the weighted citation graph and document contents, to capture the semantic correlations between the source paper, references, and their tokens.As shown in Figure 1, we encode the source document and its references with the PLMs-based encoder to yield token-level representations.We use the pooling strategy to further generate document representations for the source paper and its references, based on the token representations.The self-supervised graph contrastive learning is designed to align representations of the document and references at both document and token levels, through the hierarchical heterogeneous graph.This allows the model to integrate the salient information of references into the source paper according to their varying semantic similarity, to improve the summarization generation of the source paper.Finally, the citation-aware token representations of the source paper are fed into the decoder to conduct the summarization generation.

Input Representation
Contents Selection As shown in Table 3, sentences that are relevant to the source paper are found in the full contents of references.Unlike previous methods considering only abstracts [3], we first aim to identify the most useful contents of references to make better use of them.Similar to the oracle summary selection, for each source document, we use a greedy selection algorithm [19,24] to extract the top sentences from the full contents of the references.This process maximises the ROUGE score against the target summary of the source document (generally the abstract).This heuristic approach iteratively selects one sentence adding to the summary til the ROUGE score cannot be improved by adding any more to the summary.It is noticed that we use the introduction of the source document instead of the target summary to make the content selection for the test set since the target summary of the source document should be unseen during the test process.As shown in Table 3, the abstract and introduction of the source paper have comparable semantic similarity with references, demonstrating the necessity of extracting key contents from introductions.We also narrow down the sub-graph by only considering the most relevant neighbour nodes from the full citation graph   for each document , following the previous method [3].To reduce the computational costs of processing all references and encoding the full graph, we sample the sub-graph with the top neighbour references from the full citation graph   for each document .Our method differs from An et al. [3] that extracted neighbours based on their hidden representations, we select references that have the maximum semantic similarity (measured by the ROUGE score) with the source document from neighbour references.
Input Encoder.Given the source document  by the sequence of tokens { 1 ,  2 , • • • ,   }, and its references  ′ ∈ N  in the sub citation graph  ′  , we first convert each token of them into the sum of token embedding, position embedding and segmentation embedding.We then yield contextual representations of tokens with the pre-trained language model: Finally, we aggregate token representations with the pooling layer to achieve the contextual representation of source document: where    is the feed-forward network.Under the same process, we yield the contextual representation of its references ℎ  ′ ( ′ ∈ N  ), where N  is the set of neighbours of  in the sub-citation graph.It is noticed that the source paper and its references are input individually into PLMs.

Citation-aware Graph Contrastive Learning
We leverage information from references and the correlation structure in the citation graph to guide summary generation by proposing the self-supervised citation-aware graph contrastive learning framework.This enables a rich information fusion between source documents and their references.

Hierarchical Graph Construction.
For each source document, we first build a hierarchical heterogeneous graph with multi-granularity nodes.It consists of a two-level graph organized hierarchically: the weighted citation graph to capture the citation correlations between the source document and its references, and the document graph to model the correlation between documents and tokens.
Weighted Citation Graph Construction.For each source document , and all its selected k-hop neighbour references in its subcitation graph, we build weighted edges for them according to their semantic similarity.We also randomly select documents that have no citation relation with the source document, as the negative nodes in the graph.The edge of two nodes (, ) is defined as: where   (  ,   ) is the mean score of ROUGE-1 and ROUGE-2 between the abstract of the source document and key contents of its neighbours,   (  ,   ) is the mean score of ROUGE-1 and ROUGE-2 between highlight contents of neighbour  and .Edges with weights  ′ , that are less than , will be deleted to avoid introducing noise.
Bipartite Document Graph Construction.We build the bipartite graph for each document  (including source and references), whose nodes are documents and tokens, and edges are occurrences of tokens at the document.We also randomly select tokens from other documents as the negative token nodes in the graph.The edge of two nodes (, ) is defined as:

Graph Contrastive Learning. Document Representation
Alignment.Based on the weighted citation graph, we design the following document representation alignment (DRA) loss: where Â′ =  −  − 1 2  ′  − 1 2 is the normalized graph Laplacian of  ′ ,  is the degree matrix of  ′ ,   is the number of documents, and ℎ  , ℎ  are document contextual representations from the PLM encoder.It encourages representations of documents (including the source and its references), and their neighbours to be closer, while pulling away representations of documents that don't have citation correlation.This makes explicitly information fusion between source documents and key contents of their references according to their semantic correlation in the citation graph, to yield citationaware document representations for source documents and their neighbours.
Token Representation Alignment.To further make information propagation between token representations of source documents and their neighbours, we then design the document graphguided contrastive learning, to align token representations with citation aware-document representations.We design the following token representation alignment (TRA) loss based on the bipartite graph: where B =  −  − 1 2    − 1 2 is the normalized graph Laplacian of   , and ℎ  , ℎ  are document and token representations from the PLM encoder.It pushes the citation-aware representation of the document and representations of tokens closer if these tokens appeared in the document, and pulls away otherwise.
The alignment makes the information of references be propagated into token representations of source documents, and also token representations of references be grounded by representations of source documents.

Graph Contrastive as Matrix Factorizing
In this section, we aim to make a theoretical understanding of the information captured by the hierarchical graph contrastive learning process in the above.We can find that the hierarchical graph contrastive learning based on the weighted citation graph and document graph, can be reformulated as the implicit matrix factorization, to reconstruct them.
For the equation 4, we derive its upper bound as: where   is the number of tokens (including positive and negative tokens).Since − B , = is the same value for all positive token , thus can be dependent from the log function and collapsed to a constant, where | |  =   , | |  are the degree of document node , and token node .
To minimize the L   , we can instead optimize its upper bound.Following previous methods [15,23], the upper bound can be approximated with the negative sampling: max 1 where  , is the appearance frequency of token  in document , N +  is the set of positive tokens appeared in document ,  is the number of sampled negative tokens,  () = 1 1+ − , and N is the set of tokens in the corpus.For simplify the analysis, we consider the negative tokens are sampled from the empirical unigram distribution, thus the expectation term in equation 6 can be rewritten as: where   is the appearance frequency of the positive token  in the corpus,   is the appearance frequency of the negative token  in the corpus.
To optimize the equation 6, we consider  = ℎ  • ℎ  and yield the its partial derivative with  after explicitly representing the expectation term with equation 7: where   is the appearance frequency of the positive token  in the corpus.We set the partial derivative to zero, and achieve the following maximum point: We can see that the optimized ℎ  , ℎ  aims to reconstruct the shifted log appearance frequency of token  in document , that is regularized by the inverse document frequency (IDF):      .For the equation 3, we derive its upper bound as: where   is the number of nodes in the citation graph (including positive and negative nodes).According to the Jensen's inequality [9], we further rewrite the equation 10 as: We approximate the upper bound with the negative sampling: It has similar formulation with the equation 6, thus we yield its optimized point as following: where  +  is the number of neighbours of document .We can find that optimizing the L  loss implicitly refers to factorizing the shifted log weighted citation matrix  ′ .
The implicitly reconstructing of the log citation graph  ′ is similar to the graph auto-encoder [14], which can learn representations of source documents efficiently via capturing the topological structure information in the citation graph.The log appearance frequency of each token is similar to the document topic modeling process [4,35,37], thus capturing the global context semantics of documents.The analysis helps to explain why and how the designed hierarchical graph contrastive learning on improving representations of source documents from the perspective of reconstructing document contents and their correlation structure.

Decoder
We use the standard transformer [30] based decoder similar to previous methods [19,28].We feed token representations of document  and its references, along with previously generated tokens    −1 to get the current output: where ĥ is the set of token representations of document  and its references based on the graph contrastive learning, LN means layer normalization, and ℎ   is the hidden representation of the current token in the decoder.The final training loss is: where ,  are parameters to control the effect of graph contrastive learning, and the first term is the negative conditional log-likelihood of the target token    .
Implementation Detail.Our method is implemented by Python and Pytorch2 .We use the implementation of BERT, BART and Pub-MedBERT from Huggingface3 .We run our experiments on multiple GPUs with 32G memory.Following the previous method [19,34], we set the different learning rates to PLMs-based encoder and transformer based decoder.We set the learning rate of the encoder to 2 − 3, that of the decoder to 2 − 1, the drop-out rate of the decoder to 0.4, the training steps to 200000, the warm-up steps of the encoder to 20000, that of the decoder to 10000, the maximize a token number of input documents to 1240, and that of each reference of input documents to 100.,  on graph contrastive loss controlling, is set to 1.We save checkpoint at every 200 steps, and select the best checkpoint according to the validation.Due to the memory limitation, we set the maximum number of neighbours of input documents to 16.For each document, we take tokens from its references as the negative tokens.During the content selection of the test data set, we use the introduction of the source document to select contents from its references.neighbour references of the test may be from the training data in the transductive setting, while the dataset is split into totally independent train/validation/test sets in the inductive setting.

Results Analysis
5.1.1Main Results.We first show the ROUGE F1 score of different methods in both datasets in Table 5.We investigate both BERT [11] and PubMedBERT [8] as the encoder in our method.Our method with the PubMedBERT-based encoder (CitationSum + PubMed-BERT) presents the best performance among all baselines on both datasets when evaluating R-1 (ROUGE-1) and R-2 (ROUGE-2) for informativeness and R-L (ROUGE-L) for fluency.When compared with CGSUM which also incorporates citation graphs to enhance summarization generation, it outperforms CGSUM in both inductive and transductive settings.Different from CGSUM which uses abstracts of references, our method incorporates high-quality information from the full contents of references under the guidance of its semantic similarity with the source papers.This proves the advantage of our method to make better and full use of references and the structure information of the citation graph.It is also demonstrated by the superior performance of our method using BERT encoder compared with all BERT based abstractive methods including BERT-SUMABS+, BERTSUMABS and BERTSUMABS+Concat Nbr.Summ.This proves essential to capture salient information about references and varying semantic correlations between scientific scripts and their references.Moreover, although CGSUM is based on the LSTM backbone model, it outperforms pre-trained language modelbased methods such as BERTSUMABS and CitationSum + BART.This proves that the performance improvement of our method is not attributed to using the pre-trained language model, but due to make better leveraging the information of the citation graph.
Although it has been proven that BART shows better performance than BERT in text summarization, we surprisingly find that CitationSum + BART doesn't have the advantage for domainspecific scientific papers when compared with the CitationSum + BERT.We can also observe that CitationSum + BERT underperforms CitationSum + PubMedBERT in all datasets and CGSUM in PubMedCite due to the limited vocabulary of BERT.Since there are many terminologies in documents of PubMedCite, the PubMedCite dataset is more challenging for BERT based methods compared with SSN.During experiments, we find that when using BERT's original tokenizer, there would be so many unrecognized tokens in PubMedCite to be set as [UNK] that causes much information loss and leads to many meaningless [UNK] tokens being generated in summaries.In contrast, on SSN our method with BERT encoder outperforms CGSUM and has no tokens unrecognized as [UNK], and CitationSum + PubMedBERT has limited improvement on the performance of SSN compared with CitationSum + BERT.This proves that the PubMedCite dataset with high technical domain-specific papers is more challenging when compared with SSN.Moreover, our models and CGSUM both have superior performance to other models that ignore information of references including BERT and transformer-based extractive and abstractive methods, as well as traditional methods TextRank and PTGEN+COV.We can notice that simply appending the content from reference papers (Concat Nbr.Summ) in baseline methods presents limited benefit or even yield worse performance.For example, the BERTSUM-ABS+Concat Nbr.Summ with the extra input tokens from abstracts of references seldom underperforms the BERTSUMABS+.This may be because abstracts of references can not provide useful information or even introduce extra noise as we mentioned before in Section 1.This indicates that although leveraging references is beneficial to better understand scientific papers, it is vital to distinguish between salient and non-salient information in references.

Ablation Study
To further clarify the contribution of each component in our method, we perform experiments on our method and several ablations which respectively removes contrastive learning (W/O Contra), document representation alignment (W/O DRA), token representation alignment (W/O TRA), and concatenation of token representations of references (W/O Concate).As shown in Table 6, the performance of our method suffers least without concatenation, indicating the efficacy of our hierarchical graph contrastive learning in delivering salient information from references to encoding source documents.It can also be demonstrated by the lowest ROUGE score of W/O Contra since without our design of contrastive learning, the model can be faced with difficulty finding useful messages from concatenated reference content.Moreover, the drops brought by W/O DRA and W/O TRA separately manifest the essential of inter-document and inner-document connections in our design.In the absence of either of them, information propagation in the heterogeneous graph would be blocked, inhabiting the model's ability to fully understand the source paper.

Parameter Impacts
We further explore the influence of parameter  of controlling edge weights on the performance. is used to filter edges between the source document and its references in the citation graph, according to their semantic similarity as shown in Section 4.2.1.From Table 7, we can see that our model yields the best performance when  = 0.7.A too low value of  ( < 0.7) could incorporate references that have low semantic similarity with the source document.Thus extra noise could be introduced.In contrast, the too high value of  (0.7 < ) could filter references with useful information.

Case Study
In Appendix A Table 8, we show the generated summary by our method of an example document along with its reference.Tokens that are semantically correlated to the example document, are marked with blue colour.It shows that our model is able to generate a coherent summary for the document that is highly semantically related to its gold summary.Compared with the abstract of its reference, the selected high-quality content of its reference can provide more useful information indicated by a higher mean rouge score with the example document.This is also illustrated in the selected content of references which have a large number of related tokens that are marked with blue colour.Our method can effectively recognize salient information from references for fully exploiting the hierarchical connections among documents and tokens, which is utilized to generate concise summaries with many relevant tokens marked in blue colour.

CONCLUSION
We propose a novel self-supervised framework for the summarization of scientific papers based on the citation graph and contrastive learning, to make better use of references and citation correlations.We also propose a novel biomedical domain-specific dataset, by which we expect to support evaluation and method development in the research community.Experimental results on two benchmark datasets show the effectiveness of our proposed method.There are several limitations that could be addressed in the future: 1) For the high-quality content selection of references, we only utilize the ROUGE score as the semantic similarity metric and evaluation metric.Other advanced metrics such as the BERTScore can be explored in the future.
2) The PLM encoder in our model only encodes a limited number of input tokens.However, the average document length of documents is up to four thousand.It is expected to address this issue in the future.3) We only utilize the document contents of source documents and their references.There is rich structural information for scientific documents, such as title, introduction, and related work, that can be incorporated further.

Table 3 :
[17]mean ROUGE-1 F1 and ROUGE-2 F1 score[17]between different contents of source papers and their references.Gold and introduction mean the gold summary (abstract) and introduction of the source paper.For each part of references, we select the top-7 sentences with the greedy search algorithm to calculate the ROUGE score with the source paper.

Table 4 :
The statistics."Sum words" and "Sum sent" denote the average word and sent number of summarizations.
, we have both transductive and inductive settings in PubMedCite.For SSN, we use its original train/validation/test data splitting: 128,299/6,250/6,250 for inductive setting, and 128,400/6,123/6,276 for transductive setting.Different from it, we keep the same train/validation/test set splitting: 178,100/7,036/7,608 in our PubMedCite in both settings.This makes a more fair comparison of the influence of performance with different training setting, since we only have different citation correlations in inductive and transductive settings.Following previous work, we use the abstracts of scientific documents as target summaries.

Table 5 :
[3]GE F1 results of different models on SSN and PubMedCite.The results of our model are under 5 times running.†meansoutperform the existing model with best performance significantly ( < 0.05).Part results are from[3].

Table 7 :
ROUGE F1 results of our model with different values of  under inductive setting, that controls weights of edges in the citation graph.