Towards Summarizing Code Snippets Using Pre-Trained Transformers

When comprehending code, a helping hand may come from the natural language comments documenting it that, unfortunately, are not always there. To support developers in such a scenario, several techniques have been presented to automatically generate natural language summaries for a given code. Most recent approaches exploit deep learning (DL) to automatically document classes or functions, while little effort has been devoted to more fine-grained documentation (e.g., documenting code snippets or even a single statement). Such a design choice is dictated by the availability of training data: For example, in the case of Java, it is easy to create datasets composed of pairsthat can be fed to DL models to teach them how to summarize a method. Such a comment-to-code linking is instead non-trivial when it comes to inner comments documenting a few statements. In this work, we take all the steps needed to train a DL model to document code snippets. First, we manually built a dataset featuring 6.6k comments that have been (i) classified based on their type (e.g., code summary, TODO), and (ii) linked to the code statements they document. Second, we used such a dataset to train a multi-task DL model, taking as input a comment and being able to (i) classify whether it represents a"code summary"or not and (ii) link it to the code statements it documents. Our model identifies code summaries with 84% accuracy and is able to link them to the documented lines of code with recall and precision higher than 80%. Third, we run this model on 10k projects, identifying and linking code summaries to the documented code. This unlocked the possibility of building a large-scale dataset of documented code snippets that have then been used to train a new DL model able to document code snippets. A comparison with state-of-the-art baselines shows the superiority of the proposed approach.


INTRODUCTION
Empirical studies showed that code comprehension can take up to 70% of developers' time [42,68].While code comments can support developers in such a process [12], their availability [52] and consistency with the documented code [13,14,35] cannot be taken for granted.A helping hand may come from tools proposed in the literature to automatically document code [3,5,16,19,22,26,30,39,43,48,50,53,53,66,67,70].The most recent techniques (e.g., [19,22,30]) train deep learning (DL) models with the aim of learning how to summarize a given piece of code in natural language.This requires the building of a large-scale dataset composed by pairs <, > that can be used to feed the model with  instances asking it to generate their .These approaches are usually trained to work at function-level granularity.This means that, in the case of Java, methods are mined from open source projects and linked to the first sentence of their Javadoc which is assumed to represent a plausible code summary.
Having such a granularity could be, however, suboptimal to support comprehension activities.Indeed, while the overall goal of a method might be clear to a developer, they may not understand a specific set of statements in it.Also, looking at the datasets used in the literature to train these models, we found that the methods' descriptions extracted from the Javadoc are usually very short.For example, the seminal dataset by LeClair and McMillan [32], features an average of 7.6 words (median=8.0)to summarize each Java method.While such short descriptions could provide a grasp about the overall goal of the method, it is unlikely that they can actually support a developer struggling to understand it.
For this reason, a few attempts have been made to automatically summarize code snippets rather than entire functions [3,24,48,53,60,66,67].Most of them are based on information retrieval [3,48,66,67] meaning that, given a code snippet  to document, the most similar snippet to it is identified in a previously built dataset and its comments are reused to summarize .These approaches, while valuable, rely on manually crafted heuristics to automatically identify the "scope of an inner comment", i.e., the statements that a given comment documents.For example, one may assume that an //inline comment in Java documents all following statements until a blank line is found [8].As we will show, such a heuristic fails in several cases.Other techniques [53,60] exploit pre-defined templates to document code snippets that, however, cannot generalize to all combinations of code statements one could find.
Given the limitations of previous work, Huang et al. [24] proposed an approach exploiting reinforcement learning to document code snippets.The first challenge they faced was the creation of a training dataset.Indeed, while it is relatively easy to collect pairs of <, > when working at function-level granularity, this is not the case for code snippets.For this reason, Huang et al. exploited an approach proposed by Chen et al. [8] to automatically detect the scope of code comments.The approach exploits a combination of heuristics and learning-based techniques to automatically identify, given a comment, the set of statements documented by it.Using this approach, Huang et al. [24] built a dataset of ∼124k <, > pairs which has been used to train RL-BlockCom, a DL model combining reinforcement learning with a classic encoder-decoder model.RL-BlockCom is able, given a code snippet as input, to automatically document it reaching a BLEU-4 [44] of 24.28.While being the first DL-based approach to support code snippets' summarization, RL-BlockCom suffers of some major limitations mostly related to the way in which its training/test sets have been built exploiting the approach in [8]: 1. Simplified/unrealistic linking of code comment to the documented snippet [8].This is due to some of the design choices made in the scope detection approach [8].For example, the authors "regard the first out-of-scope statement as the demarcation point of the scope of the comment".This means that, accordingly to their approach, it is not possible for a code comment to document non-contiguous statements.As we will show, our manual validation of 6,645 instances reveals 1598 (∼27%) cases of code comments that document non-contiguous statements.These are all cases which cannot be successfully supported by the scope detection approach and, as a consequence, by RL-BlockCom.
2. Lack of filters to identify code summaries [8].Chen et al. correctly observed that not all comments "describe" code statements.Thus, they use heuristics to remove commented out code, TODO comments, IDE-generated comments, and non-text comments containing dates or links.Despite these filters, using such an approach to create a training dataset for a snippet summarization approach such as RL-BlockCom means feeding it with comments which may not be an actual code summary of the documented snippet.For example, when manually looking at the previously mentioned 6,645 instances, we found 33% of them to just act as a logical split of source code (i.e., a "formatting" comment [46]) without providing additional information on the documented code (e.g., a comment //get messages put on top of a method call getMessages()).These comments are useless to train a code summarizer, but are not excluded from the RL-BlockCom training dataset.
3. The training dataset used in RL-BlockCom includes code summaries as short as two words [24].These are unlikely to be code summaries useful to support program comprehension.
To address these limitations, in this work we take all steps needed to foster the research on snippets summarization, as depicted in Fig. 1.First (step 1 in Fig. 1), we manually built a dataset of 6,645 <, > pairs, in which we classified the code comment () as being or not a code summary and linked it to the documented Java statements.Such a dataset has been built by ensuring two evaluators for each analyzed comment, with a third one solving conflicts when needed.The overall effort spent by the six involved authors accounts for 815 man-hours.
We use this dataset to fine-tune SALOON (step 2 in Fig. 1), a multitask pre-trained Text-to-Text-Transfer-Transformer (T5) [47] model able to take as input an inner comment in a method and (i) classify whether it represents a valid code summary with a 83% accuracy; and (ii) link it to the relevant code snippets it documents with a recall/precision higher than 80%.We show that the performance of SALOON are significantly better than the comment-to-code linking approach by Chen et al. [8].
Finally (step 3 in Fig. 1), we run SALOON on 10k GitHub Java projects to automatically build a large-scale dataset of ∼554k <, > pairs.The latter has been used to train and test STUNT, a DL-based approach taking as input a code snippet and automatically generating its code summary.We show that STUNT performs better than IR-based and RL-based baselines RL-BlockCom.
Despite this finding, our results also show that STUNT is not yet ready to be deployed to developers and point to more research being needed on the task of snippet summarization.
In summary, our contributions are: (i) the largest manually built dataset in the literature featuring classified and linked code comments; (ii) SALOON, a multi-task DL model able to achieve state-ofthe-art performance in the tasks of comment classification and linking; and (iii) STUNT, a code snippet summarization model trained on a large-scale and more realistic dataset as compared to the one used in the literature [24].The dataset and all code used to train and test the models in this paper are available in our replication package [7].

BUILDING A DATASET OF DOCUMENTED CODE SNIPPETS
We detail the process used to build a manually validated dataset featuring triplets <, {}, > where  represents a natural language comment documenting the code snippet  (Documented Code) and {} represents the Comment Category (e.g., code summary, TODO comment), with more than one category possibly being assigned to the comment.We later use such a dataset to train and evaluate the model described in Section 3, taking as input a comment  and automatically (i) classifying it, thus being able to check whether  is a code summary (i.e., an actual description of the documented code) or another type of comment (e.g., TODOs), and (ii) linking  to the corresponding documented code .

Comment labelling
We labeled the dataset and solved conflicts

Study Design
As a first step to build our dataset we needed to collect the set of code comments  1 ,  2 , . . .,   to manually analyze.To collect these comments, we used the web application by Dabic et al. [11] to query GitHub for all Java projects having at least 500 commits, 25 contributors, 10 stars, and not being forks.These filters aim at discarding personal/toy projects and reducing the chance of mining duplicated code.The focus on Java was dictated by the will of accommodating the expertise of the manual validators (i.e., the authors) all having extensive knowledge of the Java programming language.Despite the focus on Java, our methodology to build the dataset as well as to train the models described in the subsequent sections is general and can be reproduced for different languages.
We randomly cloned 100 of the 1,681 projects resulting from our search on GitHub, for a total of ∼768k Java files.
We parsed their code to identify comments within each method to manually analyze.We ignored Javadoc comments since they document entire methods rather than code snippets: We only considered single-line (starting with "//") and multi-line (starting with "/*") comments as subject of our manual analysis.Also, we did not extract comments from test methods (i.e., methods annotated with @Test) to increase the cohesiveness of our dataset and only focus on documentation related to production code.The manual analysis has been performed by the six authors (from now on, evaluators) through a web app we developed to support the process.
We targeted the labeling of valid comments (i.e., excluding those removed by the above-described procedure) within 1,500 Java files, with the idea of creating a dataset of ∼10k triplets (<, {}, >).The web app assigned each Java file to two evaluators who independently labeled the comments in it.If the number of comments in a file was higher than 10, the web app randomly selected a number of comments to label going from 10 to , where  was the actual number of valid comments in the file.Otherwise all comments in the file were labeled.We opted for this process to avoid an evaluator being stuck too much time on a single file.Also, we did not consider comments belonging to methods longer than 1,024 tokens and made sure no duplicated methods were present in the final dataset (i.e., the same method might be present across different files/projects).The filter on the method length was driven by the final usage we envision for our dataset, namely training DL-models which usually works on inputs of limited size (≤512 tokens, or even less, see e.g., [19,36,37,[54][55][56]).Thus, labeling instances longer than 1,024 tokens would have been a waste of resources.
The goal of the labeling was to firstly assign the comment  to one or more categories s.The starting set of categories to use was taken from the work by Pascarella et al. [46] and included: summary, rationale, deprecation, usage, exception, TODO, incomplete, commented code, formatter, and pointer.We do not describe these categories due to the lack of space, pointing the reader to [46] for a complete description.However, as concrete examples, summary represents the classic code description explaining what the code is about, formatter is a comment used by developers to better organize the code into logical sections, while pointer refers to comments linking external resources.We excluded from the original list by Pascarella et al. [46] the following categories (i) directive and autogenerated since, as described by the authors, they both concern comments automatically generated by the IDE; and (ii) license and ownership, since this information is usually featured in Javadoc comments.
Finally, we merged the expand category into summary, since the former is defined by the authors as a code description providing more information than a usual summary.Such a distinction is irrelevant for our work.Besides the set of predefined categories, we also gave the possibility to evaluators to define new categories.If an evaluator defined a new category, it was immediately visible to all other evaluators.The following additional categories have been defined by us: orphan, indicating a code comment not linked to any line of code, and code example, indicating a comment describing e.g., how to invoke a specific method.
Once the category for a given comment under analysis was defined, the next step was the linking of the comment to the documented code .The linking has been performed at line-level granularity.This means, for example, that for a comment  the evaluator could indicate lines 11, 12, and 17 as documented.Note that gaps are possible in  (i.e., the documented code could be composed by non-contiguous lines).Our replication package [7] shows concrete examples of this scenario, that we omit here due to space limitations.Then, we started resolving conflicts arisen from the manual analysis.Two types of conflicts are possible for each manually defined triplet <, {}, >: The two evaluators could have (i) selected a different set {} when classifying the comment; and (ii) identified different sets of lines () documented by the comment.Out of the 6,645 manually labeled comments, 1,395 (21%) resulted in a conflict: 1,144 were due to different comment categories selected by the evaluators; 47 to differences in the selected ; 204 concerned both the categories and the .Conflicts were solved by a third evaluator not involved in the labeling of the conflicting instance.
Overall, we spent 815 man-hours on the labeling and conflict resolution, manually annotating 6,645 comments (with two evaluators for each of them) coming from 1,508 Java files and 85 software projects.We labeled a bit more than the target 1,500 since multiple evaluators were working in parallel without noticing that we hit our target.The obtained dataset, publicly available in our replication package [7], is briefly described in the following.Table 1 summarizes the dataset obtained as output of our analysis.We excluded from the table the categories for which we did not find any instance (e.g., exception [46], likely to be more prevalent in Javadoc comments).Since a single comment can be associated to multiple categories (e.g., summary and rationale), the sum of the "#Instances" column does not add up to the total number of comments we manually classified (i.e., 6,645).

Dataset
Besides reporting the categories to which the comments in our dataset belong, Table 1 also shows descriptive statistics related to the number of statements documented by comments belonging to different categories.As expected, orphan and commented code comments are not linked to any code statement.More than 80% of TODO comments are also not linked to any statement, since in many cases todos are related to e.g., feature that must be implemented.Similarly, the only two incomplete comments we found both of them not linked to any code: These are partially written comments needing rework.
The most frequent category is, as expected, the summary one (3,841 instances) grouping comments summarizing one or more code statements (on average, 3.40 statements).Another popular category is "formatting", with 2,209 instances.
While one could expect no code linked to formatting comments, this is actually not the case since we used such a category also for comments not adding new information to the documented code but just acting as a logical split of the code (e.g., a comment //get messages put on top of a method call getMessages()).
Finally, comments explaining the rationale for implementation choices account for 983 instances.While we focus on the generation of code summaries, these instances often contains interesting information that are hard to automatically synthesize and could represent a seed for future research.
Interestingly, 1,598 of the comments in our dataset (∼27%) include "gaps" in the linked code.This means, for example, that a comment documents lines 11, 12, and 17 (but not lines 13-16) -see [7] for concrete examples.This means that approaches to automatically link comment and code must take such a scenario into account.Motivated by these insights, we fill this gap by creating a novel method for classifying and linking code comments, as elucidated in Section 3.

AUTOMATIC CLASSIFICATION OF CODE COMMENTS AND LINKAGE TO DOCUMENTED CODE
We start by presenting SALOON (claSsification And Linking Of cOmmeNts), the approach we devised for the classification of code comments and their linking to the documented code (Section 3.1).Then, we discuss the design of the study we run to assess its accuracy (Section 3.2) and the achieved results (Section 3.3).Once trained, SALOON can be run on hundreds of projects to build a large-scale dataset featuring classified and linked code comments.While we could just refer to SALOON as a "T5 model trained for comment classification and linking", we preferred to name it to simplify the reading when we introduce the other T5 model we train for the task of code summarization (Section 4).

Approach Description
SALOON is built on top of T5, a DL transformer-based model [47].T5 has been presented by Raffel et al. [47] as a model that can be trained to support any Natural Language Processing (NLP) task that can be represented in a text-to-text format, meaning that both the input and the output of the model are text strings.Such a representation is well-suited for code-related tasks, as demonstrated by the recent literature (see e.g., [38,56,62]).
Raffel et al. [47] reported state-of-the-art results for several NLP benchmarks, especially when leveraging the "pretrain-thenfinetune" paradigm: The model is first pre-trained on a large dataset with the goal of learning patterns about the underlying language of interest (e.g., Java).Then, it is fine-tuned to learn a specific task of interest (e.g., code summarization).The pre-training is performed using self-supervised pre-training objectives such as the masked language model.
The idea is to provide the model with input sentences (e.g., Java methods) in which a percentage of randomly selected tokens has been masked, with the model in charge of guessing them.This prepares the model's weights for the fine-tuning in which tailored datasets are used to teach the model the specific task to support (e.g., pairs of code and comments).The pre-training phase is particularly important when the dataset used for the fine-tuning is expensive to build (i.e., it requires manual validation) and, as a consequence, is limited in size.This is the case for our work, since our fine-tuning is performed on the dataset described in Section 2, in which comments have been categorized and linked to the relevant statements.
In SALOON, we exploit the T5 small architecture described by Raffel et al. [47].Due to space constraints, we point the reader to the original paper for all architectural details.We describe how we built the pre-training and fine-tuning datasets for the tasks of comment classification and linking.
3.1.1Pre-training Dataset.We start from the Java CodeSearchNet dataset [25], which features ∼1.6M Java methods, ∼499k of which including a Javadoc.Given the tasks we aim at supporting (i.e., automatic classification of code comments and linking to the code they document), there are two "target languages" we aim to expose to T5 during pre-training: Java code and technical natural language in the form of code comments.CodeSearchNet features both of them.We preprocess the dataset by discarding all instances having #tokens > 1,024.During pre-training we treat Java methods and Javadoc comments as separated instances (i.e., we ignore their association), thus removing Java methods and Javadoc comments being longer than 1,024 tokens.Such a filter removed ∼32k instances (i.e., 31,702 methods and 178 Javadoc comments).Then, we excluded instances containing non-ASCII characters as well as Javadoc comments composed by less than 5 tokens (words), since unlikely to represent meaningful code descriptions (∼57k instances removed).After removing duplicates, we end up with 1,870,888 pre-training instances (1,501,013 Java methods and 369,875 Javadoc).
3.1.2Fine-tuning Dataset.Two fine-tuning datasets are needed to support the tasks we target (i.e., comment classification and linking).For comment classification, we built a dataset composed by pairs ⟨ ,  ,   ⟩, in which a specific inner comment   within a method   is linked to a category   classifying it (e.g., code summary).For comment-to-code linking, we built a dataset featuring pairs ⟨ ,  , ⟩, in which  reports the   's statements documented by   .Both datasets have been extracted from the manually built dataset of 6,645 classified and linked comments (Section 2).
Comment classification.Given the goal of our work (i.e., summarizing code snippets), we are interested in automatically identifying comments we classified as code summary while excluding all the others.Starting from the dataset in Table 1, we extracted 3,841 ⟨ ,  ,   ⟩ having   = code summary and 2,921 having having   = other.Basically, we target the training of a binary classifier taking as input a code comment (  ) in the context of the method it belongs to (  ) and guessing whether it is a code summary or not.
The specific input we provide to T5 is   's code with special tokens <comment></comment> surrounding the comment of interest (this is the representation of  ,  ), and expect as output either "code summary" or "other" (i.e.,   ).
Differently from the pre-training dataset, we did not need to remove sequences longer than 1,024 tokens, since this has already been done in the first place during the building of the dataset described in Section 2. We randomly split the dataset into 80% training, 10% evaluation, and 10% test.The first row in Table 2 shows the number of instances in these three sets.
Code Linking.Concerning the task of liking comments to code snippets, our training instances are only those comments that we manually labelled as code summary.Indeed, we are interested in linking this specific type of comments to their code.Thus, we start from the 3,841 code summary instances to build the needed ⟨ ,  , ⟩ pairs.Concerning the representation of  ,  , it is similar to the previously discussed for the comment classification dataset (i.e., the method   with special tags surrounding the inner comment of interest   ) with the only difference being a special tag <N> preceding each statement and reporting its line number in an incremental fashion.
As for the expected output  (i.e., documented code), it is represented as a stream of "<N>" tags representing the line numbers (i.e., statements) within   linked to   (e.g., <1><2><4>).Such a representation allows marking non-contiguous statements documented by   .The code linking fine-tuning dataset is composed by 3,841 instances split into 80% training, 10% evaluation, and 10% test as shown in the second row of Table 2.Note that to ensure a fair evaluation of the proposed approach, we split the dataset by taking into consideration the Java class from which these methods were originally extracted.3.1.3Training Procedure and Hyperparameters Tuning.We evaluated the performance of eight T5 models (four pre-trained and four non pre-trained) on the evaluation set of each task in terms of correct predictions, namely cases in which the generated output (i.e., the comment category or the documented statements) was identical to the expected output.We pre-train the T5 model from scratch (i.e., starting from random weights) rather than starting from already pre-trained models for code such as CodeT5 [63], which is based on the same architecture proposed by Raffel et al. [47] we exploit in our investigation.Our decision is primarily motivated by the desire to have a model pre-trained on a single programming language (Java) as opposed to a multi-language model (as CodeT5).
We pre-train T5 for 300k steps using a 2x2 TPU topology (8 cores) from Google Colab with a batch size of 16.During pre-training, we randomly mask 15% of tokens in an instance (i.e., Java method or Javadoc comment), asking the model to guess the masked tokens.To avoid over-fitting, we monitored the loss function every 10k steps and stopped the training if such value did not improve after 12 consecutive evaluations (i.e., after 120k steps, one epoch on our pretraining dataset).We use the canonical T5  configuration [47] during pre-training.We also used the pre-training dataset to train a SentencePiece model (i.e., a tokenizer for neural text processing) with vocabulary size set to 32k word pieces.
We fine-tuned a pre-trained and a non pre-trained model experimenting with four different learning rate schedulers (thus leading to eight overall trained models).
Constant Learning Rate (C-LR) fixes the learning rate during the whole training; Inverse Square Root Learning Rate (ISR-LR), in which the learning rate decays as the inverse square root of the training step; Slanted Triangular Learning Rate (ST-LR), in which the learning rate first linearly increases and then linearly decays to the starting value; and Polynomial Decay Learning Rate (PD-LR), having the learning rate decaying polynomially from an initial value to an ending value in the given decay steps.The parameters used for the learning rates are available in [7].
We fine-tuned each of the eight models for a total of 75k steps on the fine-tuning training set of each task.We include in our replication package [7] a table showing the percentage of correct predictions (for the comment classification task), precision and recall (for the code linking task) achieved by each of the pre-trained and non pre-trained models on the evaluation sets.
Overall, the pre-trained models work substantially better, especially when it comes to the code linking task.In particular, in their respective best configuration, pre-trained models achieve (i) a 75% classification accuracy in the comment classification task as compared to the 58% of the non pre-trained models; and (ii) 85% precision and 89% recall in the code linking task, as compared to the 53% precision and 67% recall of the non pre-trained models.Such a result is expected considering that the fine-tuning training datasets are quite small due to the substantial manual effort required to build them (∼6.7kinstances for comment classification and ∼3.8k for code linking).Having small fine-tuning datasets is the scenario in which pre-training is known to bring major benefits [49].As for the learning rate, the best results are achieved with ISR-LR when pre-training and with PD-LR when not pre-training.
To obtain the final model to use in SALOON, we fine-tuned the best performing model (i.e., pre-trained with ISR-LR) using an early-stopping strategy in which we evaluated the model on the evaluation sets every 5k steps, stopping when no improvements were observed for 5 consecutive evaluations.We discuss the results achieved by SALOON as compared to other baselines in Section 3.3.

Study Design
The goal of the study is to assess the accuracy of SALOON in the two tasks it has been trained for: comment classification and code linking.The context is represented by the test sets reported in Table 2, featuring 1,203 instances for the task of comment classification and 633 for the task of code linking.
Concerning the comment classification task, we do not compare SALOON against any baseline, since our goal (i.e., identifying only code summaries) is quite specific of our work.Instead, we compare the performance of SALOON against the three following baselines for the task of code linking (the implementation of all baselines is publicly available [7]).
Heuristic-1: blank line [8].The first baseline is a straightforward heuristic assuming that a given //inline comment documents all following statements until a blank line is reached.
Heuristic-2: token-based string similarity [13].The basic idea of this heuristic is that statements sharing terms with a code comment are more likely to be documented by it.We use the tokenbased string similarity by Fluri et al. [13] to compute the textual similarity between each comment in the test set and all statements in the method it belongs to.A statement is linked to the comment if its similarity with it is higher or equal than a threshold .The similarity is computed as the percentage of overlapping terms between the two strings (i.e., comment and statement), with the terms being extracted through space splitting.We experiment with different values for , going from 0.1 (i.e., 10% of terms are shared between the two strings) to 0.9 at steps of 0.1.[8].The approach by Chen et al. [8] relies on the random forest machine learning algorithm to classify statements in a method as linked or not to a given comment.Unfortunately, the source code of such approach is not available and, thus, we had to reimplement it following the description in the corresponding article.In a nutshell, the approach works as follows.The random forest uses three families of features to characterize a given statement and classify it as linked or not to a given comment.The first family comprises eight "code features", capturing characteristics of the statement, such as the statement type (e.g., if, for) and whether the statement shares method calls with the statements preceding and following it.The second family includes four "comment features", focusing on characteristics of the comment of interest, such as its length and the number of verbs/nouns it contains.Finally, the third family groups four "relationship features", representing the relationship between the comment and the statement (e.g., textual similarity).For a fair comparison, we train the random forest on the same training set used for SALOON.

Data Collection And Analysis.
Concerning the comment classification task, we run SALOON on the test set and report the accuracy of the model in classifying comments representing "code summaries".As for the code linking, we start computing the percentage of correct predictions, namely cases in which all statements linked to a comment in the test set match the ones in the oracle.This means that a comment instance correctly linked to two out of the three statements it documents is considered wrong.We also compute the recall and precision of the techniques at statementlevel.The recall is computed as TP/(TP+FN), where TP represents the set of code-to-comment links correctly identified by a technique (i.e., a statement correctly linked to a comment) and FN are the set of correct code-to-comment links in the oracle missed by the approach.The precision is instead computed as TP/(TP+FP), with FP representing the code-to-comment links wrongly reported by the approach (i.e., statements wrongly identified as linked to the comment).We also statistically compare the techniques assuming a significance level of 95%.We compare precision and recall using the Wilcoxon signed-rank test [65].To control for multiple pairwise comparisons (e.g., SALOON's precision compared with that of the three baselines), we adjust -values with Holm's correction [20].
We estimate the magnitude of the differences using the Cliff's Delta (), a non-parametric effect size measure [15].We follow well-established guidelines to interpret the effect size: negligible for | | < 0.10, small for 0.10 ≤ | | < 0.33, medium for 0.33 ≤ | | < 0.474, and large for | | ≥ 0.474 [15].As for the percentage of correct predictions, we pairwise compare them among the experimented techniques, using the McNemar's test [41], which is a proportion test suitable to pairwise compare dichotomous results of two different treatments.We complement the McNemar's test with the Odds Ratio (OR) effect size.Also in this case we use the Holm's correction procedure [20] to account for multiple comparisons.

Results Discussion
As for the comment classification task, SALOON correctly classifies 78.05% (939/1,203) of instances.Out of the 633 code summary comments present in the test set, 536 (84%) have been correctly classified, while 97 have been mistakenly reported as other.
Concerning the 570 "other" comments, SALOON correctly predicted 403 (70%) of them, wrongly reporting 167 instances as code summary.This results in a recall=0.85and precision=0.76when identifying a comment as a code summary.This means that by running our approach on the comments of a previously unseen software system, we can expect to identify 85% of code summaries present in it accompanied, however, by 25% of false positives (i.e., non code summary comments).Concerning the code linking task, Table 3 reports the correct predictions (i.e., for a given comment in our test set all linked statements have been correctly identified), recall, and precision achieved by SALOON and the three baselines.Table 4 reports the results of the statistical tests.For the Cliff's Delta  we use N, S, M, and L to indicate its magnitude from Negligible to Large.
Note that for the token-based string similarity baseline we report the results achieved with different values of  (i.e., minimum similarity threshold to link a code statement to a comment).
While we also experimented with values going up to 0.9 [7], the recall values were too close to 0 to consider these variants as reasonable baselines.SALOON predicts all statements linked to a given comment in 58% of cases, against the 23% achieved by the best-performing baseline (ML-based).The blank-line technique achieves 20% of correct predictions.
The results of the statistical tests confirm the better performance ensured by SALOON in terms of correct predictions: McNemar's test always indicates significant differences in terms of correct predictions accompanied by ORs indicating that SALOON has between 15.80 to 70.80 higher odds of providing a correct prediction against the baselines.
Recall and precision values confirm the superiority of SALOON for the code linking task.In terms of recall, SALOON is able to correctly link 89% of statements in our dataset, achieving the best performance among all the experimented techniques.While the blank-line approach achieves a similar recall (87%) it pays a much higher price in terms of precision, with a 43% false positive rates as compared to the 14% of SALOON.Note that a high recall for this heuristic is expected, considering that it links all statements following a comment until a blank line is found.The ML-Based technique can only predict half of the correct links (0.49) while achieving a precision score of 0.58.Accordingly to our results, the token-based similarity heuristic does not represent a viable solution for the code linking task: The best results are achieved when considering (=0.1)as a threshold, for which the technique can ensure a recall of 0.62 and a precision of 0.33.Differences in terms of recall and precision are always statistical significant (see Table 4).The effect size is in most of cases medium or large, with the only exception of the recall test comparing T5 with the blank-line baseline, for which a negligible effect size is reported.
To summarize, SALOON is able to identify comments representing code summaries with a recall of 0.85 and a precision of 0.76.Also, it achieves state-of-the-art results in linking comments to the documented code, with a recall of 0.89 and a precision of 0.86.In Section 4 we explain how we exploit this model to build a largescale dataset aimed at training a T5 fine-tuned for the task of code snippet summarization.

SNIPPETS SUMMARIZATION USING T5
We discuss how we trained a T5 model for the task of code snippet summarization (Section 4.1), the study we run to evaluate it (Section 4.2) and the achieved results (Section 4.3).We refer to the snippet summarization approach as "STUNT" (SnippeT sUmma-rizatioN using T5).

Approach Description
We rely on the same T5 architecture described in Section 3.1 and we reuse the same pre-trained model we built for the comment classification and code linking tasks.Indeed, as explained in Section 3.1.1,we pre-trained the model on a dataset composed by ∼1.5M Java methods and their inner comments and ∼370k Javadoc comments.Thus, T5 has been pre-trained to acquire knowledge about the two "target languages" relevant for the summarization task as well (i.e., Java code and technical language used to summarize it).We detail the fine-tuning dataset and the training procedure.
4.1.1Fine-tuning Dataset.We used the GHS tool by Dabic et al. [11] to query GitHub for all public non-forked Java projects with minimum 50 commits, 5 contributors, and 10 stars.The idea of these filters was to remove toy/personal projects while still obtaining a large set of projects to provide as input to SALOON with the goal of identifying comments representing summaries and linking them to the relevant code.We cloned 10k of the 18.7k projects returned by our query and extracted their methods using srcML [10].
We excluded all methods longer than 512 tokens and removed all duplicates, obtaining a set of methods .We also removed duplicates between our pre-training dataset and  and between our manually labeled dataset (Section 2.2) and .
Concerning the removal of duplicates between the pre-training dataset and , this was needed since  is our starting point to build the fine-tuning dataset for the snippet summarization task from which we will also extract the test set on which STUNT will be evaluated.Thus, we ensure that STUNT is not evaluated on already seen instances.As for the removal of duplicates between the manually labeled dataset and , this is due to the fact that SALOON (i.e., our approach for comment classification and linking) has been trained on those instances and we will run it on  to build the finetuning dataset for STUNT (i.e., for code summarization).Running SALOON on already seen instances would inflate its performance, and not provide a realistic picture of what can be achieved by training STUNT on a dataset automatically built using SALOON.
From the remaining methods, we extracted all inner comments, filtering out those shorter than 5 words (unlikely to represent a meaningful code summary).As done in previous code summarization works [30], we lowercased and stemmed the comments (using the spaCy NLP library [2]).Then, for each comment   extracted from a method   we created an instance  ,  in which   's code features special tokens <comment></comment> to surround the comment of interest (  ).This means that if   features three inner comments, three  ,  instances will be created, each having a different comment (  ) "tagged".This format is the one expected by SALOON to automatically (i) classify   as code summary or other, and (ii) link   to the relevant code statements.
The above-described process resulted in 2,210,602  ,  instances that we provided as input to SALOON, which classified 907,660 of them as code summary.Among these, SALOON automatically linked code statements to the code summaries in ∼85% of cases (776,531).These instances are ⟨ , ,   ⟩ pairs, where  , represents the method   with special tokens <start><end> surrounding the statements () documented by   .
If more non-contiguous statements are documented, multiple <start><end> pairs are injected in   .These pairs are those needed to fine-tune STUNT for the task of snippet summarization: the input provided to the model is  , (i.e., a snippet to document) and the expected output is the documentation   .To avoid favoring the model during testing, we also removed all duplicates at snippet-level granularity.This means that if we have in our dataset two different methods containing the same  (i.e., the same code snippet to document), we only keep one of them.Also, being SALOON an automated approach, it is expected to produce wrong instances (e.g., comments linked to wrong statements) which, in turn, will penalize the performance of STUNT.By manually inspecting a sample of the pairs in our dataset, we noticed that one clear case of wrong instances are those in which the model had very low confidence in identifying the documented statements thus producing random symbols rather than the expected documented line numbers.We automatically remove those instances, obtaining a set of 554,748 pairs, split into 80% training (443,798), 10% evaluation (55,475), and 10% testing (55,475).

Training
Procedure and Hyperparameters Tuning.As explained, we started from the already pre-trained T5 model.We then followed the same hyperparameters tuning discussed in Section 3.1.3,assessing the performance of four different learning scheduler on the evaluation set using the BLEU-4 score [44] as performance metric.
The BLEU-4 variant computes the BLEU score by considering the overlap of 4-grams between the generated text (i.e., the synthesized snippet summary) and the target text (i.e., the summary written by the original developers).This metric has been used by most of the previous work on code summarization (see e.g., [4, 19, 21-24, 27-29, 31, 57, 59, 61, 64, 69, 71]).Each of the four models has been trained for 100k steps before its evaluation.C-LR (i.e., constant learning rate) provided the best performance.Data about this evaluation are available in our replication package [7].
Once identified the best T5 variant, we fine-tuned it for up to 500k steps, using an early-stopping strategy to tame over-fitting.To this aim, we monitored the BLEU-4 score achieved on the evaluation set every 5k steps, stopping the training when no improvements were observed after 5 consecutive evaluations.

Study Design
The goal is to assess the accuracy of STUNT for snippet summarization.The context is represented by (i) 55,475 ⟨ , ,   ⟩ pairs identified by SALOON as described in Section 4.1.1 and belonging to the test set, and (ii) the test set made publicly available by Huang et al. [24] when presenting RL-BlockCom, the state-of-the-art snippet summarization approach discussed in Section 1.
We assess the performance of STUNT against an information retrieval (IR)-based technique (i.e., IR-Jaccard) and RL-BlockCom.To explain the basic idea behind the IR-based baseline let us remind that both our training and test set are composed by ⟨ , ,   ⟩ pairs.Given a pair in the test set, the baseline retrieves in the training set the pair having the  snippet being the most similar to the one in the test set pair.This means that this pair contains a documented snippet that is very similar to the one in the test set for which we have to generate a code summary.Once identified the most similar snippet in the training set, the IR-based technique reuses its description to document the instance in the test set.This baseline serves as a representative of works using IR to retrieve similar comments from a given dataset, including e.g., [67].
IR: Jaccard index [17].IR-Jaccard identifies the most similar snippet using the Jaccard similarity index.The latter considers the overlapping between two sets of unique elements, representing in our case the tokens composing the documented code () in the test instance and in each of the training instances.Indeed, we need to compare each instance in the test set to all those in the training set to find the most similar one.The similarity is computed as the percentage of overlapping tokens between the two sets.
An additional baseline for STUNT is RL-BlockCom by Huang et al. [24].Despite the code being available, we did not manage to re-train their approach on our dataset.We contacted the authors asking for help without, however, receiving answer.Thus, as an alternative form of comparison, we thought about training and testing STUNT on their dataset, which is publicly available, and then comparing the summaries generated by STUNT with those generated by RL-BlockCom.Unfortunately, the authors did not make the summaries generated by their approach publicly available.The only viable form of comparison we found was to (i) re-train STUNT on the training dataset made available by Huang et al. [24] and used to train RL-BlockCom; (ii) use this trained version of STUNT to generate predictions on the same test set on which RL-BlockCom has been evaluated; (iii) use the evaluation scripts made available by Huang et al. for the computation of the sentence-level BLEU score; and (iv) compare the achieved results with those reported in their paper.Indeed, not having access to the summaries generated by RL-BlockCom does not allow us to double-check the data reported in the original paper nor to compute additional metrics besides those used by the authors (BLEU).Note also that the training/test datasets shared by Huang et al. feature pairs ⟨,   ⟩ as compared to our ⟨ , ,   ⟩ pairs.This means that STUNT cannot exploit the contextual information of the method   when generating the predictions on their dataset.

Data Collection And Analysis.
To compare the performance of our model against the two IR-based baselines, we exploit three metrics explained in the following.
Out of those, only BLEU has been used in the comparison with RL-BlockCom for the reasons previously explained.
BLEU [44] assesses the quality of the automatically generated summaries by assigning a score between 0 and 1.In our case, 1 indicates that the natural language summary automatically generated is identical to the one originally written by the developer.Since in the test set we built there are no summaries shorter than 4 words, we use the BLEU-4 variant in the comparison with the IR-based baselines.When comparing with RL-BlockCom on their test set, we also compute BLEU-1, BLEU-2 and BLEU-3 as done by Huang et al. [24].
METEOR [6] is a metric based on the harmonic mean of unigram precision and recall (the recall is weighted higher than the precision).Compared to BLEU, METEOR uses stemming and synonyms matching to better match the human perception of sentences with similar meanings.Values range from 0 to 1, with 1 being a perfect match.
ROUGE [34] is a set of metrics focusing on automatic summarization tasks.We use the ROUGE-LCS (Longest Common Subsequence) variant, which identifies longest co-occurring in sequence n-grams.ROUGE-LCS returns three values, the recall computed as LCS(X,Y)/length(X), the precision computed as LCS(X,Y)/length(Y), and the F-measure computed as the harmonic mean of recall and precision where X and Y represent two sequences of tokens.
We also statistically compare the different approaches assuming a significance level of 95%.Also in this case we use the Wilcoxon signed-rank test [65], adjusting -values to account for multiple comparisons (Holm's correction procedure [20]) and the Cliff's Delta () as effect size measure [15].The statistical comparison was not possible with RL-BlockCom since we only had access to the overall BLEU scores reported in the paper (i.e., the BLEU scores for each generated summary were not available).

Results
Table 5 compares STUNT and RL-BlockCom, using the values reported in the paper by Huang et al. [24] as BLEU scores for RL-BlockCom.STUNT achieves better performance for all BLEU scores, outperforming the state-of-the-art approach by a large margin (e.g., +7 points of BLEU-4).A deeper comparison of the two techniques is not possible since the summaries generated by RL-BlockCom are not available.Table 6 compares STUNT against IR-Jaccard on the large-scale dataset we built.Accordingly to all metrics used in our evaluation, the gap in performance between STUNT and the baseline (i.e., IR-Jaccard) is substantial, with at least a +11 in terms of BLEU-4, a +12 in terms of ROUGE-LCS f-measure, and a +16 in terms of METEOR score.As observed by Roy et al. [51], METEOR is "extremely reliable for differences greater than 2 points" in assessing code summarization quality as perceived by humans (i.e., also humans are likely to prefer STUNT's summaries over those generated by the baselines).The statistical analyses presented in Table 7 validate STUNT's superior performance compared to IR-Jaccard.Notably, we observe significant -values and medium effect sizes for BLEU-4 and ROUGE-LCS (f-measure), while METEOR demonstrates a large effect size.While the metrics we computed provide a fair comparison among the experimented techniques, they do not give a clear idea of the quality of the summaries generated by STUNT.To this aim two of the authors manually inspected 384 randomly selected summaries generated by STUNT for which the generated text was different from the target summary (i.e., the one written by developers).These are cases that in a "binary quantitative evaluation" would be classified as wrong predictions.The authors independently classified each summary as meaningful or not meaningful, based on the ability of the summary to properly describe the documented snippet.In the labeling, the two involved authors achieved a Cohen's kappa [9] of 0.61, indicating a substantial agreement when measuring inter-rater reliability for categorical items.information related to the "file context" of the method to summarize.They show that such a contextual information helps to further boost performance.
Zhang [70] showed that combining IR and DL techniques it is possible to boost the performance of function-level code summarization.Our work focuses on the related but different problem of snippet summarization that, as explained, poses different challenges especially in the building of the training data.

CONCLUSIONS
We targeted the problem of code snippet summarization, presenting (i) a manually labeled dataset of ∼6.6k code comments classified in terms of information they provide (e.g., code summary) and linked to the code statements they document; (ii) SALOON, a T5 model trained on our manually built dataset to automatically classify and link inner comments in Java code; and (iii) STUNT, a T5 model trained on a large-scale dataset of documented code snippets automatically created by running SALOON on 10k Java projects.
We achieved promising results for both code linking and snippet summarization, pointing however to the need for research in this field.Our dataset and our models, publicly released [7], represent a step in that direction.

Table 1 :
Dataset output of manual labeling

Table 3 :
T5 vs baselines on the code linking task

Table 4 :
Code linking task: SALOON vs baselines