Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization)

Large Language Models (LLM) are a new class of computation engines, "programmed" via prompt engineering. Researchers are still learning how to best "program" these LLMs to help developers. We start with the intuition that developers tend to consciously and unconsciously collect semantics facts, from the code, while working. Mostly these are shallow, simple facts arising from a quick read. For a function, such facts might include parameter and local variable names, return expressions, simple pre- and post-conditions, and basic control and data flow, etc. One might assume that the powerful multi-layer architecture of transformer-style LLMs makes them implicitly capable of doing this simple level of "code analysis" and extracting such information, while processing code: but are they, really? If they aren't, could explicitly adding this information help? Our goal here is to investigate this question, using the code summarization task and evaluate whether automatically augmenting an LLM's prompt with semantic facts explicitly, actually helps. Prior work shows that LLM performance on code summarization benefits from embedding a few code & summary exemplars in the prompt, before the code to be summarized. While summarization performance has steadily progressed since the early days, there is still room for improvement: LLM performance on code summarization still lags its performance on natural-language tasks like translation and text summarization. We find that adding semantic facts to the code in the prompt actually does help! This approach improves performance in several different settings suggested by prior work, including for three different Large Language Models. In most cases, we see improvements, as measured by a range of commonly-used metrics; for the PHP language in the challenging CodeSearchNet dataset, this augmentation actually yields performance surpassing 30 BLEU1. In addition, we have also found that including semantic facts yields a substantial enhancement in LLMs' line completion performance.


INTRODUCTION
Large language models (LLMs) often outperform smaller, customtrained models on tasks, especially when prompted with a "fewshot" set of exemplars.LLMs are pre-trained on a self-supervised (masking or de-noising) task, using vast amounts of data, and exhibit surprising emergent behaviour as training data and parameter counts are scaled up.They excel at many tasks with few-shot (or even zero-shot) learning: with just a few exemplar input-output pairs inserted first in the prompt, the models can generate very good outputs for a given input!Few-shot learning works so well with LLMs that it is unclear whether sufficient task-specific data can ever be gathered to train a customized model to rival their performance [3,12].LLMs are ushering in a new era, where prompt engineering, to carefully condition the input to an LLM to tailor its massive, but generic capacity, to specific tasks, will become a new style of programming, placing new demands on software engineers.
We propose Automatic Semantic Augmentation of Prompts (A), a new method for constructing prompts for software engineering tasks.The A method rests on an analogy: an effective prompt for an LLM, for a task, relates to the facts a developer thinks about when manually performing that task.In other words, we hypothesize that prompting an LLM with the syntactic and semantic facts a developer considers when manually performing a task will improve LLM performance on that task.To realise this hypothesis, A augments prompts with semantic facts automatically extracted from the source code using semantic code analysis.
We illustrate the A methodology first on code summarization.This task takes code, usually a function, and summarizes it using natural language; such summaries can support code understanding to facilitate requirements traceability and maintenance.
A uses a few-shot prompting because its effectiveness.A finds relevant shots using BM25, the current state of the art in finding few-shot exemplars that are "semantically close" to the target function [48], in our case, the function-to-summarize, by querying the LLM's training data.When instantiating A for the summarization task, we equipped it to extract the following semantic facts: the repository name, the fully qualified name of the name of the target function, its signature, the AST tags of its identifiers, and its data flow graph (Section 3.4).These facts are presented to the LLM as separate, labelled, fields 2 .The model is then provided with the function-to-summarize, exemplars (along with facts extracted from each), and asked to emit a summary.We confirm our hypothesis that augmenting prompts with semantic facts can improve LLM performance on the code completion task.We evaluated A's benefits on the high-quality (carefully de-duplicated, multi-project) CodeSearchNet [32] dataset.
In summary, we find that in all cases, our approach of automatic semantic augmentation improves average performance on several commonly-used metrics.For almost all languages,the average improvement comfortably surpasses the 2-BLEU threshold noted by Roy et al. [57], below which BLEU results are unreliable predictors of human preference.For Go, gains are still significant, and just slightly less than 2; for PHP, we see an improvement of 4.6 BLEU, reaching a SOTA high-point of 32.73 on the well-curated, de-duplicated, CodeSearchNet dataset.
Our principal contributions follow: • The A approach for software engineering tasks using facts derived from code.• We evaluate A on the code summarization task on the code-davinci-002, text-davinci-003.and GPT-3.5-turbomodels against a few-shot prompting baseline built using vanilla BM25 (Section 4.1).• We find that the A approach statistically significantly improves LLM performance on the code summarization task.
In almost all cases, we observe statistically significant improvements of almost, or in excess of, 2 BLEU; and, for PHP, we break 30 BLEU for the first time (to our knowledge) on this challenging dataset.• We find that A also leads to improved performance on the code-completion task.
All the data, evaluation scripts, and code needed to reproduce this work will be available at https://doi.org/10.5281/zenodo.7779196,and can be reproduced on any available language models.Our experiments suggest that A works well with any language model powerful enough to leverage few-shot prompting.

BACKGROUND & MOTIVATION
Large Language Models (LLM) are a transformative technology: they are essentially a new kind of computation engine, requiring a new form of programming, called prompt engineering.We first contextualise A, our contribution to prompt engineering.Finally, we discuss code summarization as a sample problem to demonstrate A's effectiveness.

Few-shot Learning in Software Engineering
LLMs are now widely used in Software Engineering for many different problems: code generation [14,34], testing [38,42], mutation generation [10], program repair [18,35,36,48], incident management [6], and even code summarization [3].Clearly, tools built on top of pre-trained LLM are advancing the state of the art.Beyond their raw performance at many tasks, two key factors govern the growing dominance of pretrained LLM, both centered on cost.First, training one's own large model, or even extensively fine-tuning a pre-trained LLM, requires expensive hardware.Second, generating a supervised dataset for many important software engineering tasks is difficult and time-consuming, often beyond the sources of all but the largest organizations.
In contrast to overall LLM trends, there are some smaller models, specialized for code, that have gained popularity, e.g., Polycoder [67] or Codegen [49].Despite these counterpoints, we focus on LLM rather than small models, because, while small models can be finetuned, they don't do very well at few-shotting, and thus are not helpful when only small amounts of data are available.The few-shot approach is key because it brings into reach many problems, like code summarization, for which collecting sufficient, high-quality, project-or domain-specific training data to train even small models from scratch is challenging.
With few-shot learning, the actual model parameters remain unchanged.Instead, we present a few problem instances along with solutions (i.e., problem-solution pairs as "the exemplars") to a model and ask it to complete the answer for the last instance ("the test input"), for which we do not provide a solution.Thus with each  consisting of an ⟨input, output⟩ pair, and just a test-input input  (without the corresponding, desired output  ), the final prompt looks like: With this prompt, the LLM generates   , mimicking the inputoutput behavior illustrated by the exemplars in the prompt.In practice, this approach performs quite well.
When it works, few-shotting allows us to automate even purely manual problems, since generating a few exemplar samples is relatively easy.In this paper, we experiment with the code-davinci-002 model.We discuss models in more detail in Section 3.2.

Prompting LLMs to Reason
Human Reasoning involves using evidence, logical thinking, and arguments to make judgments or arrive at conclusions [31,51].Natural language processing (NLP) researchers have developed approaches to reason about specific scenarios and improve performance.Approaches like "Chain of thought" [66] and "step-bystep" [40] require generating intermediate results ("lemmas") and utilizing them in the task at hand.Such approaches appear to work on simpler problems like school math problems even without providing them with "lemmas" , because, for these problems, models are powerful enough to generate their own "lemmas"; in some cases just adding "let's think step by step" seems sufficient (Kojima et al. [40]).
We tried an enhanced version of the "step-by-step" prompt, with few-shots, on code summarization.We found that the model underperformed (getting about 20.25 BLEU), lower even than our vanilla BM25 baseline (24.97 BLEU).With zero-shot Kojima-style "step by step" prompt, the models perform even worse.To induce the model to generate steps, and finally a summary, we framed the problem as chain of thought, and included few-shot samples containing both intermediate steps ("lemmas") and final comments.The reasoning is that, on the (usually challenging) code-related tasks, models need to explicitly be given intermediate "lemmas", derived from code, to be able to reason effectively about most software engineering tasks, which tend to be more complex and varied than school maths.
Fortunately, mature tools for code analysis are available.We can readily derive "lemmas", viz., analysis products, using code analysis tools, rather than expecting the models to (perhaps implicitly) derive them, during on-task performance.We directly embed analysis products into the prompt we give the language model, and evaluate the benefits of such analysis products.The information we derive and add are based on our own intuitions about the kinds of "lemmas" that developers consciously or unconsciously consider as they seek to understand and summarize code.
We find that providing such information improves LLM performance.We remind the reader that most work involving large language models (LLMs) usually uses some form of prompt engineering to boost performance.In this paper, we show that the A approach, which augments prompts with code analysis products, improves on previous prompting approaches.

Summarizing Code
Well-documented code is much easier to maintain; thus, experienced developers usually add, e.g., function summary headers.However, summary comments may become outdated, as projects evolve [11,22].Automated code summarization is thus a wellmotivated task, which has attracted a great deal of attention; and considerable progress (albeit incremental, over many years) has been made.Initially, template-based approaches were popular [17,26,27,56,61]; however, creating a list of templates with good coverage is very challenging.Later, researchers focused on the retrievalbased (IR) approach [17,26,27,56], where existing code (with a summary) is retrieved based on similarity-based metrics.However, this promising approach only worked if a similar code-comment pair could be found in the available pool.
Meanwhile, the similarity of code summarization to Neural Machine Translation (NMT), (one can think of generating an English summary of code as producing a representation of "the same meaning in a different language") led to research that adapted Neural Machine Translation (NMT) to code summarization.Numerous studies have been conducted in this area [1,30,33,41].Some have combined previous approaches, such as template-based and retrieval-based approaches, using neural models [69], and have reported promising results.Such neural methods for NLP have vastly improved, due to the Transformer architectural style.
Until recently, pre-trained language models such as CodeBERT, CodeT5, and CodeT5+ performed best for code summarization.Pool of samples are given the BM25 engine, which matches the given input code against the pool and (3) retrieves bestmatching samples, viz. 3 input+output pairs.These examples are processed by A to produce a prompt (4) including 3 exemplars.Each exemplar includes a function definition, the results of analyzing that definition, and its associated comment; the input code is finally appended, along with its analysis product.Exemplar details are in Figure 2. The final prompt is sent via API call (5) to the GPT-3.xmodel; the returned output, e.g., summary ( 6) is returned by GPT-3x.
However, Large Language Models (LLMs) now typically outperform smaller pre-trained models on many problems.Ahmed & Devanbu [3] report that LLMs can outperform pre-trained language models with a simple prompt consisting of just a few samples already in the same project; this work illustrates the promise of careful construction of prompt structures (c.f."prompt engineering").We present A here as another general principle of prompt engineering.We emphasize, again, that progress in code summarization (and other applications of AI to SE, such as code patching, defect detection, testing etc) has been incremental, as in the field of NMT, where practical, usable translation systems took decades to emerge.Thus incremental advances are still needed, and helpful, and we contribute our work to this long-term enterprise.

DATASET & METHODOLOGY
We now discuss our dataset, models, and methodology.

Dataset
Our experiments use the widely used CodeSearchNet [32] dataset; CodeSearchNet was constructed by extracting the first paragraph of the function prefix documentation, subject to some restrictions (e.g.length).It is a carefully de-duplicated, multi-project dataset, which allows (more demanding) cross-project testing.De-duplication is key: Code duplication in machine learning models can deceptively inflate performance metrics a lot, when compared to de-duplicated datasets [7,46,59].
It is part of the CodeXGLUE [47] benchmark, which comprises 14 datasets for 10 software engineering tasks.Many models have been evaluated on this dataset.CodeSearchNet contains thousands of samples from six different programming languages (i.e., Java, Python, JavaScript, Ruby, Go, PHP).However, we did not use the entire test dataset, which would have been prohibitively expensive  and slow using ours models API endpoints; instead, we selected 1000 samples 3 uniformly at random from each language.Since the original dataset is cross-project and we sampled it uniformly, our subsample includes cross-project data.In addition, we subsetted this dataset for same-project few-shotting, following Ahmed and Devanbu [3]: we sort same-project data by creation date (using git blame).Now, we use the temporal order to make sure that only temporally earlier samples are used the few-shot exemplars; this is realistic, since only older, already existing data is available for use.We will delve deeper into this same-project dataset in Section 4.3.
As mentioned earlier, we don't use any parameter-changing training on the model; we just insert a few exemplars selected from the training subset into the few-shot prompt.Table 1

The Models
In earlier work, transformer-based pre-trained language models offered significant gains, in both NLP and software engineering.Pre-trained language models can be divided into three categories: encoder-only, encoder-decoder, and decoder-only models.While encoder-decoder models have initially shown success on many tasks, decoder-only LLMs are now more scaleable and effective for numerous tasks.
Encoder-Decoder model.BERT is one of the earliest pre-trained language models [15]; it was pre-trained on two self-supervised tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).Later, RoBERTa [45] was introduced with some minor 3 Please see experimental power discussion in Section 7. modifications to BERT.Using only MLM training, it outperforms BERT.CodeBERT [21] and GraphCodeBERT [25] introduced these ideas to Software Engineering.Although CodeBERT and Graph-CodeBERT are encoder-only models, they can be applied to code summarization after fine-tuning, cascaded to a decoder trained during fine-tuning.Ahmed & Devanbu report that polyglot models, which are fine-tuned with multilingual data, outperform their monolingual counterparts [4].They also report that identifiers play a critical role in code summarization tasks.PLBART [2], CodeT5 [64], and CodeT5+ [63] also include pre-trained decoders and are reported to work well for code summarization tasks.More recently, very large scale (decoder-only) auto-regressive LLMs (with 175B+ parameters) have been found to be successful at code summarization with few-shot learning, without any explicit training.In the next section, we will briefly introduce the three OpenAI models we considered for our experiments.
Decoder-only model.In generative pre-training, the task is to auto-regressively predict the next token given the previous tokens moving from earlier to later.This unidirectional auto-regressive training prevents the model from pooling information from future tokens.The newer generative models such as GPT [52], GPT-2 [53] and GPT-3 [12], are also trained in this way, but they have more parameters, and are trained on much larger datasets.Current large language models, such as GPT-3, have around (or more than) 175B parameters.These powerful models perform so well, with few-shot prompting, that interest on task-specific parameter-adjustment via fine-tuning has reduced.
Codex is a GPT-3 variant, intensively trained on code and natural language comments.The Codex family consists of two versions: Codex-Cushman, which is smaller, with 12B parameters, and Codex-Davinci, the largest, with 175B parameters.The Codex model is widely used, for various tasks.Our experiments mostly target the Code-Davinci model, particularly Code-Davinci-002, which excels at translating natural language to code [14] and supports code completion as well as code insertion 4 .Some new variants, Text-Davinci-003 & GPT-3.5-turbo, are also available; unlike the Codex variants, these models understand and generate both natural language and code.Although optimized for chat, GPT-3.5-turbo also performs well on traditional completion tasks.Text-Davinci-003 is a completion model like Code-Davinci-002.We study how our prompt enhancement works using the Text-Davinci-003 & GPT-3.5turbomodels.

Retrieving Exemplars from Training Data
As noted earlier, few-shot learning works quite well, when used with very large models.We prompt the model with a small number of ⟨problem, solution⟩ exemplars, and ask it to solve a new problem.However, carefully selecting exemplars for few-shot learning is helpful.Nashid et al. discovered that retrieval-based exemplar selection is helpful for problems such as assertion generation and program repair [48].Following their recommendation, we use the BM25 IR algorithm to select relevant few-shot exemplars from the training set.BM25 [55] is a frequency-based retrieval method which improves upon TF-IDF [54].We noted a substantial improvement over the same fixed exemplars in few-shot learning, as detailed in Section 4.1.Nashid et al. compare several retrieval methods, and found BM25 works best; we therefore use it, as well.

Automatic Semantic Prompt Augmentation
This section presents the three semantic facts we selected to enhance A's prompts and the A pipeline (See Figure 2).The choice of these facts comes from applying our central hypothesis, viz.that augmenting prompts with what developers think about when working on a task, to the code summarization task.A is not tied to any specific semantic facts or static analysis; it can easily incorporate others, as discussed later.
Repository Name & Path.Augmenting prompts with domainspecific information can improve LLM performance on various tasks.Prior work suggests that augmenting prompts with code from the same repository improves performance in code generation tasks [60].We argue that basic repository-level meta-information, such as the repository name and the complete path to the repository, provides additional context.For example, repository names like "tony19/logback−android", "apache/parquet−mr", and "ngageoint/ geo−package−android" all connect a function to a specific domain (e.g., android, apache, geo-location), which can enhance the understanding of the target code to be summarized.Figure 2 (yellow part) presents an example of how we enhance the prompt with repository-level information.Similar to the repository name, the path to the function can also contribute to the model.Tagged Identifiers.Prior work suggests that language models find more value in identifiers, rather than code structure, when generating code summaries [4].However, identifiers play different roles in code.Local variables, function names, parameters, global variables etc., play different parts in the functioning of the method in which they occur; a developer reading the code is certainly aware of the roles of identifier, simply by identifying the scope and use.Thus, augmenting prompts with the specific roles of identifiers could help the model better "understand" the function.We use tree-sitter to traverse the function's AST and gather identifiers, along with their roles.Figure 2 (blue part) presents a sample example showing how we enhanced the prompt of the function with tagged identifiers.Although the model has access to the token sequence of the code, and thus also all the identifiers, them to the model in a tagged form might a) save the model some compute effort, and b) better condition the model's output.
Data Flow Graph (DFG).Guo et al. introduced the Graphcode-BERT model, which uses data flow graphs (DFG) instead of syntacticlevel structures like abstract syntax trees (ASTs) in the pre-training stage [25].GraphcodeBERT outperformed CodeBERT [21] on various software engineering (SE) tasks.We incorporate this DFG information into the few-shot exemplars; we conjecture that this provides the model a better semantic understanding of each exemplar, and the target example.Figure 2 (orange) presents a sample showing the Data Flow Graph (DFG) we used for our experiments.Each line contains an identifier with its index and the index of the identifiers to which that particular data flows.Unlike repo and tagged identifiers, the data flow graph can be very long, making it inconvenient to add the complete data flow to the prompt.In the case of long prompts, we only kept the first 30 lines of the DFG in the prompt.In addition to identifiers, the DFG also provides a better understanding of the importance of identifiers in the function.
Use Case & Completion Pipeline.A has 3 components: an LLM, a pool of available exemplars (labeled input-output pairs, e.g., code with comments), and a static analysis tool for deriving facts from code (See Figures 1 and 2).
A configuration file specifies these components.Once configured, a developer invokes A on a function body   (Figure 1), for which an output (e.g.,, code summary) is desired.A uses   as a BM25 query over the its sample pool to get a result set of exemplar candidates ec 1 , ec 2 , . .., where each ec  is a pair of the form ⟨input  , output  ⟩; in our context, input  is the function definition and output  is the function header comment.BM25 chooses the   s that match best with the given   .A then applies program analyses to both the input   and the several exemplar inputs   s, yielding analysis products   and several   s.
Each exemplar   (Figure 2) is the triple: ⟨input  , ap  , output  ⟩, where each triple illustrates, for the LLM, how input source code   relates, via the analysis product   , to the output   .The final prompt is then " 1 ||  2 ||  3 ||   || ap  ".A queries an LLM with that prompt, and returns the completion (e.g., natural language summary).
By default, A is configured with analyses to extract repository info, tag identifiers, construct DFGs.These analyses are independent and are their outputs are separately labeled in the prompt.For example, Figure 2 shows the output of the DFG analysis in A's constructed prompt.These few shot examples, are augmented and inserted into the prompt: the code, repository info, tagged identifiers, the DFG, and the desired (Gold) summary are all included in each few-shot.The target example includes just analysis product, and the LLM is prompted to produce the desired output.
In prior work using "chain of thought" [66] or "step by step" [40] reasoning, no such information is given to the model; instead, the prompt simply helps it organize its reasoning about the sample into a sequence of instructions.Here, rather than having the model do its own reasoning, we shape its reasoning externally by using simple program analyses, since we can get very precise information from very efficient analysis tools.Each few-shot example includes source code, derived information, and conclusion (summary), thus providing exemplary "chains of thought" for the model to implicitly use when generating the desired target summary.Figure 1 presents the overall pipeline of our approach that we apply to each sample.The BM25 engine matches input code against a sample pool, A processes resulting examples to create a prompt, and the final prompt is sent to the GPT-3.xmodel via API, yielding a summary as output.
Next, we describe how we evaluate this pipeline.

Metrics
BLEU [50] is the most widely-used, similarity-based measure for code summarization [57] and commit log generation [16].BLEU counts the fraction of -grams (usually for  ∈ [1..4]), that occur in both generated candidates and one or more reference translations; the geometric mean of these fractions is the BLEU, usually normalized to the range 0-100.At sentence granularity, BLEU tends to overly penalize candidate translations when few (or none) of the longer n-grams co-occur, so "Sentence BLEU" has been criticized for correlating poorly with human judgment.Various smoothing techniques [13,23,44] have been used, to reduce Sentence BLEU's sensitivity to sparse -gram matches, and better align it with human quality assessment.We report data on two variants: BLEU-CN, which uses a kind of Laplacian smoothing [2,3,8,21,33,47,64] and BLEU-DC, which uses newer smoothing methods [29,65].Other proposed metrics such as BERTScore [28,70], BLEURT [58], NU-BIA [37], are computationally expensive, not widely used and thus not readily comparable with prior work for benchmarking.Given all these options, metrics for code summarization and, independently, for commit-log generation [16], have been debated [24,28,57].In this paper, we follow prior work and primarily use BLEU-CN; this facilitates the comparison of our results with prior work.The CodeXGLUE benchmark recommends BLEU-CN, and most newer models [3,21,64] use this metric.We, however, have not neglected other measures.Besides BLEU-CN, and BLEU-DC, we also report results using ROUGE-L [43] and METEOR [9].
In all cases, A achieves significant overall improvements: we observe gains greater than 2.0 BLEU for all programming languages except for Go (Table 3).We contend that gains greater than 2.0 BLEU are important for two reasons.Roy et al. [57] provide arguments, grounded on human subject study that for code summarization (our central task), that a gain of 2.0 or more BLEU is more likely to correspond with human perception of improvement.Second, we argue that even smaller gains matter (especially if repeatable and statistically significant) since incremental progress on such tasks accumulates, towards strong practical impact, as evidenced by decades-long work in natural language translation.
In addition to code summarization, we evaluated A approach on the code completion task.The standard metrics used for this task are exact match (did the completion match exactly) and edit similarity (how close is the completion to the expected sequence).Here, too, A achieves significant overall improvements.

Experimental Setup & Evaluation Criteria
Our primary model is OpenAI's code-davinci-002.We use the beta version, via its web service API.To balance computational constraints like rate limits and our desire for robust estimates of performance, we chose to use 1000 samples 5 per experimental treatment (one treatment for each language, each few-shot selection approach, with A, without A etc.).
Our experiments yielded statistically significant, interpretable results in most cases.Each 1000-sample trial still took 5 to 8 hours, varying (presumbly) with OpenAI's load factors.We include waiting periods between attempts, following OpenAI's recommendations.To obtain well-defined answers from the model, we found it necessary to set the temperature to 0, for all our experiments.The model is designed to allow a window of approximately 4K tokens; this limits the number of few-shot samples.For our experiments, we used 3 shots.A defaults to three shots because related work [3,12] has shown, and our own experiments with A confirmed, that more shots did not significantly improve performance.However, for up to 2% of the randomly chosen samples in each experiment, we didn't get good results; either the prompt didn't fit into the model's 5 Please see Section 7 for the rationale.
window, or the model mysteriously generated an empty string.In cases where the prompt as constructed with 3 samples was too long, we automatically reduce the number of shots.When empty summaries were emitted, we resolved this by increasing the number of shots.This simple, repeatable, modest-overhead procedure can be incorporated into automated summarization tools.

RESULTS
We evaluate the benefits of A-enhanced prompts, for code summarization, in different settings and using various metrics.We find evidence of overall performance gain, in studies on six languages.However, for other detailed analyses, we focused primarily on Java and Python, because of OpenAI API rate limits.

Encoder-decoders & Few-shot Learning
Our baseline results on CodeSearchNet [47], using IR-based fewshotting, come first.Prior work reports that IR methods can find better samples for few-shot prompting, for tasks such as program repair [48] and code generation [34].In Table 2, we observe that this is also true for code summarization; we note improvements of 3.00 (15.10%) and 1.12 (5.42%) in BLEU-4 score for Java and Python, respectively, simply by using BM25 as a few-shot sample selection mechanism.Since BM25 was already used in prior paper (albeit for other tasks) [48], we consider this BM25-based few-shot learning for code summarization as just a baseline (not a contribution per se) of this paper.

A𝑆𝐴𝑃 Prompt Enhancement
We now focus on the central result of our paper: the effect of A prompt enhancement.Table 3 shows the component-wise and overall improvements achieved after combining all the prompting components for all six programming languages.BLEU improvements range from 1.84 (8.12%) to 4.58 (16.27%).In most cases, we see improvements of over 2.0 BLEU, the required threshold for human perception noted by Roy et al. [57].
We also noticed that all three components (i.e., Repository Information., DFG Data Flow Graph, Identifiers) help the model achieve better performance in all six languages, as we combined these components individually with BM25.However, for Ruby, the best performing combination includes just the Repo.information.In most cases, the Repo.helps a lot, relative to other components.
To ascertain improvement significance, we used the pairwise one-sided Wilcoxon signed-rank test, finding statistical significance in all cases for our final prompt when compared with vanilla BM25 few-shot learning, even after adjusting for false discovery risk.

Same Project Code Summarization
We now examine the benefits of A in the context of some earlier work on few-shot selection.Prior work has shown that selecting few-shots from the same projects substantially improves performance [3].To see if our prompt enhancement idea further helps in project-specific code summarization, we evaluated our approach on the dataset from Ahmed and Devanbu [3].Due to rate limits, we reduced the number of test samples to 100 for each of the four Java and Python projects.Since we have too few samples for a per-project test, we combined all the samples to perform the  statistical test.Note that our total sample size for the statistical test exceeds the number of required samples determined through the analysis mentioned in Section 7. When working with the same project, one must split data with care, to avoid leakage from future samples (where desired outputs may already exist) to past ones.Therefore, we sorted the samples by creation dates in this dataset.
After generating the dataset, we applied our approach to evaulate the performance in same project setting.We also compared our results with a cross-project setup, where we retrieved samples from the complete cross-project training set, similar to the setting used in Section 4.2.Table 4 shows the results project-based code summarization.Note that this is a project-specific scenario where data is not available at all.The training data for each project is very limited.We found that, for 4 projects, cross-project few-shot learning yielded the best performance; while, for 4 others, same-project few-shot learning was most effective.We note that Ahmed & Devanbu didn't use IR to select few-shot samples and consistently achieved better results with same-project few-shot learning [3].IR does find relevant examples in the large samples available for Java & Python, and we get good results.We analyzed 16 pairs of average BLEU from 8 projects, considering both cross-project and same-project scenarios.Our prompt-enhanced few-shot learning outperformed vanilla BM25 retrieved few-shot learning in 14 cases (87.5%).This suggests that A prompt enhancement is helpful across projects.A statistically improves performance in both cross-project and same-project settings.

Is A𝑆𝐴𝑃 Model-agnostic?
Our results so far pertain to the code-davinci-002 models.We also fed A-augmented prompts to the other two models, textdavinci-003 & gpt-3.5-turbo(chat model).Our findings are in Table 6 performance of the gpt-3.5-turbomodel by 1.68% to 9.13% and testdavinci-003 model by 13.08% to 18.69% on 500 samples each from Java, Python, PHP.Gpt-3.5-turbodoes worse than the code-davinci-002 and textdavinci-003 models at code summarization.The Turbo version is verbose and produces comments stylistically different from those written by developers, and also from the few-shot exemplars in the prompt.Careful prompt-engineering might improve the turbo model and enable it to generate more natural, brief comments; this is left for future work.This underperformance by the chat model is consistent with the findings by Kocon et al. [39].Textdavinci-003 model showed the maximum performance increase (albeit still outdone by code-davinci-002).Note that text-davinci-003 is a completion model, like code-davinci-002.Our findings suggest that A is more effective with completion models than chat models.We also conducted pairwise one-sided Wilcoxon signed rank tests, and the statistical significance of our findings (except java with gpt-3.5-turbo)suggests that A will apply beyond just the original code-davinci-002 model.

A𝑆𝐴𝑃 for Completion
Our primary focus so far has been on code-summarization, in a few-shot setting.Here, we explore if A works on another task: code completion, in a zero-shot setting where no example is shown or presented to the model.We assessed the value of including semantic facts for the line completion task, where the model generates the next line given the prior line.We uniformly and randomly collected 9292 Java and 6550 Python samples from the CodeSearchNet dataset to conduct our evaluation.We randomly selected a line for each sample and tasked the model with generating that line, given just all the preceding lines.While applying A, we append the repository information and other semantic facts (i.e., tagged identifiers, DFG) before the preceding lines.Importantly, when generating tagged identifiers and DFG, we only used partial information from preceding lines to avoid information leakage from later lines to the target lines.
We used two metrics, Exact Match (EM) and Edit Similarity (ES), in line with the CodeXGLUE benchmark, to measure the model's performance.We conducted a McNemar test for EM and a pairwise Wilcoxon sign-rank test to evaluate the model's performance, similar to what we performed for code summarization.  the effectiveness of incorporating semantic facts.For Python, we find statistical significance only for ES improvement, not for EM.

Performance on Other Metrics
In addition to BLEU-CN, we measured performance with 3 other metrics; BLEU-DC, ROUGE-L and METEOR.Our results, in Table 10, shows average gains with A on all three metrics.We conducted pairwise one-sided Wilcoxon signed-rank tests and found significant performance improvements with BLEU-DC and ROUGE-L for all the languages.However, we did not observe significant differences with METEOR for 4 out of 6 languages, though sample averages do improve with A in all 6 comparisons.It's worth noting that we had only 1000 language samples (due to cost) for each language, so it's not unexpected to see some cases where we didn't observe significance.To evaluate the overall impact of A, we combined the dataset from all languages for code-davinci-002 model (6000 samples) and performed the same test; we then get statistical significance (p-value < 0.01) for all three metrics, suggesting that A does provide value.

DISCUSSION AND ABLATION STUDY
We now present an ablation study of A's design and the particular semantic facts our instantiation of A uses before comparing A's output to our vanilla BM25 baseline.The primary aim of an ablation study is to gauge the contribute of each aspect of a model to the final observed performance In our study, we removed each semantic component of the enhanced prompt and observed performance.We found that the Repo.component contributes most to the model's performance (Table 7) both for Java and Python.However, tagged identifier and DFG are also helpful, and the best results were obtained when we combined all three components in the prompt.
Two Illustrative Examples When manually examining results, we observed that in several samples, the A prompt contained information that was crucial for the summary.Table 8 shows two example results that illustrate this point.In the first example, the baseline model failed to generate the term "element-wise".However, our prompted enhanced version capture this important concept, yielding a higher BLEU-4 score of 74.0 compared to the baseline score of 39.0.Similarly, in the second example, the baseline model did not recognize the function as a standalone process, leading to a low BLEU score of 10.0.However, our proposed approach did identify the function as a standalone process, resulting in a higher BLEU score of 33.0.Does the Model Memorize the Path?Of the three semantic facts A adds to a prompt, repo.information impacts the model's performance most.This may be due to the fact that Code-Davinci-002 had memorized the specific file paths in our data during pretraining; when we provide the path to the function, perhaps the model just recalls memorized information?To investigate this question, we change the path representation: we took the repository name and path, split the tokens at "/", and presented the model with a list of tokens.The main idea behind this approach is to diffuse the original representation, and present the model with something not encountered during pre-training.If the model isn't literally memorizing, its performance should not be impacted.We observed that the differences between both versions were very small.For Java, we gained 0.24 BLEU but, for Python, we lost 0.04 with tokenized paths.This suggests a lower risk that the model memorized the path to the function.
Is the Identifier Tag Necessary?In this paper, we assign roles to the identifiers and tag them as Function Name, Parameters, Identifier etc. in the prompt (See Figure 2).Does this explicit tagging actually help performance?To investigate this question, we compare the model's performance when provided with a plain, "tag-free" list of identifiers.We observed that the tagged identifiers lead to better performance for both Java and Python than a simple tag-free list of identifiers.Our performance metric BLEU increased by 0.41 and 1.22 for Java and Python, respectively, suggesting that explicit semantic information does indeed contribute to better model performance.What's Better: More Shots or ASAP?Despite having billions of parameters, LLMs have limited prompt sizes.For example, codedavinci-002 and gpt-3.5-turbosupport allow prompt-lengths of just 4k tokens.A augmentation does consume some of the available prompt length budget!Thus we have two design options: 1) use fewer, A-Augmented samples in the prompt or 2) use more few-shot samples sans augmentation.To investigate this, we also tried using 4 and 5 shots (instead of 3) for Java and Python with the code-davinci-002 model.However, Table 9 shows that higher shots using BM25 does not necessarily lead to better performance.With higher shots, there is a chance of introducing unrelated samples, which can hurt the model instead of helping it.
Only for Java did we observe better performance with both 4 and 5 shots compared to our baseline model.However, our proposed technique with just 3-shots still outperforms using BM25 with 5 shots.It's worth noting that the context window of the model is increasing day by day, and the upcoming GPT-4 model will allow us to have up to 32K tokens6 .Therefore, the length limit might not be an issue in the near future.However, our study suggests that Automated Semantic Augmentation will still be a beneficial way to use available prompt length budget; moreover, it stands to reason that constructing more signal-rich, informative prompts will beneficial regardless of length.
What's New in A's Output?We add a pro forma analysis of a few hand-picked examples, to be consistent with peer-reviewrequired community rituals; however, these analyses are highly anecdotal must be interpreted cautiously.We manually examine several samples to discuss our results in greater detail; specifically, to answer three questions: to specify 1) the new types of information A presents to the LLM and 2) how A's summaries differ from those created by existing techniques, and 3) to analyze the errors that A introduces.Table 11 presents some samples where, for the first three, A performed very well compared to our retrieval-based baselines, and for the second three, the baseline performed better than A.While we discuss our findings in the context of the provided samples, our observations generalise to other samples.
The new types of information A presents to the LLM: As discussed in the paper, our primary contribution involves augmenting retrieved samples (retrieved using BM25, as per Nashid et al. [48]) with semantic facts, resulting in improved performance compared to the base retrieval approach.We add semantic facts related to repository details, identifiers, and data flow graphs to both retrieved samples and input code.As anticipated, the added semantic facts transfer into, and enhance, the model output.
In the first sample, the baseline retrieval-only method fails to capture the term "gradient" entirely.However, by incorporating  semantic facts, the model successfully recovers the term because it is frequently found in both identifiers and repository names, influencing the model's output.In the second example, where the goal is to replace rather than simply return, the baseline fails to generate the term "replace", despite the clear indication in the function name ("replaceWithMappedTypeForPath").The data flow between identifiers, provided in the semantic facts, may have helped the model recognize replacement operations.
How A's summaries differ from those created by existing techniques: Following the above discussion, we observed that A is generating more specific information: (1) It identifies "gradient" in sample 1.
(2) It suggests changing "return" to "replace" in another sample (sample 2).(3) It recommends changing "dataroot" to "datarootext" in a different sample (sample 3).These differences were observed across multiple samples when comparing our baseline to A.The A approach consistently produces more specific information compared to the baseline.

Analyze the errors that A𝑆𝐴𝑃 introduces:
The examined examples suggest that A can become too specific, and thus not match the developer-written summary.A gets over-specific in the last three examples with "Andrew's monotone chain algorithm" and "deployable unit", "column vector".While these terms are not necessarily incorrect, BLEU-4 drops, because the developer-written summary was more generic.
We also observe quantitatively that A induced positive changes in 44% of the samples.However, the performance also declined for 30% of the samples, and remained the same on the rest.Compared to our baseline (few-shot learning with BM25-retrieved samples), A requires more tokens.The additional token cost, per query (both in terms of monetary cost and performance overhead) is quite modest.On the other hand, we observe a substantial 12% overall improvement with A using the Codex model.

RELATED WORK 6.1 Code Summarization
Deep learning models have advanced the state-of-the-art in SE tasks such as code summarization.The LSTM model for code summarization was first introduced by Iyer et al. [33].Pre-trained transformer-based [62] models such as CodeBERT [21], PLBART [2], and CodeT5 [64] have been extensively used on the CodeXGLUE [47] code summarization dataset, resulting in significant improvements.However, there is a caveat to using pre-trained language models: although these models perform well, extensive fine-tuning is required, which can be data-hungry & time-consuming.Additionally, separate models had to be trained for different languages, increasing training costs.To reduce the number of models required, multilingual fine-tuning has been suggested, to maintain or improve performance while reducing the number of models to one [4].However, this approach did not reduce the training time or the need for labeled data.
LLMs, or large language models, are much larger than these pretrained models, and are trained on much bigger datasets with a simple training objective -auto-regressive next-token prediction [12].These models perform surprisingly well on tasks, even without finetuning.Just prompting the model with different questions, while providing a few problem-solution exemplars, is sufficient.Few-shot learning has already been applied to code summarization, and has been found to be beneficial [3].

Other Datasets
There are several datasets available for code summarization, in addition to CodeXGLUE [47].TL-CodeSum [30] is a relatively smaller dataset, with around 87K samples, but it does include duplicates, which may result in high performance estimates that may not generalize.Funcom [41] is a dedicated dataset with 2.1 million Java functions, but contains duplicates.We chose CodeXGLUE (derived from CodeSearchNet) because it is a diverse, multilingual dataset that presents a challenge for models.Even well-trained models like CodeBERT struggle on this benchmark; its performance is particularly poor on languages with fewer training samples.
There has been a lot of work on code summarization, ranging from template matching to few-shot learning.These models use different representations and sources of information to perform well in code summarization.Comparing or discussing all of these models is beyond the scope of this work.We note, however, that our numbers represent a new high-point on the widely used CodeXGlue benchmark for code summarization and code-completion; we refer the reader to https://microsoft.github.io/CodeXGLUE/for a quick look at the leader-board.Our samples are smaller (N=1000), but the estimates, and estimated improvements, are statistically robust (See the sample size discussion in Section 7).

LLMs in Software Engineering
Although LLMs are not yet so widely used for code summarization, they are extensively used for code generation [14,49,67] and program repair [5,18,35,36].Models like Codex aim to reduce the burden on developers by automatically generating code or completing lines.Several models such as Polycoder [67] and Codegen [49] perform reasonably well, and due to their few-shot learning or prompting, they can be applied to a wide set of problems.However, Code-davinci-002 model generally performs well than those models and allows us to fit our augmented prompts into a bigger window.
Jain et al. proposed supplementing LLM operation with subsequent processing steps based on program analysis and synthesis techniques to improve performance in program snippet generation [34].Bareiß et al. showed the effectiveness of few-shot learning in code mutation, test oracle generation from natural language documentation, and test case generation tasks [10].CODAMOSA [42], an LLM-based approach, conducts search-based software testing until its coverage improvements stall, then asks the LLM to provide example test cases for functions that are not covered.By using these examples, CODAMOSA helps redirect search-based software testing to more useful areas of the search space.Jiang et al. evaluated the effectiveness of LLMs for the program repair problem [35].
Retrieving and appending a set of training samples has been found to be beneficial for multiple semantic parsing tasks in NLP, even without using LLM [68].One limitation of this approach is that performance can be constrained by the availability of similar examples.Nashid et al. used a similar approach and gained improved performance in code repair and assertion generation with the help of LLM [48].However, none of the above works has attempted to automatically semantically augment the prompt.Note that it is still too early to comment on the full capabilities of these large language models.Our findings so far suggest that augmenting the exemplars in the prompt with semantic hints helps on the code summarization and code completion tasks; judging the value of A in other tasks is left for future work.

THREATS & LIMITATIONS
A major concern when working with large language models is the potential for test data exposure during training.Sadly, one can't directly check this since the training dataset is not accessible.The model's lower performance with random few-shotting suggests that memorization may not be a major factor.As we incorporate relevant information, the model's performance improves with the amount and quality of information.Had the model already memorized the summaries, it could have scored much higher, even without the benefit of relevant exemplars and semantic augmentation.
Sample Size Analysis: We used the observed means and standard deviations to calculate (using G*power [19,20]) the required sample sizes, using commonly used values:  of 0.01 (desired p-value) and a  of 0.20 (viz, a 20% chance of NOT discovering an effect, should one exist).For the tests that we used (Wilcoxon Signed-rank test), we found that the needed sample size was always below the sample size we used for our primary studies, viz., 1000.
User Study: We did not conduct a user study for A.Thus, the enhancements in metrics presented here may not necessarily translate into improved developer performance.This aspect is left to future work.
Finally: fine-tuning large LMs to use derived semantic facts may improve on our augmented prompting approach, but would be costly.We will leave its consideration to future research.

CONCLUSION
In this paper, we explored the idea of Automatic Semantic Augmentation of Prompts, whereby we propose to enhance few-shot samples in LLM prompts, with tagged facts automatically derived by semantic analysis.This based on an intuition that human developers often scan the code to implicitly extract such facts in the process of code comprehension leading to writing a good summary.While it is conceivable that LLMs can implicitly infer such facts for themselves, we conjectured that adding these facts in a formatted style to the exemplars and the target, within the prompt, will help the LLM organize it's "chain of thought" as it seeks to construct a summary.We evaluated this idea a challenging, de-duplicated, wellcurated CodeSearchNet dataset, on two tasks: code summarization and code completion.Our findings indicate that Automated Semantic Augmentation of Prompts is generally helpful.Our estimates suggest it helps surpass state-of-the-art.

Figure 1 :
Figure 1: Different steps of A.(1) Input code and (2)Pool of samples are given the BM25 engine, which matches the given input code against the pool and (3) retrieves bestmatching samples, viz. 3 input+output pairs.These examples are processed by A to produce a prompt (4) including 3 exemplars.Each exemplar includes a function definition, the results of analyzing that definition, and its associated comment; the input code is finally appended, along with its analysis product.Exemplar details are in Figure2.The final prompt is sent via API call (5) to the GPT-3.xmodel; the returned output, e.g., summary (6) is returned by GPT-3x.

Figure 2 :
Figure 2: Components of an A Exemplar.Source Code and Output Comment are extracted from the retrieved pool sample.The Repo info is derived from the source code using GitHub; the Dataflow Info and tagged Identifiers with labels is obtained from an analysis using Treesitter.

Table 1 :
Number of training and test samples.
lists the count of training & test samples used in our experiments.

Table 2 :
Performance of encoder-decoder and few-shot models on Java and Python code summarization, measured using BLEU.

Table 3 :
Performance of prompt enhanced comment generation with code-davinci-002 model, measured using BLEU.p-values are calculated applying one-sided pair-wise Wilcoxon signed-rank test and B-H corrected.

Table 4 :
Performance of prompt enhanced comment generation with code-davinci-002 model on same project data (measured using BLEU) and p-values are calculated applying one-sided pair-wise Wilcoxon signed-rank test after combining the data from all projects.

Table 5 :
Performance of A enhanced prompts with code-davinci-002 model on line completion task.

Table 6 :
. Our prompt-enhanced few-shot learning approach improved the Performance on code summarization, measured using BLEU.p-values are calculated applying one-sided pairwise Wilcoxon signed-rank test and B-H corrected.

Table 5
summarizes our findings.We observe an overall 5.79% gain in Exact Match (EM) and a 5.11% gain in Edit Similarity (ES), highlighting

Table 8 :
Selected examples, illustrating the effectiveness of A enhancement.

Table 10 :
The effectiveness of ASAP in popular code summarization metrics.p-values are calculated applying one-sided pair-wise Wilcoxon signed-rank test and B-H corrected.

Table 11 :
Examples Showing Strength and Weakness of A.