Out of Sight, Out of Mind: Better Automatic Vulnerability Repair by Broadening Input Ranges and Sources

The advances of deep learning (DL) have paved the way for automatic software vulnerability repair approaches, which effectively learn the mapping from the vulnerable code to the fixed code. Nevertheless, existing DL-based vulnerability repair methods face notable limitations: 1) they struggle to handle lengthy vulnerable code, 2) they treat code as natural language texts, neglecting its inherent structure, and 3) they do not tap into the valuable expert knowledge present in the expert system. To address this, we propose VulMaster, a Transformer-based neural network model that excels at generating vulnerability repairs by comprehensively understanding the entire vulnerable code, irrespective of its length. This model also integrates diverse information, encompassing vulnerable code structures and expert knowledge from the CWE system. We evaluated VulMaster on a real-world C/C++ vulnerability repair dataset comprising 1,754 projects with 5,800 vulnerable functions. The experimental results demonstrated that VulMaster exhibits substantial improvements compared to the learning-based state-of-the-art vulnerability repair approach. Specifically, VulMaster improves the EM, BLEU, and CodeBLEU scores from 10.2% to 20.0%, 21.3% to 29.3%, and 32.5% to 40.9%, respectively.


INTRODUCTION
As the software landscape expands rapidly [22], the number of software vulnerabilities has increased correspondingly [26].According to Common Vulnerabilities and Exposures [13], the yearly discovery of vulnerabilities in 2022 sets a new record with 26,448 software vulnerabilities reported, marking a notable 59% increase compared to 2021 [67].Recently, many automatic vulnerability prediction approaches [8,53,87] are proposed to identify the vulnerability.However, when a software vulnerability is detected, it becomes of utmost importance to promptly rectify and resolve it in order to minimize the potential risks of exploitation.However, addressing software vulnerabilities often requires specialized expertise, and the existing pool of experienced developers is insufficient to tackle the extensive number of vulnerabilities found in millions of software systems [31].Furthermore, the manual resolution of vulnerabilities is a time-consuming process; the GitHub 2020 security report finds that it takes 4.4 weeks to fix a vulnerability after its identification [17].As a result, there is an urgent demand for automated methods to repair vulnerabilities.
A number of program analysis-based vulnerability repair approaches [20,21,27,29,41,47,63,69,83] have been proposed.Although effective, program analysis-based approaches are often tailored for specific vulnerabilities or not applicable to certain types.In contrast, learning-based vulnerability repair approaches are not constrained to fixing specific vulnerability types.In this study, we aim to advance learning-based vulnerability repair approaches.Recently, several learning-based automatic vulnerability repair (AVR) approaches that take a vulnerable function with its CWE type (e.g., CWE-119) as input and generate the repaired function as as output have been proposed [9,19].For instance, VRepair [9] is pre-trained on a large bug-fixing corpus and then trained to repair vulnerabilities.On the other hand, VulRepair directly utilizes a pre-trained model, named CodeT5 [74].By harnessing this off-the-shelf pretrained model, VulRepair achieves state-of-the-art performance in learning-based vulnerability repair [19].Despite the promising outcomes, VRepair and VulRepair encounter three major challenges.
Understanding the entire vulnerable code: VRepair and Vul-Repair rely on the Transformer model, and its computational cost significantly increases with longer inputs.The attention matrix computations (i.e., the primary calculations) within the Transformer model scale quadratically with the length of the input sequence [68].As a result, Transformer-based models inevitably restrict the length of input code snippets, leading to truncation of longer parts.For example, VulRepair imposes a maximum limit of 512 code tokens.However, real-world vulnerable codes often exceed this limit: 44.9% of the vulnerable code in VulRepair's evaluation dataset [19] surpasses 512 tokens.Failing to provide and understand the entire vulnerable code to the model reduces the likelihood of accurately addressing the vulnerability.
Understanding the structures of the vulnerable code: Both VRepair and VulRepair treat code in a similar manner to natural language texts, disregarding the incorporation of structural information.Natural language possesses a loosely structured nature, enabling words to be arranged in different orders while maintaining grammatical correctness [35,52].In contrast, programming languages exhibit a higher level of structure.Neglecting code structures could reduce the probability of effectively resolving the vulnerability in both VRepair and VulRepair.Prior research has demonstrated the benefits of including code structures, such as the Abstract Syntax Tree (AST), in tasks like code clone detection [51,82], code search [56], and bug comprehension [49].Following these prior studies, we also include the AST as a part of the model's input to improve its grasp of the structural characteristics of code.
Leveraging the expert knowledge: Software vulnerabilities do not exist in isolation.The Common Weakness Enumeration (CWE) system [11] offers a comprehensive catalog of common software weakness types.By utilizing the CWE system, one may gain access to accurate descriptions, vulnerable code examples, and information on closely related vulnerabilities pertaining to specific vulnerability types.In order to tap into the expert knowledge provided by the CWE system, both VRepair [9] and VulRepair [19] incorporate CWE types into their input data.The CWE system, however, offers a wealth of expert knowledge that extends beyond CWE types.This includes CWE names and vulnerable code examples shared on the web pages.Our approach aims to better utilize this abundant knowledge within the CWE systems.
To tackle these challenges, we propose VulMaster, a novel automatic vulnerability repair approach that aims to process the entire vulnerable code, regardless of its length.It then generates fixes by integrating diverse information, including vulnerable code structures and expert knowledge from the CWE system.VulMaster consists of three major components.First, it follows the idea of the Fusion-in-Decoder (FiD) framework [30] to overcome the limitations associated with the input length of Transformer-based models.Second, to capture the structural aspects of the vulnerable code, VulMaster utilizes the AST as part of its input.Third, it extensively utilizes expert knowledge of the CWE system, including the vulnerability type name, typical vulnerable code examples, and extra information on other closely related vulnerabilities.
We evaluate VulMaster on a real-world C/C++ vulnerability dataset used in previous studies [5,9,15,19], which consists of 5,800 function-level unique vulnerability fixes collected from 1,754 large-scale open-source software projects.The experimental results illustrate that VulMaster enhances the EM, BLEU, and CodeBLEU scores from 10.2% to 20.0%, 21.3% to 29.3%, and 32.5% to 40.9%, respectively, compared to the state-of-the-art learning-based vulnerability repair solution.In summary, our contributions are as follows: • To the best of our knowledge, we are the first to introduce the FiD framework for the field of software engineering.More importantly, we initiate how it should be leveraged to effectively mitigate the input length limitations of Transformer-based models and integrated with other valuable aiding information.• We discover more diverse and valuable information such as vulnerable code structures and expert knowledge for CWEs, which can boost the effectiveness of repairing vulnerability.• We further identify and address a hidden label leakage issue in the datasets used in the prior work, which can contribute to a more precise evaluation of future studies.

PRELIMINARIES AND MOTIVATION 2.1 Motivating Example
In ❶ of Figure 1, we show an example to analyze how junior security engineers repair a vulnerable function of CWE-125 (e.g., the vulnerable code before fixing shown in ❷ of Figure 1) and explain our two motivations below.
(1) Understanding the vulnerable function is the first thing.
Instead of directly crafting an appropriate fix, the security developer may first undertake a meticulous examination of the vulnerable function to acquire a comprehensive understanding of the code and its weaknesses.While human developers can manage to grasp lengthy vulnerable functions, it poses a challenge for automatic tools based on Transformer models.For instance, as depicted in ❷ of Figure 1, the vulnerable function from [1] consists of over 800 code tokens before the vulnerable code lines, surpassing the input limit (512 tokens) of the learning-based SOTA VulRepair.Consequently, VulRepair failed to generate an accurate fix.
(  count, which is later used as an index in Line 14.However, the function does not check whether count is a negative value.As a result, a part of the ground-truth fix (in ❷ of Figure 1) could be inspired by the typical vulnerable example provided by the CWE website, i.e., checking whether the count is negative.

Background
Task Definition.In line with previous research [9,19], learningbased automatic vulnerability repair (AVR) is formulated as a sequenceto-sequence problem: (  ,  ) →   .Specifically, given a vulnerable code snippet   along with its associated CWE type   , a Deep Learning (DL) model generates the corresponding repaired code   .
Fusion-in-Decoder (FiD) Framework is originally proposed for addressing challenges in the open-domain question-answering task in natural language processing (NLP) [30].It is a sequence-tosequence task, where the input is the question with numerous relevant passages and the output is the answer to the question.Before the introduction of FiD, researchers [42] concatenated the question and relevant passages to create the input for the encoder.However, the encoder had a limitation on the maximum input length, which led to the truncation of the input data (i.e., the concatenated questions and passages) and resulted in a substantial information loss.
The FiD framework adopts a divide-and-conquer approach to address such length limitation [30].Specifically, each question and passage is individually encoded using an encoder, producing contextual embeddings.These contextual embeddings are subsequently concatenated to form a composite representation of the questions and all passages, which is then fed into a decoder to generate the desired output.By incorporating the FiD framework, Transformerbased models surpass the prescribed length limitation and gain the ability to effectively comprehend and encode numerous passages.

APPROACH
The framework of VulMaster is presented in Figure 2. VulMaster takes a vulnerable function and its CWE type as input and generates the corresponding fix.VulMaster involves three main parts, where the first two parts prepare the input data for the VulMaster model, and the third part is responsible for encoding the inputs and generating the fixed code.Part 3: FiD and Relevance Prediction.This part first encodes all the input data (i.e., the code token sequences, AST node sequences, and CWE knowledge) into the contextual embeddings in a divideand-conquer manner.Subsequently, the contextual embeddings are aggregated together.Then, the aggregated embedding is fed into the decoder to generate the corresponding repair   .Backbone model.VulMaster needs a pre-trained model as the backbone model and we employ CodeT5-base [74] for its great performance on sequence-to-sequence tasks [55].

Vulnerable Code Preprocessing
This part aims to perform preprocessing on the target vulnerable functions to extract code segments and AST node sequences.
Tokenization.To correctly leverage the CodeT5 model, we utilize the CodeT5 tokenizer [75] to tokenize each vulnerable function into a token sequence.
Vulnerable function partitioning.VulMaster divides lengthy input code into multiple segments, ensuring that each segment adheres to the length limit (i.e., 512 tokens in the case of CodeT5).This helps VulMaster to further process and understand each of the code segments.Specifically, the process starts with tokenization, where the entire function is transformed into a sequence of tokens.VulMaster then proceeds to partition this token sequence into segments, ensuring that each segment contains no more than 512 tokens.If a function is shorter than 512 tokens, it constitutes a single segment.
AST node sequences.Existing approaches [9,19] treat the input code as unstructured natural language text, disregarding the rich structure inherent in source code that can provide valuable semantic information.A notable structural representation of source code is the AST.Thus, we first adopt the tree-sitter parser to transform the input vulnerable function into the AST representation.However, pre-trained code models like CodeT5 are not designed to directly handle AST, as they are mainly optimized for sequential data rather than tree or graph structures.To address this limitation, we adopt a depth-first traversal method to convert the AST into an AST node sequence while preserving the structural information [28].To generate the AST node sequence, we start by adding the concatenation of the type and value of the root node to the empty list .The value refers to the actual token present in the source code, while the type represents the AST node's type.Next, we traverse the sub-trees of the root node in a depth-first order, adding the concatenation of values and types of the root nodes of each sub-tree into the list .This recursive process is repeated for each sub-tree until all nodes within the tree have been traversed.Finally, the AST node sequence  is divided into multiple segments, each adhering to the length limit of 512 tokens.

CWE knowledge Extraction
This part involves gathering rich information from the CWE website for a CWE type of target vulnerable function.
Offline CWE knowledge preparation.To provide the CWE knowledge for an incoming vulnerable function instantly, we first build a CWE knowledge dictionary in an offline manner.The CWE knowledge dictionary is a dictionary where the key of each item is the CWE type and the corresponding value is a list of CWE knowledge, i.e., the CWE name, the vulnerable examples from CWE websites, and the fixes for those examples generated by ChatGPT. Figure 3 (❶) showcases the information available on the website of CWE-362, including the CWE name, the vulnerable examples from CWE websites, and the analysis of the examples written by experts.Note that the CWE website usually only provides vulnerable examples without fixed examples.Although vulnerable code examples exist, they do not possess the essential knowledge on "how to fix vulnerabilities".The analyses of the vulnerable examples contain the "how to fix" knowledge, but they are in natural language texts rather than source code.Therefore, we seek suitable fixes for the vulnerable code examples to complement the "how to fix" knowledge.To obtain the fixed code examples, we adopt the recent advancements in generative AI models, i.e., ChatGPT, which is proficient in generating code given detailed guidance [57,73].
Figure 3 (❷) presents the overall process of building this offline CWE knowledge dictionary for VulMaster.Firstly, we begin by gathering all the CWE types that appeared in the experiment dataset [19].Secondly, for each CWE type, we access its corresponding website to extract valuable information, i.e., the CWE name, typical vulnerable examples, and expert analysis of these examples.Thirdly, for each vulnerable code example and its accompanying analysis, we leverage ChatGPT to generate potential fixes.Specifically, we crafted the following prompt to generate the fixes: "The code {} contains a vulnerability of type {}.The analysis of this vulnerable code is {}.Please generate the repaired code to address the vulnerability:" The {}, {}, and {} are respectively filled in the vulnerable code example, CWE name, and expert analyses on this vulnerable example listed on the CWE web page.We utilize the prompt for each vulnerable example with its name and expert analysis and query ChatGPT (i.e., the GPT-3.5-turbomodel with its default setting [57]) to obtain the fixed code.Given the ample and accurate expert analysis as guidance and the simplicity of those vulnerable examples from CWE, ChatGPT could generate a large number of correct fixes for those vulnerable examples from CWE web pages.We discuss such performance of ChatGPT for vulnerable code examples from CWE in Section 6.2.
However, different from the vulnerable code examples in CWE web pages, vulnerabilities in real-world software [19] often lack expert analyses, and the complexity of vulnerabilities in real-world software is significantly increased.Such factors make ChatGPT struggle to generate accurate fixes for real-world vulnerabilities, which is also discussed in Section 5.1.
CWE knowledge extraction for coming data.Given the CWE type of one target vulnerable function, VulMaster can easily access the CWE knowledge, including the CWE name, vulnerable code examples, and the generated fixes, by querying the CWE knowledge dictionary.In addition to extracting vulnerable examples and fixes for the target CWE type, VulMaster goes beyond this and includes examples and fixes from other closely related CWE types in the input.This process is represented as ❸ in Figure 3.The decision to include examples and fixes from closely related CWE types stems from one observation: CWE types are organized hierarchically, with parent, child, and sibling types sharing many similarities with a specific CWE type.For instance, CWE-125 represents "Out-ofbounds Read", while its children, such as "CWE-126: Buffer Overread" and "CWE-127: Buffer Under-read", are more specific cases of CWE-125.Thus, VulMaster can consider a broader range of relevant code examples and fixes by incorporating examples and fixes from parent, child, and sibling CWE types (i.e., the wide range) in the input, rather than solely relying on those from the input CWE type (i.e., the narrow range).

Fusion-in-Decoder and Relevance Prediction
Backbone model adaption for AST.Although CodeT5 is pretrained on a large number of source code snippets, it has not seen AST node sequences during its pre-training, resulting in a lack of familiarity with such structural representation.Prior studies [25,84,86] have suggested that the lack of pertinent software artifacts during pre-training may lead to suboptimal pre-trained models, as exemplified in pre-trained models for code changes [86], Android bytecode [65], and Stack Overflow posts [25].A straightforward improvement involves incorporating the corresponding data 4 during (additional) pre-training.Following this idea, we introduce an additional pre-training step to enhance the backbone model's understanding of AST structures, which requires a substantial and relevant code corpus.We choose to use the bug-fixing corpus provided by Chen et al. [9], consisting of over 500,000 pairs of buggy and fixed functions.The bug-fixing task shares similarities with vulnerability repair [9], making this corpus a valuable resource for us.Specifically, we further pre-train the CodeT5 model to generate the fixed version of buggy code.For half of the training samples, the inputs are AST node sequences of buggy code, while for the other half are the buggy code itself.Both the lengthy AST node sequences and source code are truncated to the first 512 tokens for simplicity.We also include the source code as a part of the inputs to ensure that the trained CodeT5 retains its ability to understand both the source code and AST semantics.In this model adaption phase, we employ the Adam optimizer [38] with a learning rate of 2e-5 and train CodeT5 for 5 epochs on the bug-fixing corpus.We denote the model after this step as adapted-CodeT5.
Contextual Embedding.As shown in Figure 2, VulMaster uses the encoder model to generate contextual embedding for the following input sources: 1) the complete vulnerable function   , 2) AST node sequences, 3) CWE type name, 4) examples and corresponding fixes of the CWE type   and related CWE types.Specifically, for one pair of vulnerable code example and its fix, we concatenate them into a single sequence and feed it to the encoder to get the contextual embedding.Notably, the encoder utilized is the encoder model of the adapted-CodeT5.
Relevance prediction.VulMaster is equipped with vulnerablefixed code pairs gathered from web pages related to the input CWE type   (i.e., the CWE type of the input vulnerable function in the evaluation dataset) and its associated parent/child/sibling CWE types.However, not all vulnerable-fixed code pairs from CWE web pages hold the same level of relevance when it comes to repairing the input vulnerable function in the evaluation dataset [5,15].Pairs associated with the input CWE type are deemed the most relevant, while those linked to the input CWE type's parent/child/sibling CWE types are considered less relevant.The objective of the relevance prediction module is to assist VulMaster in identifying and highlighting the most relevant vulnerable-fixed code pairs from CWE web pages, while also considering less relevant pairs as contextual information.To achieve this, we introduce explicit supervision by performing a binary classification task [70].
Then we explain the preparation of the input for the relevance prediction module.The input is constructed by the following steps: Firstly, we collect vulnerable code snippets from the CWE web pages of the input CWE type   and its parent/child/sibling CWE types.Secondly, we collect the respective fixes (generated by ChatGPT) for the vulnerable code snippets from the first step.Thirdly, we concatenate each vulnerable code snippet from the first step with its corresponding fix, creating a single sequence.The resulting concatenated sequences (representing vulnerable-fixed code pairs from CWE web pages) are the input to the relevance prediction module.For the outputs (labels), we label each vulnerable-fixed code pair from CWE web pages as "most related" (class 1) if it belongs to the input CWE type   .Otherwise, it is labeled as "less related" (class 0).
The k-th vulnerable-fixed code pair from CWE webpages is transformed into its embedding   by the encoder model of VulMaster.Then the contextual embedding   is fed to a binary classifier (i.e., MLP [24]) to predict whether this k-th pair is "most related" or "less related".The output of the binary classifier is denoted as   = Classifier(  ).To update the model, we minimize the Cross-Entropy loss: Here,   represents the ground truth relevance label of the -th vulnerable-fixed code pair from CWE webpages and   is the prediction score of the pair.By optimizing this loss, a model can learn to effectively distinguish between "most related" and "less related" vulnerable-fixed code pairs from CWE webpages.
Fusion-in-Decoder.VulMaster combines all contextual embeddings of the input components into a single concatenated embedding, denoted as   .The concatenation is performed as follows:   = [ 1 , ...,   ;  1 , ...  ; ;  1 , ...,   ] where ";" indicates the concatenation operation.  represents the -th segment of the input vulnerable function,   represents the -th segment of the AST node sequence of the input vulnerable function,  represents the CWE name and   represents the -th example and its fix from CWE.The concatenated contextual embedding   is then passed into the decoder model (i.e., the decoder of the adapted-CodeT5) to generate the fixed code.In general, the model is updated to minimize the loss:   = −  (  |  ,   ,    , --  ) This minimization aims to increase the probability of the model generating the correct fixed function (  ) using the provided diverse input components.
Multi-task Learning.Previous research has demonstrated that multi-task learning is beneficial in enhancing the performance of learning-based models in tasks such as code completion [45], code generation [72], and code understanding [71].In the case of VulMaster, the adoption of the multi-task learning framework also aims to improve its effectiveness.Specifically, VulMaster is simultaneously trained on two tasks: 1) the relevance prediction task which identifies the most relevant vulnerable-fixed code pairs from CWE web pages from a pool of relevant pairs (i.e.,   ); 2) the vulnerability repair task which generates the fix for the input vulnerable function from the evaluation datasets [5,15] (i.e.,   ).The total loss function is the sum of the losses from each task:  =   +   .

EXPERIMENTAL DESIGN 4.1 Dataset for Evaluation
We utilize the same real-world C/C++ vulnerability dataset that is used to evaluate the existing vulnerability repair approaches (i.e., VulRepair [19] and VRepair [9]).It consists of 8,482 pairs of vulnerable C/C++ functions and their corresponding fixes by merging two existing datasets: CVEFixes [5] and Big-Vul [15].Please note that VulMaster is evaluated using the identical dataset combination of CVEFixes and Big-Vul, which was previously employed by the learning-based state-of-the-art VulRepair [19].These pairs are collected from 1,754 open-source software projects spanning the period from 1999 to 2021.This dataset is divided into training (70%), testing (20%), and validation (10%) subsets by previous studies [19].The data statistics are shown in Table 1.Notably, we checked and confirmed that the vulnerable-fixed code pairs sourced from the CWE website (Section 3.2) have no duplicates with the vulnerability repair evaluation dataset [19] mentioned above.
Processing.To process the dataset, VulRepair [19] and VRepair [9] adopt the same steps, which involve the addition of special tokens.
Particularly, each vulnerable function in the dataset is marked using the special tokens <StartLoc> and <EndLoc>.The <StartLoc> token indicates the beginning of the vulnerable code lines, while the <EndLoc> token indicates the end.For the ground truth, the special tokens <ModStart> and <ModEnd> are inserted into each repaired function to signify the start and the end of the repaired code lines, respectively.These special tokens serve the purpose of guiding the model's attention toward the vulnerable code lines and the corresponding fixed code lines.In other words, <StartLoc> and <EndLoc> provide the perfect vulnerability localization to vulnerability repair approaches.
Label leakage issue.In this paper, the reported performance of the learning-based SOTA VulRepir is different from its original paper [19].This is caused by our leakage correction.The dataset used in VulMaster is merged from two existing datasets CVEFixes [5] and Big-Vul [15].The authors of VulRepair seem to assume that data samples from two datasets are different from each other.By checking the vulnerable and fixed code pairs from each dataset, we found that about 60% of the pairs are duplicates.As the authors of VulRepair [19] further randomly split the merged dataset into training/validation/test sets, it leads to the presence of duplicated samples across the training, validation, and test sets.The label leakage issue might lead to an overestimation in performance measurement and artificially inflated scores [4].To tackle this problem, we conducted a deduplication operation in three steps.Firstly, we removed any samples from the training set that were identical to any samples in the validation or test sets.Next, we eliminated any samples from the validation set that were identical to any samples in the test set.Finally, we eliminated 94 duplicate samples within the test set.The deduplication statistics are presented in Table 1.
To assess the impact of the label leakage issue, we first use the replication package [18] released by the authors of VulRepair to train VulRepair from scratch by using the original fine-tuning dataset.Experimental results indicate just a 0.2% difference between our replication and the reported performance in the VulRepair paper, validating the correct usage of their replication package.Next, we trained another VulRepair from scratch on the deduplicated version of the dataset, resulting in a decrease in VulRepair's performance from 44.2% to 10.2% in terms of Exact Match accuracy.This significant drop is expected because of the label leakage issue.

Baselines
We evaluate VulMaster against three groups of baseline models.
The first group consists of learning-based automatic vulnerability repair approaches, namely VulRepair [19] and VRepair [9].Both of these approaches take a vulnerable function concatenated with a CWE ID (e.g., CWE-119) as input and generate the fixed code as output.The second group comprises general-purpose pre-trained code models that are widely used in various code-related downstream tasks.Specifically, we include two encoder-based models (i.e., CodeBERT [16] and GraphCodeBERT [23]), two decoder-based models (i.e., PolyCoder-160M [80] and CodeGen-350M [54]), and two encode-decoder based models (i.e., CodeReviewer [44] and CodeT5-base [74]).Those models take the same input and output as VulRepair and VRepair.The third group consists of large language models (LLMs) for code, which have gained significant attention recently [3,36,79].For LLMs, we include the well-known ChatGPT model (i.e., gpt-3.5-turbo)[57] and the improved version of Chat-GPT model (i.e., gpt-4) [58] as baselines.We use a prompt similar to the prompt described in Section 3.2.We remove the sentence about expert analysis because of the lack of expert analyses in real-world vulnerabilities.Additionally, one sentence is added to explicitly tell LLMs about the meaning of special tokens (i.e., <StartLoc> and <End-Loc>).We also provide the CWE description and one vulnerable example from the CWE website in the prompt.The prompt used is: "A vulnerability of type {_} refers to {_}.One vulnerable example of this type is {_}.The code {} contains a vulnerability of type {_}.Note that <StartLoc> and <EndLoc> indicate the start and the end of vulnerable code lines.Please generate the repaired code to address the vulnerability:".

Experimental Setting
Implementation details.Following prior work [9,19], we also focus on fixing C/C++ vulnerabilities.Thus, we mainly collected C/C++ vulnerable examples from the CWE websites.But if there is no C/C++ vulnerable example for a CWE type, we will also collect one example of other languages like C# and Java.Please note that our approach is generic and language-agnostic and can be applied to any programming language.During training, we employ the Adam optimizer with a learning rate of 1 − 4. The weight decay rate is set to 0.01.The batch size is set to 64, and the training process consists of 20 epochs.After each epoch, we evaluate the model's performance on the validation set.The best checkpoint on the validation set is selected for the final testing.For the hyper-parameters of the FiD framework, the maximum length of an individual segment (e.g., a segment of a lengthy vulnerable function) is set to 512, aligning with the maximum length of CodeT5 [74].Additionally, we observe that about 90% of the samples in our dataset comprise fewer than 10 segments, each containing up to 512 tokens.These segments encompass various input components, including the vulnerable function, AST node sequences, CWE descriptions, and examples from relevant CWE types.Therefore, we set the maximum number of segments of FiD to 10.We will discuss the impact of the choice of the maximum number of segments in Section 5.3 (RQ3).
Evaluation metric.In accordance with previous studies [19], we employ the Exact Match (EM) metric as our evaluation measure.EM refers to the percentage of the generated code that has the same token sequence as the ground truth.We also utilize another widely used metric, i.e.BLEU-4 [59] score, which evaluates the token-level similarity between the generated code and the ground truth.In addition, we utilize the CodeBLEU [61] score, which is a specialized variant of the BLEU score tailored for source code by additionally taking code structure into account.We consider the Top 1 prediction when calculating metrics.6

EXPERIMENTAL RESULTS
Our work aims to answer three research questions (RQ).
• RQ1: How effective is VulMaster compared to baselines?
In RQ1, we compare our VulMaster to learning-based SOTA vulnerability repair approaches, widely used pre-trained code models, and the latest LLMs.• RQ2: How do the key designs of VulMaster influence the model performance?In RQ2, we conduct an ablation study to confirm the contributions of different modules.

• RQ3: What are the influences of different design choices?
In RQ3, we investigate how different design choices impact the effectiveness of our VulMaster.

RQ1. The Effectiveness of VulMaster
Setup.We evaluate both the baselines (described in Section 4.2) and our VulMaster model using the dataset under investigation (discussed in Section 4.1).The evaluation metrics, namely EM, BLEU, and CodeBLEU, are detailed in Section 4.3.Notably, higher scores for all these metrics indicate better performance.Moreover, in the development of VulMaster, we employed a bugfixing dataset from [9] and vulnerable-fixed code pairs obtained from the CWE systems.However, the baseline models had not previously been fine-tuned on the same bug-fixing corpus or CWE data, potentially resulting in an unfair comparison.To mitigate this potential bias, we conducted additional experiments with the baselines, aligning them with our approach.This involved a four-step process: Firstly, we merged the bug-fixing dataset [9] with the vulnerablefixed code pairs from the CWE system, creating a merged single dataset.Secondly, we conducted fine-tuning on this merged dataset for the baselines (with the exception of close-sourced GPT-3.5 and GPT-4).The objective of this fine-tuning process was to generate clean code snippets given the presence of buggy or vulnerable code.Thirdly, the baselines were further fine-tuned using the vulnerability repair dataset [5,15] to prepare them for the vulnerability repair task.Finally, the baselines generated predictions for the test set of the vulnerability repair dataset.The results achieved after enhancing the bug-fixing and CWE data in this manner are denoted by the label "+bug-fixing and CWE data".

Results. Table 2 presents the performance comparisons between
VulMaster and the baseline models, with the best results highlighted in bold.Our VulMaster achieves the best results among all baselines and outperforms the learning-based SOTA by a large margin.The experimental results illustrate that VulMaster enhances the EM, BLEU, and CodeBLEU scores from 10.2% to 20.0%, 21.3% to 29.3%, and 32.5% to 40.9%, respectively, compared to the state-of-the-art learning-based vulnerability repair approach Vul-Repair.We conducted the Wilcoxon signed-rank test [77] on the paired data between VulMaster and each of the baseline models.The resulting p-values for all the comparisons were found to be less than 0.001, indicating that the performance differences between VulMaster and the baseline models are statistically significant.
Furthermore, LLMs such as GPT-3.5 and GPT-4 struggle to effectively repair real-world software vulnerabilities possibly because they cannot update their parameters to fully utilize the training data.Sun et al. [66] also reported a similar phenomenon in code summarization, where GPT-3.5 performs notably worse than fine-tuned pre-trained CodeT5 in terms of BLEU scores.
Table 2 also highlights the improvements achieved by incorporating the bug-fixing corpus and CWE data into all baselines.Despite these improvements, VulMaster maintains a substantial lead over all baselines.Notably, VulMaster outperforms the best-performing baseline (the enhanced VulRepair) by a significant margin, with improvements of 19.0%, 21.1%, and 15.9% in terms of EM, BLEU, and CodeBLEU, respectively.One of the novel aspects of VulMaster is the usage of CWE data that was not leveraged in prior works.These experiment results demonstrate that the reason behind VulMaster's superior performance over the baselines is not the inclusion of the CWE data alone.Rather, the other unique designs of VulMaster, such as 1) the Fusion-in-Decoder module that handles lengthy vulnerable functions, and 2) the AST node sequence that provides vulnerable code structures to the model, are also beneficial.
Analyses.We further analyze the learning-based SOTA VulRepair (also the best-performing baseline) and our VulMaster.Specifically, we divided the whole test set into subgroups and observed how VulRepair and VulMaster perform in different subgroups.Those subgroups are split based on three different criteria: the lengths of the input functions, the frequencies of vulnerabilities, and the severity levels of vulnerabilities.In terms of the lengths of the input, two subgroups were created: 1) the long function group consisted of half of the test samples whose vulnerable function lengths are Figure 4 shows the EM performance for each subgroup.Vul-Master exhibits significant superiority over the learning-based SOTA in all investigated dimensions.For example, VulMaster could improve the EM scores from 8.2% to 17.5% for the top risk group.Table 3 presents the number of perfect predictions made by VulMaster for the top risk group.Although the correct fix ratio (17.5%) of VulMaster is not high enough, this improvement over the SOTA is significant and suggests that our method is a substantial step towards more practical automatic vulnerability approaches in the future.Additionally, we observe that VulRepair exhibits lower performance with infrequent vulnerabilities compared to frequent ones, consistent with findings from a recent study [85] that suggested learning-based approaches may struggle with infrequent data.In contrast, VulMaster achieves comparable performance across both frequent and infrequent vulnerabilities, demonstrating its great performance with less common data.

RQ2. The Impact of the Core of VulMaster
Setup.This RQ aims to investigate the contributions of the key designs of our approach.Specifically, we conduct two sets of ablation studies using different backbone models: the input component designs and the DL model designs.In each ablation study, we remove one component at a time to examine the individual contributions of key designs.ChatGPT.For DL model designs, we experiment with three variants: (4) VulMaster w/o relevance-classifier: disregarding the binary classification task used to identify relevant vulnerable examples, Results and Analyses.The experimental results are presented in Table 4, with the best results highlighted in bold and the largest performance drop in the underline.Our main findings are: 1) All key designs are essential to achieve the best performance.Based on the results shown in Table 4, we observe that the absence of each key design in VulMaster leads to a reduction in the Exact Match score.It demonstrates that each component plays an important role in VulMaster.Specifically, the complete function, structural information, and CWE knowledge bring 6%, 5%, and 5.5% improvements, respectively.For DL model designs, the auxiliary relevance prediction task, the FiD, and the model adaptation bring 2%, 16.5%, and 32% improvements, respectively.
2) The model adaptation is the most effective module.We find that incorporating model adaptation with the bug-fixing corpus contributes to a significant improvement in VulMaster.While Chen et al. [9] demonstrated the effectiveness of bug-fixing corpus in the vanilla Transformer model, we are the first to show its high efficacy in adapting pre-trained models like CodeT5.
Answer to RQ2: All designs are essential for the performance of VulMaster.Besides, our input and model designs are effective and improve the performance by at most 32% in EM.

RQ3. Influences of Design Choices
Setup.In this RQ, we experiment VulMaster with three main design choices: 1) the maximum number of segments, 2) the prompt used to generate fixes for vulnerable examples from the CWE system, and 3) the ChatGPT models used to generate the fixes for vulnerable examples.For the maximum number of segments (denoted as ), it is a key hyperparameter in FiD [30], which controls the improved range of inputs.For instance, if  = 10, then the maximum input range of VulMaster is broadened from 512 to 5,120 tokens.In this RQ, we conduct experiments by varying  from 1 to 20, with an increment of 5.This aims to validate our choice of  = 10.For prompts, they can influence the performance of LLMs, like Chat-GPT, to some extent [6,62].While the prompt used in Section 3.2 may not be the optimal choice, it is impractical to explore all potential prompts [62].To examine VulMaster's sensitivity to different 8 prompts, we evaluate three distinct prompts, as shown in Table 6.P0 is the prompt we used in Section 3.2, P1 is the rephrased version of P0, and P2 is a simplified version of P0.This experiment aims to demonstrate that VulMaster can deliver satisfactory performance even with varied prompts.For ChatGPT models, OpenAI continuously and episodically updates its ChatGPT models [57].To assess the performance of VulMaster to different versions of ChatGPT models, we experiment with three ChatGPT model versions: GPT-3.5-turbo,GPT-3.5-turbo-0301, and GPT-3.5-turbo-0613.Among them, GPT-3.5-turbo is the latest model.On the other hand, GPT-3.5-turbo-0301 and GPT-3.5-turbo-0613 are deprecated snapshots from March 1st and June 13th of 2023, respectively.
Results of Varying Maximum Number of Segments.Table 5 presents the experimental results.In previous experiments, we chose  = 10 as it adequately covered over 90% of the data samples.
From Table 5, we further validate that K=10 strikes a balance between effectiveness and efficiency.On one hand, when  is less than 10, VulMaster's effectiveness drops accordingly.For example, with  = 5, the validation and testing EM scores decrease by 9.9% and 1.0%, respectively.On the other hand, when  is larger than 10, there is no further improvement in VulMaster's test EM scores.This is likely because  = 10 already covers all input segments of over 90% of data samples, and increasing it further does not provide significant benefits.However, higher  leads to slower inference. = 10 reaches a balance between effectiveness and efficiency.
Results of Varying Prompts and ChatGPTs.The experimental results are shown in the bottom part of Table 6.For three different prompts and different ChatGPT, we observe that VulMaster performance varies from 19.8% to 20.3%, with at most 3% differences.This indicates that VulMaster can deliver satisfactory performance with varied prompts and ChatGPT models.
Answer to RQ3: By setting the maximum number of segments as 10, VulMaster achieves a balance between effectiveness and efficiency.Besides, VulMaster shows stable performance across multiple plausible prompts and Chat-GPT model versions, with at most 3% differences.

DISCUSSION 6.1 Case Study
Figure 5 presents two fixes generated by the SOTA VulRepair and VulMaster for a vulnerable function [37] of CWE-401 from the evaluation dataset.CWE-401 refers to situations where memory is allocated dynamically but not properly released after it is no longer needed, leading to memory leaks [12].The vulnerable function waits for the completion of a prior operation (Line 10), and if the operation times out (Line 12), it returns with an error code (Line 15).However, it fails to release the dynamically allocated buffer (skb) before the return statement, resulting in a CWE-401 vulnerability.VulMaster correctly generates the fix (Line 14) to free the memory with skb.Its understanding of the entire function enables it to recognize the error-handling mechanism already present in Line 19 (i.e., kfree_skb(skb);), helping it to choose the suitable function (i.e., kfree_skb) for freeing the memory.In contrast, the SOTA VulRepair makes a wrong prediction because the vulnerable function exceeds VulRepair's input limit of 512 code tokens before the vulnerability location (Line 14).As a result, VulRepair's prediction only adds an irrelevant code line (Line 7).

How Do Fixes of CWE Examples Help?
The ablation study (RQ2) revealed the effectiveness of the CWE knowledge.The potential reason behind the effectiveness lies in that the generated fixes do not only describe "what is vulnerable" like the CWE names and vulnerable examples, but also explicitly provide knowledge on "how to fix an identified vulnerability".This is crucial information for the vulnerability repair task.The validity of this explanation depends on the accuracy of the generated fixes.
To validate the explanation, the first and second authors manually assessed the correctness of the generated fixes for C/C++ vulnerable code examples.They carefully examined the vulnerable examples and the expert analyses from the CWE website and then reviewed the potential fixes generated by ChatGPT.Fixes that successfully mitigated the vulnerability were labeled as "correct," while those that did not were labeled as "wrong."In case of any  discrepancies in annotations, a separate meeting was conducted to resolve each discrepancy, resulting in the final labels (shared in our replication package [2]).The manual investigation revealed that ChatGPT achieved an accuracy of 76.1% (67/88) in generating fixes for C/C++ vulnerable examples from CWE.This high accuracy supports our explanation to some extent.Additionally, ChatGPT performs exceptionally well in generating fixes for vulnerable examples from CWE.However, it struggles to repair vulnerabilities from real-world projects (RQ1).The reason is that security experts have already provided ample analyses and descriptions of the vulnerable examples on the CWE website (as depicted in ❶ of Figure 3).In contrast, such detailed information is often unavailable when repairing vulnerabilities in real-world projects, making the task much more challenging.

Generalizability to Benchmark with Tests
Motivation.In our previous experiments, we utilized evaluation data from CVEFixes [5] and Big-Vul [15], encompassing a wide range of vulnerabilities collected from 1,754 C/C++ projects.However, this evaluation data solely included vulnerable code snippets and their associated fixes, lacking accompanying test cases.To ensure VulMaster's generalizability to benchmarks that include test cases, we conducted an additional evaluation on the Vul4J [7] benchmark.Vul4J [7] is a high-quality Java vulnerability repair benchmark and includes test cases.Our evaluation of Vul4J also demonstrates VulMaster's versatility in handling multiple programming languages, such as C/C++ [5,15] and Java [7].Setup.The Vul4J benchmark serves as a test dataset for evaluation and does not provide a training pipeline for learning-based approaches.To overcome this limitation, we adopted the methodology of a recent study by Wu et al. [78], which conducted experiments comparing multiple Large Language Models (LLMs) and Automatic Patch Repair (APR) models on the Vul4J benchmark.Wu et al. [78] conducted their evaluation by specifically selecting 35 single-hunk vulnerabilities from the Vul4J benchmark, as the APR models under investigation were primarily designed to address single-hunk bugs.Their comparative analysis involved three main categories: 1) the frozen LLMs, 2) fine-tuned LLMs, and 3) state-of-the-art APRs.The frozen LLMs directly performed inferences on the test set.For the fine-tuned LLMs, Wu et al. addressed the limited availability of Java vulnerabilities for fine-tuning by utilizing the general bug-fixing Java data shared in [32] as their fine-tuning dataset.We utilized the bug-fixing Java data [32] as the training data and followed Wu et al. to examine the correctness of the generated patches by using the available test cases and manual verification.Results.Considering that the recent work [78] utilized large model sizes for their baselines (up to 12 billion parameters), we introduced a larger variant of VulMaster in this discussion to align with the baselines.We use CodeT5p-large (0.8 billion parameters) as the backbone model in this experiment.The results, presented in Table 7, clearly demonstrate that VulMaster achieved the highest performance.Out of 35 vulnerabilities, VulMaster repaired 9, achieving a fix rate of 25.7%.

Explanations for Low Fix Rate
VulMaster achieves a relatively low fix rate, primarily due to the inherent complexity of certain vulnerabilities.To delve into the causes of these incorrect fixes, we conducted a manual analysis.We randomly sampled 100 vulnerable-fixed code pairs from the test set and examined the corresponding generated fixes.Out of the 100 sampled vulnerabilities, VulMaster produced 78 incorrect fixes.Our manual investigation of these 78 failures revealed major causes as follows: 1) Vulnerabilities requiring multiple-hunk code changes pose a significant challenge for VulMaster.This category accounted for 31 failures.Generating accurate multi-hunk patches remains a challenging task for learning-based models.Notably, many stateof-the-art APR techniques [33,34,81,88] do not support bugs that require multi-hunk code changes.2) For 13 failures, developers' correct patches incorporate members of structure/union variables that are not defined within the vulnerable functions, which means those members are unknown to VulMaster.Thus, it is unlikely for VulMaster to generate the correct fixes.3) For 8 failures, VulMaster fails to produce identical strings as the developers' correct patches.4) For 5 failures, VulMaster fails to generate complex conditions in the correct fixes.5) For 5 failures, VulMaster fails to use the userdefined APIs correctly.6) For 4 failures, VulMaster uses the wrong APIs.7) For 5 failures, VulMaster generates semantically equivalent but syntactically different patches.Since we use Exact Match as the evaluation metric, these patches are considered 'incorrect' fixes.Besides, 7 other failures cannot be classified.

Threats to Validity
Threats to internal validity pertain to potential errors and biases in our experiments.To address these threats, we take several precautions.Firstly, we strictly follow the settings and methodologies employed by baselines, using their official implementations to ensure consistency.We have reviewed and validated our code and data, which are publicly accessible for transparency and reproducibility.Threats to external validity relate to the generalizability of VulMaster.To mitigate this potential concern, we meticulously select the experimental dataset, metrics, and baselines.The vulnerability repair dataset comprises 1,754 open-source software projects and has been utilized in prior learning-based state-of-the-art approaches.For the metrics, we select the EM, BLEU, and CodeBLEU, which are widely used in generation-based SE studies, e.g., [43,50].Lastly, we follow prior works [9,19,78] and assume perfect fault localization.We acknowledged that obtaining perfect or near-perfect localization is challenging.We consider more effective fault localization to be beyond the scope of this paper.We consider a human-in-the-loop setting where an experienced software engineer sifts through the 10 output of a fault localization tool and only employs our approach once the location of the vulnerability is identified.We encourage future work to continue looking into better methods to localize faults, extending the popular line of work on fault localization.

RELATED WORK
Vulnerability Datasets.Previous research has introduced various datasets to facilitate the evaluation of vulnerability repair approaches.The ManyBugs [40] dataset consists of 185 defects in 9 open-source programs.Most of their defects are logical errors rather than security vulnerabilities.The VulnLoc [64] dataset contains 43 vulnerable programs from 10 projects that span 6 CWEs.The Vul4J [7] dataset includes reproducible vulnerabilities from 51 open-source Java projects, representing 25 different CWEs.CVE-Fixes [5] and Big-Vul [15] are two large vulnerability repair datasets without test cases, curated from 1,754 projects.The large number of projects in CVEFixes and Big-Vul datasets help ensure the diversity of vulnerabilities stored in them.
Learning-based Vulnerability Repair.A number of learningbased approaches for repairing vulnerabilities have been proposed by researchers, extending the line of work on learning-based program repair, e.g.[39].For example, Ma et al. [48] proposed Vurle, the first learning-based approach for vulnerability repair; it learns transformative edits and their contexts from examples of vulnerable code fragments and their repairs.Chi et al. [10] introduced SeqTrans, a Transformer-based machine translation model with copy mechanisms designed to fix Java vulnerabilities.Chen et al. [9] proposed VRepair, which initially pre-trains a vanilla Transformer model on a bug-fixing corpus and subsequently utilizes it to address C/C++ vulnerabilities.Fu et al. [19] proposed VulRepair, which leverages a BPE tokenizer and CodeT5 that is pre-trained on a large code corpus.Wu et al. [78] conducted an evaluation of nine large language models and four program repair models on the Vul4J dataset [7].In contrast, VulMaster is specifically designed to effectively utilize diverse input sources: 1) input vulnerable functions, 2) code structures, and 3) rich CWE expert knowledge, while leveraging the power of Large Language Models.Our goal is to advance the progress of learning-based vulnerability repair approaches.
Program Analysis-based Vulnerability Repair.Several program analysis-based approaches for repairing vulnerabilities have been proposed.CDRep [47], one of the earliest program analysisbased vulnerability repair approaches, employs a set of templates and data flow analysis to fix cryptographic-related vulnerabilities in Android apps, with a success rate of 90%.SenX [29] aims to repair vulnerabilities by leveraging vulnerability-specific and humanspecified safety properties.SAVER [27] and Memfix [41] are designed to address memory errors.FootPatch [69] generates patches that adhere to specific heap properties specified using separation logic.FootPatch is constrained to addressing only a limited set of common vulnerability types, such as memory leaks.VulnFix [83] utilizes counterexample-guided inductive inference for repairing vulnerabilities.However, VulnFix is not applicable to vulnerabilities that cannot be resolved by modifying or inserting conditions, or situations that necessitate the introduction of new program variables [83].Unlike CDRep, SenX, SAVER, Memfix, FootPatch, and VulnFix, VulMaster is not limited to specific vulnerability types.
Fix2Fit [20] uses fuzz testing to filter out patches (produced by the APR model) that cause crashes.Combining VulMaster and Fix2Fit has the potential to enhance overall performance.CPR [63] employs concolic execution, along with a user-provided specification, to generate new inputs that aid in identifying and filtering out overfitting patches among the generated ones.ExtractFix [21] uses dependency analysis and symbolic executions to fix vulnerabilities.In contrast to CPR and ExtractFix, VulMaster does not depend on heavy concolic and symbolic executions, making it able to generate patches more efficiently.
VulMaster differentiates itself from program analysis-based vulnerability repair approaches in several key aspects.Firstly, as a learning-based approach, VulMaster is not limited to a specific type of vulnerability, distinguishing it from various program analysisbased methods like SenX [29], SAVER [27], Memfix [41], Foot-Patch [69], and VulnFix [83].However, for vulnerability types where good examples are hard to find, and specialized algorithms, precise templates, or formal safety properties are available, program analysis-based solutions will perform better.Thus, the two lines of work (learning-based and program analysis-based vulnerability repair) are complementary.Secondly, VulMaster is adaptable to multiple programming languages, different from many program analysis-based vulnerability repair methods (such as SAVER [27], Fix2Fit [20], and ExtractFix [21]) focusing solely on C/C++.Thirdly, VulMaster can generate patches more efficiently (about 10-20 seconds for one vulnerability) than program analysis-based vulnerability repair methods relying on heavy symbolic and concolic executions (e.g., CPR [63] and ExtractFix [21]) or time-consuming dynamic invariant inference (e.g., VulnFix [83]).

CONCLUSION AND FUTURE WORK
We propose VulMaster, a vulnerable repair model that effectively mitigates the input length limitations of Transformer-based pretrained models, following FiD.By eliminating the length limit, Vul-Master can integrate diverse information, including vulnerable code structures and expert knowledge from the CWE system, to further enhance its repair capabilities.The experimental results demonstrate that VulMaster outperforms all baselines significantly.A replication package is provided at https://github.com/soarsmu/VulMaster_.
In the future, our interests will extend across multiple directions.As there is potential redundancy or noise in the inputs, we plan to design a refinement and denoising module to mitigate these issues.We also plan to investigate additional evaluation metrics, c.f., [14,60].Moreover, we are keen on employing larger pre-trained code models for this task in the future, utilizing parameter-efficient tuning techniques like [76].Additionally, we plan to develop trustworthy and synergistic AI4SE methodologies [46] to help software engineers better trust and synergize with VulMaster and other automated vulnerability identification and repair solutions.

- 1 Figure 1 :
Figure 1: A motivating example.❶ the process of how junior developers repair vulnerability; ❷ a vulnerable function and its fixes from the ImageMagick project; ❸ a vulnerable example and its fixes from the CWE website.

Figure 3 :
Figure 3: Details of CWE knowledge extraction.❶ the CWE name and one example from the CWE website; ❷ the process of generating fixes of typical vulnerable examples from CWE; ❷ the process to obtain vulnerable-fix code pairs given the target CWE.

Figure 4 :
Figure 4: Model performance in EM on different groups.

( 1 )
VulMaster w/o entire function: removing the extended part of lengthy functions and only including the first 512 tokens, similar to the prior approaches like VulRepair, (2) VulMaster w/o AST : excluding the AST traversals, and (3) VulMaster w/o CWE knowledge: eliminating the CWE name, vulnerable examples from CWE, and generated fixes for CWE vulnerable examples by VulMaster w/o FiD: removing the FiD, resulting in the simple utilization of the backbone model (truncating inputs at the 512 tokens).(6) VulMaster w/o model adaptation: leveraging the original CodeT5-base rather than the adapted-CodeT5 in Section 3.3.
CWE knowledge is helpful in inspiring the repair.To enhance the understanding of software vulnerabilities, the junior security developer may refer to the name/description and typical vulnerable examples of CWE-125 on the CWE website.From the example shown in ❸, one typical cause of CWE-125 is forgetting to ensure the array index is not a negative number.Upon identifying this pattern, the developer can revisit the target vulnerable function in ❷: Line 7 of the target function obtains a value for the variable 2Fusion-in-Decoder (FiD) & Relevance Prediction Part 1: Vulnerable Code Preprocessing.Given a vulnerable function   , this part preprocesses the function into a code token sequence and the AST node sequence by traversing the ASTs.Part 2: CWE knowledge Extraction.Given the CWE type   of the vulnerable function   , this part outputs the name, vulnerable code examples listed on CWE web pages, and the corresponding fixes for vulnerable code examples from CWE web pages.

Table 1 :
Statistics of the studied dataset

Table 2 :
Model performance of baselines and VulMaster

Table 3 :
Numbers of perfect predictions for the top-10 most dangerous CWEs ) the short function group, consisted of the remaining test samples.The threshold between long/short groups is 449 tokens.In terms of the frequencies of vulnerabilities, we have: 3) the frequent group, consisted of 50% of the test samples whose CWE types are more frequent, and 4) the infrequent group, consisted of the remaining half of test samples.For the severity levels of vulnerabilities, we have: 5) the top risky group consisted of test samples whose CWE types are one of the top 10 most dangerous CWE types, and 6) the less risky group consisted of the remaining test samples.

Table 5 :
Effectiveness and Efficiency when the maximum number of segments vary

Table 6 :
Different prompts and the performance of VulMaster when prompts and ChatGPT models vary