Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation

Source code vulnerability detection aims to identify inherent vulnerabilities to safeguard software systems from potential attacks. Many prior studies overlook diverse vulnerability characteristics, simplifying the problem into a binary (0-1) classification task for example determining whether it is vulnerable or not. This poses a challenge for a single deep learning-based model to effectively learn the wide array of vulnerability characteristics. Furthermore, due to the challenges associated with collecting large-scale vulnerability data, these detectors often overfit limited training datasets, resulting in lower model generalization performance. To address the aforementioned challenges, in this work, we introduce a fine-grained vulnerability detector namely FGVulDet. Unlike previous approaches, FGVulDet employs multiple classifiers to discern characteristics of various vulnerability types and combines their outputs to identify the specific type of vulnerability. Each classifier is designed to learn type-specific vulnerability semantics. Additionally, to address the scarcity of data for some vulnerability types and enhance data diversity for learning better vulnerability semantics, we propose a novel vulnerability-preserving data augmentation technique to augment the number of vulnerabilities. Taking inspiration from recent advancements in graph neural networks for learning program semantics, we incorporate a Gated Graph Neural Network (GGNN) and extend it to an edge-aware GGNN to capture edge-type information. FGVulDet is trained on a large-scale dataset from GitHub, encompassing five different types of vulnerabilities. Extensive experiments compared with static-analysis-based approaches and learning-based approaches have demonstrated the effectiveness of FGVulDet.


Introduction
Software vulnerability is defined as a weakness in the software system that could be exploited by a threat source.With the increasing number of open-source libraries and the expanding size of software systems, the count of software vulnerabilities has been rising rapidly.Since these vulnerabilities can be exploited by malicious attackers, causing significant financial and social damages, vulnerability detection and patching have garnered widespread attention from academia and industry.For instance, the Common Vulnerabilities and Exposures (CVE) Program and the National Vulnerability Database (NVD) have been established to identify and patch vulnerabilities before they are exploited.So far, over 100,000 vulnerabilities have been indexed.However, in contrast to the quantity of open-source projects and the speed of software iteration, the number of exploited vulnerabilities is insufficient.In other words, there exists a large number of "silent" vulnerabilities that have not been exploited.
Static analysis for vulnerability detection aims to identify vulnerabilities in the source code without execution, typically requiring substantial manual effort from security experts to craft rules.This approach has limited generalization ability across diverse vulnerabilities.Dynamic techniques, such as fuzzing and symbolic execution, identify vulnerabilities by dynamically executing programs.Dynamic approaches demonstrate relatively high precision in vulnerability detection, but configuring execution is complex, and execution results may be incomplete since not every program path can be executed.
Due to the capability of deep learning-based techniques to automatically extract features, more research focuses on utilizing DL techniques for vulnerability detection [9,28,29,31,40,52,54].In early DL-based vulnerability detection works, some works [40] employed convolutional neural networks (CNNs) to leverage their powerful convolution capabilities for learning vulnerability-related features.However, as programs are not fixed length compared to images, they are not well-suited for CNNs.To avoid this problem, some other works [9,29,31,54] treat programs as a flat sequence and apply recurrent neural networks (RNNs) with Long Short-Term Memory (LSTM) directly to learn the vulnerability features.Yet, in vulnerability scenarios, certain types of vulnerabilities, such as buffer overflow, are related to data flow, which cannot be captured by the program text alone.To capture the data dependency and control dependency of programs, Li et al. [28] proposed a program slicing algorithm based on the program dependency graph (PDG) to slice related statements and feed them to Bidirectional RNNs for learning.However, it still fundamentally treats programs as sequences.
How to learn well-structured control and data dependencies in programs?Devign [52] proposed an effective way by encoding programs into a code property graph (CPG) and utilizing this graph through GGNN [27] for vulnerability detection, achieving state-of-the-art performance.Afterward, there is a great number of works using GNNs to learn program semantics for source code vulnerability detection [7,38,46,46].However, most of these works combined various types of vulnerabilities to train a single classifier for vulnerability detection.Moreover, data augmentation has been shown to significantly improve performance on image data [18,20,21].Recent works [23,34] propose to augment code with the same functionality variants by the transformations for contrastive pre-training to learn code functionality for different downstream tasks.However, the defined transformations are at the granularity of the function and it cannot guarantee vulnerability-preserving attribution for vulnerabilities when transforming a function to the variants.Hence, how to perform data augmentation meanwhile preserving the source code vulnerability is a challenge.
To address these challenges, in this paper, we propose FGVulDet, which is a fine-grained vulnerability detector.Specifically, we train multiple classifiers via the enhanced GGNN for each type of vulnerability on the real collected vulnerability data set from GitHub.Then each model prediction result is ensembled for voting to give the final prediction.Furthermore, we propose a novel vulnerability-preserving data augmentation to enrich the diversity of data and improve the prediction performance.On the side of GGNN, we adopt it and further encode the edge type information along with the node features during message passing i.e., edge-aware GGNN for the enhancement of the vulnerability detection.We claim that the edge type information i.e., "Flow to", "Control" represents different semantics of programs, and encoding them explicitly during the learning process can facilitate learning more accurate code representations.An extensive evaluation is conducted on five different types of vulnerabilities compared with some static-analysis tools and deep-learning based vulnerability detection approaches have confirmed the superiority of our proposed approach.Further ablation study also reveals the effectiveness of each component in FGVulDet.Our contributions are as follows.
• We propose a novel vulnerability-preserving data augmentation technique to enrich the amount of the collected data and mitigate the limitations of rare vulnerabilities in quantity.
• We adopt an edge-aware GGNN by incorporating edgetype features with node features to improve the learning capacity of GGNN for vulnerability detection.

Code Property Graph
Code property graph (CPG) proposed by Yamaguchi et al. [50], combines several program representations e.g., Abstract Syntax Tree (AST), Control Flow Graph (CFG), Program Dependency Graph (PDG) into a joint graph to represent a program.An illustrated example is shown in Figure 1.We can observe that AST nodes (defined as black arrows in the graph) are the backstone of CPG.Besides AST, some other semantic representations i.e., control flow, and program dependency information can also be constructed on AST to represent different semantics of the program.For example, CFG represents the statement execution order of the program, and "Flow To" (blue arrow) represents this flow order in CPG.Furthermore, PDG is also involved in CPG, and the edges "Define/Use" (green arrow), "Reach" (red arrow) define the data dependencies, and the "Control"(yellow arrow) is the control dependency of a program.

Graph Neural Networks
Graph Neural Networks (GNNs) [25,27] have been widely employed in modeling non-Euclidean data structure such as social networks [19,25], protein-protein interaction networks [36].The primary objective of a GNN is to identify patterns in graph data, relying on information within the nodes and their interconnectedness.There exist various GNN variants, here we describe the broad category of message-passing neural networks [17].Suppose the original data can be modelled by a multi-edged graph, denoted as  = (N, E), where N = {  } is the node set and E is a set of directed edges    − →   and  is the edge type.Each node   is endowed with vector representation     indexed over a timestep (hop) namely .The node states are updated as where   (•) is a function that computes the message based on the edge label .⊕ is an aggregation operator that summarises the message from its neighbors and   is the update function that updates the state of node   .The initial state of each node  0   is from node-level information.Equation 1updates all node states in a total number of  times recursively and at the end of this iteration, each     represents information about the node and how it belongs with the context of the graph.The well-known Graph Convolution Network(GCN) [25], Gated Graph Neural Network(GGNN) [27] also follow Equation 1, but the definitions of   and   (•) are different.For example, GGNN, which has been widely used in modeling source code [1,32,33,52], employs a single GRU cell [8] for   , i.e.,   =  (•, •), ⊕ is a summation operation and   (    , ,     ) =       , where   is a learned matrix.The difference between GCN and GGNN lies in   is ReLU function [37] and   +1 can be expressed as following equation: 3 Approach

Overview
The framework of our approach is illustrated in Figure 2  and their diversity in functionality, spanning various domains such as operating systems, networking, and database applications (e.g., Linux Kernel, OpenSSL, QEMU).
To ensure the quality of data labelling, we follow three steps to collect vulnerability-related Commits.
• Commit Filtering.We employ vulnerability-related keywords (shown in Table 1), which have been analyzed and summarized by a team of professional security researchers from a large number of commits.These keywords include five Common Weakness Enumerations (CWE) defined in the National Vulnerability Database (NVD), with each vulnerability type having one or more associated keywords.Commits whose messages do not match any of the keywords in Table 1 are excluded, and the remaining commits are considered more likely to be related to vulnerabilities.For example, in Figure 3, the vulnerability-related commit is accurately captured by the keyword • Type Matching.Commits matched by keywords of multiple vulnerability types are excluded, as we cannot determine which vulnerability type they belong to.We retain commits matched by a single vulnerability type and use that type to label the commits.• Commit Pruning.There are some vulnerability-related commits that may modify multiple functions, and not all of these functions are related to the vulnerability.We cannot automatically identify which function is related to the vulnerability, to alleviate this problem, we exclude those commits that modify more than one function.After the above three steps, we obtain a high-quality commit data set with vulnerability type labels.

Vulnerable/Non-vulnerable Function Extraction.
Given the vulnerability-related commit as input, we can get its corresponding security patch   .We extract vulnerable functions   and patched functions   based on the change statements (i.e., the added statements   and the deleted statements   ) from   .In this work, we take   as vulnerable functions and   as non-vulnerable functions.We can get a tuple (  ,   ,   ,   ), where   ,   will be utilized for augmentation (See Section 3.3).An illustrative example of a security patch is shown in Figure 3, we can get the changed statements   and   at line 7 to line 8 and line 9 to line 10, respectively.The vulnerable function   is composed from line 5 to line 8 and line 11 to 12 in Figure 3, and the patched function   is composed from line 5 to line 6 and line 9 to line 12.

Vulnerability-preserving Data Augmentation
As we collect different types of vulnerability data, it is difficult to ensure each type has a sufficient number for models to learn, hence we propose a data augmentation technique to scale up the collected data  in Section 3.2.Furthermore, the newly generated data must retain the vulnerability characteristics of the original data, i.e., it needs to be vulnerability-preserving.If the vulnerability is lost or compromised, the generated data becomes ineffective for model training.Hence, we propose a novel vulnerabilitypreserving data augmentation method to generate new data from the original dataset .It primarily involves two steps.The first step (Section 3.3.1) is to slice all the statements related to the vulnerability.The second step (Section 3.3.2) is to augment the original dataset  by preserving the semantics of vulnerability-related statements and modifying the statements unrelated to the vulnerability.
3.3.1 Slicing vulnerability-related statements.Given a 4-tuple (  ,   ,   ,   ) from Section 3.2.2,we slice the statements that are related to   and   .To achieve this, we need to get the statement dependency relationship in a function.We utilize the program dependency graph (PDG) to obtain the data dependency and control dependency for each statement in   and   .The generated PDGs are defined as PDG   and PDG   .Based on PDG, we design an Algorithm 1 to slice the vulnerability-related statements.Specifically, for each statement s  /s  in   /  , a forward slicing procedure is performed in PDG to get a list of related statements S  in the function   /  .This step is to ensure find out the future dependent statements from the current statement i.e., start from current statement.Then based on the obtained S  , a backward slicing procedure is conducted to extract all relevant statements before the S  i.e., point to current statement.Finally, we combine both directions for the added statements S  and deleted statements S  and obtain the vulnerability-related statements denoted as S  .

Augmenting code by operators.
As all vulnerability-related statements i.e.,   are obtained, we can augment the data on the vulnerability-unrelated statements from the vulnerable function   i.e., { | ∈   \  } where  is the vulnerability-unrelated statement and \ is the set difference operation between   and   .Preserving all vulnerability-related statements in the vulnerable function can retain its vulnerability and the proof is produced in Section 6.2.We define five types of mutation operations for vulnerable data augmentation, as shown in Table 2. Specifically, the operation rn means to rename the used variable names with all the occurrences of these variables with other names.The operation ai means to add a if condition which is the logical true before the assignment statements.For example, suppose an assignment statement int a = b; is vulnerability-unrelated, it can be transposed to if (True) then int a = b; after performing the  operation.Operation del will randomly delete the statements that are not related to vulnerability-related statements, while the add will rename variable names in an assignment statement and add it back to the original function   , and the operation ro means to reorder the consequent assignment statements in the original function.By different types of mutation operators, we can greatly increase the amount of original data and enrich its diversity.

Edge-aware GGNN
While GGNN has found extensive application in modeling source code [1,15,33,35,52], it is noteworthy that the message passing is solely based on the node representations, i.e.,    , and the edge information is overlooked.We believe that the different types of edges in the Code Property Graph (CPG), such as "Flow to" and "Control" signify different semantics of programs, playing a crucial role in vulnerability detection.Building on this insight, we propose an edgeaware GGNN to leverage edge information effectively for vulnerability detection.

Graph Initialization.
For both vulnerable and nonvulnerable functions, we utilize Joern [50] to obtain the Code Property Graph (CPG).In a formal representation, a raw function  can be expressed as a multi-edged graph (V, E), where V is the set of nodes, and (, ) ∈ E denotes the edge connecting node  and node .Each node possesses its node sequence, parsed by Joern from the original function.We tokenize the node sequence by spaces and punctuation.Additionally, for compound words (tokens constructed by concatenating multiple words according to camel or snake conventions), we split them into multiple tokens.
We represent each token in the node sequence and each edge type connected with nodes using the learned embedding matrix   and    , respectively.Subsequently, the nodes and edges of the Code Property Graph (CPG) can be encoded as: ) where  denotes the number of tokens in the node .Hence, given the code property graph (V, E), we have  ∈ R × , which denotes the initial node embedding matrix of the CPG, where  is the total number of nodes in the CPG and  is the dimension of the node embedding.
3.4.2Edge-aware Message Passing.For every node  at each computation iteration  in the graph, we employ an aggregation function to calculate the aggregated vector   N () .This is achieved by considering a set of neighboring node embeddings, as well as the connected edge type information computed from the previous hop.As the edge information is also taken into account in the message passing process, it is specifically referred to as edge-aware message passing.
where N () is a set of the neighboring nodes which are directly connected with ,  ∈ R (+ ′ ) × where  and  ′ are the dimension of the node and edge embedding, and Relu is the rectified linear unit [37].For each node ,  0  is the initial node embedding of , i.e.,   ∈  .
A Gated Recurrent Unit (GRU) [8] is used to update the node embeddings by incorporating the aggregation information.
After  iterations of computation, we obtain the final node state    for node .Subsequently, we apply max-pooling over all nodes    |∀ ∈ V to acquire the -dimensional graph representation   .

Classification Layer.
After the message passing, we can get the graph representation   and use it for prediction.Specifically, a liner projection with a sigmoid activation function is used to make the final prediction.
where  ′ is the logit produced by the sigmoid function and  ′ ∈ R  ×1 is the learned matrix.

Training
In FGVulDet, for each type of vulnerability (refer to Table 1), on the corresponding    , which is the training set containing vulnerable and non-vulnerable functions for the vulnerability type , we train a set of binary classifiers  = {  , ∀ ∈ Vul}, where Vul = CWE − {404, 835, 120, 672, 362} is the vulnerability type list to detect whether the function is vulnerable or not.The loss function for   is binary cross entropy.
where  ′ is the logit (See Equation 6) and  ∈ {0, 1} is the label with 1 for vulnerable and 0 otherwise.Totally, we have five classifiers according to different vulnerability types.

Testing
Since FGVulDet targets fine-grained vulnerability detection, we train each type of vulnerability as a single binary classifier   and vote to give the final prediction for a test sample.Specifically, given a function   (resp.  ) from the test set, each type of classifier   is employed for detection  ′  =   (  ) and the predicted label can be expressed as follows: where 1 for vulnerable and 0 for non-vulnerable.The final prediction result is determined through a majority voting mechanism from all classifiers based on the majority rule.

Evaluation Setup
• RQ1: What is the performance of FGVulDet compared with baselines in detecting vulnerable code?• RQ2: Can each type of the defined mutation operations be beneficial to augment the training dataset to improve the detection accuracy?• RQ3: What is the performance of our designed edge-aware GGNN compared with other GNN variants for vulnerability detection?

Dataset Details
The statistics for the five common vulnerability types on the collected dataset are presented in Table 3.We first collect a total of 92,525 commits of the five CWE types, then extract vulnerable and patched functions from each commit as vulnerable and non-vulnerable functions.After that, we extract the code property graph (CPG) for each function and obtain 165,222 graphs in total, which is less than the number of the raw functions due to the compilation errors of some functions with Joern [51].We further conduct a data preprocessing to remove functions whose number of graph nodes is greater than 800, and finally obtain a raw data set with a total number of 99,076 functions.We divide the raw dataset into a train set, validation set, and test set at a ratio of 6:2:2.In the end, we perform five types of mutation operations (see Table 2) to augment the vulnerable functions in the train set and generate the mutated functions for each type of mutation operations with a nearly equal amount of vulnerable functions.Note that, our dataset is more challenging than Devign [52].In particular, the extracted non-vulnerable functions in Devign [52] come from non-vulnerable commits.However, FGVulDet uses the fixed version of the code from the vulnerability-related commits as the non-vulnerable functions for the model to learn.As the non-vulnerable functions are highly similar to the vulnerable functions in this operation compared with Devign, hence it is a more difficult data set for DL-based approaches to learn vulnerability features to distinguish them.

Baselines
We evaluate FGVulDet by comparing it against several wellknown vulnerability detection approaches.[24].It follows a process to abstract the function and then generates fingerprints for each function by hashing the normalized code.A target function is identified as vulnerable if its fingerprints match those of vulnerable functions.MVP [49].It is similar to VUDDY, which extracts vulnerability and patch signatures from a vulnerable function and its patched counterpart using a proposed program-slicing algorithm.Then It identifies a target function as vulnerable if it matches the vulnerability signature but does not match the patch signature.[29].It introduces a methodology involving extracting semantically connected statements associated with an argument of a library/API function call, forming code gadgets.Following program normalization, which standardizes user-defined variable names and function names, a bidirectional LSTM neural network is employed to determine the function's vulnerability.Multi-Head Attention [44].It has been widely used for modelling sequences.In particular, we leveraged the documentation from harvardnlp [26] to construct a multi-head attention layer, setting the number of heads to 4 and the maximum sequence length to 150 for comparative analysis.

Deep-learning-based approaches. Vuldeepecker
Devign [52].It is a typical work in vulnerability detection utilizing graph neural networks.Specifically, it combines varied semantics of a function into a unified graph structure to glean program semantics.Additionally, it employs a convolution module to capture features related to vulnerabilities.
CodeBERT [14].It is a pre-trained model rooted in the Transformer architecture for code modeling.Leveraging millions of code data, it undergoes pre-training and subsequent fine-tuning for downstream code-related tasks.We have reproduced its implementation using the default settings provided in the official code on our dataset.

Experimental Settings
We utilized the common words where the frequency ≥ 3 from the training set, amounting to 90,000, to create the vocabulary set.The dimensions of word and edge embeddings were set to 128 respectively.A dropout of 0.3 was implemented after the word embedding layer.The hop value was set to 4 for CWE-404 and CWE-672, 1 for CWE-835, 5 for CWE-120, and 2 for CWE-362 to achieve optimal performance.We employed the Adam optimizer with an initial learning rate of 0.001 and a batch size of 64 for training.All experiments were conducted on the DGX server with three Nvidia Graphics Tesla V100.

RQ1: Compared with Baselines
The experimental results are presented in Table 4.The first row presents the results for the static-analysis-based approaches, the second row is the results for DL-based approaches.The results for FGVulDet without/with the data augmentation are provided in the row of FGVulDet  and FGVulDet respectively.
When comparing the results of FGVulDet with staticanalysis-based approaches, it is obvious that VUDDY achieves much higher precision scores.For instance, in the case of the vulnerability CWE-672 (Operation After Free), VUDDY achieves a precision score of 84.62 which is much higher than the score of FGVulDet 50.53.The higher precision score indicates that VUDDY has fewer false positive samples, which is reasonable as VUDDY relies on expert-crafted features to detect code vulnerabilities.These hand-crafted vulnerability features are highly reliable by security experts.Therefore, if the samples being detected exhibit similar features, there is a high probability that they are vulnerable code.However, we can also find that VUDDY has a lower recall than FGVulDet i.e., 1.25 vs 92.26.It indicates that VUDDY has more false negative samples as these hand-crafted vulnerability features can only cover a limited number of vulnerability types, which leads to missing a substantial number of vulnerabilities compared to FGVulDet.In addition, we find that although MVP has lower precision scores than VUDDY, its recall scores are better than VUDDY, which indicates that MVP covers more vulnerability types but the detection precision is lower than VUDDY.Compared with these static-analysisbased approaches, FGVulDet is able to achieve a much higher recall, which leads to a higher F1-score.When comparing the results of FGVulDet with the DLbased approaches, we can find that the pre-trained model CodeBERT performs better than other baselines in terms of F1.We speculate that the main reason is that CodeBERT uses extensive code-related data for pre-training and the model architecture is more powerful than the other baselines.Hence, CodeBERT has a stronger learning capability.However, FGVulDet outperforms it in terms of recall and F1.Even without data augmentation i.e., FGVulDet  in Table 4, we can find that it still has better performance than CodeBERT in terms of F1 for vulnerability types CWE-835, 120, 672, 362), which indicates the effectiveness of our proposed approach.
Answer to RQ1: Although some static-analysis-based approaches have higher precision scores than FGVulDet, they have extensive false negative samples.Overall, in terms of Recall and F1, FGVulDet outperforms current baselines including static-analysis-based approaches and DL-based approaches by a significant margin.

RQ2: Effectiveness of Mutation Operations
In our work, we introduce five types of mutation operations to augment the data.We assess the effectiveness of each operation individually by conducting experiments using only one type of mutation operation at a time, while maintaining the hyper-parameters consistent with the original model.The results of these experiments are outlined in the final row of Table 4, where FGVulDet * denotes the specific type of mutation operation being evaluated.The combined results of all five mutation operations are presented in the row labeled FGVulDet.
Through the analysis of the experimental results, it is evident that incorporating five types of mutation operations can significantly improve recall and F1 scores.For instance, in the case of vulnerability type CWE-404 (indicating memory leaks), FGVulDet improves recall and F1 from 46.95/54.32 to 81.15/64.78,respectively.Notably, the rn operation stands out as the most effective in enhancing F1 across different vulnerability types.Even for the vulnerability CWE-672, when fusing other mutation types of data to rn, F1 has a decrease.We conjuncture that rn operation appears to be efficient in improving the diversity of the training set compared to other mutation operations, making the model more robust and powerful.
Additionally, different types of mutations exhibit inconsistent performance compared to the original model without mutations (FGVulDet  ).For example, the mutation operations add, ai, and rn improve F1 for all types of vulnerabilities compared to FGVulDet  .However, the del operation has a negative impact except for CWE-404 (memory leak), while ro improves F1 for vulnerability types CWE-{404, 835, 672} but has a negative impact for CWE-{120, 362} compared to FGVulDet  .This may be attributed to the fact that the del and ro operations introduce some other types of vulnerabilities in the mutated functions, adding noise to the dataset and making it challenging for the model to make correct decisions.For more details about the reason that del and ro can introduce new vulnerabilities, please refer to Section 6.2.Despite the negative impact of del and ro on specific vulnerability types, combining them with other mutation operations in FGVulDet yields the best overall performance.
Answer to RQ2: A thorough analysis of the performance of different mutation operations leads us to the conclusion that vulnerability-preserving data augmentation is effective for further enhancement.

RQ3: Effectiveness of Edge-aware GGNN
FGVulDet proposes the edge-aware GGNN which extends the current GGNN by encoding the edge-type information and using it in the message passing process to learn the vulnerability-related features.To illustrate the effectiveness of the proposed model, we also compared it with some GNN variants such as Gated Graph Network (GGNN) and Graph Convolution Network (GCN).The experimental results are shown in Table 4.
To make a fair comparison, we only compare the results of different GNN variants with FGVulDet  .We find that compared with GGNN, supplementing the edge type information is beneficial for the model to detect vulnerabilities.It is reasonable as different types of vulnerabilities are involved in different aspects of the program i.e., in the different types of code property graph.For example, the vulnerability of CWE-120 (Buffer Overflow) is about the definition and usage of a variable, which can be captured in the program dependency graph via the edges of "Define/Use".However, the vulnerability of CWE-672 (Operation After Free) is related to the execution order of the statements and it is reflected in the control flow graph through the edges of "Flow to".Thus, we embed the edge type information explicitly and use it for learning will decrease the difficulty of the model in identifying the vulnerability and improve the model's capability.Furthermore, we also compare with GCN, which uses multiple graph convolutional layers in updating node representations, FGVulDet  still has a better performance excluding CWE-835 in terms of F1 score.
Answer to RQ3: Edge-aware GGNN, which encodes the edge type information explicitly for learning can increase the model capability and reduce the difficulty for the model to detect vulnerabilities, thus it can produce a better performance compared with some other GNN variants i.e., GGNN, GCN.

Dataset Reliability
To ensure the reliability of our dataset, we established a systematic data collection pipeline.Initially, we crawled a substantial number of commits from 1614 C-language opensource projects on GitHub.Subsequently, we filtered out commits that only modified a single function and whose messages were matched by one vulnerability type (refer to Section 3.2).By employing this method, the functions extracted from the commits are more likely to be vulnerabilities with the correct vulnerability type.To validate the collected dataset, a team of professional security researchers randomly selected 400 commits from each of the 5 vulnerability types, totaling 2000 commits, and conducted a two-round crossverification on the data labeling.The results revealed that 97.3% of commits were correctly classified (CWE-404 was 97.50%, CWE-835 was 96.50%, CWE-120 was 97.75%, CWE-672 was 98.50%, and CWE-362 was 96.25%, respectively).The high precision of our dataset indicates that vulnerabilityrelated commits can be effectively identified and correctly classified using keywords from an unlabeled dataset.

Proof of Vulnerability-preserving
Here we discuss the proof of vulnerability-preserving augmentation, which can preserve the vulnerability of a function after being modified.First, we have the following notation   denotes the statements that trigger the vulnerability in a function, and    denotes the dependent statements of the statements   in a function.We have the assumption: Assumption: If all   and    are retained after the mutation operations, then the vulnerability still exists.
The assumption is correct since the vulnerabilities are composed of   and    .Hence, as long as we prove that our proposed method retains all   and    , it is vulnerability-preserving.
We understand that commits in GitHub primarily serve to record code changes for different versions, and the commits extracted through our Data Collection process (Section 3.2) are identified as vulnerability-related through keyword filtering.The cross-validation results presented in Section 6.1 confirm the reliability of these commits concerning vulnerabilities.Consequently, the changed statements   and   in the vulnerability commit are unequivocally associated with the vulnerability, i.e., ∃ ⊆   ∪   ⇒  ⊆   ∪    .This suggests that if we apply the vulnerability-related slicing algorithm 1 by iteratively searching for related statements via PDG to each statement in   and   separately, then all   ∪    can be encompassed, i.e., Once we obtain the vulnerability-related statements   , we can proceed with data augmentation by considering the set difference between the raw vulnerable function   and   .This augmentation process ensures the preservation of vulnerability in the augmented functions.
However, it is important to acknowledge that the defined mutation operations cannot guarantee the absence of new vulnerabilities.Consider the scenario where a statement like "usbDevs = NULL;" is unrelated to vulnerabilities and is used to set the pointer "usbDevs" to NULL to mitigate the vulnerability of "use after free".If we delete the statement "usbDevs = NULL;" using the mutation operation , it could introduce the vulnerability CWE-672 (Operation After Free).Similarly, in the case of a shared variable among multiple threads, if its assignment statement is placed before the lock operation, using the mutated operation  might result in a vulnerability CWE-362 (race condition).Fortunately, we observe that the proposed operations , , and  do not introduce new vulnerabilities.Only two operations, , and , have the potential to generate new vulnerabilities.Moreover, as indicated in Section 5.2, the combination of all five mutation operations in FGVulDet yields the highest performance.In summary, ensuring the absence of new vulnerabilities when modifying the semantics of the original function is a significant challenge, and addressing this remains in future.

Threats to Validity
The first potential threat concerns the limited scope of the collected dataset, which includes only five types of vulnerabilities in the C programming language.While this might not cover all possible vulnerabilities, we assert that the selected vulnerability types are common and capable of causing significant harm to software systems.Moreover, the proposed model is not restricted to this specific dataset and can be extended to detect other types of vulnerabilities.Another potential threat is the exclusion of functions with node sizes exceeding 800 in the graph.This practice is consistent with Devign [52] due to the GPU memory requirements of GNNs.Truncating the graph size is a pragmatic choice to facilitate experiments, as GNNs necessitate substantial GPU memory.

Related Work
Vulnerability Detection Static analysis plays a crucial role in identifying flaws and errors in a codebase to prevent the introduction of vulnerabilities and security bugs.Commercial tools like flawfinder [48] and CPPCheck [3] provide extensive static code analysis, helping discover early bugs and vulnerabilities.VUDDY utilizes a clone-based approach by matching the signature of vulnerable functions with target program signatures.Another approach by Yang et al. [49] employs novel slicing techniques while incorporating vulnerability signatures.However, these analyses often require significant context and the extensive efforts of experts.With the increasing popularity of deep learning approaches, various works, such as VulDeePecker [29], VulSniper [12], and Devign [52], have utilized deep learning techniques to predict and detect vulnerabilities.Saikat et al. [5] systematically study the existing deep-learning-based vulnerability detection approaches.Besides the function-level vulnerability detection, LineVD [22] leverages graph neural networks to locate the buggy statements.LineVul [16] employs the transformer to identify the line vulnerability.These methods typically use an intermediate representation of programs, such as tokens or graphs, to facilitate the learning of meaningful program representations.In contrast, our work introduces a vulnerability-preserving data augmentation process to enhance vulnerability data for training purposes, which offers a different perspective.Data Augmentation It is a valuable technique for expanding datasets during model training in various domains such as computer vision (CV), natural language processing (NLP), and automated speech recognition (ASR).This technique is particularly useful in domains with limited datasets to prevent overfitting, as seen in medical image analysis [41].Moreover, data augmentation has been employed to enhance model performance.Studies [13,39] introduce new methods for data augmentation, aiming to achieve better performance in image classification tasks.In code representation, Jain et al. [23] and Liu et al. [34] explore data augmentation in pretraining.Zhuo et al. [53] and Dong et al. [10] conduct the literature review of source code augmentation.The current strategies used are semantic-preserving at the functional granularity, which may not preserve vulnerability when generating mutations for code vulnerabilities.In contrast, our work introduces vulnerability-preserving data augmentation.Our approach augments the original dataset by preserving the semantics of vulnerability-related statements while modifying statements unrelated to vulnerabilities.It ensures that the generated variants retain vulnerability information, distinguishing it from other semantic-preserving strategies.

Conclusion
In this paper, we introduce a fine-grained vulnerability detector namely FGVulDet, which employs multiple classifiers to learn characteristics of various vulnerability types for source code vulnerability detection.To address the scarcity of data for some vulnerability types, we propose a novel vulnerability-preserving data augmentation technique for augmentation.Furthermore, we extend GGNN to an edgeaware variant to capture edge-type information.Extensive experiments have confirmed the effectiveness of FGVulDet.

Acknowledgment
This research is supported by the National Research Foundation, Singapore, and the Cyber Security Agency under its National Cybersecurity R&D Programme (NCRP25-P04-TAICeN), the National Research Foundation, Singapore, and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-GC-2023-008), and NRF Investigatorship NRF-NRFI06-2020-0001. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore and Cyber Security Agency of Singapore.

•
We conduct extensive experiments on the real collected vulnerability data to illustrate the effectiveness of FGVulDet.{ 1 ,  2 , ...,   }, where   is the subdataset in D for the vulnerability type  ∈  and  is a set of source code vulnerability types, we aim at learning the mapping   ∈  over   to predict whether the function has the vulnerability of type  and  = { 1 ,  2 , ...,   } is the prediction function for different vulnerability type.Further-Figure 1.An example to illustrate Code Property Graph.
[12,30,40,52]efinitionExisting works[12,30,40,52]define source code vulnerability identification as a binary {0,1} classification problem i.e., labelling all vulnerable functions as 1, regardless of the vulnerable type of the function, which is coarse for vulnerability detection.Differently, in this work, we focus on investigating a fine-grained vulnerability identification problem i.e., for different types of vulnerability, our goal is to learn the corresponding prediction function.Specifically, given a dataset  = more,   = {(  , )|  ∈ C  ,  ∈ Y}, where C  is a set of functions which contains the vulnerable functions with the vulnerability type  and the corresponding fixed functions and  = {0, 1} is the label set with 1 for the vulnerability and 0 for the non-vulnerability.

Table 1 .
Keywords of Five Vulnerability Types.

Table 2 .
Mutation Operations for Data Augmentation.
delDelete statements that are not related to the vulnerability.

Table 3 .
The Statistics of the Collected DataSet.

Table 4 .
The experimental results of different approaches for vulnerability detection.