DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295 more projects than all previous datasets combined. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. We show that increasing the volume of training data may not further improve the performance of deep learning models for vulnerability detection, but might be useful to improve the generalization ability to unseen projects. We also identify hopeful future research directions. We demonstrate that large language models (LLMs) are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features in our experiments. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.


INTRODUCTION
Detecting software vulnerabilities is crucial to prevent cybercrimes and economic losses, but to date it remains a hard problem.Traditional static and dynamic vulnerability detection techniques suffer from shortcomings.Given the tremendous success of deep learning in image and natural language applications, it is natural to wonder if deep learning can enhance our ability to detect vulnerabilities [4,15,18,25,33].However, as we show in this paper, we still need to overcome many challenges before deep learning can achieve great performance for vulnerable source code detection.
For deep learning to be successful, we need a large dataset of vulnerable source code.We release a new open vulnerability dataset for C/C++, DiverseVul.To curate the dataset, we crawl security issue websites, collect vulnerability reports, extract vulnerabilityfixing commits for each vulnerability, clone the corresponding projects, and extract vulnerable and nonvulnerable source code from them.Our dataset contains 18,945 vulnerable functions and 330,492 nonvulnerable functions extracted from 7,514 commits, covering 150 CWEs.This is more than twice the size of the C/C++ data from the previous largest and most diverse dataset CVEFixes [2].Our dataset is more diverse and covers almost 50% more projects than the combination of all previously published datasets.We publicly release the DiverseVul dataset to the community at https: //github.com/wagner-group/diversevul.
Our new dataset has enabled us to study the state-of-the-art deep learning methods and gain new insights about promising research directions as well as the challenges for ML-based vulnerability detection.In particular, we study several questions.Does more training data help, or are models saturated?Does the model architecture make a big difference?Is it better to use the state-ofthe-art model that relies on code-structure features, or better to use large language models?Is a larger LLM better than a smaller LLM?What are the most promising directions for further improving deep learning for vulnerability detection?
To study the effect of model architectures, we experiment with 11 different deep learning architectures from 4 representative model families: Graph Neural Networks (GNN) [13], RoBERTa [10,11,16], GPT-2 [17,23,30], and T5 [3,24,29].Much work on deep learning Figure 1: An overview of several of our results.When trained on only the CVEFixes dataset, ReVeal has comparable performance as large language models.If we have enough data (Previous + DiverseVul), large language models (e.g., Nat-Gen) are superior to previous-generation models (e.g., Re-Veal, a GNN model with code-structure features), but we need large datasets to see these benefits.LLMs are better able to take advantage of larger datasets than previous-generation models (blue bars vs gray bars).The best LLMs for this task, CodeT5 and NatGen, have been pre-trained with code-specific tasks.
Our experiments show that, when evaluating on a prior dataset CVEFixes [2], the model architecture has little effect and LLMs perform about the same as GNNs.In particular, on CVEFixes, the largest previously released dataset, the ReVeal model (a GNN) achieves 12.8 F1 score, vs F1 scores of 8.5-16.3 for LLMs (see Figure 1).One might be tempted to conclude from this that the exact architecture has little effect.However, when evaluating on larger datasets, we can see that this conclusion is reversed: LLMs can perform significantly better than GNNs.In particular, when we combine all previously published datasets together with our Diver-seVul, the best LLM achieves F1 score of 47.2, vs 29.8 for ReVeal.
These experiments show that we need large datasets to reliably evaluate deep learning approaches to vulnerability detection, as the relative performance of different architectures shifts radically as we increase the amount of training data available: a 5× increase in the amount of training data (from CVEFixes to all datasets) improved the performance of our best model from 10.5 to 48.9 F1 score.They suggest that LLMs are better able to make use of large datasets than GNNs: larger datasets improve the performance of ReVeal only modestly, but improve the performance of LLMs significantly.However, our experiments suggest that the performance gain from gathering more data may have stagnated.By adding our dataset to the combination of previous datasets, we can improve the test performance on 7 models out of 11.However, for the 3 best-performing models, either we don't see improvement or the improvement is small (details in Section 4.2).
Unfortunately, the state-of-the-art deep learning techniques are still not ready for vulnerability detection yet.Our best model has 47.2% F1 score, 43.3% true positive rate, and 3.5% false positive rate.The false positive rate is still far too high for the model to be practically useful.A project might contain tens of thousands of functions, and this false positive rate corresponds to hundreds of false positives, which is more than most analysts are likely to be willing to wade through [1].
Despite the challenges, Figure 1 suggests that large language models (LLMs) may be superior for deep learning based vulnerability detection.In previous papers, researchers believe that GNN with code-structure features is promising for vulnerability detection [4,18,33], since it combines domain knowledge with deep learning.In contrast, our results show that large language models (RoBERTa, GPT-2, and T5 families) significantly outperform the state-of-the-art GNN, especially when training with more data.In particular, CodeT5 models (CodeT5 Small, CodeT5 Base, NatGen) are the best.
Contrary to the common belief that model size is the most important factor for LLMs to perform well, our results show that the most important factor may be how the LLM is trained.Pretraining on code understanding tasks appears to offer large improvements.For example, CodeT5 Small is pretrained to predict variable and function names, and it can achieve an average of 8 percentage points higher F1 score than models that are twice its size but were not pretrained on code.Surprisingly, we found that pretraining tasks that are effective for natural language do not help vulnerability detection much.Instead, it appears we need code-specific pretraining tasks.We think that developing better code-specific pretraining tasks is a promising research direction for improving deep learning based vulnerability detection.
Moreover, we identify an important generalization challenge for the deployment of deep learning based models.To deploy a model we need to detect vulnerabilities from new software projects that do not appear in the training set.We found that deep learning models perform very poorly in this setting.In particular, past work has split data into training and test sets by a random split of the vulnerabilities, without regard to which project each vulnerability appears in.However, in practice, we often want to run a vulnerability detection tool on a new project, so there won't be any vulnerabilities from that project in the training set.To evaluate the performance of deep learning in this setting, we set aside a held-out set of projects, which we call "unseen projects"; we train on vulnerabilities from the other projects ("seen projects"), and then test on vulnerabilities from unseen projects.The performance of all models on unseen projects decreases significantly, e.g., from a F1 score of 49% on seen projects to only 9.4% on unseen projects.The cause is unclear; perhaps the model is overfitting to patterns or coding idioms that are specific to the particular projects that appear in the training set.This generalization failure is likely to be a significant barrier to deploying deep learning vulnerability detection in practice.We hope future research will explore how to address this problem.We suggest a simple intervention to use class weights in the training loss, that takes a small step in this direction, but the gap remains very large and more work is needed.
Lastly, we quantify the label noise in our dataset as well as previous datasets.Label noise is a significant challenge for ML-based vulnerability detection research.To extract vulnerable functions from vulnerability-fixing commits, following the state-of-the-art approach (used by Devign [33], ReVeal [4], BigVul [9], CrossVul [19], CVEFixes [2]), we label functions that were changed by these commits as vulnerable.To understand the label accuracy of such labeling approach, we randomly sample 50 vulnerable functions from our dataset, and another 50 vulnerable functions from the union of three datasets that collect commits from NVD (BigVul, CrossVul, and CVEFixes).Then, we manually analyze the vulnerability and the labeled vulnerable functions.Our results find that the vulnerable function label in DiverseVul is 60% accurate, which is 24% more accurate than the union of CVEFixes, BigVul and CrossVul but still containing many label errors.The main challenges are vulnerabilities that are spread across multiple functions and changes to non-vulnerable functions in vulnerability-fixing commits.We hope our work takes the first step towards understanding the label noise issue and highlights the need for deeper investigation of the impact of label noise.
We make the following contributions in this paper:

RELATED WORK
In this section, we analyze previous public vulnerable source code datasets for C/C++, their labeling methods, and how they are used by related works on deep learning for vulnerability detection.Synthetic Datasets: SATE IV Juliet [22] and SARD [21] are common synthetic datasets used by previous papers [15,18,25].SARD expands on the Juliet v1.0 test suite and contains test cases for multiple programming languages.The test cases are highly accurate, and contain a variety of CWEs.However, they are constructed in isolation using known vulnerable patterns, which are designed to evaluate static and dynamic analysis tools.They don't fully capture the complexities of vulnerabilities within the real-world projects.
The VulDeePecker [15] dataset focuses on only two CWEs.They selected vulnerabilities from 19 projects according to CVE information from the National Vulnerability Database (NVD) [20], and also combined SARD [21] test cases from these two CWEs.Both VulDeePecker and SARD are semi-synthetic datasets.
Static Analyzer Labels: The Draper [25] dataset generated labels using alerts from three static analyzers: Clang, Cppcheck, and Flawfinder.Some categories of alerts were labeled as vulnerable, and some are mapped to non-vulnerable.The labeled dataset is at the function granularity.The quality of the label is unknown, but the label accuracy of static analyzers tend to be low.D2A [32] used differential analysis on the static analyzer (Infer) output over six open-source repositories.Given thousands of version pairs for a github repository, if the static analyzer generates an alert for the version before a git commit, but not after the commit, then D2A treats the commit as fixing a vulnerability.For the remaining alerts, D2A labels them as unrelated to vulnerabilities.
Manual Labeling: The Devign [33] dataset was labeled by three security researchers.They first used keywords to find commits that likely fixed vulnerabilities and commits unrelated to vulnerabilities from four repositories.Then, for the first category, three security researchers manually reviewed these commits by majority vote to determine which fix security vulnerabilities.Given labels for each commit, Devign extracts the changed function before the commit as the data sample, and labels it as vulnerable or non-vulnerable according to the label of the commit.The authors of Devign released data for two repositories, FFMPeg and Qemu.This dataset has high quality labels, but manual labeling was very expensive, costing around 600 man-hours.Security Issues: Several prior datasets were generated by crawling security issues to identify vulnerability-fixing commits.The Re-Veal [4] dataset was labeled using the patches to known security issues at Chromium security issues and Debian security tracker.
ReVeal considers the changed functions before a security patch (commit) as vulnerable, after the patch as non-vulnerable, and all unchanged functions as non-vulnerable.In comparison, our dataset DiverseVul has 18K vulnerable functions, which is 11× the size of ReVeal (Table 3).BigVul [9], CrossVul [19] and CVEfixes [2] collect vulnerabilityfixing commits from Common Vulnerabilities and Exposures (CVE) records in the NVD [20].In particular, CVEFixes covers all published CVEs up to 27 August 2022.CVEfixes and CrossVul datasets cover multiple programming languages, and we use their C/C++ data in this paper.These three datasets cover a wide range of projects and CWEs.In comparions, our dataset contains more projects, more CWEs, and double the number of vulnerability-fixing commits.
A few other vulnerable source code datasets in C/C++ do not provide vulnerable functions, and therefore we did not include them in our experiments.For example, AOSP [5] collected commits fixing CVEs from the security bulletin of Android Open Source Project (AOSP), which contain patches to vulnerabilities in Android OS, the Linux kernel, and system on chip manufacturers.PatchDB [28] provides patch information, i.e., code diffs, but does not provide enough information to identify the project or git repository it came from and thus does not let us reconstruct the full code of the changed funcction.
Security issues are effective at identifying vulnerability-fixing commits, as they are based on manual analysis from developers.
They are also representative of in-the-wild vulnerabilities in realworld projects.Therefore, we also collect our new dataset Diverse-Vul by crawling security issues.Compared to all previous datasets, DiverseVul is the most diverse one, covering the most number of projects.In particular, DiverseVul has vulnerabilities from 295 new projects that have not been collected by any of the previous real-world datasets (Table 3).DL for Vulnerable Source Code Detection: Previous papers have used LSTM [15], CNNs and RNNs [25], Bidirectional RNNS [14], and Graph Neural Networks [4,18,33] to detect vulnerable source code.A recent paper from Thapa et al. [27] shows that on the VulDeePecker [15] dataset spanning two CWEs, large language models outperform BiLSTM and BiGRU models.However, they did not compare against Graph Neural Networks (GNN).GNNs represent programs as graphs that contain useful domain knowledge for vulnerability detection.ReVeal [4] used features obtained from the code property graph [31], and VulChecker [18] proposed a new enriched program dependence graph.These papers used relatively small datasets such as ReVeal and Juliet.If we train the models with larger datasets, it is not clear whether GNN with code-structure features is still effective compared to large language models.

DATA COLLECTION
Our goal is to collect high-quality vulnerability-fixing commits from a diverse set of real-world projects.We focus on collecting data from security issues, since they reflect high-quality labels from a community of developers and security analysts.We start by identifying 29 security issue websites, and then narrow it down to 2 websites with most git system commits 1 .From these websites, we crawl the issue title, body, and relevant git commit URLs.Since developer's discussions may reference both vulnerability-fixing commits and vulnerability-introducing commits, we use two heuristics to exclude vulnerability-introducing commits.First, we exclude all commit URLs mentioned in comments containing keywords "introduced" and "first included"; and second, we manually go over all commits that changed at least 10 functions and exclude ones that introduced vulnerability.We keep the remaining commits in our dataset.
Next, we parse the git commit URLs to extract the projects and commit IDs.We clone the projects and extract the commits from these projects.We identify the C/C++ related code files in the commits.Then, we extract all functions that were changed in these commits, and also functions that did not change in the files.Same as ReVeal [4], we label the before-commit version of a changed function to be vulnerable, and the after-commit version to be nonvulnerable.We label all unchanged functions in the related code files to be non-vulnerable.Like prior work, we deduplicate functions by their MD5 hashes, and we do not normalize the code before deduplication.We keep track of the set of unique MD5s when processing the functions.We process all the vulnerable functions before the nonvulnerable ones.If the MD5 of a function already exists in this set, we do not include the function again in the data.In total, we have collected 7,514 commits from 797 projects, which result in 18,945 vulnerable functions and 330,492 non-vulnerable functions, covering 150 CWEs.For issue titles that mention the CVE number, we query the National Vulnerability Database API to obtain the CWE information for the issue and the corresponding commit.For issues with developer annotated vulnerability category, we manually map them to top 25 most popular CWEs.About 85% of our data can be mapped to 150 CWE categories.We do not specifically address hierarchical CWEs.Depending on the query result from the NVD Database, a CVE number could be mapped to multiple CWEs.

EXPERIMENTS
In this section, we study how our new dataset can improve the performance of deep learning based vulnerability detection.We study 11 model architectures from 4 model families.We also discuss insights learned from these experiments.

Model Architectures
We study 4 model families, where 3 families are transformer-based large language models (LLM).Within each LLM family, there are different variants of the model pretrained using different objectives.Table 2 summarizes the number of parameters for all model architectures.[4].

Graph Neural Network. Within the Graph Neural Network (GNN) family, we choose to reproduce a representative previous work ReVeal
Given a function, the ReVeal model constructs a graph to represent the function, computes the embedding vector of the graph, and classifies the vector as vulnerable or nonvulnerable.Specifically, the graph representation for the function is a code property graph [31] (CPG).CPG combines Abstract Syntax Tree (AST), Control Flow Graph (CFG), Data Flow Graph (DFG), and Program Dependence Graph (PDG).Each node has the corresponding source code and type, and each edge has a type.The embedding of the graph is a sum of embeddings of the nodes in the graph.To learn the node embeddings, ReVeal uses Gated Graph Neural Networks (GGNN) [13] Model Family Model Architecture # Parameters Table 2: The number of parameters for different models.
to recursively update the embeddings of the nodes.The initial embedding of a node is a concatenation of Word2Vec embedding of the code and the categorical type vector.Then, the GGNN training procedure uses the message passing mechanism to update each node embedding according to the node's neighbors in the graph.Finally, after training the GGNN, ReVeal adds two fully-connected layers, rebalances the training set, to learn the final classifier.The total number of parameters of the ReVeal model is 1.28M.
4.1.2RoBERTa Family.We select three model achitectures from the RoBERTa family: RoBERTa [16], CodeBERT [10], and Graph-CodeBERT [11].All of them have 12 layers of Transformer encoders, 768 dimenional hidden states, 12 attention heads, and 125M model parameters in total.The common pretraining objective for this family is masked language modeling (MLM).The MLM pretraining process randomly masks a percentage of tokens within the input tokens, effectively removing them, and the training goal is to predict the missing tokens.RoBERTa [16] is an extension of BERT [8] that makes changes to important hyperparameters, including removing the pretraining objective of predicting the next sentence, as well as using larger mini-batches and learning rates during training.RoBERTa was pretrained on a union of five datasets: BookCorpus, English Wikipedia, CC-News, OpenWebText, and Stories.
CodeBERT [10] pretrains the model using the CodeSearchNet [12] dataset containing 2.3M functions from six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby).CodeBERT performs MLM pretraining and replaced token detection pretraining.During pretraining, each input is a pair of natural language description and source code, where the text describes the meaning of the code.The MLM pretraining in CodeBERT makes sure that tokens from both the natural language part and the source code part are masked out, and the replaced token detection corrupts both parts of the input as well.CodeBERT outperforms RoBERTa on two downstream tasks, natural language code search and code documentation generation.
GraphCodeBERT [11] also uses the CodeSearchNet [12] training datasets.In addition to having the natural language description and the source code parts of the input, GraphCodeBERT pretraining also constructs a third part of the input that captures the data flow between variables in the source code.In addition to MLM pretraining, GraphCodeBERT proposes two new pretraining objectives: edge prediction and node alignment.The edge prediction task maximizes the dot product between embeddings of two nodes if there is an edge, and the node alignment task maximize the dot product between embeddings of the code token and variable token if the variable represents the code token.Over benchmark datasets, GraphCodeBERT outperforms CodeBERT and RoBERTa on code clone detection, code translation, and code refinement tasks.
Note that the training dataset of CodeBERT and GraphCodeBERT does not have programs written in C/C++.4.1.3GPT-2 Family.We select three model architecures from the GPT-2 family: GPT-2 Base [23], CodeGPT [17], and PolyCoder [30].They have 12 layers of Transformer decoders, 768 dimentional hidden embeddings, and 12 attention heads.The size of the models are in Table 2, ranging from 117M to 160M.The common pretraining objective for this family is causal language modeling, i.e., next token prediction.How well a model is pretrained on the causal language modeling is measured by perplexity.A lower perplexity value indicates a better model.
GPT-2 [23] was pretrained on an unreleased WebText dataset, which was collected by scraping web page links on Reddit.
CodeGPT [17] uses the same training objective and architecture of GPT-2, but different training data.The authors select Python and Java codes from CodeSearchNet [12] as the training set, and release several variants of the pretrained CodeGPT models.In this paper, we use an adapted version of CodeGPT pretrained on Java codes.The CodeGPT model was initialized from GPT-2 weights, and then pretrained using Java codes from CodeSearchNet using the next token prediction task.Note that there is no C/C++ programs in the training set.
PolyCoder [30] uses the same model architecture and pretrianing objective as GPT-2, but pretrains the model from scratch.The authors pretrained the model with data from GitHub containing both source code and natural language comments within the code files.They cloned a total of 147,095 projects, that are the most popular repositories of 12 popular programming languages with at least 50 stars.Their training data contains over 24K repositories in C/C++.The authors curate an evaluation datasets of codes from unseen repositories.On C programming language, PolyCoder achieves the lowest perplexity value, compared to GPT-Neo, GPT-J, and Codex.
T5 [24] pretrains the model using the masked language modeling objective.In particular, T5 pretraining procedure randomly masks spans of tokens.The pretraining dataset is C4 (Colossal Clean Crawled Corpus).The authors curate the C4 dataset by processing the Common Crawl dataset to get hundreds of gigabytes of clean English text.
CodeT5 [29] uses the same underlying transformer architecture as T5.We consider two model sizes in our experiments: CodeT5 Base and CodeT5 small.The CodeT5 Small is the smallest LLM, with one third the model size of other T5 based models, and roughly half the model size of RoBERTa and GPT-2 family models.CodeT5 was pretrained on on both CodeSearchNet data and additional C/C# projects from GitHub.In addition to the masked span prediction objective, CodeT5 utilizes the knowledge about whether a token is an identifer (a variabel name or a function name) and designs two new pretraining tasks.The new pretraining tasks are, masked identifier prediction (masking all identifiers) and identifier tagging (predict whether a token is an identifier).
NatGen [3] proposes a new pretraining objective called "naturalizing" pretraining.The naturalizing pretraining is similar to a code editing process, that takes some weird synthetic code and tranform that into developer-readable code.The authors generate un-natural code by semantic preserving code transformations including adding dead code, changing a while loop to a for loop without variable initialization, renaming variables, and inserting confusing code element, etc.Then, the pretraining objective asks the model to naturlize the code to the original developer-friendly form.The NatGen model starts the pretraining from the CodeT5 Base weights, and then continues the pretraining process using their new pretraining objective.Doing well on the naturalizing pretraining objective requires the model to understand the code well.Compared to CodeT5, NatGen improves the performance over various downstream tasks such as code translation, text to code generation, and bug repair.

Model
Performance with More Data 4.2.1 Dataset Setup.Deep learning models perform well when they are trained on a lot of data.Therefore, we combine nonsynthetic datasets with high-quality vulnerability labels from realworld projects, including Devign, ReVeal, BigVul, CrossVul, and CVEFixes.We then combine them with DiverseVul and remove duplicate samples to create the Previous + DiverseVul dataset, as shown in Table 3.
Table 3 presents the statistics for each of the previous five datasets, as well as our dataset, DiverseVul, and the merged datasets.Compared to all previous datasets, DiverseVul includes a larger number of projects, more CWEs, more vulnerable functions, and more vulnerability-fixing commits.Specifically, DiverseVul contains 18,945 vulnerable functions, of which 16,109 have CWE information, more than twice the number in any previous dataset.Having more data associated with CWE information will provide us with a more comprehensive understanding of model prediction results.The last two rows in Table 3 show the unique new data provided by DiverseVul in the merged datasets after deduplicating samples.Comparing Previous and Previous + DiverseVul datasets, we can see that DiverseVul contains 295 new projects that do not exist in any of the previous datasets.Moreover, DiverseVul provides 10,845 unique new vulnerable functions.
For our experiments, we randomly select 80% of the samples from the Previous + DiverseVul dataset as the training set, 10% as the validation set, and 10% as the test set.We also construct the Previous training and validation sets that only contain the previous five datasets, and training and validation sets that only contain CVEFixes data.This allows us to train models with different amounts of data and evaluate how much adding more data helps in improving the model's performance to predict the same test set from Previous + DiverseVul.

Results
. For each model architecture in Table 2, we train three models, using CVEFixes, Previous, and Previous + Diverse-Vul training datasets.We train the ReVeal models from scratch, and we fine tune the large language models (LLMs) for the vulnerability detection task from pretrained model weights.This gives us 33 models in total.The detailed training setups in our experiments can be found in Appendix A.
Table 4 shows the performance of the models over the same test set from Previous + DiverseVul.The following summarizes the results.
Result 1: When trained on all available data, large language models significantly outperform the state-of-the-art GNN-based ReVeal model.When trained on all available data (Previous + DiverseVul), LLMs perform significantly better than the ReVeal model: the ReVeal model achieves a 29.76 F1 score, while LLMs achieve F1 scores from 31.96 to 47.15.The best LLM performs significantly better than ReVeal on this large training set.Comparing between ReVeal and LLMs is arguably unfair since ReVeal has 1-2 orders of magnitude fewer parameters than LLMs.We do not know whether a larger GNN could be competitive with LLMs.Unfortunately, even the best-performing model, NatGen, is not yet suitable for deployment in vulnerability detection, with a 3.47% false positive rate and a 47.15% F1 score.This false positive rate is still too high to be practical, and the F1 score is still low.Nevertheless, we believe that large language models hold promise for deep learning-based vulnerability detection.
Interestingly, LLMs require a large amount of training data to surpass ReVeal.When trained solely on CVEFixes data, a much smaller training set, there is no clear advantage of LLMs over GNNbased ReVeal model, and ReVeal is even better than 6 LLMs (out of 10) in this setting.
Result 2: Within the three base LLM models, T5 Base performs better than RoBERTa and GPT-2 Base for vulnerability detection.RoBERTa only uses encoders, GPT-2 only uses decoders, and T5 uses encoder-decoder Transformer layers.When trained on Previous + DiverseVul, T5 Base has a test F1 score that is 7.35% and 9.3% higher than RoBERTa and GPT-2 Base, respectively.Thus, an encoder-decoder architecture might have an advantage over a decoder/encoder only architecture.
Result 3: Pretraining on code does not lead to significant improvements in vulnerability prediction, if we only use natural language pretraining tasks.The code models CodeBERT, GraphCodeBERT, CodeGPT, PolyCoder are not significantly better than the corresponding text models RoBERTa and GPT-2 Base.Specifically, when trained on the Previous dataset, CodeBERT and GraphCodeBERT perform similarly to RoBERTa.When  * : CVEfixes and CrossVul are multi-language datasets.We report numbers for C/C++ code.▽ : Devign authors released data from two repositories: FFMPeg+Qemu.
: Chromium and Debian packages.PolyCoder have up to 2.3% higher F1 scores than GPT-2; but when trained on Previous + DiverseVul, PolyCoder performs worse than GPT-2.Our findings suggest that pretraining models on code using MLM or next token prediction techniques does not yield significant improvements in detecting C/C++ vulnerabilities.While CodeBERT, GraphCodeBERT, and CodeGPT have not pretrained on C/C++, PolyCoder has pretrained over C/C++ code for next token prediction, which still does not help detecting C/C++ vulnerabilities.
Result 4: Code-specific pretraining tasks on C/C++ make a big difference in improving vulnerability detection performance.The two CodeT5 models and the NatGen model have the best F1 scores.They are pretrained using code-specific pretraining tasks on C/C++.CodeT5 models use identifier-aware pretraining tasks: masked identifier prediction and identifier tagging.NatGen does additional code naturalizing pretraining on top of CodeT5, such as removing dead code and renaming variables.These pretraining tasks ask the model learn about basic code understanding, which significantly improves the fine-tuned model performance for vulnerability detection task.Note that GraphCodeBERT also does some code-specific pretraining to learn embeddings from a pair of variables with data flow to have large dot product value.However, since it did not train on C/C++ data, it is unknown whether such pretraining task is effective for vulnerability prediction.
Result 5: Code-specific pretraining task is more important than the model size.Among the best three models in Table 4 (CodeT5 Small, CodeT5 Base, NatGen), the CodeT5 Small model has only 60M parameters, half of the size of RoBERTa models and GPT-2 models, and less than one third the size of other T5 models.However, CodeT5 Small performs very similar to the largest CodeT5 Base and NatGen models, and it performs better than all the other models.Contrary to the belief that larger models tend to produce better performance, our results show that code-specific pretraining task is more important than the model size for vulnerability detection.
Result 6: Performance gain from collecting more datasets may have saturated.Figure 2 4: We evaluate the models on the same test set from Previous + DiverseVul.There isn't a big difference between model performance across different architectures if we only train on the CVEFixes dataset.However, if we train on larger datasets, large language models significantly outperform the GNN-based ReVeal model.Among them, CodeT5 Small, CodeT5 Base, and NatGen models have the highest F1 scores.We highlight the row with the highest F1 score in bold.Pretraining the model using code-specific pretraining task over C/C++ is very effective.models, and evaluate them on the same original test set without subsampling.

Results
. We fine tune 100 CodeT5 Small models on different dataset setups from 10 experiment runs.Within each run, we evaluate the models on the same final test set from the Previous + DiverseVul, and train 10 models by using different percentages of training and validation data.Figure 3 plots the average and 95% confidence interval for the test F1 score, when a model is fine tuned from a corresponding dataset setup.
Result 7: Increasing the volume of the training dataset from the same distribution helps vulnerability detection.
Our results show that training on a larger dataset from the same distribution can improve the test performance.Figure 3   Overall, the F1 scores show that these models have poor generalization performance on unseen projects.Adding DiverseVul to Previous training set helps improve the generalization performance for all models except NatGen.
deployment time, collecting more training data from that distribution might further improve the performance on vulnerability detection.

Generalization
4.4.1 Dataset Setup.In a real-world deployment scenario, a vulnerability detection model needs to predict vulnerable source code in new developer projects that it has not been trained on.Therefore, we would like to test a model's performance on unseen projects.We randomly select 95 unique projects from the merged Previous dataset as the unseen projects test set, to evaluate all models in this experiment.Then, the remaining projects are treated as seen projects in both training set and validation set.For both Previous and Previous + DiverseVul datsets, we randomly sample 90% of the seen projects as the training set, and 10% remaining projects are the validation set.The training and validation sets of Previous + DiverseVul are supersets of these from Previous.

Results
. We train ReVeal and fine tune each LLM on the seen projects training set from Previous and Previous + DiverseVul, resulting in 22 models in total.We make sure that these models have been trained well, since they have achieved validation performance similar to training performance.Table 5 shows the test performance of these models over unseen projects.
The F1 scores of all models on unseen projects are very low.The best models are CodeBERT, PolyCoder, CodeT5 Small, CodeT5 Base models trained on Previous + DiverseVul, and NatGen model trained on Previous seen projects.Adding DiverseVul to Previous training set helps improve the generalization performance for all models except NatGen.One recent, concurrent work [26] also observed a significant performance drop when testing on unseen projects.In our experiment, we have included hundreds of more projects in the training set than [26], but we still observe the poor generalization results.
Result 8: There is a significant challenge for deep learning models to generalize to unknown test projects on the vulnerability detection task.A popular use case of AI for Code is the GitHub CoPilot, where the AI model suggests ways to complete code to developers when they are writing code.If AI for deep learning detection is also a coding assistant, it needs to suggest potential  6: Using class weights for cross entropy loss improves the generalization performance of models, when they are trained on seen projects and tested on unseen projects.Using class weights improves the unseen project test F1 score of CodeBERT from 11.94% to 14.74%, PolyCoder from 11.39% to 13.63%, and CodeT5 Small from 9.39% to 17.21%.Moreover, if the training and testing samples are drawn from the same distribution, using class weights also improves the test F1 score.We highlight the row with the highest F1 score in bold.

Weighting
In this section, we investigate whether three simple weighting schemes can potentially improve the model's generalization performance to unseen test projects.The weighting schemes are the following.
4.5.1 Project Balanced Batch Sampler.Our idea is to make the model perform equally well on different projects.Therefore, we propose a batch sampler that is equally likely to sample from any project in the training set.If a project is picked, it then randomly sample from all functions belonging to the project.4.5.2Weighted Soft F1 Loss.Since we care about F1 score as the final performance metric, we would like explore if a different loss function helps with improving the generalization performance.We use normalized prediction probabilities (between 0 and 1) from the training samples to calculate true positives, true negatives, false positives, and false negatives, as in floating point numbers.Then, we use these to compute two F1 scores of predicting the positive label (vulnerable function) and the negative label (nonvulnerable functions) separately.The loss for the positive label is 1 -positive F1 score, and the loss for the negative label is 1 -negative F1 score.Finally, we give a higher weight to the first loss value, proportional to the ratio of nonvulnerable to vulnerable functions in the data.Then, we choose the corresponding loss value according to the ground truth class label as the final training loss.the positive class (vulnerable class), proportional to the ratio of nonvulnerable samples over vulnerable samples.We use the same loss value for the negative class.
4.5.4Results.We follow the same project split dataset setup described in Section 4.4.We fine tune CodeBERT, PolyCoder, and CodeT5 Small models over the seen projects training set from Previous + DiverseVul dataset, and test them on 95 unseen projects.For each model architecture, we use four schemes to fine tune four models: no weighting, project balanced batch sampler, weighted soft F1 loss, and class weights for cross entropy loss.In addition, we fine tune another four models for each architecture using these schemes over a different data split, the random data split described in Section 4.2.
Result 9: Using class weights for cross entropy loss can improve the model's generalization performance to unseen projects, but there is a lot of room for further improvements.Class weights also improve the model's performance if training / testing samples are drawn from the same distribution.Table 6 shows the evaluation results of models fine tuned with different schemes.For the seen / unseen projects experiment, using class weights increases the F1 score for all three model architectures.The project balanced batch sampler does not help with generalization.The weighted soft F1 loss helps CodeBERT and CodeT5 Small with generalization, but it hurts performance on seen projects.Overall, class weights is the best scheme, as it improves performance on both seen and unseen projects.CodeT5 Small trained with class weights has the best test F1 score (17.21%) on unseen projects.
Figure 4 shows the gap between the F1 score on seen projects vs unseen projects for two CodeT5 Small models, one fine tuned with no weighting scheme and one fine tuned with class weights for cross entropy loss.From the bars, we observe that using class weights reduces the gap between F1 score on seen vs unseen projects, with slight improvement to F1 score on seen projects and significant improvement for unseen projects.This means that using class weights improves the performance of the model over samples drawn from the same distribution as well as from a different distribution of new projects.However, there is still a large gap between 49.9% F1 on seen projects vs 17.21% F1 on unseen projects.As future research directions of the generalization problem, there is a lot of potential to further improve the model's performance over unknown projects.

Performance on CWEs
To understand the difficulty of learning different CWEs, we select 37 CWEs to examine the CodeT5 Base model's prediction performance when it is trained on Previous + DiverseVul.The 37 CWEs include the top-25 CWEs according to MITRE [6], and the 12 most common CWEs in DiverseVul outside the top 25.We select vulnerable functions belonging to these 37 CWEs and all nonvulnerable functions from the Previous + DiverseVul test set obtained from the random split in Section 4.2.
Result 10: Some CWEs are easier to learn than others regardless of the training data size.Table 7 shows the CodeT5 Base model's prediction performance across the 37 CWEs.We have highlighted the 10 most prevalent CWEs in the training set and 10 highest True Positive Rate (TPR) numbers in bold.Note that all CWEs have the same False Positive Rate (FPR) since FPR is only related to nonvulnerable functions.We observe that having more samples for a particular CWE in the training set does not necessarily result in the model learning it better than CWEs with fewer training samples.Moreover, some CWEs with very few training samples are well-detected by the model.For example, CWE-502, CWE-79, CWE-89, all of which account for less than 2% of the training data, have the highest TPRs.This suggests that some CWEs are easier to learn and do not require a large amount of training data, while others are more challenging to learn, even with more training samples.For instance, CWE-416 had 5.46% of the training samples, but its TPR was only 17.86%.
For some CWEs, we do not have enough test samples, resulting in extremely low TPR numbers.The "Test #" column shows the number of vulnerable functions belonging to that CWE in the test set.For CWEs with 0% TPR, most have less than 10 samples in the test set.

LABEL ERROR ANALYSIS
While our dataset is designed to be as accurate as possible, some functions may be labelled erroneously.To label vulnerable functions, we follow the methodology used in Devign [33], ReVeal [4], BigVul [9], CrossVul [19], and CVEFixes [2], which considers a function vulnerable if it was changed by a commit that is identified as fixing a vulnerability, based on security issue trackers.Although our labeling technique is state-of-the-art and can scale effectively, we cannot guarantee that every function changed by each such commit is vulnerable, so some labels may be inaccurate.
To quantify the amount of label noise as a result of this labeling methodology, we manually assess the accuracy of labels for the DiverseVul, CVEFixes, BigVul, and CrossVul datasets.Among previous datasets, we chose CVEFixes, BigVul, and CrossVul because they provide the commit ID that changed the vulnerable function, which allows us to verify whether a function is vulnerable in that specific version of the project.
We randomly sample 50 vulnerable functions from Diverse-Vul, and 50 vulnerable functions from the union of previous three datasets (CVEFixes ∪ BigVul ∪ CrossVul).Then, we manually analyze whether the vulnerable function has the correct label or wrong label.We inform this decision by examining the code of the function labelled vulnerable, both before and after the commit, the commit it was supposedly fixed in, the CVE description, and developer  discussions in the security issue tracker.We confirm a function as correctly labelled vulnerable if the vulnerability exists in that function, and is not spread across multiple functions.We observed three categories of label errors: 1) the vulnerability is spread across multiple functions, 2) the function is not vulnerable, but changing the function is relevant to fixing the vulnerability (e.g., to adjust calling parameters), and 3) the function is not vulnerable and irrelevant to the vulnerability (e.g., a vulnerability-fixing commit changes the spaces in some nonvulnerable functions, or makes irrelevant functionality changes to nonvulnerable functions).Table 8 shows our analysis results.The vulnerable function labels are 60% accurate in DiverseVul, which is 24 percentage points higher than the previous three datasets (CVEFixes ∪ BigVul ∪ CrossVul).Within these three datasets, CVEFixes is the most accurate one, whereas BigVul has very low label accuracy, only 25%.We observe that many commits included in BigVul from the Chromium and Android projects are not relevant to fixing vulnerabilities at all.We also found that the percentage irrelevant functions is surprisingly high, ranging from 17.4% to 50% in four datasets.These functions are not related to the vulnerability, but since they were changed by the vulnerability-fixing commits, the automatic labeling process labels them as vulnerable.
Concurrent work also examined label noise and also found significant label errors in the BigVul and Devign datasets [7].Compared to their categorization, we have a stricter criteria to label a function as vulnerable: we consider the caller of a vulnerable function as non-vulnerable; they considered it vulnerable.Also, if a function is only part of the vulnerability, and if the vulnerability cannot be recognized from the code of this function alone, we consider that a wrong label; they considered it correct.Taking into account the differences in categorization, our findings for BigVul (the only dataset common to their and our work) are largely consistent with their findings.

LIMITATIONS
The label noise in our dataset and prior datasets may introduce errors into our measurement of the performance of all models on the test set.We hope that releasing our dataset will enable the community to explore methods to remediate the effects of label noise in the future.
In retrospect, the de-duplication procedure in our dataset and prior datasets could be improved.As part of the label noise analysis, we discovered that 4% of DiverseVul labels and 6% of (CVEFixes ∪ BigVul ∪ CrossVul) labels were erroneous because the commit made whitespace-only changes to some functions, and these were treated as security fixes during labelling.Therefore, normalizing the whitespace in all functions before de-duplication could slightly improve label accuracy, and might have other benefits.
There is a risk of contamination, i.e., test data leaking into pretraining data, as LLMs are pre-trained on text and code, which could conceivably include blog articles or code patches related to security vulnerabilities included in our test set.Many of our models (CodeBERT, GraphCodeBERT, PolyCoder, CodeT5 Small, CodeT5 Base, NatGen) were only pre-trained on code, not on other text or code changes, so could have been exposed to code in our test set but were unlikely to be exposed to a description of which code is vulnerable.This could potentially affect our results in ways that we cannot measure.Other models (RoBERTa, GPT-2 Base, CodeGPT, T5 Base) were pre-trained on text, and so could possibly have been exposed to blog articles that describe vulnerable source code.We suspect that this is very rare, but we cannot measure it, so we cannot rule out the possibility of test set contamination.The latter models (RoBERTa, GPT-2 Base, CodeGPT, T5 Base) performed relatively poorly in our experiment in any case.
There is also a risk that cloned code could cause test set contamination, if the cloned code was subsequently modified slightly (thus evading our de-duplication efforts).

CONCLUSION
This paper presents a new dataset, DiverseVul, for detecting software vulnerabilities using deep learning.The dataset contains 18,945 vulnerable functions spanning 155 CWEs and 330,492 nonvulnerable functions, extracted from 7,514 commits, which is more diverse and twice the size of the previous largest and most diverse dataset, CVEFixes.We use this new dataset to study the effectiveness of various deep learning architectures in detecting vulnerabilities.We have experimented with 11 different deep learning architectures from four model families: Graph Neural Networks (GNN), RoBERTa, GPT-2, and T5.The results suggest that the increased diversity and volume of training data examined in this paper is beneficial for vulnerability detection, especially for large language models, but it is unclear whether even larger datasets would help or not.Code-specific pretraining tasks appear to be a promising research direction for deep learning based vulnerability detection.Our results highlight a major challenge for future research: improving deep learning models so they generalize to unknown projects.We release the DiverseVul dataset to the community at https://github.com/wagner-group/diversevul.rate results in a degenerate model that always predicts a function as nonvulnerable.

Figure 2 :
Figure 2: We visualize the performance of models that are trained on CVEFixes, Previous, and Previous + DiverseVul.Adding DiverseVul to the merged Previous dataset helps improve the test performance for 7 models out of 11.It does not help the CodeT5 models.

Figure 3 :
Figure 3: Deep learning for vulnerable source code detection benefits from more data collected from the same distribution as the test data.We fine-tune CodeT5 Small models on different amounts of vulnerable source code data with different volume and report the test F1 score.We run each dataset setup 10 times.The lines are the average, and the region denotes 95% confidence interval.This figure shows that a larger training set improves the F1 score on vulnerability detection on test data from the same distribution.

4. 5 . 3
Class Weights for Cross Entropy Loss.In this scheme, we still use cross entropy loss for training.We upweight the loss value for

Figure 4 :
Figure 4: Using class weights in the training loss function improves the generalization performance over unseen projects for CodeT5 Small, and it slightly improves the performance on seen projects as well.The test F1 score on unseen projects is still quite low.

Table 1 :
Table 1 shows the top 10 projects and the top 10 CWEs in DiverseVul with the most number of 1 snyk.ioand bugzilla.redhat.com.Top 10 projects and CWEs in DiverseVul and the corresponding number of vulnerability-fixing commits.vulnerability-fixing commits.Note that CWE-703 "Improper Check or Handling of Exceptional Conditions" is not on the list of MITRE top-25 CWEs.
trained on the Previous + DiverseVul dataset, CodeBERT and GraphCode-BERT improve the F1 score by up to 2.8% compared to RoBERTa.On the other hand, when trained on Previous dataset, CodeGPT and We aggregate previous five datasets by combining and deduplicating samples from Devign, ReVeal, BigVul, CrossVul, and CVEfixes.
visualizes how much training on Di-verseVul + Previous data helps improve the vulnerability detection performance, compared to Previous data.Adding DiverseVul to the training set improves the F1 score for 7 models by 2.4% on average, compared to only training with the Previous dataset.However, it does not help the best performing CodeT5 models, and it only helps NatGen modestly.Even though we see a big improvement to model performance by training on the merged Previous datasets compared to only training on CVEFixes, collecting a different dataset may not further improve that.Dataset Setup.We want to measure the effect of data volume on model performance for vulnerability detection.We run the following experiment ten times.For each run, we randomly split the Previous + DiverseVul into training, validation, and test sets.Then, we simulate the effect of different data volume by subsampling the training and validation sets.Specifically, we randomly sample 10% to 90% of the training and validation data from the full training and validation data of Previous + DiverseVul.Then, we train the shows an upward trend of better test F1 score as the volume of training data increases.If we know the test data distribution ahead of the model

Table 5 :
We randomly choose 95 projects as unseen projects for testing.The remaining projects are used for training.We train each model on seen projects and test them on unseen projects.We highlight the row with the highest F1 score in bold.

Table 7 :
We evaluate the prediction performance of the CodeT5 Base model across top-25 CWEs and 12 most popular CWEs in DiverseVul.We highlight the 10 highest training sample percentages and 10 highest TPR numbers in bold.Having more training samples for a specific CWE does not necessarily improve the model's prediction performance, and some CWEs are harder to learn than others.Most CWEs with 0% TPR have under 10 samples in the test set.

Table 8 :
Label accuracy of four datasets, evaluated on a random sample of vulnerable functions.