Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection

Deep learning-based vulnerability detection has shown great performance and, in some studies, outperformed static analysis tools. However, the highest-performing approaches use token-based transformer models, which are not the most efficient to capture code semantics required for vulnerability detection. Classical program analysis techniques such as dataflow analysis can detect many types of bugs based on their root causes. In this paper, we propose to combine such causal-based vulnerability detection algorithms with deep learning, aiming to achieve more efficient and effective vulnerability detection. Specifically, we designed DeepDFA, a dataflow analysis-inspired graph learning framework and an embedding technique that enables graph learning to simulate dataflow computation. We show that DeepDFA is both performant and efficient. DeepDFA outperformed all non-transformer baselines. It was trained in 9 minutes, 75x faster than the highest-performing baseline model. When using only 50+ vulnerable and several hundreds of total examples as training data, the model retained the same performance as 100% of the dataset. DeepDFA also generalized to real-world vulnerabilities in Dbgbench; it detected 8.7 out of 17 vulnerabilities on average across folds and was able to distinguish between patched and buggy versions, while the highest-performing baseline models did not detect any vulnerabilities. By combining DeepDFA with a large language model, we surpassed the state-of-the-art vulnerability detection performance on the Big-Vul dataset with 96.46 F1 score, 97.82 precision, and 95.14 recall. Our replication package is located at https://doi.org/10.6084/m9.figshare.21225413.


INTRODUCTION
Software vulnerabilities cause great harm to people and corporations.Many Internet users have had their personal information breached because of security vulnerabilities, with common reports of breaches exposing millions of records [52].The average data breach costs the target company $4.24 million, according to IBM's 2021 report [2].The number of vulnerabilities is growing every year, as reported by the Common Vulnerability Enumeration (CVE) from 2016-2021 [1].Due to its importance, we urgently need to develop effective and automatic vulnerability detection tools.
The rapid advance of AI technologies has motivated software companies to invest heavily in deep learning-based vulnerability detection tools [39,55].These tools have outperformed traditional static analysis [11,20,38].Recently, large language models (LLMs) have reported state-of-the-art results; LineVul [24], a recent model based on CodeBERT, reported 91 F1 score on a commonly used real-world vulnerability dataset [22].
However, LLMs require large amounts of training data and computational resources for training and inference (see § 5.4), but a large volume of high-quality vulnerability detection data is hard to get.They also can fail to detect vulnerabilities beyond the training dataset (see § 5.5); for example, the top-performing transformer models LineVul and UniXcoder were not able to detect any of the real-world vulnerabilities in DbgBench [9].Furthermore, by using solely text tokens, these models may not effectively learn program semantics, such as program values along paths, propagation of taint values, and security-sensitive API calls along the control flow paths.The performance of these models can be further improved when we consider such information (see §5.3).
In this paper, we explore the idea of combining dataflow analysis (DFA) algorithms with deep learning to develop small, efficient, yet effective models for vulnerability detection.In prior literature [16,53], deep learning integrated with domain-specific knowledge and algorithms has reported improved performance and better generalization to unseen data, while using less data and computational resources.
Dataflow Analysis (DFA) computes the data usage patterns and relations in the control flow graph (CFG) of a program and reports a vulnerability based on its root cause, i.e., whether the values and data relations collected from the program indicate the occurrence of the vulnerable conditions.Graph learning (learning based on graph neural networks (GNN)) can aggregate and propagate information in the graph in a similar fashion to DFA.In this paper, we explore the analogy between DFA and the GNN message-passing mechanism and design an embedding technique that encodes dataflow information at each node of the CFG.Specifically, we leverage the efficient bit-vector representation of dataflow facts to encode the definitions and uses of the variables.Graph learning on such an embedding propagates and aggregates dataflow information and thus simulates the dataflow computation as done in DFA.Using this approach, we hope that the learned graph representation can better encode program semantic information, e.g., reaching definitions, which will be very useful for accurate vulnerability detection.Based on this rationale, we developed an abstract dataflow embedding that can map variable definitions of individual programs to a common space so that the model can compare and generalize data usage patterns (dataflow) related to vulnerabilities across programs.We selected a graph learning architecture whose aggregate and update functions worked most effectively for the dataflow propagation.
Our evaluation shows that DeepDFA is substantially faster than our baseline models in terms of both training and inference time.It only took 9 minutes to train, and inference on a CPU took 5.8 ms/example.This remarkable efficiency permits applications for personalized training and inference in non-GPU environments.It is also efficient in its use of training data, achieving its best F1 score using only 50+ vulnerable examples and several hundred total examples ( § 5.4).This frugality allows applications within a single development team, where it may be impractical to collect thousands of vulnerable examples.Yet, DeepDFA still outperformed all non-transformer baselines ( § 5.3) and retained its performance on unseen projects better than all baseline models ( § 5.5).Additionally, when applied to a real-world benchmark of unseen projects, Dbg-Bench [9], DeepDFA detected 8.7 out of 17 of bugs (averaged over 3 runs) and correctly reported 3 out of 5 patched programs as nonvulnerable ( § 5.5).In comparison, the highest-performing baselines, LineVul [24] and UniXcoder [26], did not detect any vulnerabilities.We also show that DeepDFA's learned representation can be used with other models to further improve their performance.By combining UniXcoder with DeepDFA, we surpassed state-of-the-art performance with 96.46 F1 score, 97.82 precision, and 95.14 recall.
In summary, we made the following contributions in this paper: (1) We designed an abstract dataflow embedding to enable deep learning to generalize semantics/dataflow patterns of vulnerabilities across programs ( §4.1); (2) We applied graph learning on the control flow graph (CFG) of the program and abstract dataflow embedding to simulate reaching definition dataflow analysis ( §4.2); (3) We implemented DeepDFA and experimentally demonstrated that DeepDFA outperforms baselines in vulnerability detection for effectiveness, efficiency, and generalization over unseen projects ( §5); (4) We provided rationale to help understand why DeepDFA performs well and is efficient ( §3); and (5) We surpassed the state-of-the-art vulnerability detection performance by combining DeepDFA and UniXcoder ( §5).

OVERVIEW
We propose DeepDFA, a deep learning framework guided by dataflow analysis algorithms, shown in Figure 1.Given the source code of a potentially vulnerable program (left), we convert it to a CFG and encode the nodes using an abstract dataflow embedding which we designed.The CFG specifies the execution order of statements, and is the data structure on which dataflow analysis operates.
In the middle of the figure, we show our approach of computing abstract dataflow embeddings.In dataflow analysis, definitions of variables, e.g., a=3, are program specific.Applying to deep learning, we abstract these concrete definitions from different programs, and hypothesize that the usage patterns of the abstract definitions can be compared and summarized across programs during learning.To construct the abstract definitions, we used the properties of definitions that are important for vulnerability detection, based on domain knowledge from program analysis.Specifically, we considered the data types of the defined variable, the API calls, constants, and operators used to define the variables.Inspired by the bit-vector representation used in dataflow analysis, we encode the abstract definitions in a compact and very efficient fashion.We will provide more detailed design of this embedding in Section 4.1.
We used a bit-vector style of representing a set of abstraction definitions.This numerical representation can be directly used as the initial node representations for graph learning.In the right of the figure, we apply graph learning which aggregates the information from nodes like the "merge" operation performed in dataflow analysis, and also updates using the information at each nodes like the "update" operation performed in dataflow analysis.We provide more background on the analogy in Section 3. Finally, we use the learned graph representation to classify whether the function is vulnerable or not.By directly propagating dataflow information through graph learning, we hope to present to the classifier a representation of the program which encodes useful information directly related to vulnerability, achieving efficient and effective vulnerability detection.The advantage of deep learning is that the mapping from the encodings of programs to the decisions are learned from the data, but in dataflow analysis, we need to manually craft rules to map from the dataflow analysis results to vulnerability decisions.

RATIONALE
In this section, we provide the relevant background of dataflow analysis for vulnerability detection and graph learning.It provides understanding on why our approach is efficient and effective.Then, we compared the closely related work that also considers dataflow in deep learning to clarify the novelty of our work.

Dataflow Analysis for Vulnerability Detection
Dataflow analysis (DFA) is a method for computing data usage patterns in a program.In addition to compiler optimization, dataflow analysis is an important method for vulnerability detection.One instance of dataflow analysis, called reaching definition analysis, reports at which program points a particular variable definition can reach.A definition reaches a node when there is a path in the CFG that connects the definition and the node, and the variable is not redefined along the path.The reaching definition analysis can detect a null-pointer dereference vulnerability based on its root cause when it identifies that a definition of an NULL pointer reaches a dereference of the pointer.Similarly, it is a causal step to detect many other vulnerabilities such as buffer overflows, integer overflow, uninitialized variables, double-free and use-after-free [12].DFA uses two equations to propagate the dataflow information through the neighboring nodes in the CFG, namely meet operator and transfer function [4].The meet operator aggregates the dataflow sets from its neighbors.The transfer function updates the dataflow set using the information available in the node .In the reaching definition analysis, the dataflow set is a set of definitions that reach a program point.A simple approach of performing a DFA is the Kildall method [33].It iteratively propagates the dataflow information to the neighbors of  in the CFG, one step at a time.The algorithm terminates when the dataflow information of all nodes stops changing, denoted a fixpoint.At termination, all nodes will incorporate the dataflow information from all other relevant nodes.When used for vulnerability detection, this information is compared to a user-specified vulnerability condition to determine whether a vulnerability has occurred in the program.

Analogy of Graph Learning and Dataflow Analysis
Graph learning starts with an initial node representation, and then it performs a fixed number of iterations of the message-passing algorithm [25] to propagate information through the graph.The initial node representation is generally a fixed-size continuous vector which represents the content of the node.At each iteration, each node aggregates information from its neighbors, and then updates its state to integrate the information.The two steps are done through the AGGREGATE and UPDATE functions, similar to the two dataflow equations of meet operator and transfer function.These functions can be simple numerical equations or neural networks.After iteration is done, all node representations are combined to produce a graph-level representation, which is passed to a classifier layer to make a prediction.
In Figure 2, we visualize the analogy between graph learning and dataflow analysis on a snippet of CFG.In the CFG, each node is a statement, and each edge indicates the order of execution between two statements.In Figure 2a, we show the two dataflow equations [4] that define a reaching definition dataflow analysis.meet operator: transfer function: where   [] and  [] are the sets of dataflow located at the beginning and end of a statement.  and   represent the dataflow generated (new definitions) and killed (overwritten definition) in node .Reaching definition is a may dataflow problem and thus the meet operator used union to merge the dataflow information from its predecessors.Meanwhile, reaching definition is a forward dataflow problem, and thus we used   [],   , and   to compute the dataflow at the exit of the statement.
In Figure 2b, we show an analogous behavior of graph learning. update: where    denotes the aggregated information from the neighboring nodes and ℎ   denotes the state of node  after  iterations of messagepassing (analogous to  []).We set  as a hyperparameter.

The Novelty of Our Work
Previously, researchers have proposed to integrate dataflow information with deep learning for program analysis tasks.A category of approaches similar to Devign [56] used data dependency graphs as a part of the program representation on which deep learning is performed.However, Devign used word embeddings to encode statements into vector representations based on their unstructured text content.Such an encoding, even propagated through data dependency edges, cannot directly capture the dataflow patterns.
ProGraML [17] has developed graph learning on LLVM IR code and applied it for compiler optimization tasks.It is another work that pointed out the analogy between DFA and graph learning.Their solution is to modify CFGs by creating instruction nodes and data nodes separately.ProGraML adds control-flow edges between instruction nodes and data-flow edges between the data nodes.However, this work encoded nodes using an embedding which only represents LLVM IR operators and variable types.This approach is very coarse-grained in that many statements can have the same operators and variable types, but they will lead to different dataflow.Therefore, similar to Devign, the propagation of such an encoding even along dataflow edges does not directly capture dataflow patterns.
Our abstract dataflow embedding attempts to directly represent the variable definitions which are propagated in DFA and is modeled after the bit-vector representation used in DFA, which allows the network to learn the operations of the dataflow analysis algorithm.We also target a specific problem (reaching definitions, which was not targeted by ProGraML), for which the results of DFA are directly useful and pertinent to vulnerability detection (e.g.§ 4.2).

APPROACH
Based on the analogous behaviors of DFA and GNN, we designed a node embedding that can represent the dataflow set at each node.We developed DeepDFA, a deep learning framework which conducts graph learning on the CFG of a program and propagates dataflow information for vulnerability detection.

Abstract Dataflow Embedding
In dataflow analysis, we use a bit vector to represent the dataflow set at each node.A bit vector consists of  bits of 0s and 1s.Its length is the size of the domain.A bit is set to 1 if its corresponding element is present in the set.In reaching definition analysis, the domain consists of all the definitions in the program, and the bits are set to "1" if the corresponding definitions reach the node.For example, in Figure 1, the program contains three definitions at nodes  1 ,  3 , and  4 so the reaching definition analysis uses a bit vector [0 0 0] to initialize each node at the beginning of the analysis.This bit vector represents  [] in the dataflow equations (See Section 3.2).It is updated at each step of propagation, and when the analysis terminates, the bit vectors for each node represent all possible definitions that can reach that node.
The bit-vector representation of reaching definition analysis efficiently encodes program semantic features related to vulnerability detection.The definitions of programs can be quickly obtained at the node via lightweight analysis locally at the statements.However, in graph learning, we cannot directly use the bit vector of definitions as the node embedding.This is because in dataflow analysis and the domain of definitions are both specific to a program.In other words, different programs have different variable definitions; the bit vectors of each program thus have different lengths and the elements (each definition) are not comparable either.Whereas, in graph learning, we want to extract dataflow patterns of vulnerabilities from all the programs in the training dataset.Thus, we need to have a "global" definition set that can be used to specify definitions for different programs, so that graph learning can compare them and generalize from them.
To address this challenge, we map all the concrete definitions in the programs in a training dataset to abstract definitions by identifying important properties of the definitions.Following a list of attack surfaces identified by Moshtari et al. [41], we designed the following four properties that can encompass the attack surfaces of a vulnerability and used them to represent a definition: (1) API call: the call to library or system functions used to define a variable, e.g.malloc and strlen.(2) Data type: the data type of the variable being assigned, e.g.
int, char* and float.
(3) Constant: the constant values assigned in the definition, e.g.
We analyze a large corpus of programs, e.g., the training set, and collect the top- frequently used API calls, data types, constants and operators to construct a dictionary. is a hyperparameter of DeepDFA.We select only the top- keys because the representations of user-defined names of APIs and data types cannot be generalized across programs unless they are represented frequently in the dataset.
In Figure 3, we show an example of abstract dataflow embedding for an example  2 in Figure 1: str = malloc (10 * argc).This definition used an API call, malloc, with the constant 10, operator *, and data type char*.Contrasted with the 3-bit bit vector (the example in Figure 1 includes three variable definitions) that represents a concrete definition in dataflow analysis for this program, the abstract embedding is larger but a fixed size, consisting of 5x4 elements for this example.Here, 4 is the four properties we considered and 5 is the hyper-parameter  we mentioned above, which defines the size of the pre-defined dictionary, and the length 5 hot-vector encoding represents the value of the property.Because the vector that encodes the abstract dataflow embedding has a fixed size, our embedding approach can scale to any program size in the dataset without impacting the model's efficiency.The vectors in different programs encode common properties of definitions, so the model can capture the dataflow patterns across programs.
Abstraction potentially brings in approximation.Using the abstract dataflow embedding, two different definitions may lead to the same encoding.The embedding is designed to be sparse enough that within a program, unique definitions are often represented by unique embedding keys, which allows the model to distinguish definitions within the same function, similar to the bit-vector used in dataflow analysis.

Using Graph Learning to Propagate Dataflow Information
Our goal in utilizing graph learning is to learn a node embedding that contains dataflow information.Without loss of generality, we use reaching definition as an instance of dataflow analysis for our explanations.Our approach takes the following steps.First, we construct the CFG for a program.Second, we perform static analysis to identify all the definitions in the CFG.We then initialize each node of the CFG using the abstract dataflow embedding, based on whether the node is a definition or not.The abstract dataflow embedding is computed from all the programs in the training dataset (see §4.1 for details).
Once the nodes are initialized, we apply the message-passing algorithm [25] from graph learning to propagate the dataflow information throughout the CFG, similar to Kildall's method [33].The main differences are that (1) we propagate the abstract dataflow embeddings of the CFG nodes, and (2) instead of using the dataflow equations of transfer function and meet operator, we alternatively apply the AGGREGATE and UPDATE functions defined in Equations 3 and 4 (See §3.3).Although the analogy applies for all GNN architectures trained with message-passing, we implemented our approach using a Gated Graph Sequence Neural Network (GGNN) [35], where AGGREGATE is an Multi-Layer Perceptron (MLP) and UP-DATE is a Gated Recurrent Unit (GRU); we will use this architecture as an example to compare the two algorithms.
When dataflow information arrives at the merge point of a branch in CFG, graph learning applies the AGGREGATE function.Specifically, in GGNN, the MLP calculates a weighted sum of the representations of multiple neighboring predecessors, resulting in a single vector; this fulfills the same function as the meet operator.When dataflow information arrives at a new node, the UPDATE function in graph learning computes the next state by combining the information in the current node with the output of AGGREGATE from its predecessors.Specifically, in GGNN, the GRU selectively forgets portions of the previous state and integrates new information from the current node and from the neighboring states, similar to the set union/difference with GEN/KILL performed in the transfer function.Through applying AGGREGATE and UPDATE, the initial embedding will be updated with the dataflow information from the neighboring nodes, similar to the effect of dataflow analysis.
As Cummins et al. [17] noted, DFA iterates to a fixpoint and thus propagates information throughout the entire graph, while graph learning performs a fixed number of iterations  and thus propagates to neighbors in a distance .We set  to the setting which maximized validation-set performance.
Finally, we combine the learned abstract node embeddings to produce graph level representation using Global Attention Pooling [35], and pass it to a classifier to predict the function as vulnerable or non-vulnerable.
The AGGREGATE and UPDATE functions are learned from labeled data during training, rather than using a fixed formula as in dataflow analysis.By learning from data, we provide an alternative solution to the challenges that often block dataflow analysis such as tracking pointers and handling library calls.Importantly, we no longer need to explicitly specify vulnerability conditions, as required in static analysis.Through learning from training examples, the classifier can capture patterns of dataflow information that represent various types of vulnerabilities and also select the relevant dataflow information for vulnerability detection.
In Table 1, we step through a reaching definition analysis for the CFG example in Figure 1 to demonstrate how dataflow information propagates through the graph and how our approach uses dataflow information for vulnerability detection.
The row Iteration 0 shows the initialization of each node in the reaching definition analysis.At iteration 1, the DFA updates 3 After the DFA algorithm terminates, the final states of the nodes are used to detect vulnerabilities.The state of  4 is [1 1 1] , which indicates that both  1 and  2 may reach  4 depending on the program values.Because the definition  1 : str = NULL can reach the dereference at  4 , we can conclude that this program has a null-pointer dereference vulnerability.Similarly, in graph learning, after a fixed number of iterations, all the node representations are combined using a graph readout operation to produce a graph-level representation, which is used to predict for vulnerability detection.Programs with the null-pointer dereference bugs will have the same abstract definitions characterized by the char* type and the constant NULL to reach the pointer dereference statements.We believe that the dataflow information represented by DeepDFA will allow a relatively simple classifier to recognize this pattern among the training dataset.

EVALUATION
In the evaluation, we studied 3 research questions: (1) Is DeepDFA effective for finding vulnerabilities?(2) Is DeepDFA efficient, both in terms of training data and computational resources?(3) Can DeepDFA generalize to unseen projects?
We also performed ablations on DeepDFA to understand the effects of each feature on its performance.

Implementation
To explore whether DeepDFA can advance the state-of-the-art, we created two settings, DeepDFA and DeepDFA+LLM.We implemented DeepDFA using the GGNN architecture [35] and based on LineVD's implementation 1 , using PyTorch and DGL 2 .We used Joern 3 to parse the CFGs because it does not require compilation; this allows our approach to be utilized out-of-the-box given only the source code, without extra configuration.To implement DeepDFA+LLM, during training and inference, we combine the graph embedding generated by DeepDFA's graph readout stage with the sentence embedding produced by the final self-attention layer of LLM.The embeddings are concatenated and fed into a feed-forward classifier layer; both embeddings and the classifier are trained jointly.We believe that providing dataflow information can improve the LLM embedding, as LLM is trained exclusively from text and it is hard to learn dataflow relations among all the dependencies of tokens.
To avoid data leakage, we extracted the initial abstract dataflow embedding from the training set only.We set the hyperparameters  and  based on the best validation performance through our experimentation.When  = 1000, the model covered most (79.38%) of the definitions in the test dataset.That means, the dictionary still misses some APIs, constants, data types or operators that occur in the test data set but are not frequent in the training dataset.To learn a more general representation and improve the coverage for test dataset, in future work, we can train the abstract dataflow embedding using a very large dataset of code, e.g., using self-supervised learning without the need of vulnerability labels.
For reproducibility, Table 2 documents the hyperparameters we used for training DeepDFA.

Experimental setup
In the recent literature [13,24,36], vulnerability detection models are typically evaluated with the Devign [56] or Big-Vul [22] datasets, both of which contain real-world open-source C/C++ projects.In our evaluation, we used the Big-Vul dataset because (1) it is bigger than Devign, consisting of 188,636 functions with 10,900 (6%) vulnerable labels and 177,736 (94%) non-vulnerable labels, and (2) it reflects the imbalanced distribution of real-world code (Devign is a balanced dataset), with the minority of code being labeled as vulnerable [13].To corroborate our results, we also evaluated the models on DbgBench [9], explained further in RQ3.Scope: Currently, DeepDFA reports whether a function is vulnerable or not.We can further apply deep learning explanation tools to report line level vulnerabilities, as done in Li et al. 's work [36].We leave this evaluation for future work.We evaluated DeepDFA on C/C++ programs, as done by the most deep learning-based vulnerability detection tools.However, we believe that DeepDFA can also be applied to other popular programming languages, such as Python and Java.This is because we extract the abstract dataflow embedding (API, datatype, literal, operator) from the training dataset, independent of the programming language.RQ1: To evaluate the models' performance, we trained the models on the train/validation/test splits of 80/10/10% published by the LineVul paper [24].To address class imbalance while training DeepDFA, we undersampled the majority class (non-vulnerable) following Japkowicz et al. [31]; our initial studies found that this improved our performance on the validation set.We kept the original ratio of vulnerable/non-vulnerable labels for the validation and test sets.We used Joern4 [54] to parse the code into its CFG representation.Joern could not parse some programs in the dataset (0.8%; see Appendix B for details), so we used the remaining data in our experiments.The performance of the baseline models was similar to the full dataset (see Appendices C & F for details).
We report the following performance metrics: • Precision reports the portion of positive predictions which were correct:  =     +  .• Recall calculates the portion of positive examples which were recalled correctly:  =     +  .• F1 is the harmonic mean between Precision and Recall: 1 = 2 *  *   + .We used  1 to decide the highest performing model because it balances precision and recall, which are both important in an imbalanced dataset.
Since the model performance can vary with different random seeds [48], we trained the models 3 times with different random seeds and reported the mean score and standard deviation for each metric.We used McNemar's statistical test, following best practices [18], to confirm that our improvement is statistically significant, using the implementation in statsmodels v0.14.0 [47].RQ2: To evaluate the models' efficiency in terms of computational resources, we measured the runtime and memory usage.These are often contested resources in deep learning workloads [30].We report the following metrics for runtime and memory usage: We evaluated the runtime on an AMD Ryzen 5 1600 3.2 GHz processor with 48GB of RAM and an Nvidia 3090 GPU with 24GB of GPU memory.
To evaluate the models' efficiency in terms of training data, we trained the models on progressively smaller subsets which we randomly sampled from Big-Vul (100%, 10%, 1%, 0.5%, 0.1%, shown in the columns of Table 6).Each subset includes the smaller subsets (e.g.10% subset includes the 1% subset and 1% includes 0.5%).We generated 3 versions of the subsets using different random seeds and reported the mean and standard deviation F1 score.The goal of this study is to discover what are the minimum training data needed for these models to perform well on the test dataset.RQ3: We prepared two experiments for this RQ.In the first experiment, we created a dataset from Big-Vul to evaluate how well the models generalize to unseen projects.This dataset consists of the mixed-project and cross-project two settings.To set up the mixedproject setting, we held out 10k randomly selected examples for the validation and test sets and used the rest for training, similar to the original method of partitioning the dataset.The training set and test set can and often do contain examples from the same project, though individual examples will not be duplicated between the two sets.To set up the cross-project setting, we held out 10k examples from randomly selected projects in Big-Vul for the validation and test sets, and used the rest of the projects for training.The projects in the test set are distinct from the projects in the training set.To mitigate the potential bias caused by the selection of projects, we repeated this process 5 times with different selections of the cross-project data and report the results of 5-fold cross validation.
In order to further evaluate generalization to unseen projects, we applied DeepDFA on buggy and patched programs from Dbg-Bench [9].DbgBench consists of a set of real-world C programs with bugs, which were analyzed and fixed by professional software engineers.The DbgBench programs are distinct from the programs in the Big-Vul dataset.We labeled the buggy functions using the fault locations documented in DbgBench; these were labeled by the consensus of multiple developers and were manually checked for correctness, and thus are more reliable than Big-Vul's labeling process based on bug-fixing commits.We included all functions which had a bug location marked as "buggy" and their corresponding patched versions in our study.We excluded the bugs marked as "Functional" because these bugs cannot be detected without program-specific bug constraints.We only included the patched versions which were modified by the developers' fixes, taking the first correct 6 developer patch which could be applied to the program.We included only one patch to reduce the effects of code duplication, which can unfairly bias test performance [5].Since the models only view the function-level context, they will not produce a different prediction on the functions which were not modified by the patch.We also excluded 8 examples which could not be processed by Joern in order to fairly compare the models' performance scores.This resulted in a dataset of 22 programs: 17 buggy + 5 patched.We evaluated the checkpoints trained from 3 random seeds in Section 5.3 and report the mean performance scores.Ablation study: We ran two ablation settings for each of the four abstract dataflow embedding features, resulting in eight settings: (1) using one feature at a time and (2) using three features at a time (leaving one out).In each setting, we trained DeepDFA on the Big-Vul [22] training dataset, and then evaluated on the Big-Vul test dataset and DbgBench [9].Baselines: We compared against 7 non-transformer models: VulDeePecker, SySeVR, Draper, Devign, ReVeal, ReGVD, IVDetect, and 4 large language models: CodeBERT, LineVul, UniXcoder, and CodeT57 .These models were developed recently with diverse architectures, and they represent the state-of-the-art of vulnerability detection models [48].See Section 7 for an overview of the models and Appendix A in the supplementary materials for the details of our reproductions.

Effectiveness
Comparison with non-transformer models: In Table 3a, we show that DeepDFA performed much better than the baseline models on F1 score and recall.DeepDFA's score was 47.51 higher than the average F1 score computed over all the baselines.In addition, compared to the other 6 models, DeepDFA reported lower variances for all the three metrics.This indicates that DeepDFA was more robust to random noise throughout training, and thus more likely to perform as expected after training.
The results show that our abstract dataflow embedding indeed encodes useful information for vulnerability detection, despite the fact that the node representation is small and the graph is simple.It is more effective than property graphs (a combination of AST, CFG, and PDG) used in Devign and Reveal.These baseline models represented nodes using unsupervised word embeddings [34,40,43], which do not have a direct relationship with vulnerabilities.In contrast, DeepDFA's node representation encodes the dataflow sets of reaching definitions, related to the root causes of vulnerabilities.Comparison with transformer models: We compared DeepDFA with CodeBERT, LineVul, UniXcoder, and CodeT5 -the state-ofthe-art among the transformer language models we evaluated.(see Appendix C for the performances of all the baseline models).Table 3b shows that DeepDFA performed considerably better than CodeBERT and CodeT5 in F1 score and had the smallest variance (among all the models) between runs.
Although UniXcoder and LineVul performed better than DeepDFA in terms of F1 score, DeepDFA's embedding can be combined with UniXcoder and LineVul to further improve their performance.We achieved state-of-the-art performance on all three metrics by adding DeepDFA's embedding to UniXcoder, with an F1 score of 96.46 (1.35 improvement), a Precision score of 97.82 (0.86 improvement), and a Recall score of 95.14 (1.80 improvement).Adding DeepDFA's embedding improved CodeT5 considerably -by 35.78 F1 score -and improved LineVul by 3.17 F1 score.We used McNemar's significance test, as recommended by Dietterich et al. [18], to confirm that the differences in performance were statistically significant ( < 0.05; see Table 4).
DeepDFA does not use any text-/token-level information such as variable and function names, yet it has achieved excellent performance.We believe that leveraging the domain-specific algorithm of reaching definition analysis to guide graph learning indeed plays an important role and that the embedding indeed encodes semantic features (e.g., data relations) that are important for vulnerability detection.The fact that DeepDFA can further improve the topperforming LLMs indicates that LLMs, which exclusively leverage text information, may not sufficiently learn the dataflow of code; DeepDFA thus provides the complementary information for vulnerability detection.We further believe that the examples which DeepDFA predicted incorrectly could be attributed to the fact that reaching definition analysis cannot handle all types of vulnerabilities.Thus, by adding other dataflow analyses such as live variable analysis, DeepDFA could further improve its performance.We will leave such an investigation to our future work.

Efficiency
Efficiency of computational resources: In Table 5, we present the runtime comparison of DeepDFA, LineVul, and UniXcoder.Here, we did not list other models because their performances are much worse (shown in Table 3), and they took hours to train (see Appendix D in the supplementary material), compared to DeepDFA which finished training in 9 minutes (excluding data preprocessing time).
In Appendix E, we also listed the sizes of the models in terms of the number of parameters.Compared to UniXcoder, DeepDFA took 75x less time to train, 2x faster inference on GPU, and 84x faster inference on CPU.DeepDFA had the least parameters of all models, equal to 67% of the smallest model (ReVeal) and 0.3% of the highest-performing baseline model (UniXcoder).These results consistently indicate that DeepDFA excels in its efficiency compared to other models.This is possible because DeepDFA is based on the dataflow analysis's compact representation -bitvector, which captures the relevant semantic information in bits and thus is more efficient compared to tokenized strings.DeepDFA propagated information along only the domainspecific CFG edges, rather than associating every pair of tokens in an exhaustive fashion.
DeepDFA's short inference time due to a low number of MAC operations enables its use in non-GPU environments (which are common for software development) where large language models may not be easily deployed.DeepDFA's short training time enables techniques like per-project fine-tuning and hyperparameter tuning, which would be much more costly with the LLMs' training times of over 10 hours.Because of DeepDFA's small parameter count, it is ideal for resource-limited computing platforms such as mobile devices, where large models cannot be used [30].Efficiency on training data: In Table 6, we report the performance of DeepDFA over reduced training dataset sizes, compared to the SOTA models, LineVul and UniXcoder.The columns "# data" and "# vul" list the number of training examples in each subset and, of these, the number of vulnerable examples.The results show that DeepDFA maintained a stable performance across small dataset sizes, even with only 0.1% of the training dataset, using only 151 training examples.In contrast, LineVul and UniXcoder steadily For project-specific training in applications within a single development team, a model which can learn efficiently from a small dataset is useful.
We believe that our model's stable performance over the reduced dataset and good performance with very small training datasets demonstrate the advantage of the small models and the effectiveness of domain-specific algorithms to guide model learning.DeepDFA is less prone to overfitting to datasets of limited size since it has fewer parameters than LineVul [8].On the other hand, the transformer models require a large corpus of programs to learn the patterns among the unstructured token data.
RQ2 result: DeepDFA was considerably faster than the baselines; it took 9 minutes to train, 4.64 milliseconds for inference on GPU, and 5.8 milliseconds for inference on CPU.DeepDFA retained stable performance as the training dataset size was reduced.In a low-data scenario, DeepDFA outperformed LineVul and UniXcoder by 25.49 and 50.88 points F1 score.

Generalization
Cross-project evaluation on Big-Vul: We compared the models' F1 scores on the cross-project (shown as Cross F1) and mixed-project (shown as Mixed F1) settings to evaluate the models' capabilities of generalizing over unseen projects.Table 7 presents the highestperforming baseline models, LineVul and UniXcoder, compared to DeepDFA (the results of the other baseline models are available in Table 7: How do the models handle unseen projects?Note the performance drop (ΔF1) from the cross-project to mixedproject setting.Applying to DbgBench:

Model
In Table 8, we report our experience of applying deep learning tools to real-world bug benchmarks.DeepDFA detected 8.7 out of 17 total bugs on average across 3 runs.DeepDFA also correctly predicted non-vulnerable for 3 out of 5 patched programs.On the other hand, neither of the competing LLMs, LineVul and UniXcoder, detected any bugs and in fact both models reported all programs as non-vulnerable with high confidence.This implies that these models were heavily biased to predict all examples in DbgBench as non-vulnerable.With the addition of DeepDFA, DeepDFA+LineVul and DeepDFA+UniXcoder's generalization greatly improved, yet they did not perform as well overall as DeepDFA alone.It should be noted that the bugs in DbgBench are very complex and took human experts hours to diagnose [9].In the past, we have tried a variety of static analysis tools, such as Cppcheck8 and Polyspace 9 , to detect bugs in DbgBench, but we have not detected any of these bugs.
We believe that DeepDFA generalizes better because it does not rely on spurious features that may exist at token and text level, such as variable names and function names, as reported by previous research [13].These spurious features are no longer correlated with vulnerabilities in unseen projects, as their input tokens will likely change.Our abstract dataflow embedding encodes the usage patterns of commonly used API calls, operators, constants, and data types.Such patterns can be extracted from unseen text and are directly related to the cause of the vulnerabilities, and thus might help DeepDFA generalize better over unseen projects.
RQ3 result: DeepDFA had the smallest drop in F1 score (Δ F1) when applying to the vulnerabilities in the projects that are not seen in training datasets.DeepDFA was able to detect complex bugs in DbgBench and was able to distinguish the buggy and patched versions.The SOTA models, LineVul and UniXcoder, did not detect any bugs in DbgBench.

Ablation studies
Table 9 shows the model's performance on DbgBench.The model detected the most bugs when using all four features compared to other ablation settings.When using only one feature at a time, the model consistently missed 1-2 bugs which were detected by DeepDFA.When using three features at a time (leaving one out), the model still consistently failed to detect 1 bug which was detected by DeepDFA.  10 shows the model's performance on the Big-Vul test dataset.DeepDFA (integrating all the four features) performed the best out of all configurations.When testing one feature at a time, datatype by itself performed better than the other 3 features alone.When we used the combined feature sets, the model performed better than using only one feature.

THREATS TO VALIDITY AND DISCUSSIONS
Threats: We evaluated performance primarily on the Big-Vul dataset because this dataset was supported by all the baseline models.Compared to the Devign dataset, Big-Vul is imbalanced and can better reflect a real-world vulnerability detection scenario.However, Big-Vul's data collection process based on bug-fixing commits can introduce label noise and selection bias and as a result, the evaluation could fail to represent real-world performance.To address the selection bias, we studied settings which reflect more realistic scenarios with reduced training datasets and cross-project generalization; to address the label noise, we evaluated the models on additional real-world bugs collected in DbgBench, which were labeled by developers and manually checked.
The performances reported in RQ1, RQ2, and RQ3 will be affected by the random noise in the model training and, for RQ2, dataset selection.To mitigate this effect, we generated 3 versions of the subsets using different random seeds and reported the mean performance.The mixed-project and cross-project performance reported in RQ3 will be affected by the random selection of projects in the training/held-out datasets.To mitigate this effect, we performed 5-fold cross-validation and reported the mean performance.Discussions: We believe our approach can be extended to bitvector dataflow problems [29,45].All problems in this category contain a finite set of dataflow facts and have the same form of transfer functions and meet operators (see Equations 1 and 2).For example, live variables and available expressions [45] are bit-vector problems that are important for vulnerability detection [12].We believe a new dataflow analysis can be integrated by: (1) defining an abstract dataflow embedding which can capture the dataflow set of the analysis, (2) configuring the neural network used as the aggregate function in GGNN to better simulate the meet operator (based on whether it is a union or intersection operation), and (3) reversing the CFG edges for backward dataflow problems (as reaching definition is a forward dataflow problem).

RELATED WORK
Many works have used GNN for vulnerability detection [6,10,14,19,21,42,51].In several recent approaches, Devign [56], ReVeal [13], IVDetect [36], and LineVD [28] used GNN on program graph representations such as AST, CFG, and PDG, and annotated the nodes with unsupervised or pretrained word embeddings.The novelty of our work is a bit-vector inspired abstract dataflow embedding based on the analogy of graph learning and DFA algorithms.
Transformer models such as CodeBERT [23], LineVul [24], and UniXcoder [26] used a token-based program representation pretrained on a large body of NL-PL pairs, and then fine-tuned for vulnerability detection.Using CFGs, our graph learning only propagates the information along semantically important edges instead of trying to learn the relations of each pair of tokens.Thus, our approach is substantially more efficient.Since we have used a semantic-based embedding, we show that we can improve the performance of token based models.The most recent work, ContraFlow [15], learns embeddings of def-use paths (an output of dataflow analysis), then predicts vulnerability detection using a transformer model.Our work directly emulates dataflow analysis and does not require an expensive pretraining phase.
There were also models that used sequence and CNN architectures.VulDeePecker [38] used BiLSTM on slices considering data dependencies.SySeVR [37] used BiGRU on slices and adds data dependencies.Draper [46] used CNN and Random Forest.However, none of these models integrates dataflow analysis in its algorithm.
Cummins et al. [17] formulated dataflow analyses as supervised learning tasks and applied it for device mapping and algorithm classification; we discuss the differences from our work in-depth in Section 3.3.Other relevant work that explores dataflow analysis and deep learning include: (1) VenkataKeerthy et al. [50] used the output of dataflow analysis, reaching definitions and live variables, to learn flow-aware embeddings; and (2) Bielik et al. [7] and Jeon et al. [32] learned static analysis formulas from a dataset based on a fixed language.None of these works aims to develop a model for vulnerability detection.

CONCLUSIONS AND FUTURE WORK
We propose DeepDFA, an efficient graph learning framework and embedding technique for vulnerability detection.Our abstract dataflow embedding leverages the idea of bit-vector in dataflow analysis and integrates data usage patterns from semantic features: commonly used API calls, operations, constants, and data types that potentially capture the causes of the vulnerabilities.DeepDFA emulates the Kildall method of dataflow analysis using the analogous messagepassing algorithm.Our experimental results show that DeepDFA is very efficient.It is trained in 9 minutes and used only 50 vulnerable examples to achieve its top performance.Yet, it still outperformed all non-transformer baselines and generalized the best among all the models.DeepDFA found bugs in real-world programs from Dbg-Bench while neither of the highest-performing baselines, LineVul and UniXcoder, detected any bugs.Importantly, DeepDFA can be used to improve other models.By combining DeepDFA with the top performing models, we surpassed the state-of-the-art performance for vulnerability detection.
In the future, we plan to incorporate other dataflow analyses, e.g., live variable analysis, that have been used for or vulnerability detection [12].We also plan to explore the application of explanation tools to precisely pinpoint the vulnerability location at specific lines in the code, and evaluate our framework on detecting vulnerabilities in other programming languages.

C ADDITIONAL EFFECTIVENESS RESULTS
Of the transformer models, we evaluated CodeBERT and LineVul in our experiment for Section 5.3.Table 11 reports the performances of the other models.

Figure 2 :
Figure 2: Analogy of information propagation in Dataflow Analysis and Graph Learning

•
Training time: the wall-clock time to execute one training run with one validation run per epoch • Inference time: the average wall-clock time to predict for one example.• MACs: the average number of Multiply-Accumulate operations 5 to predict for one example; this measures the performance independently of the computing platform [30, 49].• Parameter count: the number of trainable parameters in the neural network model.

Table 3 :
[36]DFA outperformed the baselines and can be used to further improve the existing model performance.All scores are reported as Mean (Standard deviation).Note that VulDeePecker, SySeVR, Draper, and IVDetect performance were directly taken from the IVDetect paper[36], so we do not report the variance.
(b) Comparison with transformer models.

Table 4 :
Results of statistical tests for model comparison.

Table 5 :
DeepDFA's training/inference time was faster than the baselines.DeepDFA performed much better than all nontransformer baselines.When combined with transformer models, it achieved the highest SOTA score on all metrics.

Table 6 :
DeepDFA retained its performance on limited data.

Table 8 :
DeepDFA generalized to real-world bugs in Dbg-Bench.Results are averaged over checkpoints from 3 random seeds.Buggy/Patched columns show the number of correct predictions on buggy/patched programs respectively.

Table 9 :
Ablation study evaluated on DbgBench

Table 10 :
Ablation study evaluated on the Big-Vul test dataset

Table 11 :
Initial trial run of performance on 100% of the Big-Vul dataset.