Combining Structured Static Code Information and Dynamic Symbolic Traces for Software Vulnerability Prediction

Deep learning (DL) has emerged as a viable means for identifying software bugs and vulnerabilities. The success of DL relies on having a suitable representation of the problem domain. However, existing DL-based solutions for learning program representations have limitations - they either cannot capture the deep, precise program semantics or suffer from poor scalability. We present Con-coction, the first DL system to learn program presentations by combining static source code information and dynamic program execution traces. Concoction employs unsupervised active learning techniques to determine a subset of important paths to collect dynamic symbolic execution traces. By implementing a focused symbolic execution solution, Concoction brings the benefits of static and dynamic code features while reducing the expensive sym-bolic execution overhead. We integrate Concoction with fuzzing techniques to detect function-level code vulnerabilities in C pro-grams from 20 open-source projects. In 200 hours of automated concurrent test runs, Concoction has successfully uncovered vul-nerabilities in all tested projects, identifying 54 unique vulnera-bilities and yielding 37 new, unique CVE IDs. Concoction also significantly outperforms 16 prior methods by providing higher accuracy and lower false positive rates.


INTRODUCTION
Despite significant efforts to enhance software reliability, software vulnerabilities remain a primary concern in modern software development [40,58].In recent years, deep learning (DL) techniques have emerged as a powerful method for constructing sophisticated tools and models to identify common software bugs and vulnerabilities [21,57,59,60,63,80,84].By training a predictive model on large volumes of training data, DL models can learn the latent code patterns indicative of vulnerable programs.Once trained, these models can be applied to new, previously unseen programs to identify potentially buggy code [8,58].
The success of a supervised learning model is heavily dependent on having a suitable representation of the problem domain that can encode the essential information needed for the task at hand, such as vulnerability detection in our case [72].In the context of DLbased code modeling, this requires constructing numerical vectors, or embeddings, that capture the important characteristics of the program source code or binary.
The vast majority of DL-based code modeling techniques rely on deep neural networks (DNNs) to learn program representations from static code, such as source code texts [57,59,60], abstract syntax trees (ASTs) [54,75], program data and control flow graphs (PDCGs) [21,63,88], or a combination of these [80].While static code information can capture all possible program execution paths, it can suffer from complex and ambiguous information due to redundant statements, complex data structures, and extensive execution paths in the source code.Like classical compiler analysis, this can lead to over-conservative decisions and a high false-positive rate1 , and a low true positive ratio [15] for automatic bug detection.
More recent approaches, like LIGER [81], attempted to use dynamic execution traces to learn program representation.These approaches utilize execution statements seen during profiling to represent static program information and track changes in program variables to capture dynamic program behavior.By considering dynamic execution paths, symbolic traces provide precise information about dynamic program behavior and reduce false-positive rates in code analysis.While promising, prior approaches have two limitations.Firstly, they solely rely on executed code statements seen for static code representation.This can suffer from poor coverage and overlook the structured data flow and dependence information available in the static program graph.This comprehensive data and control flow information is crucial for vulnerability detection, encompassing all possible execution paths.Secondly, they employ random sampling for dynamic tracing, which presents challenges when applying dynamic tracing methods to real-world software projects due to the expensive overhead of symbolic executions [18].
In this paper, we ask the question, "what if we could bring the best of static and dynamic code information in a single DL framework for code vulnerability detection?".In response, we develop Concoction, a new DL system to combine static and dynamic code representation to detect software bugs and vulnerabilities at the source code level.Specifically, static code information, such as PDCG (Program Data and Control-flow dependence Graph), offers a high-level view of all possible program behaviors and data flow information, which can mitigate the coverage issue of symbolic execution.On the other hand, symbolic execution traces on a small set of carefully selected execution paths can provide more precise information and deeper program semantics to disambiguate static code information.By integrating these two types of information, we avoid the computational overhead of running symbolic executions on every possible execution path while still leveraging the benefits of deep program semantics provided by dynamic executions.
Concoction utilizes a Transformer-based DNN architecture [78] to leverage static and dynamic code information.The model comprises a representation component and a detection component.The representation component maps static and dynamic program information into a joint embedding (or feature vector) for program representations.The detection component takes the program representation as input and predicts whether the input code contains a vulnerability.One of the key features of Concoction is its ability to minimize the overhead of dynamic tracing.This is achieved by employing a path selection component during deployment to determine which execution paths from the PDCG should be chosen to collect symbolic execution traces.These traces are then fed into the representation component to generate dynamic embeddings, which are combined with the static embeddings and passed to the detection model for vulnerability prediction.
To overcome the challenge of limited numbers of labeled code samples, Concoction combines supervised and unsupervised learning techniques.We first leverage unsupervised contrastive learning to pretrain the representation network.For this purpose, we employ the language masking method [76] and train the representation model on a dataset of 100K unlabeled C functions sourced from GitHub and open datasets.To generate additional training samples, we introduce a dropout-based contrastive learning component [34].The contrastive loss function encourages the model to understand code semantics better by mapping similar samples closely and distantiating those with different semantics in the embedding space.Furthermore, we also extend unsupervised learning to train the path selection component using the same unlabeled training dataset.
Once the representation component is trained, we remove the contrastive learning network and combine the representation layers with the detection component, creating an end-to-end model.This model takes joint embeddings of static and dynamic information as input and predicts whether the input code contains a bug.To train the end-to-end model, we use the learned weights of the representation model to initialize the corresponding layers of the model and fine-tune the entire architecture on a dataset of 14K labeled code samples obtained from public datasets, including the Software Assurance Reference Dataset (SARD) [64] and Common Vulnerabilities and Exposures (CVE) [2].
We have implemented a working prototype of Concoction2 .Our implementation utilizes KLEE [17] to generate the symbolic execution traces.We demonstrate the benefits of Concoction by applying it to C programs to detect function-level vulnerabilities from source code.We further integrate Concoction with fuzzing test techniques [12] to automatically generate bug-exposing test cases when a function is predicted to have a code vulnerability, aiming to minimize the effort of manual examination.
We evaluate Concoction by applying it to 20 diverse, reallife open-source projects that are not presented in our training dataset.We compare Concoction against 16 prior methods, including eleven state-of-the-art learning-based methods [21,23,33,39,43,57,60,63,80,81,88], two symbolic execution engines [16,17], two static analysis tools [1, 4] and a fuzzing tool [31] for identifying security flaws.Experimental results show that Concoction consistently outperforms 16 competing methods across evaluation settings by discovering more code vulnerabilities with a lower falsepositive rate.In less than 200 hours of automated concurrent testing runs, Concoction has uncovered vulnerabilities in all tested projects and successfully identified 54 software vulnerabilities, with 37 new CVE IDs assigned.
This paper makes the following contributions:

MOTIVATION
Static code analysis techniques for bug detection can suffer from false positives (incorrectly flagging a bug).For example, Figure 1 shows a function b() which will not be invoked during execution because the static array a is initialized to zero by definition, and the condition at line 9 is evaluated to false.However, modern compilers such as GCC and LLVM cannot detect dead code due to complex and inaccurate pointer aliasing analysis.Similarly, DL-based bug detection approaches [21,39,57,59,60,63,80,88] that rely on static code features predict that this example contains a "division-by-zero" vulnerability at line 4, which is a false positive.
Avoiding a false positive (FP) of Figure 1 would require capturing program semantics between variables, function calls and structured data.An approach that uses only static information may not accurately trace the data flow across function calls and data structures.Meanwhile, DL-based solutions based on static code information, such as AST or PDCG, suffer from the same issue.
Can we do better by combining static and dynamic information?This is the insight shared in this paper.For this example, we can infer from symbolic execution traces that the function b() will not be executed due to the values in array a by inlining the callee function b() to test().The static PDCG further reveals that a is an invariant in all possible execution paths.Combining static and dynamic information, we can observe that this branch is never taken, making a "division-by-zero" error impossible.
A natural question is: "why not just rely on symbolic executions?".In an ideal world where computation resources and symbolic execution overhead are not an issue, a symbolic execution engine will be able to identify the vulnerability of this example through exhaustive executions.However, this is often infeasible because exhaustively trying all possible execution paths is prohibitively expensive.This example highlights the need to leverage static and dynamic program information for code vulnerability detection.Concoction is designed to offer this capability.

OUR APPROACH
Concoction is a DL framework for detecting software vulnerabilities in source code.In this work, we apply Concoction to identify bugs at the function level in C programs.Specifically, our work focuses on detecting bugs and security flaws defined in the CWE database [3].In practice, Concoction can be integrated into an automated build system like Jenkins [5] to execute the vulnerability detection process as a background process when a new merge request is submitted.As these build systems often run overnight on a dedicated backend server, they do not affect the standard development activities and the overhead of Concoction should be acceptable to many developers.
The key technical contributions of Concoction include: (1) combining structured static code information and symbolic execution traces to learn program representation (Sec.3.2.1 and 3.2.2),and (2) a learnable path selection component to reduce symbolic execution overhead (Sec.3.4).Concoction builds upon prior foundations in enhanced AST (Sec.3.2.1),Transformer-based neural architectures, and contrastive learning (Sec.3.3).Pre-processing.Concoction uses a LLVM compiler plugin [50] to partition the project code into individual functions by inlining callee functions, relevant data structures, and global variables.

Prediction.
Concoction extracts two types of information for each target function: (1) AST and PDCG from static source code, and (2) the symbolic execution traces of selected execution paths using a symbolic execution engine [17].It employs a path selection component (Sec.3.4) to identify critical paths to collect dynamic symbolic execution traces.The static and dynamic information produce static and dynamic embeddings through dedicated representation networks, which are then concatenated to create a joint representation to be used by the detection model for prediction.

Test case generation.
When the model detects a potential vulnerability in the input function, it invokes a fuzzing engine to identify and expose the weakness by generating randomized test inputs for the function.As we only employ fuzzing for functions suspected to contain vulnerabilities, the fuzzing overhead is manageable, taking less than 12 hours for all fuzzed functions within a project.Program representation.Our representation component uses two Transformer-based networks to map the input source code and symbolic execution traces into a numerical embedding vector.Then, a dense layer concatenates the embeddings generated by the two networks to a joint vector as the output.We set the embedding length of the static and dynamic embedding vectors to 100 dimensions, leading to a joint embedding vector of 200 dimensions.As in prior work [37], using a larger dimension does not yield better performance in our setting but may increase the training overhead.

The Concoction Architecture
Vulnerability detection.The detection component is a multi-layer perceptron (MLP) network that takes joint embedding to predict the vulnerability.It includes a fully connected layer, a dropout layer with a rate of 0.1, and a sigmoid layer.Our current implementation only predicts if a function may contain a vulnerability and does not identify the type of vulnerability.

Extracting static code information.
We use a parser built upon the Language Server Protocol (LSP) [38] to rewrite the variable names with a consistent naming scheme.This step handles syntactic variations in the programs.Next, we construct an enhanced AST, using the LSP, which contains the standard syntax nodes, i.e., nonterminals in the language grammar like an AST node for an if statement or function declaration, and syntax tokens, i.e., terminals like identifier names and constant values.Following [9], we also

Extracting dynamic information.
We use KLEE [17] to obtain symbolic execution traces.To pre-train the representation model (Sec.3.3), we generate execution traces using a non-uniform random search heuristic to explore different execution paths.During deployment, symbolic traces are generated solely on the selected paths instead of random search to minimize the overhead of symbolic executions (Sec.3.4).We terminate symbolic execution after a configurable time limit (4 hours for collecting training data and 5 minutes when using the trained model).Subsequently, we combine different symbolic inputs and their corresponding reachable program paths as a sequence of execution traces to be fed into the dynamic embedding network (Figure 3).

Contrastive Pre-training
We use a bidirectional Transformer network [39] to learn static and dynamic program embeddings.We pre-train the static and dynamic embedding models separately on the same unlabeled dataset.Our pre-training dataset contains 100K C code snippets collected from GitHub and SARD shown as Table 1.After training, we use the output of the last hidden layer of the embedding networks as the static or dynamic embedding vector.We employ contrastive learning to increase the dataset size and enhance the model's robustness.

Model inputs.
We pair the source code text and flattened enhanced AST sequence, sending them to the static embedding network.The AST sequence is generated by traversing the AST in a breadth-first manner.During training, the static embedding model predicts masked tokens from either the source code or AST's data and control flow relations to generate contextual representations.Likewise, the dynamic embedding network takes symbolic execution traces and maps them to an embedding vector, capturing temporal dependencies and runtime behavior of the program.

Training methodology.
To enhance the model's robustness and generalization ability, we employ dropout-based contrastive learning [34].Our approach includes dense, dropout, and pooling layers, depicted in Figure 4. Specifically, we individually attach the contrastive learning component to the static and dynamic representation networks.Then, we train each combined network separately and remove them from the contrastive learning component.Contrastive learning increases the training dataset size by adding noises to the data.Concoction randomly disables neurons in the representation network, generating various dropouts as shown in Figure 4. Specifically, it passes the code sample inputs   through the representation network twice with different dropout probabilities, resulting in two different embeddings as a "positive pair" for   .Another sample is paired with   to create a "negative pair".The contrastive learning component is then used to predict positive samples from negatives in a training mini-batch and calculate the loss.The training process aims to minimize the standard Noise Contrastive Estimate (NCE) loss function [47] to maximize the agreement between semantically similar pairs.

Training the end-to-end detection model.
After training the static and dynamic embedding networks, we attach them to a dense layer to create joint embeddings for the detection network.This forms the final end-to-end architecture shown in the right part of Figure 3.The joint representation component serves as the encoder, and we initialize its weights using those obtained during pre-training.We train the end-to-end network using labeled data samples, which consist of 13,768 C code snippets from CVE and SARD datasets.Each sample has a two-dimensional one-hot label indicating whether it contains a vulnerability.Vulnerable code samples are collected from open-source projects using the assigned CVE or SARD IDs, whereas benign samples are obtained from the patched version of the same project.For each training sample, we generate an enhanced AST and randomly sampled symbolic traces (Sec.3.2.1 and 3.2.2).We use the pre-trained representation component to generate the joint embedding as the program representation, which becomes the detection model's input.Our end-to-end model is trained to optimize the cross-entropy loss for classifications.

Path Selection for Symbolic Execution
After training the end-to-end model, we use the path selection component to choose significant paths for symbolic executions during deployment.This differs from execution traces collected during training, where we use a random sampling scheme to improve the training data size, as symbolic execution overhead is less of an issue for offline model training.

Overview of path selection.
Figure 5 shows the workflow of choosing  most important paths from all possible execution paths of the target function.Like ContraFlow [23], we use unsupervised active learning to identify most representative paths for encoding programs [51], such that the number of paths for code embedding is reduced while important program semantics are well-preserved.Unlike ContraFlow, our goal is to select paths to collect symbolic execution traces.As we will show in Sec.5.2 and 5.3, our approach outperforms ContraFlow.
Our goal.Given  execution paths  = [ℎ 1 , ..., ℎ  ] collected from the PDCG of the test sample, our goal is to choose a subset of important paths to collect symbolic execution traces, which ℎ  is an embedding vector generated by our static representation model.The objective is to choose representative data points that can reconstruct most of the input PDCG information and preserve the unique features of the sample.In this work, we use the K-means clustering algorithm to model the path features by grouping data points into clusters on the embedding space.The path selection component is trained to find a sample subset that captures the input patterns and preserves the cluster structure of the data.

Execution path representation.
For each test sample (function), we extract static execution paths from the PDCG.For each static execution path, we first eliminate irrelevant code (and AST nodes) and then feed the code segmentation to the trained static representation model (fine-tuned when training the end-to-end model) to generate embeddings for the code segments.This process is repeated for each static path, resulting in a matrix,  , to serve as the input of the path selection component, which each matrix element being an embedding vector produced by our static representation model.Since path collection involves traversing the PDCG without program execution, the overhead is negligible.

Network structure.
As our path selection component uses unsupervised learning (we do not have labels to tag if a path is important), using an encoder-decoder network to map the input into a latent space for path selection is a natural choice.We add a selection block between the encoder and decoder to select samples (or paths) from all input paths to be passed into the decoder.As the selection block is differentiable, the selection network is trainable.
Selection block.The selection block comprises two branches, each consisting of a single fully connected layer without bias and nonlinear activation functions.The first branch aims to identify a subset of paths ( ) that can effectively approximate all paths from the input matrix ( ).The second branch initially clusters the data in the latent space and then selects a sample subset approximating the resulting cluster centroids.

Unsupervised active learning.
For all input paths of a test sample, our approach constructs two coefficient matrices,  and  , where each matrix element is a -dimensional embedding vector produced by the encoder.The matrix  is constructed by the first branch of the selection block to approximate the input matrix ( ), while the matrix  , constructed by the second branch of the selection block, aims to preserve the cluster structure when applying K-means to the latent space learned by the encoder.
Training objectives.Our active learning process refines the matrices to maximize the distance between (1)  and matrix , (2) the cluster centroid matrix  obtained using K-means clustering on  , and the reconstructed matrix  , and (3)  and the decoder outputs.
Loss function.Our overall loss function is defined as  ℓ = ℓ  + ℓ  + ℓ  , where  and  are tradeoff parameters.The terms in the loss function are: Eq. 1 approximates the input patterns.The value of  represents either 2 (l2-norms) or F (Frobenius norm) for the normalization function || • ||  , which is used to measure the informativeness of each feature. ( ) is a nonlinear transformation that maps input paths  to a new latent representation, and the tradeoff parameter  controls the balance between the reconstruction loss and the regularization term.Eq. 2 aims to minimize the cluster centroids reconstruction loss, corresponding to objective (2), with the tradeoff parameter .Finally, Eq. 3 corresponds to the reconstruction loss of the encoder-decoder model, which represents the third objective. denotes the decoder's output for a given input  ( ).
Training process.We iteratively train the selection component on the unlabeled Concoction training data.Firstly, we pre-train the encoder and decoder without considering the selection block.After that, we perform K-means on the encoder output and consider the obtained K cluster centroids as the centroid matrix  for subsequent sample selection.The number of clusters () is determined automatically using the Bayesian information criterion (BIC) [62].Finally, we use the pre-trained parameters to initialize the encoder and decoder, batch all data, and minimize the overall loss function using Adam optimizer with a 0.001 learning rate.

Path selection during deployment.
Once the selection component is trained on the Concoction training data, we can obtain two reconstruction coefficient matrices  and  .To use the selection component, we normalize the columns of  and  using l2-norm and convert the values to [0, 1].This produces two ranking vectors q, p ∈ R  , which we merge and sort in descending order to identify  top-ranked paths.Parameter  can be flexiblely set by the user.In this work, we set  to be 30%, sufficient to cover important paths of vulnerable functions in our training dataset.If the number of paths of the target code is less than 10, we consider all execution paths as the overhead of symbolic executions is small.

Symbolic execution for chosen paths.
We extended KLEE [17] to cover the selected paths of the target function.To do so, we use a compiler-based pass to insert a callback function to the target program to guide KLEE skip paths not selected by our path selection By skipping unwanted execution traces early, we can manage the overhead of symbolic execution effectively.We terminate symbolic execution after a configurable threshold (5 minutes in this work) during evaluation.The symbolic inputs and corresponding reachable code paths produce execution traces that we pass to the dynamic representation model to generate dynamic embeddings.Note that our approach guarantees that there are always symbolic execution traces generated for the test function.

3.4.7
Targeted fuzzing for test case generation.When our detection model identifies a potentially buggy function that symbolic execution fails to expose, we use AFL++ [31] (an AFL extension) for fuzzing functions predicted to be vulnerable.The idea is to generate bug-exposing test cases to help developers analyze and verify the identified vulnerabilities.In this work, we use the AFL++ partial instrumentation mode to guide AFL++ towards targeting functions predicted by Concoction to have vulnerabilities.To this end, we provide AFL++ with the project's source code, build scripts, vulnerable functions identified by Concoction, and seed program inputs to be mutated using the default AFL++ configurations.We then ask AFL++ to instrument the compiled assembly code and mutate the seed test inputs to cover the target functions.Fuzzing is terminated if AFL++ detects a program crash or exceeds a 12-hour runtime (which may involve fuzzing multiple functions together).We then manually verified and reported the issue to developers with information to reproduce the issue.If AFL++ does not trigger any crash, we manually examine the predicted function to check for a vulnerability and file an issue report for each confirmed bug.

Workloads
Open-source projects.We applied Concoction to 20 open-source projects from various domains.These projects, listed in Table 3, were chosen because they are widely used, have been used in related  We stress that none of these projects was used to train Concoction, and we tested the latest version of each project at the time of testing.We also note that bugs discovered by Concoction were previously unreported at the time of testing; hence, there were no data leakage issues.
Open datasets.In Sec.5.2 and 5.3, we compare Concoction with prior work on three datasets used by the prior work for evaluation purposes.In Sec.5.2.1 and 5.2.2, we use samples from SARD [64] and CVE datasets (see Table 1), respectively.In Sec.5.3, we evaluate Concoction on three open-source projects that have known CVEs (see Table 2).We use three-fold cross-validation to evaluate all approaches on the above datasets, where samples are split on the project-level, and samples of a test project are excluded from the training dataset.
Data collection and workload characteristics.Our program representation model, Concoction, was trained and tested on a dataset of over 100K functions from SARD and 9 large C-language open-source projects (Table 1).Four security researchers spent 600 man-hours on manual labeling and cross-verification to collect these samples.Additionally, we spent 200+ machine hours extracting dynamic and static information using KLEE.Figure 6 shows the cumulative distribution functions (CDF) of the number of lines and execution paths in the test samples in Tables 1 and 3.The SARD dataset consists mainly of short functions, where over 50% have <40 lines of code and 4 paths, leading to high detection accuracy with baseline methods.However, the functions from the CVE dataset and open-source projects are much larger, with over 50% containing ≥ 400 lines of code (up to 10K) and 128 paths (up to 12K).This increased complexity in the CVE dataset and opensource projects reduces the accuracy and recall for our baselines compared to SARD.

Competing Baselines
We evaluate Concoction by comparing it with 16 prior methods.These include (1) eleven state-of-the-art DL-based models, (2) two symbolic execution tools, (3) one fuzzing tool that our approach relies on, and (4) two static analysis tools.Before running the DL baselines on the same datasets, we ensure that our evaluation setup achieves results comparable to those reported in their source publications for a fair comparison.
DL models based on static information.We compared Concoction against eleven DL models that use static code information.These include Vuldeepecker [57], which utilizes a BiLSTM architecture, as well as Funded [80], Devign [88], ReVeal [21], and ReGVD [63], which employ a variant of graph neural networks to learn program representations.We also compared against LineVul [33], LineVD [43], CodeXGLUE [60] and Graphcode-BERT [39], which use the Transformer architecture, and Con-traFlow [23] which utilizes contrastive learning to represent the code, followed by an LSTM architecture to identify vulnerabilities.
DL models based on dynamic information.LIGER [81] is a closely related work that learns program representations from symbolic execution traces.However, unlike our approach, LIGER uses a random sampling method for collecting symbolic traces and does not utilize structured data flow and dependence information from static source code.Additionally, it employs a multi-tier recurrent neural network (RNN) architecture, whereas Concoction uses the Transformer architecture.For our evaluation, we used the opensource implementation of LIGER, adapting it for vulnerability detection and training it on the same datasets as Concoction.
Static tools.We compare Concoction with two representative static analysis tools: CodeQL [1] and Infer [4], by using the default, recommended configurations of the tools.

Symbolic execution engines.
We also compare Concoction to two state-of-the-art symbolic execution tools, including KLEE [17] and MoKLEE [16].The latter is designed to reduce the overhead of symbolic executions by allowing symbolic executions to run on previous paths while continuing to explore new paths.

Evaluation Methodology
We applied Concoction to 20 open-source projects and 14K functionlevel source code samples (vulnerable and benign).Our evaluation is designed to answer the following research questions: RQ1: Does combining static and dynamic information help detect code vulnerabilities in real-life open-source projects (Sec 5.1)?RQ2: How does Concoction compare with prior approaches in detecting function-level vulnerabilities (Sec.5.2 and 5.3)?RQ3: How do individual components of Concoction contribute to its overall performance (Sec.5.4)?
Evaluation metrics.We consider four higher-is-better statistical metrics: accuracy, precision, recall and the F1 score.Accuracy is computed as the ratio of correctly labeled cases to the total test cases.Precision is the ratio of correctly predicted samples to the total number of samples predicted to have the same label.It answers the question, "Out of all the samples predicted to contain a vulnerability, how many are correct?"High precision indicates a low false-positive rate, meaning that a lower proportion of the samples predicted to have bugs are bug-free.Recall is the ratio of correctly predicted samples to the total number of test samples belonging to a class.It answers questions like "Of all the vulnerable test samples, how many are actually predicted to be vulnerable?".High recall suggests a low false-negative rate.Finally, the F1 score is the harmonic mean of Precision and Recall, calculated as 2 ×  × + .It is useful when the test data have an uneven label distribution.

Detect Vulnerabilities in Large-scale Testing
This subsection quantifies Concoction's ability to detect functionlevel code vulnerabilities in the 20 projects listed in Table 3.For ethical considerations, we first contacted the developers through a private email for vulnerabilities that are likely exploitable (including all those with a CVE ID assigned), and followed their advice.3 reports the distribution of our submitted vulnerability reports across the tested projects.In total, we have submitted 54 reports, 53 were confirmed by developers.At the time of submission, 27 vulnerabilities have been fixed, with 37 new, unique CVE IDs assigned and 17 CVE applications pending.

Vulnerability types. Table 4 categorizes the vulnerabilities
found by Concoction 3 .The top three security flaw-related categories are presented here.The "other types" category includes six types of vulnerabilities: 'allocation-size-too-big', 'out-of-memory', 'use-after-free', 'memcpy-param-overlap', 'illegal-memory-access', and 'DEADLYSIGNAL'.Of all the detected vulnerabilities, 62.3% are buffer-overflow related, covering both heap and stack-bufferoverflow. Concoction's ability to detect buffer-overflow vulnerabilities comes from its capability to reason about input value change ranges by combining static code structures and carefully selected symbolic traces.During testing, Concoction discovered six vulnerabilities (11.3%) related to SEGV (segmentation violation), which were later confirmed by developers.Concoction identifies SEGV by inferring and verifying bounds on the variable value of array references.Additionally, Concoction submitted four vulnerabilities (7.5%) related to memory-leaks.The combination of static and dynamic code features enables Concoction to infer this type of vulnerability by correlating the allocated and released memory buffer sizes within the test function.Listing 1 shows an example of a heap-buffer-overflow vulnerability detected by Concoction, caused by an incomplete bounds-checking pattern.Meanwhile, Listing 2 presents a SEGV example uncovered by Concoction.

Concoction detected examples.
We present several examples of Concoction-detected vulnerabilities, covering three types of memory-related security flaws.As DL models generally work as a black box [20], to understand Concoction's workings and the vulnerability's root cause, we compare the original buggy code with the developer-generated patch after reporting the issue.Heap buffer overflow.Listing 1 shows CVE-2022-26181, a heapbuffer-overflow vulnerability identified by Concoction.The vulnerability stems from the value of the data variable.By making the data variable symbolic, Concoction allows the DL model to infer that when data is not null, the execution trace consistently reaches line 6.Concoction successfully identifies the incomplete bounds checking, which may overlook certain non-compliant inputs.

Segmentation violation.
Listing 2 shows a segmentation-violation vulnerability resulting from an out-of-bounds read.The issue arises when the value of dctx->dcmpri.lenexceeds dctx->dcmpri.f->len,causing a reference to a memory location beyond the allocated buffer boundary.Concoction could not discover this vulnerability without the symbolic trace input.

Memory leak.
Listing 3 shows CVE-2021-3574, a memory leak vulnerability discovered by Concoction.This issue can be detected by inspecting the execution trace from the memory allocation size of samples_per_pixel to the memory-free size of MaxPixelChannels.Concoction detects this vulnerability by learning to compare the sizes of these two variables because memory leaks are typically caused by allocating more memory than required and subsequently freeing less.

Importance of bugs found.
It is difficult to assess the importance of the bugs we found.Still, we found some evidence to show their importance: (1) some of the bugs we found were also reported by other application users later, indicating that the issues we identified are relevant and have occurred in real-world use cases; (2) most of our newly reported issues were confirmed and fixed by the developers demonstrates their importance; (3) developers promptly welcomed and resolved 14 of our reported issues within 48 hours, showing the importance of these issues.Vuldeepecker [57], Funded [80], Devign [88], ReVeal [21], ReGVD [63], LineVul [33], LineVD [43], CodeXGLUE [60], Graphcode-BERT [39], ContraFlow [23] 22  1) are more complex than the SARD dataset.As such, it is more challenging to achieve good performance.However, Concoction outperforms all other methods across all evaluation metrics shown as Figure 8. Thanks to the carefully selected execution traces, Concoction can track changes in program states and variables (shown in examples in Sec.5.1).This information enhances precision by reducing the false positive rate and helps discover more vulnerabilities with a higher true positive rate than static information alone, resulting in a higher recall.Among the baseline methods, LIGER performs best, but its F1 score is 10.4% lower than that of Concoction.This shows that Concoction strikes a better balance between false and negative positives, leveraging the advantages of structured static source code information.

Comparison on Known CVEs
We compare Concoction to the baselines on three open-source projects listed in Table 2.These projects contain 35 CVEs reported by independent users, which were also used by prior work [69,71].We apply all methods to functions associated with a CVE and use the reported CVEs to compute evaluation metrics (Sec.4.3).To ensure a fair comparison, we train all methods, including Concoction, on the same training dataset, but we exclude these projects from the training data.For the dynamic methods listed in Sec.4.2, we allocate 200 hours of machine time for each project.

DL model implementation choices.
We conduct an ablation study [36] on Concoction using the CVE dataset.The study includes the following variants: Static (using enhanced AST, Sec 3. Dynamic (utilizing randomly sampled symbolic traces with 30 minutes of symbolic execution for each project, Sec.3.2.2), NonCL (without the contrastive learning module, Sec.3.3),NonSel (omitting the path selection module by using randomly sampled symbolic execution traces with static code information, Sec 3.4), and Conc (the complete Concoction implementation).
The results are given in Figure 9.Using only static or dynamic representations is insufficient for accurately modeling program structures, with F1 scores of 68.7% and 77.2% for each variant, respectively.In our approach, we employed dropout-based contrastive learning as data augmentation for training our representation model (Sec.3.3.2).This helps extend our training set and mitigates overfitting [70].Removing the contrastive learning component led to a 3.7% decrease in the F1 score, reaching 82.4% compared to the full model.Additionally, removing the path selection method resulted in an F1 score drop to 77.6% since random sampling may not capture crucial path information within a given budget.

Sensitivity analysis.
To test the sensitivity of Concoction on mislabeled training samples, we introduce mislabeled samples that account for from 20% to 80% of the training samples into the training dataset of Concoction, leading to a variant of Concoction-MisSam.Similarly, we randomly remove some symbolic execution traces selected by Concoction to simulate a scenario where some traces are missing.This led to another implementation variant named Concoction-IT.The performance of Concoction-MisSam on Figure 9 shows that mislabeled training samples can harm performance, where the F1 score of Concoction-MisSam drops from 78.0% to 64.3%.This is expected, as machine learning techniques can suffer from noisy and mislabeled training data [30,32].However, the impact of mislabeled samples can be mitigated by increasing the training dataset and using data cleaning methods [13,66], which are orthogonal to our approach.Similarly, missing execution traces can also negatively impact the performance, where the F1 score of Concoction decreases from 86.1% to 77.1% in Concoction-IT, which is still at least 2.1% higher than other DL baselines that rely on static code information.Missing symbolic execution traces is likely to happen when testing external libraries where the tool has no access to the source code.This issue is beyond the capability and scope of a source code-level detection tool.In the worst case, where all symbolic traces are missing, our DNN model can still use static information to detect bugs, albeit less efficiently.During deployment, Concoction can complete predictions within minutes, and the fuzzing tool may take several hours to generate a vulnerability-exposing test case.Since Concoction can be integrated with a parallelized overnight build system, the deployment overhead should be acceptable for many software developers.

DISSCUSSIONS AND THREATS TO VALIDITY
Naturally, there is room for further work and improvement.We discuss a few points here.
Runtime overhead.The runtime overhead of Concoction mainly comes from collecting symbolic execution traces.The Concoction path selection component is designed to minimize this overhead.There are methods to accelerate symbolic executions through parallelizing symbolic test case generation [14,28,65].These approaches are orthogonal to Concoction.
Large and complex code.During our evaluation of 23 large opensource projects, we used function inlining to remove the procedural boundaries for code analysis.Our current implementation does not support analysis for complex code involving recursive functions, pointer aliasing, and function pointers with dynamic dispatching, which remains an open problem.Concoction will benefit from techniques for reasoning these complex code patterns [24,49,52,55,68].Our current implementation parallelizes symbolic executions and fuzzing on individual code regions.We envision that extending Concoction to larger code regions would require new techniques.For example, it would require capturing the data flow across methods and potentially pointer alias analysis [44], as well as techniques to accelerate symbolic executions [14,28,65].
We are excited about the potential of Concoction and hope its initial promising results can encourage further research.
Other languages.We showcased Concoction in C, but it can extend to other languages.Doing so requires adapting the source code rewriting tool (see Sec. 3.2.1)and having a tool to generate symbolic traces for the target language.Nevertheless, our DL framework for combining static and dynamic information broadly applies to other languages.Concoction already supports using the Language Server Protocol (LSP) [38] to construct the enhanced AST.LSP currently supports C, C++, Java, JavaScript and Python with a single interface so Concoction can be easily ported to these languages.Concoction can also use symbolic execution engines built for other languages, including JDart [61] for Java, PyExZ3 for Python [45], and Jalangi2 [73] for JavaScript.

RELATED WORK
Our work builds on the past foundations of deep learning, source code vulnerability analysis, and static symbolic execution.We apply Transformer-based networks [39,78] to learn program representations and contrastive learning to train our models.
Deep learning-based vulnerability detection.Our research is part of the recent efforts in DL-based software vulnerability detection [21,57,59,60,63].Prior studies primarily relied on static information like ASTs and control flow graphs.Concoction incorporates symbolic traces with static code information to capture deeper program semantics, enhancing vulnerability detection.Symbolic execution.Symbolic execution [17,41] sidesteps the need for hand-crafted rules by exploiting symbolic values and analyzing their use over the execution tree of a program on source code.However, language constructs like loops and branches can significantly increase the number of execution states, limiting the scalability of the technique to large programs [11].
Contrastive learning.Contrastive learning is popular for its ability to reduce the costs of annotating large-scale datasets [46].It has been used in computer vision [19,22,42], natural language processing [10,35], and other domains.Concoction uses contrastive learning for addressing label scarcity in vulnerability detection.This allows us to train our models effectively with unsupervised learning, reducing the need for extensive manual labeling.
Path selection for code embedding.Several techniques use learning-based approaches for path selection [23,41].Like Con-traFlow [23], Concoction also employ unsupervised active learning to train a path selection network.ContraFlow is designed to pinpoint static value-flow paths that may trigger a vulnerability, but it still solely relies on static code features.Unlike [23], we use path selection to minimize the overhead of symbolic executions, which are then combined with static code features to learn more efficient program representation.Nevertheless, our experiments in Sec.5.2 and 5.3 show that our approach outperforms ContraFlow.

CONCLUSION
We have presented Concoction, a new DL system for detecting vulnerabilities at the source code level.It utilizes structured static code features and dynamic symbolic execution traces to learn program representations, enabling accurate prediction of bugs.We train Concoction by combining unsupervised and supervised learning and minimizing the overhead of symbolic executions by using a path selection network.We apply Concoction to detect bugs and vulnerabilities for C programs from 20 open-source projects.In 200 hours of automated concurrent test runs, Concoction successfully detected vulnerabilities in all tested projects, discovering 54 unique vulnerabilities and yielding 37 new, unique CVE IDs.Compared to 16 previous methods, Concoction finds more vulnerabilities with higher accuracy and a lower false positive rate.

1Figure 1 :FunctionsFigure 2 :
Figure 1: This example contains a false "division-by-zero" issue at line 4 because the branch at line 10 will not be taken.

Figure 2
Figure 2 depicts the workflow of using Concoction to detect function-level code vulnerabilities during deployment.

Figure 3
Figure 3 shows the workflow of training the Concoction DL components for program representations and vulnerability detection.

Figure 3 :Figure 4 :
Figure 3: The Concoction DNN architecture and its training workflow.

Figure 5 :
Figure 5: Rank and select paths for symbolic execution.

Figure 6 :
Figure 6: CDF for the (log-scale) number of lines (a) and execution paths (b) our test samples.work[7],or have active development teams.We stress that none of these projects was used to train Concoction, and we tested the latest version of each project at the time of testing.We also note that bugs discovered by Concoction were previously unreported at the time of testing; hence, there were no data leakage issues.

Figure 7 :
Figure 7: Evaluation on standard vulnerability databases.Min-max bars show performance across vulnerability types.

Listing 3 :
Patch of CVE-2021-3574 for fixing a memory leak vulnerability in the ImageMagick project.

Figure 8 :
Figure 8: Evaluation on the CVE dataset.

Figure 10
compares training overhead for various DL-based methods.It includes one-off time spent on feature extraction (e.g., AST and symbolic executions) on training samples and iterative training time using labeled samples

Table 1 : Open datasets used in training and evaluation
We also use a script to record the addresses, sizes, and names of all variables of the target function during symbolic execution.The script also generates a test driver program for KLEE to facilitate the execution of KLEE.

Patch for a segmentation-violation vulnerability that reassigns dctx->dcmpri.len to avoid memory overflow.
Figure7reports four "higher-is-better" metrics (Sec.4.3) achieved by Concoction and the baselines on the SARD dataset (Table1).The min-max bar shows variances across cross-validation runs.Concoction outperforms other methods in all metrics and has the most reliable performance across crossvalidation runs, with the narrowest min-max bar.While LineVul and LineVD achieve high precision (low false-positive rate) similar to Concoction, they have lower Recall and miss some vulnerable cases.For example, LineVD only detects 70.3% of the CWE-126 vulnerability typed test cases, whereas Concoction detects all.Other baselines show low detection accuracy.Concoction achieved 100% recall in detecting certain vulnerability types like CWE-416 and CWE-789.Other methods, in contrast, failed to detect all of these vulnerabilities.Notably, Concoction is highly effective in detecting the use-after-free vulnerability by leveraging dynamic traces and static code structures to infer the use of pointers.

Table 5
CVEs within the same test time by guiding the fuzzing engine to focus on potentially buggy code paths.Though highly effective, Concoction missed four vulnerabilities (one example in Sec.5.3.3)due to incomplete vulnerable execution traces during feature extraction.Addressing this limitation could involve extending symbolic execution time and improving the path selection model or the number of paths selected.5.3.2Casemissed by baselines.Listing 4 shows CVE-2020-35523, an integer overflow vulnerability identified by Concoction but missed by all baselines.Other methods missed this bug because they mainly rely on static information, such as tokens typically associated with integer overflow (e.g., malloc), which is insufficient in this case.Concoction detects this vulnerability by symbolizing the tw and w variables, allowing the DL model to infer the absence of a corresponding INT32 bounds check, which aligns with the pattern of integer overflow.

An integer overflow vulnerability in Libtiff.
Case missed by Concoction .List 5 shows CVE-2020-35523, a memory leak vulnerability missed by Concoction and all other baselines.The issue is caused by the function directly returning without closing the input file handle in, causing resource leakage.Concoction missed this because the memory leakage is introduced within the file handle data structure but Concoction does not learn such patterns from the training dataset.This can be improved by extending the training dataset to cover a wider range of patterns.

memory leak vulnerability in Libtiff that occurs because the file handle in is not released.
from Table 1.Training terminates when the loss does not improve within 20 consecutive epochs or meets the termination criteria CodeXGLUE, and Lin et al. achieved the shortest training overhead, relying mainly on sequence neural networks like Bi-LSTM.However, they have a low F1 score, indicating a limited ability to capture complex code structures.More advanced models that use ASTs required longer feature extraction and training times but showed higher accuracy during evaluation.Additionally, LIGER and Concoction incur a more expensive feature extraction time due to collecting dynamic runtime information by symbolic execution.It is important to note that model training is performed offline and is a one-off cost.