Concrete Constraint Guided Symbolic Execution

Symbolic execution is a popular program analysis technique. It systematically explores all feasible paths of a program but its scalability is largely limited by the path explosion problem, which causes the number of paths proliferates at runtime. A key idea in existing methods to mitigate this problem is to guide the selection of states for path exploration, which primarily relies on the features to represent program states. In this paper, we propose concrete constraint guided symbolic execution, which aims to cover more concrete branches and ultimately improve the overall code coverage during symbolic execution. Our key insight is based on the fact that symbolic execution strives to cover all symbolic branches while concrete branches are neglected, and directing symbolic execution toward uncovered concrete branches has a great potential to improve the overall code coverage. The experimental results demonstrate that our approach can improve the ability of KLEE to both increase code coverage and find more security violations on 10 open-source C programs.


INTRODUCTION
Symbolic execution is a prevalent and powerful technique originated in the 70's for program testing and debugging [14,25].Over the years, researchers worldwide have made significant efforts to enhance the performance and scalability of open-source symbolic execution tools.This technique has found applications in various academic fields, including software testing [26,33,34,54], program verification [17,23,40], vulnerability analysis [47,48], and firmware emulation [9,24,57].In addition, industrial practices are also tried in combination with symbolic execution for enhancing software security in various domains such as Baidu [28], Cyberhaven [13], Fujitsu [19] and Trail of Bits [21].
The concept behind classical symbolic execution is to replace concrete inputs with symbolic inputs in order to explore all feasible paths within a program [7].During symbolic execution, program variables are mapped to symbols, and the constraints formed by these symbols are collected along each path.These constraints are then utilized by SMT solvers to find concrete values for for program variables to explore new paths [8].However, this approach generates new states whenever symbolic branching conditions are encountered, leading to a surge in the number of states and resulting in the problem known as path explosion [4].Consequently, significant time and memory resources are wasted without an efficient mechanism to choose promising states.To address this issue, it is crucial to represent each state using static and dynamic features specific to the program.These features can assist in selecting promising states [53], pruning [12] or merging redundant states [27].Therefore, careful selection of features is essential to enhance the efficacy of path exploration.
Existing methods.The features to represent program states serve as a proxy to find new program paths by guiding states to favor specific program properties.For instance, existing methods select features such as path depth, query time, sub-path, the number of executed instructions and generated test cases and so on [7,29].However, the search strategy based on one single feature is not adaptive to all programs since they are largely distinct in the design and implementation of programs such as code structure, branching conditions and the dependencies between program variables [11,22].Therefore, recent works have shifted to using machine learning techniques to model several manually picked features to learn a more comprehensive and robust search strategy for better state selection [10,11,22,37].Yet, it is difficult for machine-learning based methods to achieve good performance for all programs on software testing due to the lack of guidance on choosing suitable models [45].In addition, machine-learning based methods are also limited to the training data and hyper-parameters.
New insight.In general, the variables in branching conditions during symbolic execution can either be symbolic or concrete [5].When all the variables in a branching condition are concrete, this branching condition is considered as a concrete branching condition, e.g., x>1 when the variable x is concrete.Furthermore, we denote the branching constraint for each branch of this branching condition as concrete constraint.In this paper, we propose to represent each program state based on its ability to cover more concrete branches.The idea derives from the fact that enormous efforts have been made on constraint solving techniques to deal with symbolic branching conditions [3,38,46,51].However, according to our observations in Section 2, concrete branching conditions take up over 95% during symbolic execution but nearly 70% of them are partially covered, i.e., only one branch of the branching condition is covered, in realworld programs.Therefore, there is a great potential to enhance symbolic execution to fully cover more concrete branching conditions to improve overall code coverage.
Furthermore, for a partially covered concrete branching condition, it is necessary for the states reaching it to carry different values for the variables involved in the branching condition, so that the constraint for the uncovered branch can be satisfied.However, existing symbolic execution tools face a challenge in this regard.They often rely on other features for state selection while overlooking the correlation between the definitions of these variables and the selection of promising states.Since these variables are usually defined and used at many different program locations, the execution path that only covered one or several definitions have a rather low possibility to satisfy the branching constraints of the uncovered branches.Consequently, these tools frequently waste time in feeding incorrect definitions of these variables in uncovered concrete branching conditions.This not only reduces the chance of covering more concrete branches but also reduces the chance of further covering more symbolic branches, and thus decreases the overall efficacy of path exploration.In this paper, we shift the focus to the variables in concrete branching conditions to investigate the states that have defined them, and whether these definitions can satisfy specific concrete constraints to select more promising states to improve overall code coverage.
Our solution.Overall, we leverage the inter-procedural data dependency of the variables in concrete branching conditions as the core feature for state selection during symbolic execution.By identifying when a state defines a variable in a partially covered concrete branching condition, we can prioritize this state if this definition can satisfy the concrete constraint of the uncovered branch of the branching condition.However, there are two main challenges: one, extracting the inter-procedural data dependency of program variables can be expensive; two, it is not clear how to leverage data dependency for the state selection during symbolic execution.Existing inter-procedural static value-flow analysis methods for program variables are mostly expensive in terms of balancing accuracy and performance [39].Instead, we propose a backward data dependency analysis method to identify the source variables of a given target variable.Then, we use these source variables as intermediate variables to establish the inter-procedural data dependency.During symbolic execution, we prioritize states that have specific variable definitions that can satisfy the constraint corresponding to the uncovered branch of the partially covered branching condition.
Evaluation.We implement our method on top of a widely-used open-source symbolic execution framework, KLEE [7,36].To evaluate the efficiency, we use 10 open-source C programs as benchmarks, which are also used in relevant works on symbolic execution.For code coverage, our method improves the branch coverage by 28.8% on average compared to the best of existing methods and fully covers 15.6 more concrete branching conditions than existing methods.In addition, our method is stable with a low sensitivity on the settings of hyper-parameters for optimizations.Furthermore, our method triggers 6 out of 7 security violations in 5 programs while other methods trigger at most 3 using the generated test cases.
Contributions.We summarize our main contributions as follows: • We design an algorithm to extract the inter-procedural data dependencies for the variables involved in branching conditions.
• We propose a concrete constraint guided search strategy to cover more concrete branches to improve the overall branch coverage for symbolic execution.
• We implement and evaluate our method on 10 open-source C programs on branch coverage and security violations, which demonstrates the effectiveness of our method.We make our tool publicly available at https://github.com/XMUsuny/cgs.

MOTIVATION
In this section, we use the source code from the subject program grep 3.6 [1] to present an example of concrete constraint, and then analyze the branching conditions encountered in symbolic execution on real-world programs to motivate this work.
An example of concrete constraint.We illustrate the call chains from grep 3.6 to demonstrate its use for text matching with regular expressions in Figure 1.After receiving user inputs from command line, the parse function calls the two functions on the left to assign a value to token->type and then proceeds to call the four functions on the right to use this definition in the branching condition of the if statement.Specifically, the peek_token function assigns a concrete value to token->type based on the input variable and the parse_dup_op function employs different handlers to interpret tokens depending on token->type.On this call chain, the struct-typed pointer token stays unchanged as the function parameter and is directly passed.This establishes an inter-procedural data dependency between the definitions and usages of token->type in these functions.It is important to note that the branching constraint in the parse_dup_op function is always a concrete constraint because token->type is consistently assigned a concrete value during symbolic execution.Consequently, a single state can only satisfy one concrete constraint and partially cover the corresponding concrete branch of this branching condition.Branch coverage at runtime.To investigate how the symbolic/concrete branches are covered during symbolic execution, we conducted an experiment where we ran four typical real-world programs for 2 hours using KLEE, employing a random-path search strategy [7].We gathered the number of branching conditions in different types, and their coverage in Figure 2 and Figure 3, respectively.For simplicity, we only focused on the conditional statements with LLVM icmp instructions as branching conditions.Note that there can be roughly two categories of concrete branching conditions: (1) they are originated from different input sources that are not symbolic (e.g., environment variables and configuration files).
(2) they are constrained by the variables that are implicitly affected by symbolic inputs as shown in Figure 1.Intuitively, the second one takes the major role so our experiments does not consider the influence of the first one.
According to the results in Figure 2, the concrete branching conditions accounted for over 95% of the total branching conditions.Furthermore, we find some branching conditions can be both symbolic and concrete since the variables involved in these branching constraints may be defined using concrete values along some paths while using symbolic values along other paths.Consequently, their sum exceeds the total number of branching conditions.
Moreover, we collected the coverage of branching conditions in different types, and summarised the results in Figure 3.Our analysis revealed that, nearly 25% of the symbolic branching conditions were partially covered, i.e., only one branch of the branching condition is covered, while nearly 70% of the concrete branching conditions were partially covered.In summary, the results suggest a significant disparity in the coverage of branching conditions with different branching constraints.Concrete branching conditions showed a higher rate of being partially covered compared to symbolic ones.
Concrete constraint guided search strategy.Since concrete branching conditions encompass a significant majority and a large portion of them are only partially covered.we believe there is a great potential to improve overall code coverage by guiding symbolic execution towards covering more concrete branches that are reached but not covered during symbolic execution.
In this paper, we leverage static analysis to soundily extract the inter-procedural data dependency of the variables in branching conditions.Moving forward, by analyzing the uncovered concrete branching conditions during symbolic execution, we can identify the states that have defined these variables with values that can satisfy the unsatisfied concrete constraints.Finally, we can leverage this information to steer symbolic execution to cover more concrete branches.

DESIGN
In this paper, we focus on covering more concrete branches during symbolic execution using the data dependency of the variables in concrete branching conditions.Respectively, the branch dependency analysis module aims to statically extract inter-procedural data dependency of these variables.The concrete constraint guided searcher uses the extracted data dependency for efficient state selection during symbolic execution.

Branch Dependency Analysis
This module aims to construct the inter-procedural data dependency between the branching conditions and the definitions of the variables used by them.Respectively, we first identify the source variables for each variable used in the branching conditions and the address of assignment instructions.Then, we match these source variables based on the values or types of them to construct the inter-procedural data dependency, Source variable definition.Given a program variable, we define its source variable as the variable from which the value is initially assigned within a function.We describe each source variable with type and value.In general, the type of source variable within a function falls in the following types: global variable, local variable and function parameter.The value of source variable depends on its type.For global and local variable, the value is explicitly defined with unique instruction identifier.However, for function parameter in inter-procedural scope, the values may come from different instances so they are ambiguous.In addition, since some function parameters always remain unchanged across functions, we further divide the type of function parameter into three subtypes: non-pointer parameter, struct-typed pointer parameter and non-struct-typed pointer parameter.
The non-pointer parameter is a non-pointer variable that is usually defined in the caller functions on function call graph.The struct-typed pointer parameter is a struct-typed pointer variable that is usually defined and used in different functions but always stays unchanged.As shown in Figure 1, it can be an intermediate variable to build an inter-procedural data dependency between the variables in different functions.Moreover, sometimes there is more than one nested struct-typed variables to reach sink variable, which builds a list of struct-typed pointers and each node is denoted by the type and offset in nested structs.For instance, the source variable of dfa->nodes_alloc is recorded as [(regex_t *, 0), (re_dfa_t *, 1)] in Listing 1.In addition, we do not consider non-struct-typed pointer parameter such as string and integer pointers.re_dfa_t * dfa = preg -> buffer ; 5 dfa -> nexts = re_malloc ( Idx , dfa -> nodes_alloc ); 6 .. 7 } Listing 1: An example of struct-typed pointer variable Source variable identification.We propose an algorithm for performing a backward data dependency analysis.The goal of this algorithm is to identify the source variables of a given variable and categorize them into one of the five types mentioned earlier, which we refer to as variable v in Algorithm 1.In general, our approach involves tracking LLVM-IR instructions stored in  inst in a depth-first manner, going backward along the def-use chains for different types of instructions.Here are the key steps we follow: ❶ When encountering invalid instructions in lines 7-8 (e.g., call, switch, and icmp), we stop tracking since we only focus on data dependency.❷ For gep instruction, we record the type and offset of the accessed structure member in a struct-typed data in  st , and add the instruction with the type of struct-typed pointer to  inst in line 9-11.❸ For other instructions, we employ different strategies for tracking based on the type of each operand in current instruction.Specifically, ① If the operand is a global variable, we directly add it as the source variable in line 14-15.② If the operand is a local variable, we first use the getStores function to find all the store instructions that defines this local variable in line 17.Since local variable can be defined using function parameter at the function entry, we use the findParameter function to find a definition that comes from a function parameter in line 18.If this function parameter is non-pointer or struct-typed pointer, we create new source variable in line 19-20 and use the addSource function to set the type of this source variable to  st and clear it.③ If no function parameter is found, we add this local variable as the source variable in line 22.Moreover, we use the selectInsts function to find new store instruction for the next round in line 23.We first attempt to find the store instruction that dominates current instruction to  inst .If not found, we adds the nearest store instruction defines this local variable to  inst in line 25 if it can reach v on the control flow graph of current function and the written data is an instruction.④ Otherwise, if the operand is neither a global or local variable, we add it to  inst if it is an instruction in line 27-28 for the next round.
Example.We illustrate the workflow to identify the source variable of variable %41 in the upper part in Figure 4.The instructions in blue color are the traced instructions in  inst .In line 63, we add struct-typed element [(re_token_t *, 1)] to  st .For local variable %11, we find the store instruction in line 26 and the written data %3 is a function parameter.Eventually, we set %3 as the source variable of variable %41 in type of struct-typed pointer parameter.
Source variable for usage.The usage is the value of branching condition.For branch statements, we consider all the switch, and partial br instructions which use the results of icmp instructions as the branching conditions.We extract these variables from branch statements and then use Algorithm 1 to extract the source variables of them.For instance, we identify the source variable of variable %41 in the br instruction in line 68 as parameter %3 in Figure 4 for the parse_dup_op function in Figure 1.
Source variable for definition.The definition is the address of store instruction.Similarly, we also use Algorithm 1 to extract the source variables of it.For instance, we assign the source variable of variable %25 in the store instruction as parameter %0 in Figure 4 for the peek_token function in Figure 1.For brevity, we omit the related codes in the peek_token function in Figure 4 since the tracking workflow is the same as the br instruction above.
Branch dependency construction.Overall, we use different strategies to match the source variables of definitions and usages with different types and instruction identifiers.❶ For global or ❷ local variables, the matching is direct and sound since their source variables are globally or locally defined and the instruction identifiers are easily identified.❸ Otherwise, if the type of source variable is struct-typed pointer parameter, we use a field-sensitive typematching method to match the source variables based on whether their types of  st are overlapped.For instance, the types of a->b->c and b->c are the matched but the the types of b->c and b->d are not.It's important to note that this matching is not completely accurate because there may be different instances of the matched type used by definitions and usages respectively.However, this over-approximation is typical in static analysis and is acceptable for higher completeness.For example, APICraft proposes a similar method called Type-based transition to statically build new interprocedural data dependencies between library APIs [55].❹ For non-pointer parameters, we track them backward on call graph to locate their source variables in the caller functions.We then refine their source variables into the three types mentioned earlier for further matching.Finally, we build inter-procedural data dependency between the definitions and usages of the variables in branching conditions such as the store and br instructions in Figure 4.

Concrete Constraint Guided Searcher
This section presents our concrete constraint guided search strategy, which leverages the results of branch data dependency to select promising states for symbolic execution.
General symbolic execution.In general, the standard workflow of the symbolic execution technique is depicted in Algorithm 2, which is represented by black lines of code.In standard workflow, during each iteration of main loop, the symbolic execution engine chooses a state in line 6 based on the search strategy.If current state terminates or triggers an error, this state is removed and a new test case is generated if current path constraints are solvable in line 9-10.Otherwise, current instruction is symbolically interpreted based on its type.For branch instructions such as br or switch with a symbolic branching condition, new states are created via forking in line 13-14 after the SMT solver determines the solvability of the current path constraints.For other instructions, the current state advances one step within the current basic block in line 30.Finally, all states are updated for the next iteration in line 31.
Concrete constraint guided searcher.The goal of our searcher is to select promising states to satisfy the concrete constraints in uncovered concrete branches.Specifically, the selected state must satisfy the following two properties: • It has covered the definitions of the variables in at least one partially covered concrete branching conditions.
• There existed a covered definition that can satisfy one of the constraint of uncovered concrete branch.To simplify the narration, we denote these definitions as valid definitions in the following paper.To this end, we propose a novel strategy to deal with each concrete constraint in a three-step cycle, which is shown in Algorithms 2 and 3. We add our methods in Algorithm 2 in blue lines of code, including the new handlers for branch instructions with concrete branching conditions in line 16-24, and new handler for store instructions in line 25-28.The metadata for each concrete constraint is recorded in a global map V, where each item consists of three elements: <branch instruction, comparison operator, compared constant>.In addition, Algorithm 3 mainly focuses on how to update states in each step.
Workflow of our searcher.Initially, we have extracted the definitions of the variables in branching conditions before symbolic execution in Section 3.1.During symbolic execution, our searcher leverages three steps to cyclically handle each concrete constraint in partly covered branching condition from when it is encountered for the first time (Step 1) to when it is fully covered (Step 3): • Step 1: When a concrete branching condition is partially covered for the first time, we identify the unsatisfied concrete constraints to update states and start to track them.
• Step 2: We prioritize current state if it gives a valid definition to the variable in tracked concrete constraints.
• Step 3: We update states when a tracked concrete constraint is satisfied finally and stop to track it.

Note that
Step 2 is complementary to Step 1 because it is likely that at the time of Step 1, some definitions are not covered by all current states.In addition, some prioritized states can not reach the concrete branching conditions.Next, we combine Algorithm 2 and Algorithm 3 to explain each step in details.
Step 1.When a concrete branching condition is encountered for the first time, we extract the store instructions associated with this branch instruction in line 16 and they are added to be globally tracked for Step 2 in line 18 in Algorithm 2.Then, we use the extractValidValue function to analyze the covered concrete branching conditions to save the metadata in unsatisfied concrete constraints to V in line 19 by extracting the inverse comparison operator and the value of compared constant 1 .Next, in Algorithm 3, lines 3-5 iterate through all states to find those that have covered the store instructions mentioned earlier.For each of them, we extract the definition if it is constant and use the isValidValue function to determine whether current state satisfies the second property we list in this section in line 6-7.If so, we prioritize this state in next round in line 8 .Furthermore, we set this branch instruction as target branch and start to track it in line 9, which denotes the statement that this state steps towards.
Step 2. During symbolic execution, we track each store instruction added in Step 1 for each state and record the written value only if it is constant.Since one state can globally assign different values to the same variable, we maintain a global mapping that links a variable to all the store instructions that define it and update its definition to the latest one.When we encounter the tracked store instructions, we also use the isValidValue function in line 26 in Algorithm 2 to determine whether the definition is valid as Step 1 does.If so, we set this branch instruction to current state as target branch and prioritize current state in line 16-17 in Algorithm 3.
Step 3. Once current state reaches its target branch, it is highly likely that this uncovered concrete branch is covered.As a result, other states that are also directed towards this branch instruction are useless so their target branches are removed and and stopped to track in lines 11-13 of Algorithm 3. We also postpone the execution of a state if one state has no target branches in line 14 in Algorithm 3. Additionally, the target stores that were added in Step 1 are removed from line 23 in Algorithm 2. To this end, we have completed the cycle for dealing with one concrete constraint in our searcher.
Except from this cycle, we classify all states into two sets based on whether they have target branches or not.When new states are generated, we always give priority to the states that have target branches.This strategy increases the possibility that these states will be able to fully cover new concrete branching conditions.In 1 For switch statements, we extract all of the values from uncovered cases.addition, for both sets of states, we employ a breadth-first search approach to state selection.
Example.We illustrate each step in Figure 5, which comprises one concrete branching condition v>2 and four definitions of variable v.
Step S0 is the initial state before symbolic execution and steps S1-S3 correspond to the three steps.❶ In step S1, after one state has defined the variable v with value 1 and partially covers this concrete branching condition, we record the inverse comparison operator in satisfied concrete constraint as '>'and the constant value 2. Then we find all the covered definitions of v including v=2 and v=3 and prioritize the state that has covered definition v=3 since this definition is valid.However, this state fails to reach the target branch perhaps due to other unsatisfied branching constraints.❷ Then during step S2, when one state that has covered definition v=2 covers definition v=4, we update the latest definition to v=4 and prioritize this state since this definition is valid.❸ Finally, in step S3, this state reaches and covers the uncovered concrete branch and the cycle to deal with this concrete constraint is completed.

Optimizations
According to our observations, it is possible that there are concrete constraints that are unable to be satisfied during symbolic execution.For example, some branching conditions are controlled by global macros, environment variables, configuration files and hardly triggered exceptions.In addition, there are probably some states that have covered valid definitions but can not reach their target branches.Therefore, since it can bring large overheads to track these target branches, we set a fixed number of the tracked target branches (e.g., 10) and periodically update them based on the number of executed instructions at runtime (e.g., 300000).
Specifically, we implement this mechanism by splitting tracked target branches into two FIFO queues.The first queue (Q1) is for tracking target branches and the size is fixed.The second queue (Q2) is varied-size and is filled when new concrete branching conditions are encountered only if the first queue is full.When the counter

IMPLEMENTATION
We implement our design using about 3.2k LoC of C++ including 2k LoC for the branch dependency analysis module, and we add 1.2k Loc to KLEE 2.3 for our concrete constraint guided searcher on Ubuntu 18.04.We use the def-use chains, analysis passes for loops and control flow graph from LLVM 11.1.0.The added codes to KLEE mainly aim to analyse concrete constraints and manage states that partially or fully cover concrete branches.
Currently, our implementation only supports concrete constraints with integral comparisons.However, extending it to support floatpoint values and textual strings for different programs is straightforward.This can be achieved by adding new strategies to handle these types of values in the extractValidValue function in Algorithm 2. Furthermore, once we identify all the source variables of definitions and usages, we make a trade-off by focusing only on the branching conditions that only have a single variable in Section 3.1.The rationale behind this is that if a variable in a branching constraint has multiple source variables, its value cannot be determined by only checking the definition of one source variable's definition.This can result in incomplete coverage of target branches.Therefore, this trade-off ensures that the uncovered concrete branches are covered when a state, which has covered a specific definition, is reached.To support multiple source variables into our design, the main effort is to add a new handler in the isValidValue function in Algorithm 2 and 3. We further discuss this issue in depth in Section 7.

EVALUATION
In this section, we evaluate our methods mainly to answer the following four questions: • RQ1: How many instructions are involved in the analysis of inter-procedural branch dependency?
• RQ2: Can our method improve branch coverage and also fully cover more concrete branching conditions ?
• RQ3: Can our optimization effectively and stably improve branch coverage?
• RQ4: Can our method trigger more security violations?

Experiment Setup
Benchmarks.We use 10 open-source C programs that are widely used by related work on search strategies for symbolic execution [6,10,12,22].The details on the benchmarks are listed in Table 1.
For the harness to test libraries, we use an example from source codes for expat, and an example from official website from for libxml2.Since our methods hold the assumption that there are sufficient branching conditions that contain only one variable in the program, the size of the tested program should be larger enough to compare our method with other search strategies so that we do not use smaller standard programs such as coreutils.
Before running the programs, we use the methods in Section 3.1 to statically extract the branch dependency information and save it in newly generated programs in LLVM bitcode format, which are fed into symbolic execution tool for evaluation.
Baselines.Overall, all of the eleven search methods implemented in KLEE [7] are considered for our evaluation.The eight search methods we choose consists of bfs (breadth first search), rss (random state search), rps (random path search) and five instances of the nurs (non uniform random search) family including nurs:rp (nurs with 1/2ˆdepth), nurs:covnew, nurs:cpicnt, nurs:md2u and nurs:qc.We do not use dfs (depth first search) and nurs:depth since they are similar with nurs:rp and nurs:rp performs best on the average coverage for most programs.In addition, nurs:cpicnt and nurs:icnt both use the guidance of the number of instructions so we reserve the better one nurs:cpicnt.We denote our searcher as cgs (concrete-constraint guided searcher).Therefore, in total, there are eight search methods in our baselines and we adopt each of them in each program for evaluation.
Settings.All experiments are conducted on a Linux machine with 4 Intel(R) Gold(R) 5218 CPUs (2.30GHz) and 128GB RAM, where it has a total of 16 cores and 64 threads.We test each program for 8 hours for 8 times with 4G memory limit.We use the default definitions for symbolic inputs from KLEE in the last column of Table 1.In addition, we set the -optimize option of KLEE to false since it is likely that the extracted branch dependency information is removed before symbolic execution.

Branch Dependency
This section shows the results of branch dependency analysis including the number of involved instructions and the number of different types of source variables of involved branches.The branches and stores involved in this module are used by our concrete constraint guided searcher for state selection.
In general, the total number of source variables is greater than the number of branch instructions because some variables in branching conditions share the same source variables and not all branch instructions can match store instructions.On average, The branches account for about 15% ∼ 20% of all branches by comparing data in Table 1 and Table 2.This proportion seems relatively small because we only focus on the variables that have only one source variable in branching conditions.However, as we explain in Section 4, this trace-off guarantees that the concrete constraints can be satisfied when one state that has covered specific definition reaches.Therefore, our methods strictly limit the branching conditions used for our concrete constraint guided searcher.

Branch Coverage
In this section, we evaluate our search strategy by examining the branch coverage improvements on KLEE [7].To start, we present the total branch coverage over time, showcasing the mean values and 95% standard deviations for each program and search strategy.Next, we report the final number of fully covered concrete branching conditions.This analysis helps us to shed light on the reason of the improvements on overall branch coverage by our concrete constraint guided searcher.We collect the results from KLEE's statistics during execution and show them in Figure 6 for each tested program.Overall, the experimental results show that cgs achieves the highest branch coverage for 9 programs and the second best for nasm.After 8h time budget expires, the average branch coverage of cgs is 2243.5, which outperforms the second ranked strategy bfs with 1741.5 by 28.8% and the third ranked strategy nurs:rp with 1518.0 by 47.8%.In particular, for grep, gawk, expat, readelf and objcopy, cgs notably increases the covered branches than the second best strategy.For instance, cgs notably outperforms the second best strategy bfs for objcopy by 1360 branches on average.Moreover, as shown in readelf and objcopy, the rate for the branch coverage increase over time is noticeably higher than other strategies.
For nasm, nurs:rp is the best strategy and it finally covers 48 branches more than cgs.The standard deviations for sqlite3 is huge for bfs and cgs since there are several branches control a large number branches and they both can not cover these branches in some runs.In addition, we find that random-state, nurs:covnew, nurs:cpicnt and nurs:md2u always stop executing to solve branching constrains, which spends so little time for instruction execution that the final branch coverage and standard deviations are both rather small.Concrete branching conditions.Furthermore, to demonstrate whether cgs improves branch coverage by fully covering more concrete branching conditions, we collect the final number of them with each strategy for each program after 8h.We only list the mean values of the number of these branches in Table 3.Finally, cgs fully covers 46.6 concrete branching conditions on average that we instrument before symbolic execution compared to bfs with 31.0, which improves 15.6 concrete branching conditions.In particular, for grep, gawk, expat, sed and objcopy, cgs increases the number of fully covered concrete branching conditions with a rather significant advantage over the second best strategies.However, for readelf and make, nur:cpicnt and nurs:rp perform better than cgs with a rather small gap for 2.8 and 1.3, respectively.
In general, there are two main reasons why overall branch coverage can be improved when more covered concrete branching conditions are fully covered.Firstly, these concrete constraints govern new sections of code that include numerous new branches, both symbolic and concrete.Secondly, new definitions for variables in branching constraints introduce more covered branches as the new branch constraints are satisfied.In summary, the findings presented in Table 3 provide evidence of the effectiveness of our method and offer an explanation for the significant improvement in branch coverage.

Hyper-parameters
In this section, we evaluate the effectiveness and stableness of our optimization in Section 3.3 by setting two hyper-parameters including the fixed number of the tracked target branches (i.g.,   ) and the number of executed instructions (i.g.,   ) to update these branches.Note that we both evaluate whether this setting can improve branch coverage and the sensitivity of results with different pairs of two hyper-parameters.Specifically, for the first goal, we set (  ,   ) to (0xffff, 0xffffffff) to track all target branches and do not update them so the optimization is off.For the second we use 4x4 combinations of two hyper-parameters including [5,10,30,100] for   and [10000, 100000, 300000, 1000000] for   to show the sensitivity of results with different pairs of them.We run each setting three times on all programs for 8h and collect the average branch coverage to show the results.In Figure 7, the y-axis shows the ratio of the average branch coverage with optimization to the average branch coverage without optimization, which is denoted as   /  .In addition, the value of   is shown on x-axis and the value of   is shown at   the upper right of Figure 7.The   reaches 1900 after running for 8h.In brief, the values of   /  are greater than 1.0 on all pairs of two hyper-parameters so that our optimization is proved effective to improve branch coverage.Moreover, the improvements with all settings on two hyper-parameters are nearly around 15% and very close to each other.Therefore, our optimization is stable with a low sensitivity on the improvement of branch coverage.

Security Violations
In this section, we replay the test cases generated by all strategies on the program instrumented with UBSan (Undefined Behavior Sanitizer) checkers to evaluate their capability of detecting security violations [2].We only focus on seven types of UBSan checks including signed and unsigned integer overflow, shift overflow, out of bounds array indexing, pointer-overflow and null pointer deference because these bugs are relatively common in C programs [20].We replay the test cases generated from all runs for each program and use the best one as the results.If the UBSan errors can be triggered by all of the four strategies, we do not list them.Table 4 shows the program, the type of UBSan errors, the error-location and whether the error can be triggered using the test cases generated after 8h by each strategy, which is marked with a " " if success.Overall, for the 7 triggered errors from five programs, cgs successes for six of them except one for sed while others success for at most three of them.In particular, cgs finds both UBSan errors in objcopy, which is reasonable since cgs also achieves a novel increase on branch coverage.In short, due to the ability of cgs on path exploration, the symbolic execution engine can find more security violations with the same time budget.
Search strategies for symbolic execution.To solve the path explosion program, various search strategies with different goals, characteristics and capabilities have been proposed for symbolic execution tools to select the most suitable candidate state.HOMI ranks states using the covered branches that contribute to determining the value of a test case, which can effectively increase branch coverage [12].DiSE leverages static analysis techniques to compute program difference information to explore program execution paths and generate path conditions affected by the differences [32].To improve the efficiency for bug reproductions, the error traces of static analysis reports are used to guide symbolic execution tool [6].Ferry focuses on program-state-dependent branches and guide symbolic execution using the behaviours of state variables [56].Our work spends more efforts on how to fully cover more concrete branching conditions to improve overall code coverage.
Data dependency guided software testing.According to lines of research, the data dependency information represents a suitable candidate for program state descriptions during software testing.DDFuzz uses data dependency graphs as new code coverage metrics to guide existed fuzzing tools such as AFL++ [16,31].DGSE relies on the analysis of data dependencies to determine whether different visits to a statement instance produce the same symbolic value for symbolic execution [43].GREYONE proposes a fuzzing-driven taint inference to identify taint of variables, which are used to mutate input bytes during fuzzing [18].PATA identifies the variables used in branching constraints and constructs the representative variable sequence for input mutation in fuzzing [30].Instead, our work prioritizes states during symbolic execution based on reliable inter-procedural data dependency on the variables used in concrete constraints to increase overall code coverage.

FUTURE WORK
In this paper, we introduce a new issue in our design regarding the handling of multiple source variables for a single variable in one branching constraint.Generally, when a branching constraint's outcome depends on more than one source variable, it becomes unreliable to prioritize a single state when it defines only one variable.To tackle this issue, we can use an alternative approach: synthesizing a symbolic function that connects the branching constraint variable with all of its source variables before symbolic execution.These synthesized symbolic functions are then loaded before symbolic execution.When the involved branches are encountered, the variables are concretized to calculate the results of the variable in the branching constraint, which are subsequently utilized by our search strategy for path prioritization.

CONCLUSION
In this work, we introduce Concrete Constraint Guided Symbolic Execution, a novel approach to effectively improve overall code coverage.The core idea of our approach involves extracting concrete constraints from partially covered concrete branching conditions and leveraging inter-procedural data dependency to improve symbolic execution.By investigating whether a state has carried variable definitions that satisfy the constraints of uncovered concrete branch conditions, we can achieve better overall branch coverage.The experimental results demonstrate the effectiveness of our method on increasing overall code coverage and the number of security violations during symbolic execution on 10 real-world C programs.

Figure 1 :
Figure 1: Function call chains for parsing inputs in grep 3.6.

Figure 2 :
Figure 2: Number of branching conditions in different types.

Figure 3 :
Figure 3: Coverage of different branching conditions.

Figure 5 :
Figure 5: Workflow of concrete constraint guided searcher.The executed branches are in red color.The uncovered and covered definitions are in light and dark green color, respectively.The dash line denotes the branching constraint can not be satisfied.

Figure 6 :
Figure 6: Branch coverage for 10 real-world programs by running KLEE with different strategies for 8h.Mean values and 95% standard deviations over 8 runs are shown in solid lines and shadows, respectively.

Figure 7 :
Figure 7: Branch coverage improvement with optimizations by setting different pairs of hyper-parameters.

Table 1 :
The details of the benchmarks used for evaluations.Branches are collected using KLEE internal coverage.The KLEE format of symbolic inputs are set based on the common usage of programs.

Table 2 :
Statistics of branch dependency analysis for 10 realworld programs.The branch and store instructions are the usage and definition sites of inter-procedural data dependency, respectively.For the types of source variables of branches, GV=Global Variable, LV=Local Variable, NPP=Non-Pointer Parameter, SPP=Struct-typed Pointer Parameter.
*The branches also contain switch instructions.

Table 3 :
The number of fully covered concrete branching conditions for 10 real-world programs by running KLEE with different strategies for 8h.Only mean values are listed.

Table 4 :
UBSan Errors triggered by the test cases generated using the top-4 strategies on part of programs.UIO=Unsigned Integer Overflow, NPD=Null Pointer Dereference.