Marco: A Stochastic Asynchronous Concolic Explorer

Concolic execution is a powerful program analysis technique for code path exploration. Despite recent advances that greatly improved the efficiency of concolic execution engines, path constraint solving remains a major bottleneck of concolic testing. An intel-ligent scheduler for inputs/branches becomes even more crucial. Our studies show that the previously under-studied branch-flipping policy adopted by state-of-the-art concolic execution engines has several limitations. We propose to assess each branch by its potential for new code coverage from a global view, concerning the path divergence probability at each branch. To validate this idea, we implemented a prototype Marco and evaluated it against the state-of-the-art concolic executor on 30 real-world programs from Google's Fuzzbench, Binutils, and UniBench. The result shows that Marco can outperform the baseline approach and make continuous progress after the baseline approach terminates.


INTRODUCTION
Concolic execution (CE), which conducts concrete and symbolic execution of the program under test (PUT) simultaneously, is a program testing technique used for code exploration and vulnerability detection.Unlike dynamic symbolic execution (DSE), which explores the program space by using symbolic inputs [7], concolic execution is performed with a concrete input exercising a concolic path consisting of branches that are dependent on a subset of input bytes.Each input-dependent branch in the concolic path has two directions: 1) a visited direction that is traversed by the concrete execution; 2) an unvisited direction that can potentially lead to a new path.Concolic execution effectively explores the program space by generating new inputs that traverse these unvisited directions of input-dependent branches.
Although powerful, concolic execution is known to be costly.As a result, many efforts have been made to improve the runtime efficiency of CE over the past few years, in terms of both constraint collection and constraint solving [11,12,35,36,50].In contrast, another essential component in concolic execution, branch-flipping policy, has not yet received enough attention.A branch-flipping policy dictates which symbolic branch needs to be flipped to generate a new testcase traversing the flipped branch.State-of-the-art (SOTA) CE engines employ a very restrictive branch-flipping policy -to flip only a very small fraction of symbolic branches that are most likely to reach new code coverage -in order to suppress testcase/path explosion problems, where the number of generated testcases quickly surpasses CE's processing capacity.
In this paper, we conduct the first study on this branch-flipping policy and have a few unique observations.On the one hand, we show that this policy is too strict, misses many good branches that could lead to much higher code coverage.On the other hand, this policy is not as effective as expected since only a small fraction (on average 27%) of branches selected by this policy can actually lead to new code coverage (see §2.3).Consequently, we observe that this rigid and nearsighted branch-flipping policy significantly undermines the effectiveness of CE.
Moreover, we show that the path divergence problem [2] (i.e., the testcase generated by CE does not follow the expected path) can be as high as 50% in practice and is a norm rather than an exception due to the imperfect design and implementation of CE.Therefore, we argue that a good branch-flipping policy needs to model the path divergence on each symbolic branch when selecting the next branch to flip.
To overcome the limitations, we propose a global-view newcoverage directed branch scheduling algorithm for concolic execution.To find out which symbolic branch is the best to flip, we estimate the potential of each symbolic branch (i.e., how likely we can reach new code coverage by flipping this branch) and select the branch with the highest potential.Specifically, we model the concolic execution as a Markov process: each branch transition is a probabilistic event, and an execution path is a sequence of branch transitions, and thus a sequence of probabilistic events.To obtain a global view of all testcases, we observe the executions of all testcases and construct a stochastic Concolic State Transition Graph (CSTG) to characterize transition probabilities between states and estimate the probability of a given branch reaching any unvisited states.We refer to this probability as reachability score.This reachability score is further dampened by the path divergence rate observed on this branch.
To select the best branch (i.e., one with the highest reachability score) to flip, our branch selection must be asynchronous.When encountering a symbolic branch, the existing CE engines decide synchronously whether to flip it based on the historical information collected up to this point.This decision, however, might not be globally optimal because a seemingly good branch to flip might have already been traversed by the remaining execution of the current testcase or the remaining testcases that have not been processed yet.Therefore, we propose to process all testcases to maintain an up-to-date global view (in the form of CSTG) and then asynchronously select the best branch to flip.To do so, we develop an efficient concolic state saving and restoring mechanism.We save the symbolic expression table and branch dependency information for quick reloading after the highest potential state is identified.
To evaluate the efficacy of this idea, we implement a prototype called Marco1 , atop SymSan [11].We evaluated Marco on 16 realworld programs and 71 programs from the DARPA Cyber Grand Challenge (CGC) binary set to demonstrate that Marco, on average, increases edge coverage by 13.03%.For 3 out of the 11 programs where Marco finds more coverage, it also covers all edge coverage found by SymSan.We further evaluate its bug detection efficiency on 14 programs from Unifuzz [29].The result shows that our approach can find 33.52% more unique bugs than the SOTA CE engine SymSan.Furthermore, Marco can uniquely find more than twice of bugs than SymSan does.On 5 of the tested programs, Marco finds more unique bugs in 12h than any of the seven fuzzers evaluated in UniFuzz (excluding QSYM, which is configured as a hybrid fuzzer) can find in 24h experimental runs.
The contributions of this paper are summarized as follows: • We evaluate the state-of-the-art branch-flipping policy and reveal several important yet unreported limitations.

BACKGROUND AND MOTIVATIONS
In this section, we provide the background knowledge about symbolic/concolic execution and existing branch-flipping policies and further motivate our work by stating four key observations.

Symbolic Execution
Symbolic Execution (SE) is an automated program testing technique that aims to maximize code coverage by generating specific inputs to satisfy every condition check that is dependent on the input within the program under test.With SE, the program is executed with symbolic expressions instead of concrete values.An SE engine maintains 1) the mapping between program variables and symbolic expressions, and 2) a set of path predicates imposed by the sequence of branches visited along the execution path.Two types of SE are extensively researched: 1) online symbolic execution and 2) concolic execution.Online symbolic execution engines, such as KLEE [7] and S2E [15], explore the program space via state forking: when encountering a branch point (whose direction is dependent on the input), an SE engine will fork a new state to explore the opposite branch direction (if it is feasible).As a result, the number of states grows exponentially, leading to the state explosion problem.To tackle the state explosion problem, some recent works [25,31] resorted to machine learning.Legion [31] leverages Monte Carlo Tree Search (MCTS) to model the state exploration as a sequential decision-making process on the tree-structured program space.Symbolic execution is performed lazily and only when a state is deemed promising.Learch [25] trains a regression model on a set of training programs to learn the state selection policy based on a set of state-describing features.Then, the trained model is used to test unseen target programs.
In contrast, concolic execution (CE) explores the program space iteratively.Given an input, a CE engine (e.g., QSYM [50], SymCC [35], and SymSan [11]) executes the program concretely and simultaneously collects symbolic constraints along its concrete execution path.When a symbolic branch point (whose direction depends on the input) is encountered, the CE engine collects the constraint of the current branch condition to dictate which branch direction is taken by the concrete execution.Additionally, based on a branchflipping policy, the CE engine may decide to generate a new input that can traverse the untaken branch direction.To do so, the CE engine constructs a constraint set that includes the negated current branch condition and a number of preceding branch conditions and queries an SMT (Satisfiability Modulo Theories) solver for a solution.Then a new input is generated by replacing parts of the original input with the values suggested by the solution.After the CE finishes processing the current input, it will pick and process one of the newly generated inputs.Obviously, the branch-flipping policy is essential for concolic execution.

Branch-Flipping Policies
The most naïve branch-flipping policy would be "flip all".As the name suggests, this policy tries to flip all possible branches.This policy can ultimately achieve the highest code coverage, given unlimited computing resources and time.No one has ever adopted this policy because computing resources and time are never unlimited, and many branches are either redundant or unworthy to be flipped.
A more realistic branch-flipping policy is to flip every branch executed through a unique execution path prefix.More specifically, this path prefix consists of a list of symbolic branches along the execution path, while concrete branches are ignored.For this reason, we refer to this policy as "PP policy".However, even with this policy, the number of branches to be flipped can still be enormous.This is because a program often contains loops and function calls, and one branch that appears in different loop iterations and different calling contexts will be flipped repeatedly due to its unique path prefix in each loop iteration and each calling context.Yun et al. observed that constraints repetitively generated by the same code are useless for finding new code coverage in real-world software [50].
Based on this observation, existing state-of-the-art CE engines (e.g., QSYM [50], SymCC [35], and SymSan [11]) adopt a more restrictive branch-flipping policy, which was first introduced in QSYM.This policy looks at branch bigrams.It will flip the current branch if its bigram (i.e., the pair of the previous symbolic branch and the current one) is new.We refer to this policy as the "BR policy" because it focuses on branches rather than paths.Compared to the PP policy, the BR policy will flip significantly fewer branches because only the last symbolic branch is included in the "context" of the current branch instead of all preceding symbolic branches.Some CE engines explore the branch selection heuristics with respect to the branch locality.In particular, SAGE [22] executes inputs in descending order of their code coverage and only flips branches that are located below the point where the current execution trace branches off from its parent trace to avoid redundant exploration.To efficiently explore uncovered branches, CREST [6] proposes a control flow graph (CFG) directed searching algorithm to prioritize branches in close proximity to uncovered branches through the statically constructed control flow graph and call graph.Specifically, each branch is evaluated by a scalar value obtained by adding up 1) the length of the shortest path to its nearest uncovered branch and 2) the number of flipping attempts devoted to it.CREST then flips branches in ascending order of this scalar value.

Motivation
To understand the effectiveness of different branch-flipping policies, we conducted a measurement study.We equipped SymCC [35], one of the SOTA CE engines, with both PP and BR policies.We selected four programs in binutils (objdump, size, nm-new, and readelf) and assembled an input corpus of 1000 seeds for each of them.We made the following four observations.(1) The BR policy is so overly strict that it filters out many promising branches.We investigate if the branches discarded by the BR policy are indeed useless for reaching higher code coverage.
To answer this question, for each program, we compared the code coverage after SymCC processed the same input corpus and attempted to flip the branches according to the BR and PP policies, respectively.To simplify the evaluation, we did not allow SymCC to further process testcases generated from the initial input corpus.Table 1 lists the results.We can see that for all four programs, the PP policy achieved higher code coverage than the BR policy.For readelf, the PP policy achieved a whopping 43% higher code coverage.This evaluation shows that the BR policy can miss promising branches that could lead to higher code coverage 2 .(2) The strict BR policy often leads to early termination.Observation 1 tells us that the BR policy filters out promising branches that directly lead to new code coverage.A promising branch may indirectly lead to new code coverage after several generations of testcases that are derived from a testcase traversing this branch.Ideally, a good branch-flipping policy would recognize this kind of branch and make continuous progress by iteratively processing newly generated testcases.Therefore, we would like to see how well a CE engine performs when it continuously processes newly generated testcases.Table 2 presents the results of this continuous exploration under the BR policy.We can see that the CE engine terminated within five hours 3 for 2 Edge coverage measured by SanitizerCoverage tool. 3 All experiments conducted in this paper are measured in terms of wall time.
all four programs, because it exhausted its attempts to flip all available branches in all initial inputs and generated testcases.Moreover, the final code coverage only covers 6.53% to 13.90% of the total coverage 2 .(3) Branches selected by the BR policy are of low quality.As described above, the BR policy uses branch bigrams to select promising branches to flip.We would expect that most of these selected branches could lead to new code coverage.Table 3 illustrates our findings on the quality of the new inputs generated from these selected branches.In fact, on average, only 27.46% of the generated inputs from the branches selected by the BR policy can lead to new coverage.In other words, the majority (72.54%) of the flipping and solving efforts do not immediately translate into code coverage gain.One major reason why BR policy cannot select high-quality branches is that the branch-flipping decision is only made based on the testcases that have been previously processed and the current testcase that is processed up to this point.It does not have a chance to examine the remaining execution of the current testcase or the remaining testcases to make a globally optimal decision.(4) Path divergence (PD) rate of concolic execution is exceedingly high that many constraint solving efforts go wasted.We observe that oftentimes, a generated testcase does not traverse the intended unvisited path.This problem is referred to as path divergence problem [22].Table 4 lists our findings with respect to path divergence.We can see that path divergence is very common (as high as almost 50% for size, and on average 28.72%).
We also observe that the path divergence issue is programspecific and branch-specific.Many branches do not have path divergence at all, while other branches constantly lead to path divergence.Unfortunately, current branch-flipping policies do not take this into account, leading to the low performance of CE.
Based on the observations, we are motivated to design a new concolic execution scheme that can overcome the aforementioned limitations for more efficient testing.

DESIGN AND IMPLEMENTATION
In this section, we introduce Marco, a novel stochastic and asynchronous concolic explorer.Specifically, to tackle the first two limitations, unlike SymSan, Marco keeps all path constraints from a unique path prefix and incorporates extra information, including calling context and branch direction, into branch definition to retain more meaningful branches.To address the third limitation, our system implements a reachabilityguided branch scheduler that can accurately assess the potential of finding new code coverage for every branch.The scheduler then conducts asynchronous solving to make sure our decisions are globally optimal.Furthermore, to overcome the PD problem, Marco models the PD rate for each branch and takes it into consideration when making scheduling decisions.

Approach Overview
As shown in Figure 1, Marco comprises three major components: 1) the asynchronous concolic execution engine, 2) the CSTG constructor, and 3) the reachability-guided branch scheduler.At the beginning of the testing process, the asynchronous concolic execution engine receives an initial seed input and a binary program as input.It then performs concrete and symbolic execution simultaneously, without any constraint solving, to collect concolic traces.These traces comprise symbolic branches encountered, along with the path constraint information needed for branch flipping.
The resulting trace is then passed to the CSTG constructor, which incrementally constructs a CSTG using branch points and branches as nodes, and branch point-to-branch transitions, as well as branchto-branch point transitions as edges.The reachability-guided state scheduler assesses the potential of each node in the CSTG, calculates a reachability score for each node, and ranks them based on their scores.The highest-ranked node has the greatest potential to lead to new code coverage.The asynchronous CE engine will be invoked to solve a path constraint from the top-ranked node for new testcase generation.Marco then executes the testcase to collect trace and repeat the process.

A Running Example
To better explain our design, we will use an example program from [9], illustrated in Listing 1.The program takes two symbolic inputs, x and y, as input parameters for the function testme().This function contains two symbolic branches located at Line 7 and 8 respectively.The directions taken at these two branches depend on the values of the symbolic inputs.

Asynchronous Concolic Execution Engine
Unlike synchronous CE engines [11,35,36,50] that perform symbolic tracing and branch flipping simultaneously, Marco takes an asynchronous approach.Specifically, it decouples branch flipping logic (which includes path condition collection and new testcase generation) from the symbolic tracing logic and defers it until after all branch points uncovered are assessed, and a global optimal branch choice is made.It is worth mentioning that although some prior works (e.g., SAGE and CREST) collect execution traces and then replay them offline for branch-flipping, the branch selection is made while processing the current trace.In other words, their branch-flipping policy adopts only a local view as compared to Marco, which will evaluate all branches to make a global optimal selection.
Specifically, the asynchronous CE engine alternates between two modes: 1) the symbolic tracing mode, where it executes the target program with existing testcases to collect data for educated branch prioritization, and 2) the path exploration mode, where it flips a selected branch to find a new path.

Symbolic Tracing Mode.
In this mode, the CE engine takes one Program Under Test (PUT) and one testcase as input and produces a concolic path and an AST table.Since our implementation is based on SymSan [11], the AST table stores all the necessary information for reconstructing symbolic expressions.
A concolic path consists of a list of symbolic branches that follow the execution path and some auxiliary information.Below are related definitions: Branch Point.A branch point  is defined as: where  is the address of the branching instruction, and  represents the calling context, which is calculated as a hash of all the call sites on the call stack.The context-sensitive definition of branch point allows Marco to differentiate a branching instruction under different calling contexts and characterize program exploration status more accurately.
For the running example, we have two branch points {L7, main → testme} and {L8, main → testme} at Line 7 and 8 respectively.For brevity, we refer to them as L7 and L8 in the following discussion.
Branch.A branch  is defined as: where  is a branch point defined by Definition (1), and  is the direction taken from the branch point.Each branch point has two branches.We use  and  to denote "then" and "else" branches respectively.In the running example, we have four branches denoted as L7T, L7F, L8T, and L8F.
For each symbolic branch, we need to collect essential information about its path constraints.In traditional symbolic execution, the path constraints include all preceding symbolic branches.However, this strategy is often overly strict: generating a new input that follows the exact same path and visits the untaken branch is often impossible [15].However, there may exist a new input that follows a slightly different path and successfully visits the desirable branch.EXE [8] presents constraint independence optimization which divides path constraints into subsets which are dependent on disjoint sets of input bytes to solve them separately.This idea is then adopted by QSYM [50] and SymSan [11] for concolic execution.Specifically, when negating a branch, SymSan includes any preceding branch that shares data-flow dependencies with the current branch or another preceding branch already included.The resulting set of branches are referred to as nested branches in SymSan.Since we perform concolic execution asynchronously, we prefer not to collect branch constraints right away.Instead, we just record their nested branches.
Nested Branch Set.We define Nested Branch Set   for a branch  as a set of branches in a recursive manner: if a branch   has a data-flow dependency with the target branch , then   ∈   (); and if ∃  ∈   () and   has a data-flow dependency with   , then   ∈   ().Concolic Path.A Concolic Path  is defined as a list of 2-tuples: where   is the -th symbolic branch encountered in the execution trace, and   (  ) is the nested branch set of   .
Loop Pruning Optimization.Furthermore, we employ optimization to speed up the concolic path collection.Real-world programs often have many loops.Symbolic branches in loops will repeatedly appear in the concolic paths.It takes time to collect their nested branch sets, even though it is much faster than collecting the nested branch constraints.It is also unlikely to iterate through all these sets in order to generate new testcases in the later stage.Therefore, we decide to prune the nested branch sets early on.More specifically, during the execution, we trace the visit count of each encountered symbolic branch.For a branch whose visit count does not evaluate to the power of 2, we do not generate its NBS.In other words, its NBS is ∅.
In summary, in symbolic tracing mode, the CE engine traces all the testcases in the queue to collect the concolic paths and AST tables.Then it switches into path exploration mode.

Path Exploration Mode.
In this mode, the CE engine invokes the reachability-guided branch scheduler (discussed in §3.5) to find a global optimal branch choice.With the constraint data of the chosen branch, the CE engine will assemble the path constraint set for traversing this branch and solve it to generate a new testcase.
The constraint data of the chosen branch consists of 1)   , the chosen branch; 2)    , nested branch set of   ; and 3) the AST table of the execution where the chosen branch is encountered.We start by initializing the , the path constraint set as ∅.Then we query the AST table for the branch predicate of   and add it into .If    is not empty, we query the AST table for each branch in    for their branch predicates and add them into .Then we reuse the solving strategy proposed in QSYM [50].Specifically, if  is not satisfiable, we resort to optimistic solving, which will only solve the branch predicate of the target branch and disregard any predicates collected from    .If optimistic solving is not viable either, the branch scheduler will be prompted again to generate another set of constraint data until a new seed is generated.

CSTG Constructor
After collecting concolic paths in the asynchronous CE engine, we seek to construct CSTG, which further enables the reachabilityguided branch scheduler.The graph is a directed heterogeneous graph defined as follows.

Concolic State Transition Graph (CSTG).
A CSTG is defined over a set of s as: where  is a set of branch points and branches, and  is a set of vertex transitions.In addition, a virtual root vertex denotes the program entry point.Each vertex  ∈  is associated with a set of attributes including: 1) .: the number of concrete visits at ; 2) .: the number of branch flipping attempts at ; 3) .: the number of successful branch flipping attempts at ; and 4) .: the queue of path constraint sets for generating new testcase that potentially will traverse .Note that attributes 2) to 4) only apply to vertices representing branches instead of branch points.Each edge  ∈  is associated with a concrete visit count ..Algorithm 1 illustrates how Marco constructs the CSTG incrementally.Initially, the graph contains one root node  as the program entry, and the edge set is empty.The procedure takes as (addr, ctx, dir) = CP.pop() if !G.findNode(addr, ctx) then = G.newNode(addr, ctx) 0 = G.newNode(addr, ctx, 1 = G.newNode(addr, ctx, ! G.newEdge(<lastnode, >,<,  0 >,<,  1 >)  0 = G.getNode(addr, ctx, dir) end while 24: end procedure input the graph  and a new concolic path  as defined in (3).The algorithm then performs a preprocessing step to retain a set of visited branches along with their concrete path prefixes and remove from  any branch that is visited through the observed path prefix.When a new branch (according to the Definition 2) is observed for the first time, we insert three nodes (one for its branch point  and two for the taken and untaken branches  0 and  1 , and three edges into the current graph (Ln.6-10).If the branch has been observed and thus has already existed in the graph, we simply retrieve the existing three nodes (one branch point and two branches) from the graph (Ln.12-14).Then, the algorithm calls  to update the nodes' attributes defined in §3.4 as needed (Ln.[16][17].Specifically, for  and  0 , we update visit count ..If  0 matches ℎ, we update its win count ..For  1 , we update the path constraint queue . to include the new path constraint collected from the current execution path to potentially force execution down  1 .Further, if the currently taken branch  0 is equal to the node picked by the last scheduling round to perform branch flipping on ℎ, it means the testcase generated from the last round (i.e., the current testcase) indeed traverses the selected branch.In this case, it will increase the current branch's win count by one.
Initially, the graph is empty with a root node, which denotes the program entry.For the first concolic path {Ł7F}, since node L7F is a new node, we add three nodes, i.e.L7, L7F, and L7T, and three edges, i.e. (R, L7), (L7, L7F), (L7, L7T), into the graph.We increase the visit counts of node L7, L7F, edge (R, L7), (L7, L7F) from 0 to 1.After processing this concolic path, the graph is presented as Figure 2(a).Similarly, Marco then takes the second concolic path {L7T, L8F} as input.As the first branch L7T already exists in the graph, we increase the visit counts of node L7, L7T and edge (L7, L7T) by 1.However, the second branch L8F is not an existing node in the graph.Therefore, we insert three nodes, i.e.L8, L8F and L8T, and three edges (L7T, L8), (L8, L8F), (L8, L8F) into the graph.And we update the visit counts accordingly.After processing this concolic path, the graph is shown as Figure 2(b).Moreover, we make similar changes as discussed above for the concolic path {L7T, L8T}.Hence, after processing the three concolic paths, Figure 2(c) is the final CSTG, which will then be used for branch scheduling.

Reachability-guided Branch Scheduler
Reachability-based branch scheduler aims to find the branch that bears the highest potential for new code coverage and gives the path constraint data of the top-ranked node to the asynchronous CE engine for input generation.
Essentially, we assess the potential of a branch by the number of reachable yet unvisited branches deeper in the execution paths that traverse the branch.To do so, we generate a reward score for each untaken branch, consider a concolic trace as Markov Chain and compute a transition probability to take the path divergence (PD) rate into consideration, and further accumulate the rewards up to calculate a node reachability score that estimates the potential of every branch in the CSTG, in order to pick the best one for further exploration.However, one technical challenge is that the transition probability between nodes and the estimated reward of nodes are unknown at the beginning of the testing.Here in this section, we discuss how Marco tackles this challenge.
Edge Transition Probability Calculation.The transition probability of an edge captures how likely an execution will branch to the end node from the start node.As discussed in §3.4,there are two types of edges in Marco: 1) the -to- edges and 2) the -to- edges.And we calculate their transition probability differently.
The transition probability of a -to- edge is associated with the success rate of generating a testcase traversing  by solving a path constraint associated with , i.e., the opposite of the path divergence rate of this edge.The total solving attempt count at  is .,and the success count is ..Intuitively, the estimated transition probability is ./..The estimation is relatively accurate when the attempt count at  is high enough.But this assumption does not always hold, especially at the early stage of testing and for the less-explored code regions.For better estimation of the transition probability and to balance between exploration and exploitation, we resort to Thompson Sampling (TS) [39].The key idea of TS is to sample the success rate of an action over the Beta Distribution defined by the outcomes of the past trials.The Beta distribution is defined by two positive parameters , denoting the win count, and , denoting the loss count.It becomes more and more concentrated around the empirical success rate /( + ) as the number of total trials ( + ) grows.For a -to- edge,  is .and  is (.-.).The transition probability of a -to- edge is calculated by Equation 5.
where  denotes Thompson Sampling.Note that, each  has two -to- edges, each leading to one viable branch.We normalize the transition probabilities of these two edges to ensure they sum up to one.
The -to- edge transition differs from that of -to- edge in the following aspects: 1) CE engine cannot actively steer execution from a  node to one particular  node through path constraint solving; 2) one  node has two outgoing edges, each leading to one viable branch, while a  node potentially has zero to multiple succeeding  nodes; 3) each  has only one parent node which is its branch point while each  can potentially have multiple preceding  nodes.Therefore, Equation 5does not apply to computing the transition probability for a -to- edge.
The transition probability of an edge leading from  to  can be estimated as the success rate of transitioning from  to .In this case, each visit at edge  , is considered a win.The total amount of trials for visiting this edge includes the visit count and attempt count at .In other words, each time  is visited or attempted, but the subsequent execution does not lead from  to  is considered a losing attempt.Similarly, when the total trial count is low, the accuracy of the estimation can be low.Again, we leverage TS for the transition probability computation for -to- edge with  being  , .and  being .+ .−  , .as shown in Equation 6.
In summary, we leverage Thompson Sampling to dynamically optimize the estimation of transition probabilities of the two types of edges in CSTG which allows us to balance between exploration and exploitation.
Node Reachability Score Calculation.We compute a node reachability score for each node in CSTG, which indicates the nodes' potential for leading to new code coverage in future testing.We then use it to guide the path prioritization in concolic execution.
The reachability score of the  node should capture two aspects: 1) potential new code coverage reachable from  and 2) the difficulty of generating a new testcase that visits the  node.We measure a node's reachable new code coverage as a numerical value denoted as Coverage Score and compute it by Equation 7(for leaf nodes) and Equation 8(for interior nodes).
.=  (0,  .+  .) The coverage score of a leaf node is calculated by Equation 7. In particular, for an unvisited leaf node, the coverage score is affected by the number of solving attempts devoted to it.When the number of attempts grows but the node remains unvisited, it means that this node could be too hard to reach.Therefore, our limited resources are better off being relocated to other nodes.For a visited leaf node, apart from the attempt count, the visit count also affects its coverage score.Each visit to a node without steering the execution into a deeper state is considered a failed attempt.Consequently, the exploration should try to avoid such nodes.As the visit count and the attempt count grow, the coverage score of a visited leaf node will decrease.
We compute an interior node's coverage score by Equation 8: where   ( ∈ [0, ]) denotes a child node of  .Essentially, the coverage score of an interior node is affected by two major factors.First, a node that is adjacent to a large number of unvisited nodes is in general of higher potential than a node that has only a small number of unvisited neighbor nodes.Hence, the number of a node's unvisited successors in CSTG can strongly indicate its potential for new code coverage.Second, given any path in a program, the number of inputs that go through the child node is strictly less than or equal to the number of inputs that go through the parent node.Subsequently, the distance between a node and its unvisited successors also plays an essential role in estimating the potential for new coverage.The coverage score of each node in CSTG is updated periodically to reflect the most recent changes.Apparently, CSTG can be a cyclic graph which imposes a challenge for efficiently updating each one of the nodes for an updated coverage score.We periodically perform the whole graph score updates by first performing a post-order traversal over the graph to extract all the nodes into an ordered list.Then we traverse the list to update each node's coverage score.The reachability score for each  node in the graph is computed by Equation 9.
Essentially, for a branch  with a high path divergence rate (i.e.low in-edge transition probability), it is hard to generate a testcase traversing that branch and it renders the coverage score in vain.We then prioritize the branch nodes in CSTG for scheduling by their reachability score.
Branch Prioritization.After calculating reachability scores, the node (i.e., branch) with a better potential of reaching new code will have a higher score and be promoted in the scheduling.The path constraint associated with this top-ranked node is sent to the Path Constraint Solver for new input generation.In case of an unsatisfiable path constraint, the scheduler is prompted again until a new testcase is successfully generated and the testing continues.

EVALUATION
In this section, we evaluate the efficacy of our proposed approach by answering the following research questions: • RQ1: Effectiveness of end-to-end concolic execution.Can our model improve the performance of end-to-end concolic execution?• RQ2: Effectiveness of design choices.What are the unique contributions of each design choice in Marco?• RQ3: Vulnerability detection.Can Marco be more effective when detecting vulnerabilities?

Evaluation Plan
To better answer the aforementioned research questions, we use the following configurations: • SymSan [11], the SOTA CE engine, which adopts the traditional synchronous solving (i.e., the constraint solving is conducted at the time when the branch is encountered) with the same native branch flipping policy as QSYM [50].This baseline is to directly compare with Marco.• SymSan-pp, a variant of SymSan that adopts a PP branch-flipping policy, where each branch is defined by its path prefix.The testcases are executed in a First-In-First-Out (FIFO) order, with all branches flipped by the visit order.This baseline is to show that simply selecting more branches to flip will not improve CE performance.• Marco-rdm, a variant of Marco that defines each branch by its path prefix and picks a random branch from the last visited program path to flip and generate a new testcase.This configuration is to demonstrate the effectiveness of our branch scheduling strategy.• Marco-cfg, a configuration that applies the CFG-directed searching algorithm of CREST [6] on our dynamically generated CSTG.The assessment of each branch is determined by the branch count between the branch itself and the nearest unvisited branch, as well as the flipping attempt count.This configuration is used to show the effectiveness of our branch scheduling strategy.• Marco-uv, a variant of Marco which only allows the scheduler to pick from unvisited nodes.This configuration is used to show the necessity of flipping the visited branches.• Marco-MC, a variant of Marco with Markov Chain modeling but no Thompson Sampling.This configuration evaluates the importance of Thompson Sampling in Marco.• Marco, our full-fledged system.
To answer RQ1, we compare the code coverage metric of the full-fledged model against SymSan.The experiment is conducted on real-world programs listed in Table 5.For RQ2, we compare the code coverage per path constraint solving for all the configurations listed to showcase the effectiveness of each design choice.Finally, to answer RQ3, we run both Marco and SymSan on the UniFuzz dataset, and compare the number of unique bugs found by them.
Dataset.We collect a dataset consisting of 30 popular real-world programs as shown in Table 5, as well as 71 programs from the DARPA Cyber Grand Challenge (CGC) dataset.To answer RQ1 and RQ2, we conduct experiments on the CGC programs, as well as programs No. 1 to 16 (Binutils and Fuzzbench [1] binaries).To answer RQ3, we further leverage programs No. 17 to 30, which are the subset of the Marco compatible Unibench dataset proposed in UniFuzz [29].We configure the experiment to align with the original setup in UniFuzz, including each program's execution option and the initial seed corpus used.

RQ1: Effectiveness of Marco
To demonstrate the effectiveness of Marco in terms of exploring new code coverage, we measure the edge coverage during testing and compare our full-fledged model with SymSan.Each configuration is repeated 10 times to reduce randomness.We collect the edge coverage at the end of each 24h trial and measure the coverage improvement ratio of Marco over the baseline SymSan.For each program, we further investigate the relative code coverage between SymSan and Marco with the formula proposed in QSYM [50].For code coverage A (Marco) and B (SymSan), we can quantify the coverage difference by using: With the coverage difference score  (, ), we can infer the number of unique edges that A covered, out of the total edge coverage that either A or B can uniquely explore.A positive score means A (Marco) finds more unique coverage than B (SymSan).The value will be 1 if A (Marco) not only finds more coverage than B (SymSan) but also covers all edge coverage explored by B (SymSan).
In our experiment, we evaluate the performance of Marco on 16 real-world programs (programs No.1 to 16 in Table 5).On average, Marco is able to cover 13.03% more edges, with a maximum improvement on readelf for 88.56%.This indicates the effectiveness of our approach in improving the effectiveness of concolic testing for real-world programs.Out of the 16 tested programs, Marco finds more edge coverage than SymSan on 11 programs (68.75%).Moreover, Marco dominates the coverage findings on three targets (file, lcms, and sqlite3), where it also covers all the edges found by SymSan.
Real-world programs CGC binaries 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Further investigation (Table 6) shows that SymSan would terminate within 5 hours on 75% (12/16) of the real-world programs, even though there still exist many edges unexplored.This is due to the overly strict branch definition and ill-advised branch-flipping strategy adopted in SymSan.
To evaluate the scalability of Marco, we investigate the graph size growth and the memory cost for each of the 16 real-world programs during testing.The results show that the number of nodes in CSTG grows sub-linearly during the 24h trials.At the end of each trial, the minimum, maximum, average, and median values of the

RQ2: Effectiveness of Design Choices
As discussed earlier, SymSan terminates very early in 75% of the tested programs, meaning only a limited number of solving attempts have been made.In RQ2, we allocate the same amount of solving attempts for the other configurations and assess their new code coverage.By doing so, we can demonstrate the effectiveness of our design choices in improving the branch prioritization scheme.We look into the edge coverage for SymSan-pp, Marco-rdm, Marco-cfg, Marco-uv, and Marco-MC with the 16 real-world programs and compute the coverage difference scores compared with Marco as defined in Equation 10.The experimental results are displayed in Figure 4.Each row depicts the coverage difference score of A (Marco) and B (the baseline labeled to the left of the row).The blue color indicates that Marco finds more edge coverage than the corresponding baseline, and the red color suggests otherwise.The results exhibit a few major conclusions:

SymSan-pp
Firstly, Marco outperforms SymSan-pp and Marco-rdm on all 16 programs.On average, Marco covers 55.99% and 86.64% more code than SymSan-pp and Marco-rdm.This result explicitly demonstrates that the novel branch prioritization strategy in Marco, other than a simple FIFO or random selection, is extremely useful when it comes to code exploration.
Secondly, Marco outperforms Marco-cfg on 13 out of 16 tested programs.For the other three programs (vorbis, curl, and woff2) where Marco-cfg finds more coverage, the differences are slight (<1%).Marco is able to find 83.92% more code coverage than Marcocfg.This result indicates that our branch flipping strategy is better than the CFG-directed approach.
Thirdly, compared with Marco-uv, Marco manages to find more coverage on 15 programs out of 16, with only one exception curl.On average, Marco finds 29.22% more code coverage than Marcouv.This shows that it is indeed a good strategy to deem both visited and unvisited nodes as candidates for path constraint solving.
Lastly, Marco outperforms Marco-MC on 12 out of 16 tested programs with an average coverage improvement ratio of 57.21%, indicating that modeling edge transition and reachability score with Thompson Sampling to balance between exploration and exploitation is crucial to making effective branch prioritization decisions.
We further evaluate the significance of difference comparing Marco and the other configurations in Figure 4 across the 16 tested programs on their code coverage findings using p-values from the Mann-Whitney U-Test.We use p-value < 0.05 as the threshold for statistical significance.We observed p-values above 0.05 in only two programs, curl (0.07) and woff2 (0.48), when comparing Marco and Marco-cfg.For the rest of the results, the p-values are below 0.001 for majority of the cases.The result suggests significant difference between Marco and the other configurations.

RQ3: Vulnerability Detection
Lastly, we showcase the capability of vulnerability detection for Marco by using the UniFuzz dataset, which consists of 14 programs.Specifically, we run both Marco and SymSan 5 times for 24 hours and compare the average number of unique bugs detected.According to the results, Marco is able to find 33.52% more bugs (47.8 v.s.35.8) than SymSan.Among them, Marco can uniquely identify 2.41 times the bug count of SymSan (20.5 v.s.8.5).More concretely, Marco finds more unique bugs than Symsan on 7 programs, less on 2, and the same amount on 5.These numbers show that Marco has its unique advantages when finding vulnerabilities compared with state-of-the-art CE engines.
We further cross-check our results with that reported in UniFuzz paper 4 in UniFuzz [29] paper for 7 fuzzers (AFL [51], AFLFast [5], Angora [13], HonggFuzz [48], MOPT [32], T-Fuzz [34] as well as VUzzer64 [37]) in 24h.We draw the box plot of all 8 baselines in Figure 5.According to the result, Marco is able to find more unique bugs in 12h than any of the 7 fuzzers can find in the 24h trial on 5 (imginfo, jhead, mp42aac, jq and tcpdump) out of the 14 tested programs.Marco ranks the second place on mujs and sqlite3, second to Angora and MOPT respectively.We further explore the statistical rankings among Marco and the 7 fuzzers by their average unique bug detection counts for each program.Marco beats 6 fuzzers and is second to MOPT only.This demonstrates that Marco can find bugs very efficiently.

DISCUSSION
Marco still has some limitations.First, Marco requires access to the source code for instrumenting the PUT with the symbolic tracing and branch-solving logic through the compiling pass.In practice, however, a real-world PUT can be developed with a set of external libraries whose source codes are not accessible.In addition, the cost of reconstructing the branch dependency for recovering the nested path constraint set can be expensive when the dependency is very complex.We leave the performance optimization as future work.Moreover, in this paper, we do not study how to coordinate concolic execution with fuzzing better.SymSan [14] reported that the existing hybrid fuzzing scheme cannot consistently outperform pure fuzzing such as AFL++ [18].A recent work by Jiang et al. [27] proposed edge-oriented scheduling to improve the performance of hybrid fuzzing.How a more efficient and intelligent concolic execution engine affects the design of hybrid fuzzing deserves more investigation.

RELATED WORK
In this section, we discuss precedent works closely related to Marco.Symbolic Execution.Symbolic Execution [21,40] has proven to be a powerful program testing technique for test case generation and bug detection.However, due to the state explosion problem and the large overhead imposed by path constraint solving, it has low scalability for real-world program testing.To tackle this problem, a series of research has been done to prune states [7,22,43,49], merge states [24,28,41], prioritize states [25,31], and perform constraints reduction [16] and solution caching [7,8] in order to improve the scalability.
Seed Scheduling in Fuzzing.Many techniques [4,10,20,23,30,33,45,47,52,53] have been proposed for improving fuzzing.One important optimization is to improve the seed selection [26,38].AFLfast [5] favors the less explored paths and allocates more powers for them.Vuzzer [37] prioritizes test cases that traverse paths, which are more likely to reveal a vulnerability.Entropic [3] leverages information-theoretic entropy for scheduling seeds to optimize coverage gain and bug-finding ability.AFL-Hier [46] implements a reinforcement learning model for scheduling seeds clustered by multi-level coverage metrics.K-scheduler [42] performs graph centrality analysis to promote seeds that have a higher potential for reaching new code coverage.
Markov Chain Modeling.In AFLfast [5], the greybox fuzzing process is modeled as Markov Chain to identify the high-frequency paths and steer exploration away from those paths.Sparks et al. [44] model the program control flow as Markov Chain and seek to drive the testing toward less-explored code regions leveraging a fitness function concerning the path exploration frequency.Other reinforcement learning strategies such as Probably Approximately Accurate (PAC) bounds [17,19] were proposed for solving Markov Decision Process.Integrating these approaches remains a potential future direction of this paper.

CONCLUSION
In this work, we propose to model the concolic execution as a Markov Chain process and construct a stochastic concolic state transition graph to assess each branch's potential for code coverage in a global view.The states are evaluated by their reachability to new code coverage with respect to path divergence rate along the execution trace.Evaluation of our prototype Marco shows that the new approach proposed in this paper outperforms the stateof-the-art concolic execution engine in both code coverage and vulnerability detection.

Figure 2 :
Figure 2: CSTG Construction of Example Program

Figure 3 :
Figure 3: Coverage Difference Score of Real-world Programs and CGC Binaries

Figure 4 :
Figure 4: Coverage Difference Score Within Solving Budget

Figure 5 :
Figure 5: Number of Unique Bugs Detected

Table 1 :
Code Coverage w/ Single-Pass Exploration

Table 2 :
Code Coverage w/ Continuous Exploration

Table 3 :
Quality of Generated Testcases

Table 4 :
Path Divergence Rate

Table 5 :
Details of Real-world Applications Evaluated

Table 6 :
Average Termination Time for SymSan