PEM: Representing Binary Program Semantics for Similarity Analysis via a Probabilistic Execution Model

Binary similarity analysis determines if two binary executables are from the same source program. Existing techniques leverage static and dynamic program features and may utilize advanced Deep Learning techniques. Although they have demonstrated great potential, the community believes that a more effective representation of program semantics can further improve similarity analysis. In this paper, we propose a new method to represent binary program semantics. It is based on a novel probabilistic execution engine that can effectively sample the input space and the program path space of subject binaries. More importantly, it ensures that the collected samples are comparable across binaries, addressing the substantial variations of input specifications. Our evaluation on 9 real-world projects with 35k functions, and comparison with 6 state-of-the-art techniques show that PEM can achieve a precision of 96% with common settings, outperforming the baselines by 10-20%.


INTRODUCTION
Binary similarity analysis determines if two given binary executables originate from the same source program.It has a wide range of applications such as automatic software patching [4,32,37,42,43,51], software plagiarism detection [7,30,41,45,54], and malware detection [5,8,15,16,21,24,53].For example, assume a critical security vulnerability has been reported and fixed in a library.It is of prominent importance to apply the patch to other deployed projects that included the library.However, the library may be compiled with different settings in different projects.Binary similarity analysis allows identifying all the variants.Given a pool of candidate binaries, which are usually functions in executable forms, a similarity analysis tool reports all the binaries in the pool equivalent to a queried binary.The problem is challenging as aggressive code transformations such as loop unrolling and function inlining in compiler optimizations may substantially change a program and produce largely different executables [39].
Given its importance, there is a large body of existing work.Earlier work (e.g., [6,22]) focuses on extracting static code features such as control-flow graphs and function call graphs.They are highly effective in detecting binaries that have small variations.Many proposed to use dynamic information instead [10,13,18,46] because it better discloses program semantics.For example, inmemory-fuzzing (IMF) [46] uses fuzzing to generate many inputs and collects runtime information when executing the program on these inputs.It then uses the collected information to compute binary similarities.When the fuzzer can achieve good coverage, IMF is able to deliver high-quality results.However, achieving good coverage is difficult for complex programs (see our example in Section 2.1).Recently, Machine Learning and Deep Learning techniques are used to address the binary similarity problem [26,31,33,34,47,52,54].These techniques work by training models on a large pool of binaries that have positive and negative samples.The former includes binaries compiled from the same source and the latter includes those that are functionally different.The models are hence supposed to learn (implicit) features that can be used to cluster functionally equivalent programs.However, as shown in Sections 2.2 and 4.2, these models may learn features that are not robust, and in many cases, not semantics oriented, leading to sub-optimal results.
Inspired by the existing works that leverage dynamic information [13,18,46,56], we consider the semantics of a binary to be a distribution of its inputs and their corresponding externally observable values during executions.Observable values are those encountered in I/O operations, and global/heap memory accesses.Compared to other runtime values such as those in registers, observable values are persistent across automatic code transformations as compilers hardly optimize these behavior [18,46].However, since we need to compare arbitrary binaries, ideally, we would have to collect sufficient samples in the input space of all these binaries.Making such samples universally comparable is highly challenging.In Section 2.2, we show that a naive sampling strategy that executes all subject binaries on the same set of seed inputs can hardly work as different binaries take inputs of different formats.For example, a valid input for a program  is very likely an invalid input for programs  and .As such, it can only trigger similar error handling logics in  and , making them not distinguishable.
In this paper, we propose a sampling technique that can effectively approximate semantics distributions by selecting and interpreting a small set of equivalent paths across different versions of a program.It is powered by a novel probabilistic execution engine.It runs candidate binaries on a fixed set of random seed values.Although many of these seed values lead to input errors, it systematically unfolds the program behavior starting from the execution paths of these seed values, called the seed paths.Specifically, it flips a bounded number of predicates along the seed paths.For instance, flipping a failing input check forces the binary to execute its normal functionality.While predicate flipping is not new [35,53,55], our technique features a probabilistic sampling algorithm.Specifically, we cannot afford exhaustively exploring the entire neighborhood (of the seed paths) even with a small bound (of flipped predicates).Hence, we leverage a key observation that the predicates with the largest and the smallest dynamic selectivity tend to be stable before and after automatic transformations, while other predicates vary a lot (by the transformations).Dynamic selectivity is a metric computed for a predicate instance that measures the distance to the decision boundary.For example, assume a predicate x>y yields true, x-y denotes its dynamic selectivity.Our theoretical analysis in Section 3.5 discloses that since automatic transformations cannot invent new predicates, but rather remove, duplicate, and reposition them, the likelihood that code transformations change the ranking of predicates with the smallest/largest selectivity is much smaller than that for other predicates.Hence, we sample paths by flipping predicates that have close to the largest and the smallest selectivity, following the Beta-distribution [20] that has a U shape, biasing towards the two ends.Therefore, if two binaries are equivalent, our algorithm can sample a set of corresponding paths in the binaries by flipping their corresponding predicates such that the observable values along these paths disclose the equivalence.
Our contributions are summarized as follows.
• We propose a novel probabilistic execution model that can effectively sample the semantics distribution of a binary and make the distributions from all binaries comparable.• We develop a path sampling algorithm that is resilient to code transformation and capable of sampling equivalent paths when two binaries are equivalent.We also conduct a theoretical analysis to disclose its essence.• We propose a probabilistic memory model that can tolerate invalid memory accesses due to predicate flipping while respecting the critical property of having equivalent behavior when the binary programs are equivalent.• We develop a prototype PEM.We conduct experiments on two commonly used datasets including 35k functions from 30 binary projects and compare PEM with five baselines [13,31,33,34,46].The results show that PEM can achieve more than 90% precision on average whereas the baselines can achieve 76%.PEM is also much more robust when the true positives (i.e., binaries equivalent to the queried binary) are mixed with various numbers of true negatives (i.e., binaries different from the queried binary) in the candidate pool, which closely mimics real-world application scenarios.Consequently, PEM can correctly find 7 out of 8 1-day CVEs from binaries in the wild, whereas the baselines can only find 2. We upload PEM at [49] for review.

MOTIVATION AND OVERVIEW 2.1 Motivating Example
Our motivating example is adapted from the main function of cat in Coreutils.The simplified source code is shown in Fig. 1a.Lines 2 to 10 parse the command line options.Lines 12 to 19 iteratively read the file names from the command line and emit the file contents to the output buffer.The function delegates the main operations to two functions.When some conditions at line 13 are satisfied, a simpler method simple_cat() is called.Otherwise, it calls a more complex function that formats the output according to the full panoply of command line options.For example, at line 22, if the global flag print_invisible is set, the function prints out the ASCII values of invisible characters.Compiler optimizations may substantially transform a program.In Fig. 2b and Fig. 2a, we show the control flow graphs (CFGs) for our motivating example generated by two respective compilation settings, -O0 meaning no optimization and -O3 meaning having all commonly used optimizations applied.The switch statement at line 3 is compiled to hierarchical if-then-else structures with -O0, as shown in the orange circle in Fig. 2b.In contrast, it is compiled to an indirect jump with -O3, as shown in the orange circle in Fig. 2a.The predicate at line 13 corresponds to the blue circle in Fig. 2b.We can see two branches, each consisting of only one basic block.Two delegated functions are called in the two basic blocks, respectively.However, the two functions are inlined in the optimized version, resulting in branches with much more blocks, e.g., 50 blocks in the branch of the complex function, as shown in the blue circle in Fig. 2a.
(a) Coreutils: cat To better illustrate the challenges, we introduce another function adapted from the main function of touch in Coreutils, as shown in Fig. 1b.The function touch modifies the meta information of files.Lines 2 to 10 parse the command line options and the for-loop at line 11 iteratively performs the touch operation.We can see from Fig. 2c that the syntactic structures of touch and cat are more similar than those between cat with and without optimizations.The observation can be quantified by the statistics of these CFGs shown in the caption.

Limitations of Existing Techniques
Fuzzing-Based Techniques.There are techniques that leverage fuzzing to explore the dynamic behavior of programs and use them in similarity analysis.For example, in-memory fuzzing (IMF) [46] iteratively mutates function inputs and collects traces.Since the parameter specifications for functions in stripped binaries are not available, it is challenging to generate inputs that can achieve good coverage.In our example, IMF can hardly generate legal command line options for the function main_cat.Thus most collected behavior is from the error processing code at line 8.Moreover, they tend to collect similar (error processing) behavior from main_touch.As such, the downstream similarity analysis likely draws the wrong conclusion about their equivalence.Our experiments in Section 4.2 show that IMF can achieve a precision of 76% on complex cases, whereas ours can achieve 96%.Forced-Execution-Based Techniques.To extract more behavior from binary code, there are methods that use coverage as guidance to execute every instruction in a brute-force fashion.A representative work BLEX [13] executes a function from the entry point.Then it iteratively selects the first unexecuted instruction to start the next round of execution until every instruction is covered.We call techniques of such nature forced-execution-based as they largely ignore path feasibility.There are two essential challenges for these techniques.First, they tend to use code coverage within a function as the guidance for forced execution, which has the inherent difficulty in dealing with function inlining [39].Another challenge is to provide appropriate execution contexts when execution starts at arbitrary (unexecuted) locations.For example, suppose that in the first few rounds, BLEX executes the true branch at line 14 of Fig. 1.
When it tries to cover the false branch at line 17, it uses a fresh execution context, discarding the variables computed at line 11.
According to our experiments in Section 4.2, these techniques can achieve a precision of 69%, whereas our technique can achieve 96%.
Learning-Based Techniques.Emerging techniques [33,34,48,54] leverage Machine Learning models.Some models [48,54] extract static features from CFGs.However, these static features are not robust in the presence of optimizations.Another line of work uses language models [33,34].Their hypothesis is that these models could learn instruction semantics and hence function semantics.To limit the vocabulary (i.e., the set of words/tokens supported), binaries are often normalized before they can be fed to models.For example, immediate values (i.e., constants in instructions) and constant call targets are replaced with a special token HIMM in SAFE [33], e.g., the token x_call_HIMM around line 8 in Fig. 3 (b) and (c) that corresponds to the function invocation get_cli_opt.While this makes training convergence feasible, a lot of semantics are lost.These models may not learn to classify based on instructions essential to function semantics.For example, SAFE leverages an NLP technique called attention [44].Conceptually, the attention mechanism determines which instructions are important to the output.We highlight the statements and their tokens with the largest attention values in Fig. 3.In these three functions, the first few tokens (in gray) with large attention values are in the function prologues.The corresponding instructions (e.g., push) perform the same functionality, saving register values to memory and allocating space for local variables.In Fig. 3a, the model also pays attention to tokens/instructions related to the switch-case statement.As discussed before, however, static structures are not reliable due to optimizations.In contrast, in Fig. 3b, the model instead emphasizes the normalized function invocation at line 8, which is not distinguishable from the invocation at line 8 in (c) with a large attention value as well.From the parts that the model pays attention to, it is easy to explain why SAFE concludes cat@O0 is more similar to touch@O0, instead of cat@O3.We visualize the weights of full attention layers in Fig. 25 of the appendix.

Our Technique
We aim to leverage program semantics in similarity analysis.We define the semantics of a binary program  as follows.Intuitively, the joint distribution of inputs and observable values when executing  on the inputs denotes 's semantics.Observable values are hardly altered by code transformations.A Naive Sampling Method.One may not need to collect a large number of samples to model the aforementioned distribution because if two programs are equivalent, executing them on equivalent inputs produces equivalent observable values.Therefore, a naive method is to provide the same set of inputs to all programs such that those that are equivalent must have identical observable value distributions.However, such a simple method is ineffective because of the following reasons.First, even equivalent programs might have different input specifications (e.g., different numbers of parameters and different orders of parameters), making automatically feeding equivalent inputs to them difficult.Furthermore, different programs have different input domains.When the provided inputs are out-of-range (and hence invalid), the corresponding observable value distributions cannot be used to cluster programs.In our example, the valid domain of  at line 3 of main_cat is a set of characters {b,e,s,t,u,v} whereas the domain of  at line 3 of main_touch is {b,c,d,f,h,v}.Without input specifications, which are hard to acquire for binary functions, the naive sampling method may provide a random input, say,  = 173.As a result, the executions of both functions fall into the exception handling paths and the observable values are not distinguishable.Our Method.Instead of solving the input specification problem, which is very hard for binary programs, we propose a technique agnostic to such specifications.Specifically, we propose a novel probabilistic execution model that serves as an effective sampling method to approximate the distribution D denoting program semantics.Given a program , we acquire its semantic representation as follows.We execute  on a set X of pre-determined (random) inputs, which is an invariant for all programs we want to represent.To address the challenge of input specification differences, we assign the same value  ∈ X to each input variable (for all programs).That is, we feed the same value to all input parameters, making their order irrelevant.We repeat this for all values in X.As an example, for the programs in Fig. 1, we set argc and **argv (all elements in the buffer) in both main_cat and main_touch, as well as *inbuf, insize, and *outbuf in function complex_cat, to 173, acquiring three executions.Then we set them to 97, acquiring another three executions, and so on.
These random values may not be valid inputs and hence the corresponding executions may not disclose meaningful semantics.We hence further sample k-edge-off behavior.Definition 2.2.Given a program  and an input , let  be the program path taken with input , we say a path  ′ is -edge-off (from ) if  predicates along the execution need to be flipped to other branch outcomes in order to acquire  ′ .
For instance, suppose that when executing main_cat with  = 173, the path  is 2-3-8.If the branch at line 2 is flipped to line 11, assuming that the following execution path is 13-16-17-19, 2-11-13-16-17-19 is 1-edge-off from . -edge-off behavior (of an input ) is essentially the observable values encountered in all -edge-off paths (of ).Observe that for main_cat and main_touch, although the 0-edge-off behaviors (i.e., the original executions) are not distinguishable, the 1-edge-off behaviors are quite different, e.g., the behavior of main_cat includes those from the delegated function at line 17.However, there is a practical challenge: covering all -edge-off behavior even when  = 2 may be infeasible for complex programs since the number of -edge-off paths grows exponentially with .Moreover, controlling the sampling process exclusively by  induces substantial noise due to code optimizations/transformations.Specifically, optimizations substantially change program structures, adding/removing predicates.The -edge-off behaviors are hence quite different.An example can be found in Section A of the appendix.To suppress the noise introduced by optimizations, we leverage the observation that optimizations rarely change the (selectivity) ranking of predicates with the maximum and minimum dynamic selectivity.For instance, suppose that in an execution, the value of inbuf[i] at line 22 in Fig. 1 is 173.It is then compared with 0x20.The dynamic selectivity of the predicate instance is hence 141 (i.e., |173 − 0x20|).Essentially, the dynamic selectivity reflects how likely a branch predict evaluates to true [40].Although automatic code transformations may change dynamic selectivity, the predicate instances with the largest/smallest dynamic selectivity tend to stay as the largest/smallest ones after transformations.We formally explain the observation in Section 3.5 and empirically validate it in Section 4.5.Therefore, we select predicate instances to flip following a Beta-distribution [20] with  =  = 0.03.The distribution has the largest probabilities for predicates with the minimum and maximum selectivity and small probabilities in the middle (like a U shape).Intuitively, if two programs are equivalent/similar, their predicates with the largest and the smallest selectivity tend to be the same.By flipping these predicates in the two versions, we are exploring their equivalent new behavior.
In our example, for both the optimized and the unoptimized version of main_cat, the algorithm first flips the predicate at line 2 with a high probability since the -1!=c has the largest selectivity on path 2-3-8.Then we achieve the 1-edge-off path 2-11-13-16-17-19 as discussed above.Along the new path, the algorithm flips the predicate with the largest selectivity at line 22 for further 2-edge-off exploration in both versions, exposing similar behavior.
To realize the probabilistic execution model, we develop a binary interpreter that can feed the binary with specially crafted inputs and sample observable values (Section 3.3).It also features a probabilistic memory model that can tolerate invalid memory accesses while ensuring equivalent observable values for equivalent programs (Section 3.6).Compared to traditional forced-execution-based techniques, PEM naturally handles the function inlining problem as our sampling is not delimited by function boundaries and our execution contexts are largely realistic.Compared to fuzzing based techniques, ours does not rely on solving the hard problem of generating valid inputs.Compared to Machine Learning based techniques, our technique focuses on dynamic behavior of programs, which are more accurate reflections of program semantics [23].

DESIGN 3.1 Overall Workflow
The workflow of PEM is shown in Fig. 4. The input is in the grey box on the left side.It consists of a set of seed inputs, each being an infinite sequence of the same value, the binary executable, and a path sampling strategy that can predict the next path to interpret based on the set of interpreted paths.The interpreter interprets the subject binary on a seed input, supplying the same value to any input variable encountered during interpretation, to eliminate any semantic differences caused by parameter order differences.
The interpretation also strictly follows the path indicated by the path sampling component.When invalid pointer dereferences are encountered, which can be easily detected, the interpreter interacts with the probabilistic memory model to emulate the access outcomes.The emulation ensures that the same sequence of (observable) values are returned for equivalent paths.After sampling, on the right side, the observable value distributions are summarized for later similarity analysis, which simply compares two multi-sets.The remainder of this section is organized as follows.We first model binary instructions using a simplified language.Then we present the semantic rules.After that, we discuss the path sampling method and the probabilistic memory model.

Language
The syntax of our language is in Fig. 5.A program  consists of a sequence of instructions.There are three categories of instructions.First, there are instructions that move values among registers:  1 =  2 moves the value in  2 to  1 ;  =  moves a literal value ;  1 =  2 ⋄ 3 moves the result of  2 ⋄ 3 to  1 .The second category is load and store instructions.The load instruction  1 = [ 2 ] treats the value in  2 as a memory address and loads the value in the specified memory location to  1 .Store is similar.There are also instructions that change the control flow.Instruction jmp  jumps to the instruction at ; jcc   performs the jump operation only when the value in  is non-zero; jr  is an indirect jump that uses the value in  as the target address.Instruction done means the interpretation is finished.Although our language does not model functions for simplicity, our implementation supports the full x86 instruction set, including function invocations and returns.

Interpretation
The state domains of the interpreter are illustrated in the upper box of Fig. 6.The register state  is a mapping from a register to a value.While in our presentation values are simply non-negative integers, our implementation distinguishes bytes, words, and strings.The memory store  is a mapping from an address to a value.We use an instruction counter  to identify each interpreted instruction along the execution path. denotes the observable value  statistics.It is a mapping from a value to the number of its observations, that is, how many times the value appears in the current interpretation.In the lower box, we define a number of auxiliary data/structures that are immutable during interpretation and a number of helper functions used in the semantic rules.In particular, we use ⊥ to denote an undefined value;  to denote the path to interpret, determined by the path sampling component (for a given seed value ).It is a mapping from instruction count to an instruction address.For example, a 2-edge-off path for a seed value 994 can be {1000 → 0x804578, 2000 → 0x80a41f}.It means that the predicate instance with the instruction count 1000 ought to take the branch starting at 0x804578 when executing the binary with the seed input 994, and the instance with count 2000 should take the branch at 0x80a41f.The helper function decode() disassembles the instructions in a basic block starting at .The function valid() determines if an address is valid.Note that since we enforce branch outcomes and use crafted inputs, the execution states may be corrupted.This function helps detect such corrupted states and seeks help from the probabilistic memory model.The function invalidLd() loads a value from an invalid address.
Part of the semantic rules are in Fig. 7.As shown at the top of Fig. 7, the state configuration is a tuple of five entries.A rule is read as follows: if the preconditions at the top are satisfied, the state transition at the bottom takes place.For example, Rule JccGT says that if there is a branch  ′ specified in  for the current instruction count , the conditional jump is interpreted and the continuation is Intuitively, given a seed value , the interpreter initializes all registers and parameters with the same value , and starts interpretation from the beginning (Rule Start).The interpretation largely follows concrete execution semantics except the following.First, when it encounters a conditional jump which is indicated by the path descriptor  to take a specific branch, it takes the specified branch (Rule JccGT).Otherwise, it follows the normal semantics (Rules JccT and JccF).Second, when it encounters a load, if the address is valid but the memory location has not been defined, it fills it with  (Rule LdUd); if the address is invalid, it fetches a value from the probabilistic memory model (Rule LdIv); otherwise it loads a value from the memory as usual (Rule LdV).Here Figure 7: Interpretation Rules are interpreted similarly.We track all dynamic memory allocations for access validity checks.Details are elided as this is standard.
We also have a set of logging rules that describe how PEM records the statistics of observable values.We record the frequencies of memory addresses accessed, values loaded/stored, control transfer targets, and predicate outcomes.Due to space limitations, details are presented in Section B of the appendix.Loops and Recursion.Since our goal is to disclose semantic similarity and not to infer semantics faithful to any executions induced by real inputs, following common practice, we unroll each loop and recursive call 20 times.

Path Sampling
We present the path sampling method in Algorithm 1.It consists of two functions.Function interpret at line 1 interprets the input program and flips the predicates that are indicated by path, a mapping from instruction count to an address (Fig. 6).Specifically, during interpretation, the algorithm flips a predicate instance to an address indicated in path if the corresponding instruction count is met.The function returns a list of encountered predicate instances.
Function sample iteratively selects a predicate instance to flip (from all the interpretation results in previous steps).Variable candidates denote a set of candidate predicates for flipping and budget the number of interpretations allowed.To begin with, PEM first interprets a faithful path without altering any branch outcome.It then adds predicates in this faithful path to the candidates list (line 7).
As shown in the loop at line 8, PEM iteratively selects a predicate to flip (line 10), composes a new path with the outcome of the selected predicate flipped at line 11 (function getBranch() acquires the target address for the true/false branch outcome of a predicate pr), interprets the program according to the new path (line 12), and updates the list of candidates (line 13).Note that at line 10, to select the predicate instance to flip, PEM sorts all the candidate predicates by their dynamic selectivity.Then a real number  ∈ [0, 1] is sampled following the probability density function (PDF) of a Beta-distribution [20].PEM selects a predicate that is at the -percentile of the sorted candidates list, i.e., _ = Algorithm 1: Probabilistic Path Sampling

Formal Analysis of Path Sampling
The effectiveness of our path sampling algorithm piggybacks on the following theorem.Theorem 3.1.Assume two functionally equivalent programs  and  ′ .If we interpret them along two equivalent paths and collect the predicate instances during interpretation, the predicate instances with the largest (smallest) dynamic selectivity in both programs have a larger probability to match, compared to those with non-extreme selectivity.
While optimizations (e.g., constraint elimination [29]) may modify predicates to simplify control flow, predicates with the smallest and largest dynamic selectivity are most resilient to optimizations, namely, their selectivity ranking hardly changes before and after optimizations.Modifications to predicates introduced by optimizations fall into two categories: predicate elimination and insertion.A predicate relocation can be considered as first removing the predicate and then adding it to another location.Specifically, compiler may eliminate a predicate if its outcome is implied by the path condition reaching the predicate.For example, it may eliminate a predicate  > 10 if the path condition includes  > 20.On the other hand, compiler may introduce new predicates to provide control flow shortcuts.Take Fig. 8 as an example.Compiler inserts a new predicate,  < 10, in Fig. 8b (shown in red).The modification simplifies the control flow when  is less than 10.Note that, in these cases, the dynamic selectivity of an inserted predicate will be close to the dynamic selectivity of an existing one because these inserted predicates are derived from constraints in existing predicates.
The intuition of our theorem is hence that the rankings of predicates with the smallest/largest selectivity do not depend on whether other predicates are modified.In contrast, the predicates ranked in the middle by their selectivity are more likely to have their rankings changed when predicates are removed or added by optimization.Proof Sketch.We formalize the intuition by first reasoning about the predicates having close to the smallest dynamic selectivity.Reasoning for the largest ones is symmetric.Suppose that for each predicate, the compiler has a probability  to eliminate it and a probability  for having a predicate inserted that ranks right before it.In either case,  we say the predicate is modified.The probability that a predicate is not modified is noted as  = 1 −  − .We further denote as P  the probability that the -th smallest predicate is still the -th smallest one after optimization.It is calculated by the following formula: Intuitively, the ranking of the -th smallest predicate is not changed by optimizations if (a) this predicate is not modified and (b) the number of predicates with a smaller dynamic selectivity does not change.In the above formula,  represents condition (a) and the second term represents condition (b).Specifically, (b) is satisfied only when the numbers of removed and inserted predicates that rank before  are equal.Here,  −1 2   −1−2 means an even number (2) of the  − 1 predicates with a smaller ranking are modified, and 2      means half of the modifications are removals and the other half are insertions.We visualize the distribution of P  in Fig. 9 with three sets of configurations of  and .We can see that in all setups, P  monotonically decreases when  increases.□ We also conduct an empirical study to validate our theoretical analysis.The results are visualized in Section 4.5.The results show PEM has an 80-90% chance of making correct selections and exploring equivalent paths by deterministically selecting the predicates with largest/smallest dynamic selectivity.Advantages of Probabilistic Path Sampling Over Deterministic Selection.Note that the probability of predicates with the smallest/largest selectivity having their rankings changed by optimization is not 0, although it is smaller than others.To tolerate such certainty, we employ a probabilistic approach, meaning that we follow a Beta-distribution instead of deterministically selecting the predicates with extreme selectivities for flipping.We further conduct a formal analysis to justify why the probabilistic sampling algorithm is better than the deterministic algorithm.Intuitively, by following a Beta distribution, PEM spends some budget on predicates that do not have the largest or smallest selectivity, but selectivities close to the largest and smallest.These "additional" selections increase the probability that PEM selects the correct path (i.e., the equivalent path) at each step.Taking more correct steps at earlier selections increases the chance that PEM chooses a correct step at later selections because the candidate predicates of later selections come from previously explored paths.The formal proof is shown in Section C of the appendix.Effect of Path Infeasibility.Our algorithm may select infeasible paths.Two possible concerns are (1) whether observable values along infeasible paths in two similar binaries can correctly disclose their semantic similarity; and (2) whether observable values along k t = 0.1, q = 0.05 t = 0.2, q = 0.1 t = 0.2, q = 0.2 infeasible paths in two dissimilar binaries may undesirably match, leading to the wrong conclusion of their similarity.
For the first concern, we show that PEM likely selects corresponding paths when two binaries are similar, regardless of the feasibility of selected paths.That is, although the paths may be infeasible, the sequences of observable values along them are equivalent.We show a proof sketch in Section D.1 and show empirical support in Section D.3 of the appendix.
For the second concern, the probability that two equivalent paths are selected by PEM in two dissimilar binaries is very small.In those cases, although the initial seed paths may be undesirably similar (e.g., the error handling paths), the following flipped (infeasible) paths quickly become substantially different.The formal proof is in Section D.2 and the empirical study is in Section D.3 of the appendix.

Probabilistic Memory Model
The goal of the probabilistic memory model (PMM) is to handle loads and stores with invalid addresses induced by predicate flipping and the use of (out-of-bound) seed values.A key observation is that the specific values written-to/read-from the PMM do not matter as long as they can expose functional equivalence.We define the following two properties for a valid PMM.Definition 3.1.We say a PMM is equivalence preserving if the sequence of (invalid) addresses accessed, and the values writtento/read-from the PMM must be equal, for two equivalent paths in two functionally equivalent programs.
This property ensures PEM can place equivalent programs into the same class.Definition 3.2.We say a PMM is difference revealing if the sequence of (invalid) addresses accessed, and the values written-to/readfrom the PMM must be different for two different paths (pertaining invalid memory accesses) in two respective programs, which may or may not be equivalent.This is to ensure different programs are not mistakenly placed in the same class.For example, a naive PMM always returns a constant value for any invalid reads and ignores any invalid writes.It is equivalence preserving but not difference revealing.
Our PMM is designed as follows.Before each interpretation run, it initializes a probabilistic memory (), which is a mapping Addr → Val of size  such that: ∀ ∈ [0, ],  [] = ().An invalid memory read from the normal memory  with address  is forwarded to the  through the invalidLd() function, which returns  [ mod ].Similarly, an invalid memory write to the normal memory  with address  and value  is achieved by setting  [ mod ] = .
It can be easily inferred that our PMM satisfies the equivalence preserving property by induction (on the length of program paths).Intuitively, the first invalid accesses in two equivalent paths must have the same invalid address.As such, our PMM must return the same random value.This same random value may be used to compute other identical (invalid) addresses in the two paths such that the following invalid loads/stores are equivalent.It also probabilistically satisfies the difference revealing property.Specifically, different paths manifest themselves by some different invalid addresses, and our PMM likely returns different (random) values for these different addresses, rendering the following memory behaviors (with invalid addresses) different.The chance that different paths may exhibit the same behavior depends on .Due to the complexity of modeling memory behavior in real-world program paths, we did not derive a theoretical probabilistic bound for our PMM.However, empirically we find that  = 64 enables very good results (with our loop unrolling bound 20).An example can be found in Section E of the appendix.

EVALUATION
We implement PEM on QEMU [38].Details are in Section F of the appendix.We evaluate PEM via the following research questions: RQ1: How does PEM perform compared to the baselines?RQ2: How useful is PEM in real-world applications?RQ3: Is PEM generalizable?RQ4: How does each component affect the performance?

Setup
We conduct the experiments on a server with a 24-core Intel(R) Xeon(R) 4214R CPU at 2.40GHz, 188G memory, and Ubuntu 18.04.Datasets.We use two datasets.Dataset-I: To compare with IMF and BLEX, which only use Coreutils [9] as their dataset, we construct a dataset from Coreutils-8.32.We compile the dataset using GCC-9.4 and Clang-12, with 3 optimization levels (i.e., -O0, -O2, and -O3).Dataset-II includes 9 real-world projects commonly-used in binary similarity analysis projects [25,31,34].They are Coreutils, Curl, Diffutils, Findutils, OpenSSL, GMP, SQLite, ImageMagick, and Zlib.The binaries are obtained from [34].In total, we have 30 programs with 35k functions, compiled with 3 different options.Details can be found in Table 8 of the appendix.Baseline Tools.We compare with 6 baselines.For execution-based methods (Baseline-I), we use IMF [46] and BLEX [13], which are SOTAs as far as we know.For Deep Learning methods (Baseline-II), we use SAFE [33] and Trex [34].We use their pre-trained models or train using their released implementation with the default hyperparameters.Also, we compare with the best two models (i.e., GNN and GMN) in How-Solve [31] that conducts a measurement study on Machine Learning methods.Metrics.Following the same experiment setup in IMF and BLEX, for a function compiled with a higher level optimization option (e.g., -O3), we query the most similar function in all the functions (in the same binary) compiled with a lower level optimization option.As such, there is only one matched function.We hence use Precision at Position 1 (PR@1) as the metric.Given a function, PR@1 measures whether the matched function scores the highest out of the pool of candidate functions.Many data-driven methods [31,33,34,54]  (ROC) curve.Existing literature [3] points out that a good AUC score does not necessarily imply good performance on an imbalanced dataset (e.g., class 1 having 1 sample and class 2 having 100).Therefore we choose PR@1 as our metric, which aligns better with the real-world (imbalanced) use scenario of binary similarity.

RQ1: Comparison to Baselines
Comparison to Baseline-I.We compare PEM with Baseline-I on Dataset-I.To conduct the evaluation, we first use PEM to sample each function in these binaries and aggregate the distribution of observable values.Then, for each function in an optimized binary, we compute its similarity score against all functions of the same program compiled with a lower optimization level, and use the ones with the highest scores to compute PR@1.Besides PR@1, we also use PR@3 and @5 for a more thorough comparison with IMF.
The comparison results with IMF and BLEX are shown in Table 1 1 .The first two columns list the compilers and the optimization flags used to generate the reference and query binaries.Columns 3-5, 6-7, and 8-9 list PR@1, @3, and @5, respectively.Note that BLEX only reports PR@1 and does not have results for binaries compiled with Clang.PEM outperforms BLEX on PR@1 and outperforms IMF on all 3 metrics under all settings.Especially, for function pairs (Clang-O0, GCC-O3) and (GCC-O0, Clang-O3), which are the most challenging settings in our experiment, PEM outperforms IMF by about 25%.
Comparison to Baseline-II.We compare PEM with Baseline-II on Dataset-II.Following the setup of How-Solve [31], for each positive pair (of functions), namely, similar functions, 100 negative pairs (i.e., dissimilar functions) are introduced to build up the test set.
The results are shown in Fig. 10.The  axis represents different programs, and the  axis is PR@1.The results of PEM, GNN, and GMN are shown in green, yellow, and red bars.The average PR@1 of each tool is marked by the dashed line with the related color.Note that GNN and GMN are the best two models out of all 10 ML-based methods in How-Solve [31] (including Trex and SAFE).As Fig. 10

RQ2: Real-World Case Study
We demonstrate the practice use of PEM via a case study of detecting 1-day vulnerabilities.Suppose that after a vulnerability is reported, a system maintainer wants to know if the vulnerable function occurs in a production system.She can use PEM to search for the vulnerable function from a large number of binary functions and decide whether further actions should be taken (e.g., patch the system).We collect 8 1-day Vulnerabilities (CVEs) and use the optimized version of the problematic function to search for its counterpart in the unoptimized binary.The results show that in 7 out of the 8 cases, our tool can find the ground truth function as the top one, while the other two ML-based methods each can only find 1 of them.Even if we look into the top 30, both of them can only find 2 of these problematic functions.Details can be found in Section I of the appendix.

RQ3: Generalizability
We evaluate the generalizability of PEM from three perspectives.First, we show that PEM is efficient so that it can scale to large projects.Second, we illustrate that PEM has good code coverage for most functions.That means it can explore enough semantic behavior even for complex functions.Last but not least, besides x86-64, we show that PEM can support another architecture with reasonable human efforts, meaning that PEM can be easily generalized to analyzing binary programs from multiple architectures, without the need of substantial efforts in building lifting or reverse engineering tools to recover high-level semantics from binaries.Efficiency.PEM analyzes more than 3 functions per second in most cases.Note that this is a one-time effort.After interpretation and generating semantic representations, PEM searches these representations to find similar functions.PEM compares more than 2000 pairs per second in most cases.The comparison can be parallelized.With 4 processes, we are able to compare 1.7 million function pairs in 4 minutes (wall-clock).We visualize the results in Figure 27 of the appendix.PEM takes 13 minutes to cover more than 95% code for all functions in Coreutils (with a single thread).In comparison, the forcedexecution based method BLEX takes 1.2 hours.In our experiment, PEM takes 26 minutes to process two Coreutils binaries compiled with different optimization levels, and it takes another 14 minutes to compare all 1.7 million function pairs between these two binaries, yielding a total time cost of 40 minutes.While IMF takes 32 minutes to complete the same task, PEM achieves significantly better precision than IMF.Machine learning models typically have an expensive training time.They have better performance in test time.
Coverage.The code coverage of PEM on Dataset-II is shown in Fig. 11.The  axis marks the projects and the  axis shows the percentage of functions for which PEM has achieved various levels of coverage, denoted by different colors.As we can see, 90% of the functions in -O0 and 85% functions in -O3 have a full or close-to-full coverage.Those functions with less than 40% coverage have extremely complex control flow structures, with many inlined callees.For example, the main function of sort in Coreutils has 496 basic blocks, resulting in millions of potential paths.Note that even with such a huge path space, PEM is still able to select similar paths and collect consistent values with a high probability.Cross-arch Support.We add AArch64 [1] support to PEM with only around 200 lines of C++ code and 0.5 person-day efforts.This is possible because our probabilistic execution model is general and does not rely on specialized features from the underlying architecture.PEM achieves a PR@1 of 86.8 for Coreutils (-O0 and -O3) on AArch64, whereas its counterpart on x86-64 is 89.4.In addition, it achieves a PR@1 of 84.9 when we query with functions compiled on x86-64 in the pool of functions compiled on AArch64.Details can be found in Table 7 of the appendix.

RQ4: Ablation Study
Probabilistic Path Sampling.First, we empirically validate our hypothesis that branches with the largest and smallest selectivity are stable before and after code transformations.We collect equivalent interpretation traces from the main functions in Coreutils binaries compiled with different options.Then we analyze the matching traces and check if the predicates with the largest and the smallest selectivity in these cross-version traces match, leveraging the debug information.In total, we study 636 traces from 6 binaries with a total of 16k predicate instances.We observe that with a probability of more than 80%, our hypothesis holds.The detailed results are shown in Fig. 12. From the two ends of the lines, we can observe that in more than 80% cases, the predicates with the smallest and the largest selectivity match.In contrast, those in the middle do not have such a property.The median for the max-3 selectivity is even close to 0%.For predicate instances with the smallest/largest selectivity in one trace (e.g., -O3), we further study the selectivity rankings of their correspondences in the other trace (e.g., -O0).The results are visualized in Fig. 13.Observe that in more than 98% cases, they have the top-3 smallest or largest selectivity in the other trace.
Furthermore, we select 80 most challenging functions in Coreutils to further study the effectiveness of our path sampling strategy.These functions have more than 150 basic blocks and the average connectivity is larger than 3, namely, a block is connected to more than 3 blocks on average.We compare the performance of 3 path sampling strategies.The results are shown in Table 2.The three rows show the PR@1, the code coverage for -O0 and -O3 functions, respectively.The second column presents a strategy in which PEM flips the last predicate encountered in the previous round with an uncovered branch.The third column denotes a strategy in which

Results of a Program Median
Figure 12: Predicate Correspondence versus Dynamic Selectivity.Each blue dashed line represents the analysis results of path pairs from two respective binaries compiled differently from a program.The  axis represents selectivity (with min the minimal and max the maximum) and  denotes the percentage of predicate matches.We also compute the median for each selectivity, resulting in the orange line.PEM deterministically flips the predicates with the largest and the smallest selectivity at each round.The last column presents our probabilistic path sampling strategy.Observe that the probabilistic strategy substantially outperforms the other two and both the deterministic and probabilistic strategies can achieve good coverage.Code Coverage versus Precision.We run PEM with different round budgets on Coreutils and observe coverage and precision changes.The results are shown in Table 3. Observe that if we only interpret each function once without any flipping, the precision is as low as 70 and the coverage is low too.With more budgets, namely, flipping more predicates, both the precision and the coverage improve, indicating PEM can expose equivalent semantics.But the improvement becomes marginal after 200.Probabilistic Memory Model (PMM).We run PEM with different memory model setups on Coreutils to illustrate the benefit of modeling invalid memory accesses.The results are in Table 4. Specifically, No-Mem means we do not model invalid memory accesses.We return random values for invalid reads and simply discard invalid writes.The precision of No-Mem is nearly 10% lower than PMM, while their coverage is similar.That is because some dependencies between memory accesses are missing without handling invalid writes.On the other hand, if we allow writes to invalid memory regions but always return a constant value for all invalid reads, as shown in the column of Const, the precision is better than No-Mem.However, it is still inferior to PMM.This is due to returning the constant value making reads from different invalid addresses indistinguishable.Robustness.We alter system configurations of PEM and run random sampling for each probabilistic component in PEM.The experimental results show that PEM is robust with regard to different configurations and variances in samplings.Details can be found in Section G of the appendix.

RELATED WORK
Binary Similarity.Many existing techniques aim to detect semantically similar functions, driven by static [12,27] and dynamic [13,16,18,46] analysis.A number of representative methods have been discussed in Section 2.2.Other techniques compare code similarity at different granularity, e.g., whole binary [28,50], assembly [11,14,48], and basic block [36].While our method represents semantics at the function level, the resulting value sets of our system can be used as function semantic signatures and facilitate comparisons working at other granularity.Forced Execution.Forced execution [13,35,53,57] concretely executes a binary along different paths by flipping branch outcomes.They typically aim to cover more code in a program and thus use coverage as the guidance.They can hardly select similar sets of paths for the same program compiled with different optimizations.Their focus is on recovering from invalid memory accesses.In contrast, the probabilistic memory model of PEM reveals the different semantics introduced by different invalid accesses with high probability.

CONCLUSION
We develop a novel probabilistic execution model for effective sampling and representation of binary program semantics.It features a path-sampling algorithm that is resilient to code transformations and a probabilistic memory model that can tolerate invalid memory accesses.It substantially outperforms the state of the arts.

DATA AVAILABILITY
Our experimental data and the artifact are available at [49].

A 𝐾-EDGE-OFF BEHAVIOR CHANGE ACROSS OPTIMIZATION
Take main_cat in Fig. 1 as an example.The condition check at line 13 is compiled to the structure shown in the blue circle of Fig. 2b.The program first checks the value of flag0.If flag0 is false, it further checks the value of flag1.If either evaluates to true, the program checks the value of format.Note that format always has the same value as flag0, as shown at line 4. Thus the check for format is redundant when flag0 is true.In Fig. 2a, the compiler removes this check from the true-path of flag0.Assume that for some input the values of flag0, flag1, and format are all false.Any path reaching simple_cat in Fig. 2b has to be at least 2edge-off since one has to flip either flag0 or flag1 and flip format.
In contrast, in Fig. 2a, paths going to simple_cat are 1-edge-off by only flipping the branch at flag0 (due to optimization).This suggests the 1-edge-off behavior of the two programs are quite different.

B LOGGING RULES
The logging rules are shown in Fig. 14.They only update the observable value statistics  , without changing any other states.They have a higher priority than the interpretation rules, namely, they fire before the interpretation rules if both have the preconditions satisfied.They only fire once for a unique combination of preconditions and states.This ensures values are properly logged (only once) before they are updated.The first three rules dictate that we record both address and value in memory accesses including those to the probabilistic memory.PEM also records predicates (Rule LogCC) and jump targets (Rules LogJN and LogJR).Note that although different binaries from the same source may have different numeric values for comparisons and jump targets, PEM normalizes these values to reveal similarity between equivalent programs.Details are discussed in Section F of this material.

C ADVANTAGES OF PROBABILISTIC PATH SAMPLING
In this section, we prove that the probabilistic path sampling strategy outperforms the deterministic one.We formally model the path selection workflow of PEM as follows.Given a query program, PEM first selects a set of paths to interpret and produces a set of observable value traces.Then, it applies the same path selection strategy to the pool of candidate programs which may include multiple versions of the query program.These versions are similar to the query program and the others are dissimilar.During the path selection for the programs in the pool, we say a predicate selection is correct if the chosen predicate's correspondence in other versions of the program also has the smallest selectivity 2 .Intuitively, if PEM makes more correct steps, it has better precision in the downstream similarity analysis.On the other hand, the return of growing correct steps becomes marginal when the number of correct steps becomes large, as the analysis already has sufficient information to expose similarity.We hence hypothesize the relation between the number of correct steps and the similarity analysis precision follows a distribution like in Fig. 15 (and we will empirically demonstrate it later). 2We only focus on proving the case of smallest selectivity as the other end is symmetric.
Our proof is hence to show that the probabilistic selection strategy can increase the density of the green area in Fig. 15 (denoting a middle range of correct steps), with the cost of reduced density in the two orange areas on the two sides, when compared to the deterministic selection.Intuitively, having a higher density in the green area than the orange area to its left means that the probabilistic selection strategy can bias towards having a higher precision (by having a larger number of correct steps); having a higher density in the green area than the orange area to its right means that we sacrifice the chance of having a large number of correct steps, which is affordable because its gains on precision is marginal.
To formally prove it, we have the following steps.(1) We first empirically show that the relation between precision and number of correct steps is indeed like Fig. 15, which can be approximated by a Pareto Distribution [2].(2) We further prove that the probabilistic selection strategy can improve the density for the mid-range correct steps.(3) At the end, we show that with the new density function and the Pareto Distribution, the expected precision is improved.
Step (1): Modeling Relation between Precision and Number of Correct Steps.It is difficult to compute the number of correct steps during the execution of PEM.We hence approximate it using a simplified probabilistic model and derive the relation between precision and the expected number correct steps.
We first consider a deterministic path sampling strategy that exclusively flips the predicate with the smallest selectivity in a single path.We use a pair of number (, ) to denote the state of PEM at each step with  (correct) denoting the number of previous steps that PEM correctly selects a predicate, and  (error) denotes the number of previous steps that PEM misses true correspondence.At a given system state (, ), we define the probability that PEM selects the next predicate correctly (from the aggregated pool of paths and predicates from all the previous steps) as follows.The  axis denotes the number of correct steps.The  axis denotes the PR@1 PEM achieves.The dark blue line depicts the PR@1 of PEM w.r.t. the number of correct steps.We show the effects of the probabilistic path sampling algorithms via red arrows.The algorithm moves cases from yellow areas to the green area.

PC(𝑐, 𝑒)
Intuitively, it requires that PEM first picks a path from a correct previous step (with the probability of  + ) for further flipping and then selects the correct predicate to flip in this path (with a probability  0 ).
The probability that a certain system state (, ) appears, denoted as P (, ), can be calculated by the following: There are three initial cases in the formula.The first case means  and  can never be less than 0. The following two cases are related to the two outcomes of the first round of interpretation.For other steps, state (, ) is reached by either selecting a correct predicate at state ( − 1, ) or selecting a wrong predicate at state (,  − 1).For a given budget , P (,  − ) denotes the probability that PEM makes correct selections in  steps.We treat  as a random variable and study how it impacts the final precision (of function similarity analysis).
We use 970 Coreutils functions and run 6 experiments with different budgets.While it is hard to quantify the number of correct steps in the real-world system, we use the probability mass function Equation (3) to compute the expected number of correct steps for each budget.For example, if the budget is 400, we compute the expected number of correct steps via Σ 400 =0 P (, 400 − ) × .The (empirical) precision and the corresponding (expected) correct steps are shown in Fig. 16.The blue dashed curve tries to fit the data points.Observe that the precision increases sharply at the beginning with the increase of correct steps and then the gains become marginal, which aligns well with Fig. 15.Inspired by the Pareto Distribution [2], whose probabilistic density function has the aforementioned property, we use the following formula to model the relation between PR@1 and the number of correct steps  as follows.In the above equation,   is the minimal number of correct steps required to achieve a reasonable precision.The precision achieved with   number of correct steps is denoted by  0 ;  and  are two parameters defining the max value of the precision and how fast the amount of increment decays as the number of correct steps grows.The blue curve in Fig. 16 has the parameters  0 = 65,   = 10,  = 21,  = 1.4.
Step (2): Probabilistic Path Selection Improves Mid-range Correct Steps Density.The essence of probabilistic path selection is to have additional sampling at each step, flipping predicates with selectivity close to the smallest/largest.Suppose that we spend  steps from the budget to make additional sampling.The probability that PEM selects a correct predicate at each step is updated as follows.

PC(𝑐, 𝑒)
is the budget, and  is a factor representing the probability that PEM indeed selects a correct predicate not having the smallest selectivity.Intuitively, when the correct predicate does not have the smallest selectivity (with the probability of 1 −  0 ), PEM has  likelihood selecting the predicate when given an additional round.For example, the predicate likely has the next-to-smallest selectivity.Note that we assume  is relatively small to .Thus we can ignore the noise introduced by additional samplings.
We visualize the distribution difference introduced by the probabilistic sampling algorithm in Fig. 17.We can see that when  = 80, PEM has a larger probability to make 70-230 correct steps (i.e., the denser green area).
Step (3): Probabilistic Path Selection Yields Better Expected Precision.Then we show that the overall expected PR@1, noted as E(PR 1 ()), is increased.It is computed as follows.
The possible number of correct steps PEM makes in total ranges from 0 to  −  (i.e., the total number of steps).For each possible number , PC(,  − −) denotes the probability that PEM makes  correct steps in total, and PR 1 () denotes the expected PR@1 when PEM makes  correct steps.We show in Fig. 18  can see that as  grows, the expected PR@1 of PEM improves.Thus the probabilistic sampling algorithm achieves a better performance than a deterministic one.

D EFFECT OF PATH INFEASIBILITY
In the following, we formally analyze two important properties of PEM related to infeasible paths.

D.1 Selecting Infeasible Paths Does Not Affect the Effectiveness of PEM
Assume two equivalent programs  and  ′ .We want to prove that PEM can faithfully disclose their equivalence even when it selects infeasible paths.We first consider a simplified scenario in which each path in , including infeasible path, has a corresponding path in  ′ , meaning that the two paths produce the same sequence of observable values.This precludes code removal type of optimizations.Starting from two seed paths, which are equivalent as they are derived from the same input (on the two equivalent programs), we can derive that the predicates having the smallest and largest selectivity correspond to each other along the two seed paths.As such, flipping them yields two equivalent new paths regardless of their feasibility.This can be turned in a formal inductive proof.Details are elided.Complexity arises when  ′ is  with certain (dead) code removed.In such cases, PEM may select a path in  that may include the removed code and does not have the correspondence in  ′ .We prove that by leveraging the observation that code removal is rare (due to the difficulty of proving path feasibility/infeasibility).As such, we assume any path in  has a large probability  to have an equivalent path in  ′ .We can hence derive a probabilistic proof similar to the above.Details are elided.the same with  2 .We prove that most infeasible paths that PEM samples in  1 will not coincide with paths in  2 .The proof consists of three key steps.First, we show that for any path sampled from  1 , the probability that it coincides with a path in  2 decreases exponentially when it contains growing flipped predicates.Then we demonstrate that most paths sampled by PEM contain at least 4 flipped predicates (given a sample budget of 400 paths).Finally, we show that the expected number of coincided paths is only 25, whose effects are limited out of the 400 sampled paths in total.

D.2 Selecting Infeasible
Recall that we use the term -edge-off path to denote a path where  predicates on it are flipped.We denote as P  () the probability that a -edge-off path from  1 coincides with a path in  2 .An upper bound of P  () is computed as follows.
P  () ≤ 1  = 0 P  ( − 1) * 0.5 + P  ( − 1) * 0  > 0 The first row of Equation 7shows the probability of coincidence for a seed path (i.e., the faithful path of executing  1 on seed values without flipping any predicate).We give a pessimistic estimation by assuming the seed path in  1 is equivalent to a path in  2 with a probability as high as 1.
The second row of Equation 7shows the probability that a -edgeoff path in  1 coincides with a path in  2 for any  > 0. Note that a -edge-off path is derived from a ( − 1)-edge-off path.The first term hence depicts the case that the ( − 1)-edge-off path coincides with a path in  2 .In the worst case, the predicate PEM chooses to flip in  1 has an aligned predicate on the corresponding path in  2 .Flipping the predicate has at most 0.5 probability to lead the execution of a basic block that is in  2 , resulting in an equivalent edge-off path.On the other hand, if the ( −1)-edge-off path already contains basic blocks that are not in  2 , the execution context will not be in alignment with any execution context in  2 .Thus the derived -edge-off path will not coincide with any path in  2 .
We have shown that the probability of coincidence for a -edgeoff path exponentially decreases as  grows.Next we reason about the distribution of  by modeling the sampling behavior of PEM.At each step, PEM flips the predicate with the largest/smallest dynamic selectivity.We assume the probability is the same for each path that a selected predicate comes from the path.The probability of the selected predicate being on a -edge-off path is then #   k-edge-off paths #     ℎ .the latter traverses a linked list with a node size 0x18.Fig. 22 shows a PM by PEM, which has a size of 128 and is filled with random values.Assume in the interpretations of both functions, the pointer to the root of the respective linked-list is null (e.g., the preceding linked-list allocation and initialization are bypassed due to predicate flipping).With our PMM, both functions access the invalid address 000 and get the same value (because our PMM is equivalence preserving).However, as the interpretation progresses, these two functions access different sets of addresses.For findDouble, it treats the value at address 008 as the pointer to the next node.When it tries to access the address 033...5418 (the value stored at 008), our PMM maps it to an address 018 in PM via the operation mod 128.Then it reads from 018 and treats the value at 020 as the next pointer, and so on.The cells accessed by findDouble, 000 − 018−050, are chained by the blue arrows in the figure.In contrast, although findPoint also accesses 000 at the beginning, the pointer field of the node it traverses is at 010, resulting in a different access chain 000 − 030 − 060, connected by the red arrows.
As such, the observable values are different for the two functions, denoting their different semantics.In addition, one can easily tell that two versions of findDouble, optimized and unoptimized, would produce the same sequence of observable values.Note that if the two functions traverse a linked list of the same node size, their behaviors are indeed not separable.□

F IMPLEMENTATION
As shown in Fig. 4, PEM consists of two key components: a probabilistic execution engine (in the left grey box) that interprets binary programs and collect the observable values, and a value analyzer (in the right grey region) that aggregates sets of observable values collected from different paths and normalizes values according to their type.
Probabilistic Execution Engine.We use IDA [19] and Ghidra [17] to obtain static information about binary programs, and build the probabilistic execution engine on top of a well-known emulator QEMU [38].QEMU translates binary programs to an intermediate representation (IR) that is similar to the language defined in Fig. 5.
We then implement our semantic rules (Fig. 7) and logging rules (Fig. 14) on the QEMU-IR.To reduce invalid memory accesses and Then PEM sorts the list of candidates by their dynamic selectivity and then selects the predicate as follow: Value Analyzer.The value analyzer first aggregates the observable values  by adding up the number of observations in all paths for each value.Then it normalizes the values according to their types: for values that are jump targets, we leverage the dynamic linking information (these information are available even in stripped binary files) to look up external library functions that are potentially associated with the jump targets.If a jump target is not associated with external functions, we remove it from the observable value set.For values with the string type, we truncate the values at the string terminator (i.e., the value 0).For values collected in predicate comparisons, we compute their dynamic selectivity.All other values remain unchanged.Finally, the value analyzer sorts values by their number of observations and only selects the top 50,000 most frequent values.For similarity comparison, we use the standard metric for set comparison, i.e., the Jaccard index [46].

G ROBUSTNESS OF PEM
Robustness w.r.t.System Configurations.We alter the system configurations of PEM and analyze how each component affects   23.We can see that the performance of PEM does not change significantly in most cases, meaning that PEM is robust when system configuration changes.In the setup that we only unroll loops for one time, the precision@1 and the coverage for both O0 and O3 binary drop a bit.That is because compilers usually conduct sophisticated optimizations to unroll loops, resulting in a large number of basic blocks in the loop body.Table 5 lists the detailed results for all the configurations.Note that we also alter how PEM initializes input values for functions, shown in the last 5 rows.Same means PEM initializes all the input parameters with the same value; Rand1-4 means PEM initializes parameters at different positions with different random values and we conduct the experiments with 4 different random seeds.We can see that initializing parameters differently is slightly inferior than initializing parameters equivalently because the order of parameters may change among optimizations.Robustness w.r.t.Probabilistic Variance.For each probabilistic component in PEM, we run the related random sampling process with 10 different seed values.As shown in Fig. 24, PEM is robust with regard to variances in samplings.

H PERFORMANCE VARIATION W.R.T. NEGATIVE SAMPLE RATIOS
A dataset is composed of negative samples (dissimilar pairs) and positive samples (similar pairs).A class ratio N:1 means that in a query, apart from the ground truth target function, we also sample N other functions as the negative data.Existing machine-learningbased work samples at a relatively low ratio.For example, Trex samples at a ratio of 5:1 and achieves good precision.However, 5:1 is not realistic in real-world scenarios.A tool may need to compare a function with many other functions to find the real similar one.In our setup, there is only one true positive and all the other functions are true negatives.The ratio is usually very large.We argue a ratio around 1:1000 is more realistic.We compute the PR@1 of PEM, Trex and SAFE with different ratios including 1:1, 5:1, 10:1, 20:1, 50:1, 100:1, 500:1, and ∞:1.∞ is the number of all functions in the project.As Fig. 26 depicts, while the PR@1 of all three tools decreases as the ratio increases, PEM has more stable performance.For example, PEM degrades 23% (from 100% to 77%) on OpenSSL as the ratio increases from 1:1 to ∞:1, whereas the precision of the other two tools degrades to less than 20%.This reveals that it is more challenging when the ratio becomes larger, and PEM is more resilient to class ratio changes and likely to perform better in real-world scenarios.

I SEARCHING FOR ONE-DAY VULNERABILITIES
Detailed results are shown in Table 6.PEM lists the problematic function with the highest score in most cases.For the case that PEM ranks it at 43, we manually check the function and find it delegates most logic to external library functions, thus PEM does not find enough semantic information.The results can be improved by modeling the behavior of calling to external functions.

1
void m a i n _ c a t ( i n t a r g c , char * * a r g v ) { 2 while ( − 1 != ( c = g e t _ c l i _ o p t ( a r g c , argv , " b e s t u v " ) ) ) { 3 switch ( c ) { 4 c a s e ' b ' : f l a g 0 = t r u e ; . . .; f o r m a t = t r u e ; break ; 5 c a s e ' e ' : f l a g 1 = t r u e ; break ; 6 c a s e ' v ' : p r i n t ( " C o r e u t i l s v8 .3 0 " ) ; break ; 7 . . .8 d e f a u l t : q u o t e ( " e r r o r " ) ; a b o r t ( / / d e f i n e : pageSize, inbuf and insize 12 do { 13 i f ( ( f l a g 0 | | f l a g 1 ) && f o r m a t ) { 14 r e t = s i m p l e _ c a t ( i n b u f , i n s i z e ) ; 15 } e l s e { . . .16 o u t b u f = x m a l l o c ( o u t b u f , p a g e S i z e ) 17 r e t = c o m p l e x _ c a t ( i n b u f , i n s i z e , o u t b u f ) c o m p l e x _ c a t ( char Figure 1: Motivating Example

Figure 3 :
Figure 3: Our example (Fig. 1) in SAFE.The statements highlighted in yellow have large attention (and hence are important).The gray boxes to the right (of the yellow statements) denote the corresponding tokens.Special token HIMM denotes a constant or a constant control flow target. and  ( ()) the set of externally observable values when executing  on .Observable values are those observed in I/O operations, global, and heap memory accesses.

Figure 5 :
Figure 5: Syntax of Our Language

Figure 6 :
Figure 6: State Domains in Interpretation (top) and Auxiliary Data and Functions (bottom)

Figure 8 :
Figure 8: Example of optimization that provides control flow shortcut by inserting predicates.The compiler inserts a predicate x<10 at line 1 in Fig. 8b.When x<10, the execution directly goes to abort() without comparing with other values.

Figure 9 :
Figure 9: P  w.r.t.; The -axis denotes the ranking of predicates by dynamic selectivity; the -axis denotes the probability that the predicate with the -th smallest dynamic selectivity after optimization has the same ranking.Each line shows results for one set of  and .

Figure 10 :
Figure10: Comparison with How-Solve.We leverage the best two models (i.e., GNN and GMN) in How-Solve.Each bar denotes a program, whose name is elided.A bar with 1.0 PR@1 means that PEM finds the correct matches for all functions in the program.Dashed lines denote the average PR@1 of each tool.

C o r eFigure 11 :
Figure 11: Coverage of PEM

Figure 13 :
Figure 13: Correspondence of Predicates with Min and Max Selectivity.Blue is for min and orange for max.For example, the bar at min+1 means that about 20 predicates with min selectivity in one trace have min+1 selectivity in the other trace.

Figure 15 :
Figure15: Effects of the Probabilistic Path Sampling Algorithm.The  axis denotes the number of correct steps.The  axis denotes the PR@1 PEM achieves.The dark blue line depicts the PR@1 of PEM w.r.t. the number of correct steps.We show the effects of the probabilistic path sampling algorithms via red arrows.The algorithm moves cases from yellow areas to the green area.

Figure 16 :
Figure 16: Number of Steps with Correct Predicate Selection w.r.t.Final Function Similarity Analysis Precision.The  axis denotes the expected number of correct steps.The  axis denotes the PR@1 PEM achieves with the related number of correct steps.

Figure 17 :
Figure 17: Distributions of the Number of Correct Step.The  axis denotes the expected number of correct steps.The  axis denotes the probability.Blue and orange lines are for  = 0 and  = 80, respectively.Green and orange areas denote where the probabilistic sampling algorithm has a larger and smaller probability than the deterministic one, respectively.

Figure 23 :
Figure 23: Perf.w.r.t.Different Configurations.Each subfigure presents the experimental results of changing one system configuration.The  axes denote different values of the related system configurations, and the  axes denote the values for the metrics.The blue lines represent PR@1, the orange lines coverage for -O0 programs, and the green lines coverage for -O3 programs.

Figure 24 :
Figure 24: Perf.w.r.t.Different Random Seeds the performance.For each configuration, we run the experiment on Coreutils with 5 alternative values.The results are visualized in Fig.23.We can see that the performance of PEM does not change significantly in most cases, meaning that PEM is robust when system configuration changes.In the setup that we only unroll loops for one time, the precision@1 and the coverage for both O0 and O3 binary drop a bit.That is because compilers usually conduct sophisticated optimizations to unroll loops, resulting in a large number of basic blocks in the loop body.Table5lists the detailed results for all the configurations.Note that we also alter how PEM initializes input values for functions, shown in the last 5 rows.Same means PEM initializes all the input parameters with the same value; Rand1-4 means PEM initializes parameters at different positions with different random values and we conduct the experiments with 4 different random seeds.We can see that initializing parameters differently is slightly inferior than initializing parameters equivalently because the order of parameters may change among optimizations.

Table 1 :
use Area Under Curve (AUC) of the Receiver Operating Characteristic Comparison of PEM, IMF, and BLEX.C and G denote Clang and GCC, respectively.Each precision is averaged over the 106 binaries in Coreutils.

Table 2 :
Perf. w.r.t.Different Path Sampling Strategies

Table 4 :
Perf. w.r.t.Different Memory Models Paths Does Not Conclude Equivalence for Dissimilar ProgramsAssume  1 and  2 are an arbitrary pair of predicates from two different binary programs  1 and  2 .Since  1 and  2 are different, we suppose that no more than half of the successors of  1 are Figure 18: E(PR 1 ()) w.r.t..The  axis denotes the value of  (i.e., the total number of additional samplings).The  axis denotes the value of E(PR 1 ()) (i.e., the expectation of reward) Example of Probabilistic Memory Region.The numbers on the top denote the lowest 4 bits of memory addresses, and the numbers on the left denote the highest 8 bits.Cells with bold texts are the least significant byte (LSB).Memory cells in blue are exclusively accessed by findDouble and those in red are exclusively accessed by findPoint.Green cells are accessed by both functions and white cells are accessed by neither.Also, cells in bold boxes are interpreted as pointers, with arrows pointing to their targets.providemore realistic execution contexts, we also model the localstack for each function and the dynamically allocated heap memory.Recall that for a binary program, PEM interprets it for multiple rounds with a predicate flipped in each round (see Algorithm 1).The flipped predicates are selected following the probability density function (PDF) of a Beta-distribution with the shape parameter  =  = 0.03, noted as B 003 .B 003 has a U-shape with large values near 0 and 1 and small values in the middle.To select the predicate to flip from a list of candidates, PEM first samples a random number  ∼ B 003 .Then we normalize the range of  from (0, 1) to [0, 1] via the following formula:

Table 5 :
Perf. w.r.t.Different Configurations.The first two columns list the configurations we alter and the values we use for that configuration.Bold texts are default values for the related configurations.The remaining three columns list the PR@1, coverage for -O0 programs, and coverage for -O3 programs, respectively.The highest value for each column is marked as bold.

Table 8 :
Statistics of Dataset.The first two columns are the names of projects and optimization flags used to compile the binary files.The 3-4 columns are the functions and the number of basic blocks in the related binary files.Note that we remove functions with no more than 3 basic blocks since they are basically wrapper functions.Figure27: Efficiency The left depicts the throughput for interpretation and preprocessing.The  axis is the number of functions PEM can process per second.We can see that PEM analyzes more than 3 functions per second in most cases.The right part depicts the throughput for comparison.It illustrates that PEM compares more than 2000 pairs per second in most cases.