CodeExtract: Enhancing Binary Code Similarity Detection with Code Extraction Techniques

In the field of binary code similarity detection (BCSD), when dealing with functions in binary form, the conventional approach is to identify a set of functions that are most similar to the target function. These similar functions often originate from the same source code but may differ due to variations in compilation settings. Such analysis is crucial for applications in the security domain, including vulnerability discovery, malware detection, software plagiarism detection, and patch analysis. Function inlining, an optimization technique employed by compilers, embeds the code of callee functions directly into the caller function. Due to different compilation options (such as O1 and O3) leading to varying levels of function inlining, this results in significant discrepancies between binary functions derived from the same source code under different compilation settings, posing challenges to the accuracy of state-of-the-art (SOTA) learning-based binary code similarity detection (LB-BCSD) methods. In contrast to function inlining, code extraction technology can identify and separate duplicate code within a program, replacing it with corresponding function calls. To overcome the impact of function inlining, this paper introduces a novel approach, CodeExtract. This method initially utilizes code extraction techniques to transform code introduced by function inlining back into function calls. Subsequently, it actively inlines functions that cannot undergo code extraction, effectively eliminating the differences introduced by function inlining. Experimental validation shows that CodeExtract enhances the accuracy of LB-BCSD models by 20% in addressing the challenges posed by function inlining.

LB-BCSD methods primarily calculates similarity at the function level and is divided into two stages [20,45]: 1) Function extraction, where code within binary programs is extracted into functions; 2) Similarity calculation, relying on neural networks to transform functions into semantic vectors, with the distance between these vectors representing the similarity between different functions.Researchers [13,14,35,50] have conducted extensive studies on enhancing LB-BCSD techniques, focusing mainly on the similarity calculation stage and paying less attention to the function extraction stage.
Function inlining [46] is an optimization strategy employed by compilers, whereby the code of a callee function is integrated directly into the caller function.Due to various compilation options (such as O1 and O3) initiating different levels of inlining strategies, significant variations arise in the number of instructions and the control flow structures of functions compiled from the same source [12,47].This variation substantially impacts the accuracy of the LB-BCSD model.Research [24] indicates that function inlining can lead to a reduction in accuracy ranging from 30% to 40%.Consequently, the LB-BCSD method requires more refined handling of function inlining to mitigate these impacts.
Function inlining significantly impacts the function extraction phase [3,32,58,62], with previous LB-BCSD methods [8,13] resorting to inline emulation strategies to tackle issues arising from function inlining.Inline emulation [8,13], grounded in the principle of normalization, aims to replicate the compiler's inlining behavior to eliminate discrepancies between homologous functions compiled at different optimization levels due to inlining.However, the rules of inline emulation are influenced by the size of the caller function and the callee function, both of which change due to function inlining, rendering inline emulation ineffective in normalizing functions.We discovered that although inline emulation can enhance model accuracy by approximately 10%, significant differences still persist between homologous functions utilizing inline emulation rules due to the complexity of function inlining.
Given that functions are often called multiple times, function inlining integrates the callee function into each call site, leading to duplicate code fragments in the binary program.In contrast to function inlining, code extraction [27] identifies and removes duplicate code from the program, replacing it with corresponding function calls.Consequently, we developed CodeExtract, a system based on code extraction technology.It identifies and extracts duplicate code fragments within binary internal functions through similarity matching, thereby extracting code introduced by function inlining.For code that cannot be extracted, we proactively inline it into the caller, eliminating the differences introduced by function inlining.Compared to inlining emulation, CodeExtract effectively reduces the discrepancies introduced by function inlining and brings about a 20% accuracy improvement to the SOTA LB-BCSD models.
Our main contributions are summarized as follows: • We explored the impact of function inlining on the accuracy of the LB-BCSD model and demonstrated that function inlining significantly affects the accuracy of the LB-BCSD model.We pointed out that existing inline emulation solutions cannot adequately address the issues posed by function inlining and analyzed the reasons.

Mitigating Measures for Function Inlining
Function inlining significantly affects the accuracy of LB-BCSD models, with many LB-BCSD methods [10,23,55,56] indicating their accuracy is impacted by function inlining.
To address the discrepancies introduced by function inlining, BinGo [8] first proposed the concept of inlining emulation, which was further expanded by Asm2vec [13].Inlining emulation aims to mitigate the differences between homologous functions caused by function inlining by selectively inlining callee functions at the binary level.The goal is to maintain consistency in the number of instructions and control flow graphs of functions processed through inlining emulation across different optimization levels.The heuristic rules followed by inlining emulation are as follows: Here,   and   respectively represent the callee function and the caller function, while ℎ(  ) and ℎ(  ) indicate the number of instructions in the callee and caller functions, respectively.If the value of  is less than 0.6, or if the number of instructions in the callee function is fewer than 10, then the inline emulation will proceed to inline the callee function into the caller function.

Limitations in Inline Emulation
Inlining emulation aims to mimic the compiler's inlining behavior so that, after undergoing inlining emulation, homologous functions compiled with different optimization levels can eliminate the discrepancies caused by function inlining.Through a detailed analysis of the inlining emulation rules, we identified two key points: 1) For caller functions with a larger number of instructions, inlining emulation tends to prefer inlining callee functions into the caller.This means that for the same caller function A, compiled with different compilation options (such as O1 and O3) resulting in a different number of instructions, the version of function A with more instructions may inline more callee functions.This prevents inlining emulation from eliminating the discrepancies introduced by function inlining.2) The inlining emulation rules primarily target callee functions.Similarly, for the same function A compiled with different options O1 and O3, if A@O1 and A@O3 call different callee functions, the inlining emulation rules will inline different callees accordingly, thus failing to eliminate the discrepancies introduced by function inlining.

Our Technique
Existing inline emulation schemes are influenced by the compiler optimization flags when determining whether to inline functions, thus failing to eliminate the differences caused by inlining between homologous functions compiled at different optimization levels.The core idea of our technology is to process homologous functions compiled at various optimization levels in an optimization-level-independent manner.In CodeExtract, we will employ two techniques, code extraction and proproactive inlining, to eliminate the discrepancies introduced by function inlining.For a callee function, after inlining, if duplicate code is present in the program, the code extraction technique is used to extract the duplicate code into corresponding function calls.If no duplicate code exists, the proproactive inlining technique is employed to inline the callee function, which is called only once, into the caller.
In this section, we first introduce the impact of function inlining on binary programs, then explain how code extraction and proproactive inlining techniques can eliminate the differences introduced by function inlining, and finally analyze the feasibility of solving the function inlining issue using code extraction and proproactive inlining techniques.

When Does Function Inlining Introduce
Duplicate Code?Function inlining involves incorporating the callee function directly into the caller.Given that the prerequisite for code extraction to be effective is that function inlining leads to the presence of duplicate code in the program, this section will analyze the circumstances under which function inlining results in duplicate code, based on the number of times the callee function is called.
3.1.1Callee Function Called Only Once. Figure 1 illustrates the scenario where a callee function, called only once, undergoes function inlining.In this scenario, Function A consists of two basic blocks, A1 and A2, while Function C comprises four basic blocks, C1-C4, and is called by Function A within the binary program.When inlining Function C, compilers typically do not inline the instructions from basic blocks C1 (the prologue) and C4 (the epilogue) into Function A, as they are primarily responsible for the preservation and restoration of registers.Instead, the compiler inlines basic blocks C2 and C3 into Function A, resulting in a new Function A'.After inlining, since there are no other calls to Function C within the program, compilers usually eliminate the original Function C to reduce the amount of code [46].At this point, as Function C is inlined only once, its basic blocks C2 and C3 exist solely within Function A', not introducing duplicate code, thus the code extraction technique cannot be applied.For a Function C that is called only once, it may not be inlined under the O1 compilation option, but could be inlined under O3, resulting in differences between Function A@O1 and Function A@O3.To eliminate such discrepancies, we propose a proactive inlining technique, which automatically inlines all functions in the program that are called only once.
3.1.2Callee Function Called More Than Once.In our previous discussions, we utilized proactive inlining techniques for functions that are called only once.Now, we will analyze the situation when a callee function, called more than once, undergoes function inlining.
Illustration of the inlining process for functions called more than once, with the inlining of identical sections (basic blocks C2 and C3) of the callee during the process.
Figure 2 depicts a scenario where a callee function is called more than once, leading to the generation of duplicate code.In this example, both Function A and Function B consist of two basic blocks and both call Function C. The compiler inlines Function C into Functions A and B, resulting in the inlined Functions A' and B'.A1, A2, B1, and B2 are the original basic blocks of Functions A and B, while C2 and C3 are basic blocks from Function C. We refer to functions that inline the same basic blocks at different call sites as "inline-stable functions".In this case, both Functions A' and B' contain basic blocks C2 and C3, creating duplicate code.At this point, we can apply code extraction techniques to address these duplicates.If Function C has not been deleted by the compiler, we replace the instructions in basic blocks C2 and C3 with a call to Function C. If Function C has been deleted, the code extraction technique will recreate Function C, place basic blocks C2 and C3 within it, and then replace the basic blocks C2 and C3 in Functions A' and B' with calls to the newly created Function C.However, even if a callee function is inlined at multiple call sites, it may not result in duplicate code, as illustrated in Figure 3. Unlike the scenario in Figure 2, Function C in this example includes branch instructions.If the branch condition is true, it executes a path containing basic blocks C2 and C3; otherwise, it follows a path with basic blocks C4 and C5.Since the compiler optimizes the inlined code and eliminates dead code, only the executing path's basic blocks, C2 and C3, are retained in Function A', while basic blocks C4 and C5 are not inlined into Function A'.A similar situation applies to Function B', which retains only basic blocks C4 and C5.We call functions that inline different paths at different call sites "inline-sensitive functions".In this case, although Function C is inlined multiple times, since A' and B' retain different basic blocks from Function C, duplicate code is not introduced into the program (unless the body of Function C itself is duplicated).If the body of Function C is deleted, then there is no duplicate code.However, our analysis of real-world programs reveals that this situation is uncommon.In Section 3.2, we will further discuss the practicality and effectiveness of the code extraction scheme from the perspective of real-world programs.

Feasibility Argument for the Proposed Approaches
In the previous sections, we employed proactive inlining techniques for functions called only once and used code extraction techniques for functions called multiple times.It is important to note that code extraction technology is specifically tailored for handling inlining-stable functions and cannot address inlining-sensitive scenarios.This section delves into a detailed analysis of the inlining phenomenon in actual programs, quantifying the proportion of inliningsensitive versus inlining-stable functions, and theoretically validating the feasibility of our approach.
Table 1.This table presents the instances of functions being inlined more than once across different programs.The "Num" column represents the number of functions in the program that were inlined more than once.The "Inlining-Stable Num" column displays the count of inlining-stable functions among these, and the "Duplication Ratio (D.R.)" column reflects the proportional relationship between the two.

Programs
Nums We chose the representative SPEC CPU 2017 [6] benchmark suite as our evaluation target.By modifying the function inlining-related code within LLVM [1], we then compiled the C/C++ programs from SPEC CPU 2017 to extract information relevant to function inlining.Our focus was on determining how many functions in SPEC CPU 2017 were inlined more than once, identifying which of these functions were Inlining-Stable, and filtering out programs with fewer than 50 instances of inlining.The results are presented in Table 1.
From Table 1, it is evident that, on average, 635 functions were inlined more than once.Among these, a significant 92% of the functions are inlining-stable, with only 8% being inlining-sensitive.This implies that the code extraction approach can cover 92% of the functions that were inlined multiple times, effectively eliminating the discrepancies introduced by function inlining.
In summary, proactive inlining techniques are suited for functions called only once (as shown in Figure 1), while code extraction techniques are appropriate for functions called multiple times (as depicted in Figure 2).Although code extraction can only handle inline-stable functions among those called multiple times, we have found that in real programs, 92% of functions are inline-stable.Therefore, our proactive inlining and code extraction approach is highly practical and feasible in eliminating the discrepancies introduced by function inlining.

Design
Figure 4 illustrates the overall workflow of CodeExtract, comprising two main components: proproactive inlining and code extraction.At the top, we depict the fundamental workflow of the LB-BCSD method.Initially, function extraction is performed on both the target binary program and candidate binary programs.Subsequently, the extracted functions undergo preprocessing to eliminate differences introduced by compilers.Previous studies [8,13] utilized inline emulation for preprocessing to obtain normalized functions, while this study employs proproactive inlining and code extraction techniques.Following this preprocessing step, the normalized functions serve as inputs to a neural network, which computes the similarity between functions and outputs matched functions.
The input to the CodeExtract system is functions, and the output is normalized functions.Identifying duplicate code relies on similarity calculations at the basic block level, which can be computationally intensive due to the presence of numerous basic blocks in programs.To mitigate computational costs, the Filter module filters basic blocks in functions, identifying basic blocks introduced by function inlining, termed inline basic blocks.Subsequently, similarity calculations are performed on these inline basic blocks to extract highly similar duplicate code, resulting in the output of normalized functions.To provide better elucidation of our design, we first provide the following definitions: Successors (Succ): In a control flow graph, the successors of a node  are the nodes directly reachable from .Predecessors (Pred): In a control flow graph, the predecessors of a node  are the nodes from which  can be directly reached.Descendants: In a control flow graph, the descendants of a node  are all nodes reachable from  by following a path of edges in the forward direction.Ancestors: In a control flow graph, the ancestors of a node  are all nodes from which  can be reached by traversing edges in the backward direction.

Challenge
CodeExtract eliminates discrepancies introduced by function inlining by extracting duplicate code within functions, relying on the calculation of basic block similarity to identify such duplicate code.However, when analyzing the similarity of basic blocks within the same program, we encountered the following challenges: • Challenge 1: Given that the identification of duplicate code is based on the calculation of similarity at the basic block level, and a program may contain a large number of basic blocks, performing pairwise similarity calculations among all basic blocks can be very timeconsuming.• Challenge 2: When performing function inlining, the compiler optimizes the instructions within the callee function, leading to significant variations in the instructions of the same callee function's basic blocks at different call sites.This results in considerable similarity differences for the same basic block in different contexts.
To address Challenge 1, we designed a Filter module that avoids the need for pairwise similarity calculations across all basic blocks in the program by analyzing all potential "entry point basic blocks" within the control flow graph.To overcome Challenge 2, we normalized the basic blocks before calculating their similarity to reduce the impact of discrepancies caused by inlining.For detailed methods and technical specifics, please refer to Section 4.3.

Filter Module
Given the vast number of functions in binary programs, each comprising numerous basic blocks, computing similarity for every pair of basic blocks within the program would be exceedingly time-consuming.To circumvent this, the Filter module analyzes the basic blocks within the program, filtering out those not introduced by function inlining.As previously mentioned, functions processed through inlining exhibit a unique characteristic in the control flow graph (CFG): they connect to external instructions solely through a unified entry point, referred to as the "entry basic block."Since only the entry basic block and its descendants could be basic blocks introduced by inlining, limiting similarity calculations to these blocks can significantly reduce computational demands.
The identification of entry basic blocks is based on their distinct characteristic: other basic blocks stemming from an entry basic block can only have the entry basic block or its successor blocks as predecessors.We traverse all basic blocks within a function, utilizing this feature to determine if each block is a potential entry basic block.Initially, we consider each basic block as a potential entry basic block and exhaustively enumerate all control flow subgraphs starting from it within the CFG.Subsequently, we assess whether these basic blocks align with the characteristics of an entry basic block, a process elaborately described in Algorithm 1.This algorithm accepts a function extracted from a binary program as input and outputs all identified potential entry basic blocks.While traversing each basic block, we employ the isEntryBB(BB, k) function to determine its conformity to the characteristics of an entry basic block and return all possible entry basic blocks (lines 1-6).In this function, we first exhaustively enumerate all control flow subgraphs starting from BB, limiting the size of the subgraph to k basic blocks to ensure that the control flow subgraph of the identified entry basic block contains at least k basic blocks.If the number of basic blocks introduced into the caller by an inlined callee function is less than k, then these less numerous callee functions' entry points will not be recognized as entry basic blocks.
For each control flow subgraph, we further traverse each node, checking if its predecessor is the entry basic block or its descendants.As long as there is a control flow subgraph that makes a basic block meet the conditions of an entry basic block, we consider it as such (lines 10-21).In practice, we set k=3, and in Section 5.3, we discuss the impact of the number of basic blocks in a function on the accuracy of the LB-BCSD model.Functions with fewer than 3 inline basic blocks have a negligible impact on the accuracy of the LB-BCSD method, thus obviating the need to extract these less frequent duplicate codes.
Once the entry basic blocks are identified, we only need to mark their direct successors as the basic blocks to be matched, rather than all descendants.This is because, in the subsequent duplicate code extraction algorithm, given a pair of similar basic blocks, we continue to match the similarity of other basic blocks near this pair as starting points.This approach significantly reduces the number of basic blocks to be matched, thereby effectively decreasing the computational overhead of basic block similarity calculations.

Similarity Calculate Module
After acquiring the basic blocks to be matched, the subsequent task involves calculating the similarity between these blocks.In the domain of LB-BCSD, computing similarity at the basic block level poses a significant challenge.This is primarily because, compared to functions, basic blocks typically contain fewer instructions, making it difficult for LB-BCSD methods to generate accurate semantic vectors for basic blocks.To address this challenge, SOTA LB-BCSD [14,41,57,63] approaches leverage the contextual information of basic blocks to assess their similarity, a strategy that has shown effective results in matching basic blocks across different programs.
However, the main objective of this paper is to identify duplicate code within the same binary program.In this scenario, function inlining results in the callee function being inlined into different call sites, causing originally similar basic blocks to be placed in entirely different contexts.Thus, existing LB-BCSD methods based on contextual information at the basic block granularity are not suitable for addressing the basic block matching problem within the same binary program.
We observe that within the same binary program, when a callee function is inlined into different call sites, its control flow structure and instructions largely remain unchanged.This observation leads us to treat the instructions in the basic blocks as strings and evaluate the similarity of basic blocks by calculating the edit distance between these instruction strings.Edit distance [43,53] (also known as Levenshtein distance) measures the minimum number of single-character editing operations required to transform one string into another (including insertion, deletion, and substitution).A smaller edit distance implies closer functional proximity of the basic blocks.
Empirically, we consider basic blocks with a similarity exceeding 0.95 to be similar.However, in practice, optimizations by the compiler during function inlining may change the register names in the callee functions, increasing the edit distance between two similar basic blocks.This paper will further explore the root causes of this issue and propose corresponding strategies to improve the accuracy of duplicate code identification within the same binary program.

Differences Introduced by Variations in Regis-
ter Names.In Figure 5, the same basic block of the callee function is inlined at different call sites 1 and 2. Due to the compiler's optimization that merges the callee and caller, different registers are utilized at different call sites [46].This variance due to compiler optimization can be mitigated by renaming the registers within the basic block.This process is achieved through def-use analysis of the registers, where registers exhibiting identical def-use behavior are assigned the same name.In Figure 5(a), we observe that call site 1 primarily performs def operations on  and , with 's value originating from memory pointed to by  and serving as the first operand of the  instruction; 's value comes from memory at  + 20 and serves as the  instruction's second operand.In Figure 5(b),  exhibits the same def-use behavior as  in Figure 5(a), while  behaves similarly to  in Figure 5(a).Hence, we rename the registers used in Figure 5(b), with the results shown in Figure 5(c).Additionally, to eliminate differences introduced by varying jump instruction targets due to different calling contexts, jump instruction targets are uniformly replaced with the   label.

Repeated Code Extract Module
After completing the similarity calculations for basic blocks, we identify similar basic blocks.For instance, if basic block A is similar to basic blocks B and C, this suggests that ABC might be the result of the same function being inlined at different call sites.To identify all inlined basic blocks, Code-Extract does not immediately extract code from these similar basic blocks.Instead, based on control flow, it continues to analyze the similarity of neighboring basic blocks to the similar ones until no new similar basic blocks are identified, and only then does it perform the extraction of duplicate code.For example, in Figure 2, once we confirm that the basic block C2 in function A' and function B' are similar, we mark the basic block C2 as duplicate code.Next, we assess the predecessors and successors of these duplicate codes-whether the predecessor and successor nodes of the basic block C2 are similar.This search process continues until no more similar basic blocks are found.This entire process is described by Algorithm 2. Algorithm 2 takes as input a function  and a set of similar basic blocks   within function  , and outputs the modified function  ′ .This algorithm iterates through each basic block  in  , searching for all instances of duplicate code starting from , and then removes these codes from function  (lines 1-8), inserting a call instruction at the appropriate location to invoke the removed duplicate code.The  function takes two basic blocks  and  as inputs, aiming to recursively search for similar basic blocks within the predecessors and successors of  and , adding them to the  collection of duplicate codes (lines [12][13][14][15][16][17][18][19][20].If the number of basic blocks in the  collection is less than , it returns an empty set; otherwise, it returns the  collection.Here,  refers to the edit distance mentioned earlier, for which we empirically set  = 0.95.In practice, we set  = 3, and in Table 3, we demonstrate that functions with fewer than 3 basic blocks have a negligible impact on the accuracy of the LB-BCSD model.Therefore, during code extraction, we focus only on those inlined functions with a number of basic blocks greater than or equal to 3.

Evaluation
CodeExtract is implemented on the angr [49] platform.To comprehensively evaluate our technique, we have set the following research questions (RQs) for in-depth exploration: RQ1: How much does function inlining affect the accuracy of the LB-BCSD model?RQ2: What is the impact of the number of basic blocks in a function on the LB-BCSD model?RQ3: Can CodeExtract significantly improve the accuracy of the LB-BCSD model when facing function inlining?RQ4: why CodeExtract can enhance the accuracy of the LB-BCSD model?

Experiment Setup
We conduct the experiments on a server with a Intel Xeon Gold 6132 CPU at 2.60GHz, 256 GB memory, 8 Tesla V100 GPUs, and Ubuntu 18.04.5.1.1Baseline Models.We selected three SOTA LB-BCSD models as baselines: a) SAFE [35] employs an RNN structure with a self-attention mechanism to generate semantic embeddings for functions; b) Asm2vec [13] utilizes the PV-DM model combined with the program's control flow information to generate semantic embeddings for functions; c) JTrans [50] encodes control flow information into a Transformer architecture to produce semantic embeddings for functions.These three models are open-source, and we utilized the implementations provided by the authors of the papers for the binary similarity matching task.

Test Datasets.
The SPEC CPU 2017 suite, which gathers programs covering a broad spectrum of computational and application scenarios, has emerged as the benchmark of choice for numerous research studies [42].Leveraging this, we selected C/C++ programs from the SPEC CPU 2017 [6] test suite to create our dataset, excluding those with fewer than 50 instances of function inlining.As a result, our dataset encompasses 11 real-world projects, including blender_r, cpugcc_r, cpuxalan_r, imagick_r, leela_r, nab_r, omnetpp_r, perlbench_r, povray_r, x264_r, and xz_r.In alignment with previous research [13,24,50], we utilized programs compiled with the O1 and O3 optimization options of llvm-10.0 [1] for our test set.

Metrics.
In this experiment, we employed the top-10 accuracy as our evaluation metric.This metric focuses on whether the correct answer (i.e., the function compiled from the same source code as the queried binary function) appears within the top 10 ranked options among all predictions when iteratively querying a set of binary functions from the candidate function pool.Furthermore, we also conducted an analysis on the false positive and false negative rates of CodeExtract.

RQ1: Impact of Function Inlining on Model
Accuracy In this section, we evaluate the impact of function inlining on model accuracy.For each baseline model, we input binary programs with function inlining enabled and disabled, respectively.By comparing the accuracy changes of the models with function inlining turned on and off, we represent the influence of function inlining on model accuracy.For every binary program in our dataset, we randomly sample 500 functions from the O0 binary and query them one by one in a pool composed of the corresponding 500 functions from the O3 binary.In other words, for each query, there is only one similar function in the function pool.This setup is consistent with the literature [22,33,52,54].
The results are shown in Table 2.We can see that LB-BCSD models exhibit high top-10 accuracy for programs without function inlining, with JTrans reaching an accuracy of up to 84%.This demonstrates that existing LB-BCSD models can correctly identify the relationships between instructions and accurately extract the semantic information of functions.However, the accuracy of existing LB-BCSD models decreases by about 31% for programs with function inlining, with JTrans having the highest accuracy at 53%.This result is consistent with the literature [24], indicating that function inlining significantly impacts the precision of LB-BCSD models.Compared to JTrans and Asm2vec, the impact of function inlining on SAFE is minimal, only 27%.This is mainly because function inlining significantly affects the control flow structure of programs, and Asm2vec and JTrans encode the program's control flow information into the function semantic vector, whereas SAFE does not, thus making its accuracy less impacted by function inlining.In summary, function inlining significantly affects the accuracy of LB-BCSD models, causing an average accuracy drop of 30%.

RQ2: Impact of Callee Function Size on LB-BCSD Model Accuracy
In this experiment, we investigated the impact of the callee function's basic block count on the accuracy of the LB-BCSD model.To this end, we modified the LLVM source code while keeping the original inlining rules unchanged.That is, when the compiler decides to inline a function, we check the number of basic blocks in that function.If the count is less than or equal to X, then it is inlined; otherwise, it is not.
The experimental results, as shown in Table 3, reveal that when function inlining is disabled, the accuracy of the LB-BCSD model is 84%.However, when the compiler only inlines those functions with a basic block count of two or less, the accuracy of the LB-BCSD model decreases by 3%.This decline is primarily because smaller functions lack complex control flow structures and have fewer instructions, thus having a minimal impact on the accuracy of the LB-BCSD model.
Conversely, when the compiler only inlines functions with a basic block count of three or less, the accuracy of the LB-BCSD model drops by 15%.From this, we can infer that the accuracy of the LB-BCSD is significantly affected only when the inlined functions have a basic block count of three or more.

RQ3: Accuracy Enhancements Brought to LB-BCSD Models by CodeExtract
In this section, we evaluated the accuracy improvements brought by inline emulation and CodeExtract to the models.Since inline emulation algorithms are only implemented in Asm2vec, to fairly assess the impact of inline emulation on Asm2vec, SAFE, and JTrans, we modified the function inlining-related code in LLVM.When compiling the SPEC benchmarks, functions were inlined according to the rules of inline emulation discussed earlier.The other settings in this experiment are the same as in Section 5.2.
The experimental results, as shown in Table 4, indicate that inline emulation can bring about a 9%-12% accuracy improvement for the three models.In comparison, code extraction can lead to an 18%-22% accuracy improvement for the three models.Inline emulation can, to some extent, make the homologous functions compiled with O1 and O3 more similar.We discussed the drawbacks of inline emulation in the section 2.1, so it cannot resolve the issues introduced by function inlining.On the other hand, our CodeExtract can extract the inlined basic blocks introduced by function inlining, making the function's functionality relatively pure, thus better eliminating the discrepancies introduced by function inlining.In the following experiments, we will systematically evaluate the advantages and disadvantages of inline emulation and CodeExtract in terms of the number of instructions, false positive rates, and false negative rates.

RQ4: False Positive and False Negative Analysis
In Section 5.3, we discovered that the presence of three or more basic blocks in inlined functions significantly influences the accuracy of the LB-BCSD model.Consequently, this experiment aimed to analyze the rates of false positives and false negatives for both inlining emulation and CodeExtract across all functions and specifically those with a basic block count of three or more in the respective programs.To ensure the precision of our assessment, we benchmarked against the compilation outcomes obtained using LLVM with the O3 optimization level.
Table 5.This figure presents an analysis of the false positive rate (FPR) and false negative rate (FNR) for both inline emulation and CodeExtract methods on program functions.In the table, "Functions" refers to all functions within a program, while "Functions_BB3" specifically denotes functions with a number of basic blocks greater than or equal to 3.

Programs Inline Emulation CodeExtract
Functions Functions-BB3 Functions Functions-BB3 The experimental results, presented in Table 5, demonstrate that CodeExtract exhibits a higher rate of both false positives and false negatives for all functions within the respective programs compared to inlining emulation.The increased false negative rate is attributed to CodeExtract's method of only extracting functions with a basic block count of three or more, ignoring those with fewer than three blocks.The rise in false positives is due to CodeExtract's proactive inlining strategy, which inlines all functions called only once, leading to a higher incidence of false positives.However, CodeExtract shows lower rates of false positives and false negatives for functions with a basic block count of three or more, indicating its superior handling of such functions.Given that the accuracy of the LB-BCSD model is significantly affected only when functions with three or more basic blocks are inlined, employing CodeExtract proves to be more efficacious in enhancing the LB-BCSD model's accuracy compared to relying solely on inlining emulation.

Figure 1 .
Figure 1.Diagram illustrating the inlining of a function called only once.Function C, exclusively invoked by Function A, is transformed into Function A' following the inlining procedure.

Figure 3 .
Figure 3. Illustration of the inlining process for functions called more than once, showcasing the inlining of different parts of the callee function.Function A' has inlined basic blocks C1 and C2, while Function B' has inlined basic blocks C4 and C5.
Basic block at callsight 1 (b) Basic block at callsight 2 (c) Normalized basic block

Figure 5 .
Figure 5. Due to compiler optimizations, the same basic block of a callee function utilizes different registers at various call sites (a) and (b).Through data flow analysis, these basic blocks from different call sites have been normalized, as illustrated in figure (c).

Function
inlining significantly affects the accuracy of the LB-BCSD model, and previous methods based on inline emulation have not been able to fully address the issues introduced by function inlining.Therefore, we propose a system called CodeExtract, based on code extraction and proactive inlining techniques.Compared to inline emulation, CodeExtract can better eliminate the discrepancies introduced by function inlining.Experiments have shown that, in addressing function inlining issues, CodeExtract can improve the accuracy of the LB-BCSD model by 20%.

•
By analyzing the behavior of function inlining, we designed a code extraction-based approach.

Table 2 .
The impact of function inlining on model accuracy.NI represents the top-10 accuracy of the model with inlining disabled, WI represents the top-10 accuracy of the model with inlining enabled, and Infl.denotes the influence of function inlining on model accuracy.

Table 3 .
Analyzing the impact of the number of basic blocks in functions with function inlining on the Top-10 accuracy of the JTrans model."NI" stands for function inlining disabled, "WI" represents function inlining enabled, and "BBX" indicates that functions will not be inlined if the number of basic blocks exceeds X.

Table 4 .
This table shows the impact of two techniques, inline emulation (IE) and CodeExtract (CE), on model accuracy.Ori.represents the original top-10 accuracy of the model.IE/CE represent the top-10 accuracy of the model after applying Inline Emulation and CodeExtract techniques, respectively.Correspondingly, IE/CE Impr.shows the specific improvement of these two techniques on model accuracy.