Khaos: The Impact of Inter-procedural Code Obfuscation on Binary Diffing Techniques

Software obfuscation techniques can prevent binary diffing techniques from locating vulnerable code by obfuscating the third-party code, to achieve the purpose of protecting embedded device software. With the rapid development of binary diffing techniques, they can achieve more and more accurate function matching and identification by extracting the features within the function. This makes existing software obfuscation techniques, which mainly focus on the intra-procedural code obfuscation, no longer effective. In this paper, we propose a new inter-procedural code obfuscation mechanism Khaos, which moves the code across functions to obfuscate the function by using compilation optimizations. Two obfuscation primitives are proposed to separate and aggregate the function, which are called fission and fusion respectively. A prototype of Khaos is implemented based on the LLVM compiler and evaluated on a large number of real-world programs including SPEC CPU 2006&2017, CoreUtils, JavaScript engines, etc. Experimental results show that Khaos outperforms existing code obfuscations and can significantly reduce the accuracy rates of five state-of-the-art binary diffing techniques (less than 19%) with lower runtime overhead (less than 7%).


Introduction
Embedded devices have been widespread in many fields of modern life, such as wearables, traffic lights, and autonomous driving vision sensors and the total number are expected to reach 30 billion by 2025 [61].In recent years, the number of vulnerabilities disclosed in embedded device software has been on the rise, and attacks targeting embedded devices have increased more than fivefold in the past four years [46].Once a vulnerability in an embedded device is exploited, it can lead to the collapse of the backbone network [4], while vulnerabilities in medical devices such as pacemakers are life-threatening [11,49].
In addition to directly writing flawed code to introduce vulnerabilities, the reuse of vulnerable third-party code is another important reason for the widespread existence of vulnerabilities in embedded devices [14,33,51].For example, Cui et al. [14] found that 80.4% of LaserJet printers used third-party libraries with known vulnerabilities.However, vulnerabilities in these embedded devices cannot be patched in time due to the fragment issues -similar code exists in multiple versions of various products due to the fast replacement of embedded devices [70].For example, the QualPwn vulnerability [63] in the Qualcomm's WiFi controller, which is equipped in millions of Android phones, took nearly 6 months from the vulnerability disclose to the patch released by the Qualcomm, and OEMs took longer to patch all devices across all versions.
Unfortunately, the above problem favors attackers in which they could detect existing vulnerabilities instead of exploring 0-day vulnerabilities laboriously.Since most embedded device software is not open source, attackers usually utilize the binary diffing techniques [3, 6-10, 15-17, 20, 21, 23-32, 34, 35, 42, 44, 47, 54, 55, 66, 67, 69, 71, 73, 75-81] to locate the vulnerable code reused in the binary by comparing the binary with the third-party code.With the introduction of machine learning, binary differing techniques have made great progress in recent years.This greatly facilitates attackers locating existing vulnerabilities in binaries.For example, David et al. [17] searched for common vulnerabilities in mobile devices, wearable devices, and medical devices, and were able to locate 373 existing vulnerabilities.
Software obfuscation techniques [5,12,36,41,68,72] can transform the program code to change the characteristics of the binary code even though the source code is the same.They could be used against binary diffing techniques, preventing attackers from locating existing vulnerabilities and thus protecting software.Recent researches have shown that software obfuscation techniques are no longer effective against the state-of-the-art binary diffing techniques [20,44,47,69].The main reason is that most software obfuscation techniques focus on the intra-procedural code obfuscation, which does not fundamentally change the semantics of functions, while binary diffing techniques can more and more accurately extract features within functions to obtain their semantics.
Based on the above observations, we argue that interprocedural code obfuscation should be emphasized due to their ability to change function semantics at the binary level.To this end, we propose an inter-procedural code obfuscation technique, Khaos, which moves the code across functions and utilizes the compiler's optimizations to transform (obfuscate) the code.The core idea of Khaos is that once the code is restructured among functions, the generated binary code after compilation optimizations can be very different.To achieve the inter-procedural code obfuscation, Khaos changes function code across functions by separating a function into sub-functions and aggregating functions into one.
It is non-trivial that transform arbitrary functions in Khaos due to the challenges posed by performance, correctness, and obfuscation effect.For example, 1) To balance the obfuscation effect with the performance overhead, choosing which code blocks within a function (or functions) to be separated (or aggregated) is a problem; 2) Rebuilding all control flow and data flow among functions after transformations (especially the indirect function calls handling in the fusion) is difficult; 3) Aggregating functions deeply without affecting the functionality of each function is a problem.
To address these challenges, two obfuscation primitives are proposed in Khaos-the fission primitive and the fusion primitive.The fission is used to separate a function into sub-functions, and the fusion is used to aggregate several functions into one.The fission and the fusion are two complementary primitives, in which the fission tries to obfuscate the function by itself and the fusion tries to obfuscate the function by other functions.Furthermore, these two primitives can also be used together to improve the obfuscation effect, that is, the sub-functions separated by the fission can be aggregated with other functions again.
The fission partitions the code region to a sub-function on the control flow of the function with the dominator tree as the granularity, and also combines the static cold/hot code analysis technique to achieve lower performance overhead.Since the define-use relationships of variables are changed from within a function to cross functions, the fission needs to rebuild the data flow by passing parameters.To minimize the performance degradation caused by parameters passing, we also propose a data-flow reduction mechanism to reduce the number of parameters of the sub-functions.The control flow (including the exception control flow) is also rebuilt by inserting the function calls that call to sub-functions and encoding the return values in the sub-function.
The fusion selects two functions with compatible return values and no variadic parameters for the aggregation.Compatible means converting between different data types without losing precision.The parameter list of the post-aggregation function is merged from these two functions.To avoid the inefficient way of passing parameters through the stack, we propose a parameter list compression mechanism to reduce the number of the parameters.To rebuild the control flow completely, we propose a tagged pointer mechanism, which attaches control bits on function pointers to decide the executed code when the aggregated functions are called indirectly.We also propose a trampoline mechanism to handle the function calls across modules.To further improve the obfuscation effect, the deep fusion method is proposed to aggregate innocuous basic blocks, whose execution does not affect the global memory state, from different functions together within the aggregated function.
Khaos was implemented based on the LLVM framework.The experimental evaluations were conducted on the Linux/X86_64 platform by using SPEC CPU 2006 & 2017 C/C++ programs, CoreUtils, and 5 common embedded device software containing vulnerabilities.Five state-of-the-art binary diffing tools [20,21,26,45,81] were used to evaluate the effectiveness of Khaos.The results show that Khaos is not only effective but also efficient: the effectiveness experiments show that the accuracy of these binary diffing was reduced to be less than 19%, and the ranking of the vulnerable functions decreased significantly; the performance experiments show that Khaos incurs less than 7% overhead on average.In summary, our contributions are as follows: • A novel inter-procedural code obfuscation mechanism.We point out that the inter-procedural code obfuscation is necessary against the binary diffing techniques, and propose a new obfuscation mechanism, Khaos, which could obfuscate the code across functions.• The fission and the fusion primitives.We propose two obfuscation primitives in Khaos to move the code across functions.The fission separates a function into multiple functions, and the fusion aggregates multiple functions into one.• New insights from implementation and evaluation.
We implement and evaluate a prototype of Khaos, and the results show that it outperforms the existing obfuscators against the state-of-the-art binary diffing techniques.Our study suggests that binary diffing techniques should focus more on extracting the inter-procedural code features.
2 Background and Motivation

Binary Diffing
Binary diffing is a technique for visualizing and identifying differences between two binaries.It can quantitatively measure the differences between two given binaries and give matching result at predefined granularity (e.g., function).It has been widely used in software vulnerability search [10, 15, 17, 18, 23-26, 54, 55, 73, 78, 80], security patch analysis [75,76,79], malware detection [9,28,34,47,71], code clone detection [3,8,30,31,44], etc.The workflow of binary diffing can be divided into two stages, i.e., the offline features extraction and the online code search.On the offline stage, tools extract features from binaries, while what features should be extracted is the focus of recent research; On the online stage, tools calculate the similarity of the given binaries by using extracted features.For example, BinDiff [81] extracts the number of basic blocks, control flow edges, and function calls within a function as the function's identity.Then, it combines the control flow graph matching algorithm to search for similar functions.

Software Obfuscation
Software obfuscation transforms the program without changing its functionality to make it hard to be analyzed.It can be used to hide vulnerabilities, protect intellectual property, etc. Actually, there is an arm race between software obfuscation and binary diffing.Software obfuscation does not want binary diffing techniques to match un-obfuscated with obfuscated code successfully, and vice versa.In recent decades, there are various techniques proposed in software obfuscation, and they can be classified into data obfuscation, static code rewriting, and dynamic code rewriting [59].
Data obfuscation techniques [13] transform the format of data to prevent it from direct matching.Since most binary diffing techniques utilize the features of the code, obfuscating data is less effective against binary diffing.
Various dynamic code rewriting approaches follow the concept of packing [50,58], which hides code by encoding or encrypting it as data.But, the packing techniques are easy to be automatically unpacked [1,53] or be memory-dumped [19,22,60,65], which would lose the effect of obfuscation.Code virtualization is another popular obfuscation technique [37].It translates code into specific interpret representations (IRs) instead of the native instructions and then uses an engine to interpret the IRs at runtime.This technique sacrifices much performance (10x slowdown at least [38]) in exchange for a more powerful obfuscation.Therefore, the dynamic code rewriting technique is not suitable for fighting against binary diffing due to less effectiveness or too much overhead.
In contrast, static code rewriting is a promising technique against binary diffing.It modifies program code during obfuscation without further runtime modifications, which is similar to compiler optimization.Researchers have proposed many techniques for static code rewriting.For ease of introduction, we categorize them by obfuscation granularity: Instruction level: Instruction substitution [12,36] replaces the original instruction with equivalent instruction(s), such as replacing an "add" instruction with two "sub" instructions.O-LLVM [36] designed 10 different substitution strategies for arithmetic and logical operations.To increase the complexity of conditional branch instructions, opaque predicate techniques [12,36,48,64,72] were proposed.They add permanent true or false (e.g.,  2 != -1) conditions that do not affect the original control flow, which are frequently used against analytical techniques such as symbolic execution.
Basic block level: Bogus control flow [12,36,52] inserts dead code into the original control flow and often utilizes opaque predicates to prevent these codes from being optimized away and executed, thereby ensuring the original functionality of the program.
Function level: Control flow flattening [12,36] converts the control flow of the function into the "switch-case" form, which is hard to be analyzed, and maintains the original jump relationship by controlling the values of the cases.To prevent being degraded back to the original control flow, the "case" relationship is also obfuscated (encrypted).

Motivation
As binary diffing techniques continue to advance, many static code rewriting techniques (referred to as code obfuscation in the rest of the paper) with the intra-procedural granularity (i.e., instruction, basic block, and function) are no longer effective [20,44,47,69].The main reason is that intraprocedure code obfuscations do not fundamentally change the semantics of each function, while most binary diffing techniques are increasingly capable of extracting features within functions to understand their semantics.
Therefore, we argue that inter-procedural code obfuscations should be emphasized due to their ability to change function semantics at the binary level which is the key to defeating binary diffing techniques: 1) For the binary diffing works that only consider the intra-procedural information, the interprocedural code obfuscation can fundamentally defeat them because the code structures along with the semantics are significantly changed; 2) For the binary diffing works that take inter-procedure information into account, the interprocedural code obfuscation can also defeat them because the inter-procedural information extracted, such as the types of function calls, the numbers of function calls, and the call graph, are also significantly changed after the obfuscation.

Overview
To achieve the inter-procedural code obfuscation, Khaos changes the amount of code within a function by moving code across functions firstly and then utilizes the compiler's optimizations to transform (obfuscate) the code.The idea behind it is that once the code is restructured among functions, the generated binary code after compilation optimizations (especially intra-procedural optimizations) can be very different.
In detail, we propose two obfuscation primitives -the fission primitive and the fusion primitive.The fission primitive separates a function into multiple sub-functions thus making the function thinner.The fusion primitive aggregates functions into one thus making the function fatter.These two primitives can also be used together to make more in-depth changes to the function, that is, the separated sub-functions can be aggregated with other functions.
For the convenience of discussion, we denote a function before the transformation as an oriFunc (short for original function), and denote the new function formed after the fusion as the fusFunc (short for fused function).The new function formed by the separated code during the fission is denoted as the sepFunc (short for separated function), and the function formed by the remaining code is denoted as the remFunc (short for remnant function).Figure 1 gives an example about how the fission and the fusion are performed on a function named cal_file().The function is used to find the number of a special character in a given file.It first checks the file name and open the file (lines 4-7), then reads the content and counts the amount (lines 9-11).We can see that the fission separates two basic blocks ( 2 ○ 3 ○) to sepFunc-1, and four basic blocks ( 5 ○-8 ○) to sepFunc-2, respectively.To maintain the correctness, the fission inserts three trampoline basic blocks in the remFunc-1 to create the call relationship of two sepFuncs.Basic block ( d ○) is used to return different value of sepFunc-1 (detailed in subsection 3.2).On top of the fission, the fusion aggregates the log() function and the sepFunc-2 into a fusFunc-1.The entry basic block ( e ○) will be inserted into the fusFunc-1 to select the aggregated code blocks.
Changing functions by recombining basic blocks from different functions is not trivial, and it still faces several challenges from performance, correctness, and obfuscation.
• Challenge-1: Choosing which basic blocks (or functions) to be separated (or aggregated) will seriously affect the performance overhead and obfuscation effect, and how to balance them well is difficult.For example, separating each basic block as a sepFunc would favor the obfuscation, but brings unacceptable overhead.• Challenge-2: How to completely rebuild all control flow and data flow among functions after transformation (especially the fusion) is difficult.For example, once several functions participate in the fusion, we need to handle all pointers of the oriFuncs so that it can correctly jump to the fusFunc when de-referenced.• Challenge-3: Simply merging functions makes they become each other's junk code and has a limited obfuscation effect because the compiler will still optimize the code for different functions separately.Binding control flows and data flows belonging to different functions in the fusFunc can prevent that but is also challenging to avoid changing the functionality of the function.
In the following subsections, we will detail the fission and the fusion design, and how we address the above challenges.

The Fission Primitive
The fission first identifies the regions (each region is a basic block set) that need to be separated, then composes these regions into sepFuncs, and finally rebuilds the control flow and the data flow among sepFuncs and remFunc.

Partitioning
Regions to Form sepFunc.In general, a function's property is single entry and multiple exits.Hence, as long as a certain code region satisfies this property, it can be separated to become a new function.More precisely, as long as a code region is a dominator tree [2] on the control flow graph, it can be extracted into a sepFunc.The fission creates call relationship among sepFuncs and remFunc to ensure correctness.If the fission generates too many sep-Funcs, the newly created function calls in remFunc will bring additional overhead (especially new function calls inside a loop).However, if the number or size of the sepFuncs is small, the oriFunc cannot be significantly changed.Therefore, designing a reasonable region identify algorithm is the key to reducing the overhead and improving the obfuscation effect.
The core idea.We abstract the code region partitioning problem as a graph cutting problem.The function's control get dominator tree set  of  return 23: end procedure flow graph can be regarded as a directed graph, and the edge weight represents the frequency of execution which indicates the cold/hot information.Partitioning the code region can be regarded as cutting the graph, where the weight of the cut edge is the cost of performance and the obfuscation effect is the number of the nodes in the sub-graph.
The region identifying algorithm.Based on the above idea, we design the region identifying algorithm (algorithm 1) on top of the directed weighted graph cut algorithm [62] to balance the performance overhead and the obfuscation effect.The algorithm takes function code as input and performs dominator tree analysis [40] (line 2) at first.To avoid separating the whole function body into a sepFunc, we remove the dominator tree of function itself (line 3) and identify the regions from the rest of the trees.To indicate the effect of the fission on obfuscation, we use the number of basic blocks in the tree to represent it (line 7).To indicate the effect of the fission on performance, we use the execution frequency of the root node of the dominator tree by using block frequency analysis [43] (line 8) and the loop count (if the region is in a loop, the call to sepFunc will increase) as the cost of the cut (lines 8-12).We iteratively select the most cost-effective (i.e., maximum the ratio of effect and cost) dominator tree to separate until the tree set is empty (lines 13-16).

Data-flow
Rebuild.In addition to identifying regions as the function bodies of sepFuncs, we also need to identify the inputs and the outputs of these regions to construct the parameters and return value of sepFuncs.For each variable used in a region, it should be an input if its point is outside the region; Similarly, for each variable defined in a region, it should be an output if it has a use point outside the region.For example, as shown in Figure 2, the fd and n variables are inputs because the defined points are outside the region, and the value variable has a use point outside the region, so it is an output.For the variables whose define-use relationship are across regions, we use the function parameters to pass the pointer to them.We don't pass a region's output variables by using the return value of sepFunc because a region may have multiple output variables.
Data-flow reduction.In general, the local variables of a function are defined at the entry basic block.Therefore, if an identified region needs to use local variables, these variables need to be passed into the sepFunc through parameters.In fact, if some local variables are only used by a sepFunc, then these variables do not need to be passed into the sepFunc, they can be defined directly in the sepFunc.This can shorten the length of the sepFunc parameter list, save unnecessary variable transmission, and further improve performance.To achieve this, we propose a lazy allocation strategy -if a local variable is only used in the region, we will move the variable definition to the sepFunc.For example, the n variable in Figure 2 is initially defined in the oriFunc but redefined and only used in the region-2, which becomes sepFunc-2 function, so the definition point of the variable can be delayed in the sepFunc-2 function.

Control-flow
Rebuild.We extract the basic blocks of each identified region into a sepFunc.The jump relationship between the regions in the oriFunc is transformed into the function call-return relationship after fission.The creation of a function call is simple, we only need to insert the function call at the location of the entry basic block of the region before extraction and set the parameters that need to be passed into the sepFunc.
The handling of function returns is relatively complex due to: If a region has multiple exits, the corresponding sepFunc needs to encode this information into the return value, so that the remFunc can use this information to select the corresponding code to execute.As Figure 2 shows, for the two exits (0 and 1) in region-1, when sepFunc-1 returns from exit 0, the control flow should go to BB5, and when returns from exit 1, it should go to BB9.
We use the return value of sepFunc to indicate the remFunc to determine the execution direction: We first number each exit of the sepFunc, uses the number as its return value, and then insert a basic block at the call-site of this sepFunc in the remFunc (e.g., a ○ in Figure 1) to transfer control flow based on the return value.

3.2.4
Handling the Exception Control-flows.During program execution, there are some exception control flows that deviate from the usual function call and return, including the setjmp/longjmp and the C++ exception handling (EH in short).The fission requires special handles of them.
Handling the setjmp/longjmp.Programmers could use the setjmp() in a function to record the current context into a jmpbuf structure.And then, they could use the longjmp() in any subroutines on the call chain of this function to go back to place the jmpbuf is pointing to, i.e., the call-site of the setjmp().There is a requirement here that the setjmp() and the longjmp() using the same jmpbuf must be in the same call chain.Therefore, the call-site of the setjmp() cannot be separated into any sepFunc, because the stack frame of the function that calling the setjmp() cannot be freed when the corresponding longjmp() is executed.Otherwise, the longjmp() will direct control flow to an unknown location.
Handling the C++ exception.The EH mechanism is a feature of the C++ that developers can capture exceptions in the try block by writing the catch statements.Since the fission moves part of the code into a sepFunc, the try-catch pair may be broken, making EH information inconsistent.Simply skipping the exception-relevant function would reduce the obfuscation effect.Therefore, when identifying the code region, if it contains any code that may generate an exception, we will locate the corresponding catch code and divide it into the same region.

The Fusion Primitive
The fusion selects functions to form fusFunc, and rebuilds the control and the data flow to ensure the correctness.In theory, the fusion can aggregate any number of functions.To balance the performance overhead and the obfuscation effect, we choose to aggregate two functions to form a fusFunc.

Selecting
Functions to Form fusFunc.The fusion cannot arbitrarily select functions, it needs to select functions with compatible types of the return values.The definition of incompatibility is that if converting between two types loses precision, the two types are incompatible.For example, when the return value of one function is an integer and the other is a float, these two functions cannot be aggregated.
In fact, there are other conditions that limit the selection of functions: 1) The variadic functions, e.g., the printf(...); 2) Two functions with incompatible types of the return values; 3) Two functions that have a direct calling relationship.The first two constraints are designed for correctness, and the last is designed for performance to avoid generating a lot of recursive fusFuncs.Functions that meet the above constraints will be randomly aggregated in pairs.

Data-flow Rebuild.
Once the two functions to be aggregated are determined, the function prototype of the corresponding fusFunc can be determined immediately.For example, as shown in Figure 3 (a) and (b), the bar() and the foo() are aggregated into int bar_foo_fusion().The ctrl parameter is used to select the function bodies aggregated from the bar() and the foo().Determining the function prototype of fusFunc is crucial to the rebuild of the data flow, which involves setting the parameter list and return value.Parameter list compression.Simply merging the parameter lists of the two functions makes the parameter list of fusFunc too long, which will degrade the performance of calling fusFunc.This is because in the X86_64 calling convention, the first six parameters are passed in registers, and the rest of the parameters are passed on the stack, which is an inefficient way.To achieve efficient parameter passing, we propose a parameter list compression mechanism -if the types of the two parameters from the two functions are compatible, we compress them into one.The reason why we can do this is that when a fusFunc is called, only the parameter list of one of the functions participating in the aggregation is used.For example, as Figure 3(c) shows, both the bar() and the foo() have an integer parameter (short a and int m), we compress them into one integer parameter (int x).
If a parameter can not participate in the compression, it is copied into the parameter list of the fusFunc.The number of parameters after the fusion will increase.In the worst case, it is the sum of the parameters of the two functions, which means none of the parameters can be compressed.To avoid using the stack to pass parameters as much as possible, we preferentially select functions with the total number of parameters less than six for the fusion.
Return value determination.Determining the return type of fusFunc is relatively simple: 1) If the return type of one function is void, then the return type of the fusFunc is the return type of another; 2) If the return types of the two functions are both not void, the compressed type is used as the return type of the fusFunc, which is similar to the parameter list compression mechanism.

Control-flow Rebuild.
Once the fusFunc is created, the two involved oriFuncs need to be removed, and all callsites to the oriFuncs need to be replaced to call the fusFunc.As mentioned before, a ctrl parameter will be added into the parameter list of the fusFunc to select the code block aggregated from the oriFuncs.The value of this parameter is  0 or 1, which is set according to the original call-site of the oriFunc.Since the fusFunc parameter list includes the parameters of both oriFuncs, we only need to pass the parameters required by the oriFunc to the fusFunc at the call-site of this oriFunc.Unused parameters are set to be 0.
Handling Indirect function calls.Indirect function calls are more difficult to handle than direct function calls because we do not know where the oriFunc will be called.Figure 4 (a) shows an example that calls two functions by de-referencing the function pointer.The corresponding data flow is given in Figure 4 (b).When aggregating the bar() and the foo(), we need to change the function pointer points to the fusFunc and then replace the function call to call this fusFunc.But, we encounter a problem that we do not know what the value of the ctrl parameter should be set to.This is because at the compile time, we don't know whether the original function pointer fptr points to the bar() or the foo().
To address the above problem, we propose a tagged pointer mechanism, which is similar to the low-fat pointer [39].The core idea is to encode the information (called tag) of which oriFunc pointed to by the original function pointer into the new function pointer, and when the new function pointer is de-referenced to make a call, the value of the ctrl parameter can be dynamically determined by parsing the new function pointer.In detail, when the operation of taking the address of the function participating in the aggregation occurs, we need to perform the encoding operation.Since the tag is encoded into the function pointer, it can be propagated along with the function pointer.When the function pointer is de-referenced to make a call, we will extract the tag in the pointer as the and set the ctrl parameter according to the tag.
The tag requires two extra bits, where a bit indicates whether the pointer points to a fusFunc, and the other bit records the value of the ctrl parameter.For example, as shown in Figure 4 (c), if the pointer fptr points to the bar(), the value of the tag will be set to 11b.When the pointer fptr is dereferenced to make a call, we insert code to first check whether the tag is empty.If not, the code will extract the ctrl parameter and call the fusFunc.Otherwise, no additional operations are required.
We choose the 2nd bit and the 3rd bit of function pointers to place the tag.This is because the functions are usually 16-bytes aligned with the performance consideration, so the lowest 4 bits of the function pointer can be used (more reasons and considerations are detailed in A.1).

Handling function calls across modules.
There are two cases of cross-module function calls, one is the function pointer of a module is propagated to other modules, and the other is a module directly calls functions exported by other modules.If any case happens on a fusFunc, we needs to process all involved modules to ensure the fusFunc can be called correctly.But in some cases, we can not process all the modules (e.g., some libraries may have no source code).
To address this problem, we propose a trampoline mechanism so that all modules do not need to be processed.In detail, we transverse the data flow conservatively and identify all function pointers that may propagate outside the module.And then, we modify these function pointers to point to a piece of trampoline code instead of the fusFunc.So that when the external module calls these function pointers, the control flow will transfer to the trampoline code first, and the trampoline code will help the function outside the module to reorganize the function parameters and call the fusFunc.For the exported oriFuncs, the method is similar to the replacing the oriFunc's function body with the trampoline code.

The Deep Fusion.
To further improve the obfuscation effect, we propose a deep fusion method to aggregate as many basic blocks as possible between the two parts of the code during the fusion process.
We have observed that some basic blocks can be executed many times without affecting the normal function.The characteristic of these basic blocks is that their execution does not affect the global memory state, and they are called the innocuous basic block in this paper.The concept is very similar to the reentrant function [56] that it can be re-executed without affecting the functionality of the program.For innocuous basic blocks from different oriFuncs, they can be aggregated together within the fusFunc.The innocuous analysis of each basic block is conservative.For example, 1) if a memory write operation in a basic block cannot be determined whether the modified data is local or global, then this basic block is not innocuous; 2) if there is a function call to an external function in a basic block, this basic block is not innocuous.We give a simplified example of 464.h264ref in SPEC CPU 2006 benchmark.As shown in Figure 5, the Update() and UMV() are aggregated into the Fusion().The basic block (BB) 3 ○ of the Update() firstly redefines the local variable delta, and then loads the value of global variable Current, and writes two local variables tmp1 and tmp2 at last.Since these operations do not affect the global memory state, the BB 3 ○ is determined to be innocuous, and so as the BB 6 ○ of UMV(), thus we aggregate them into one -the BB 8 ○.This deep fusion method modifies the control flow graph and data flow graph of the fusFunc at the same time, adding data dependency and control dependency so that the fusFunc cannot be simply separated back to the two functions.

Combining the Fission and the Fusion
The fission and the fusion can be used together to further enhance the obfuscation effect.There are three combinations as follows: • FuFi.sep:Only aggregating the sepFuncs generated by the fission.In this case, the issue of handling indirect function calls no longer exists; • FuFi.ori:Only aggregating the oriFuncs that are not processed by the fission, e.g., the functions with only one basic block.This combination could balance the obfuscation effect and the performance overhead well, and is suitable for software in most real-world scenarios; • FuFi.all:Aggregating the fission-generated sepFuncs and the fission-unprocessed oriFuncs uniformly and randomly.In this combination, the obfuscation effect is prioritized, followed by the performance overhead.It is suitable for programs that require a high obfuscation effect.

Evaluation
We implemented Khaos based on the LLVM-9.0.1.The fission and the fusion are implemented as the middle-end passes, and the fission pass is scheduled before the fusion pass.We run Khaos on Ubuntu 20.04 (Kernel v5.4.0) that runs on an Intel(R) Xeon(R) Gold 5218 CPU with 128G memory.This section evaluates Khaos in terms of effectiveness and performance, and answers the following questions: • (Q1) How is the performance of the obfuscated programs?
• (Q2) How does Khaos work against the state-of-the-art binary diffing techniques?• (Q3) How good is Khaos at hiding real vulnerable code?Test Suites.We used three test suites to evaluate Khaos: 1) All C/C++ programs in SPEC CPU 2006/2017 benchmarks with the ref input (denoted as the T-I); 2) All 108 programs in the CoreUtils 8.32 (denoted as the T-II); 3) Five commonly used programs in embedded devices with at least one vulnerability, including two popular IoT JavaScript engines (Jer-ryScript and QuickJS), OpenSSL-1.1.1,BusyBox-1.33.1 and libcurl-7.34.0 (denoted as the T-III).The performance evaluation was performed on the T-I (Q1); The effectiveness against binary diffing techniques was evaluated on the T-I and the T-II (Q2); The ability to hide vulnerable code was evaluated on the T-III (Q3).Since software developers typically link programs into a single binary in embedded devices, we compiled and obfuscated these test suites in the same way under O2 with the link-time optimization (LTO).
Comparison targets.To compare with existing obfuscator, we choose the popular compiler-level obfuscation tool O-LLVM [36] as our comparison target because it is opensourced and compiler-based (same as Khaos).O-LLVM [36] contains three obfuscation methods: instruction substitution (Sub), bogus control flow (Bog), and control flow flattening (Fla).Literatures [5,20,57,69] in software engineering, systems security, and programming languages fields all use it in their experiments.To ensure the consistency of the evaluation environment, we upgrade the LLVM version of O-LLVM [36] to 9.0.1, which is same as Khaos.We also choose BinTuner [57], which is an iterative compiler tool that uses compiler options to transform the code to enlarge the difference of binaries, as another target to compare Khaos with compiler's options.

Performance Overhead after Obfuscation
We separately evaluated the performance overhead of the fission and the fusion, and the three combination modes introduced in subsection 3.4 on the T-I.As shown in Figure 6, the geometric performance overhead of the fission and the fusion are 5% and 6%, respectively.The reason why some cases (e.g., 456.hmmer) have a negative performance overhead is that after the fission separates part of the code, the remFunc can be further inlined to its callers, and the fusion improves the code locality of the aggregated functions.The results demonstrated that obfuscations compliant with the compiler optimizations can have good performance advantages.
Compared with the FuFi.ori, the other two combinations have a higher overhead because the fission generates many sepFuncs, aggregating them all incurs non-negligible performance overhead.For example, the 502.gcc_r contains many recursive functions, the sepFuncs generated by these functions are aggregated to the fusFuncs which are also the recursive functions.Since the stack frames of fusFuncs are larger, they will bring much pressure to the stack.[36].We compared the performance overhead of Khaos with Sub, Bog, Fla.As shown  [26] function N Y Y Y Asm2Vec [20] function N N N Y Safe [45] function

Compared with O-LLVM
in Figure 7, Khaos has comparable overhead with the Sub and the Bog.Due to the high overhead of Fla, we reduce its obfuscation ratio to 10% (Fla-10), and others are all at 100%.

The Effectiveness against Binary Diffing
Comparing binary diffing works is challenging due to their measurements of similarity are very different [57], such as graph edit distance or statistical significance.Simply comparing their similarity scores does not provide accurate information.For the commercial binary diffing tool BinDiff [81], we normalized its similarity score to [0, 1].For other tools open-sourced in academia, we normalized their results by computing the ratio of true matching function pairs that are also the top-ranked matching candidates (i.e.Precision@1).
Paring success judgment method.Since Khaos changes the number of functions, we relax the requirements for Precision@1.For the fission, if the oriFunc is paired with any sepFuncs generated from it or the remFunc, this pairing is recognized as successful.For the fusion, if the fusFunc is paired with any function before the fusion, this pairing is recognized as successful.For the DeepBinDiff [21], since its result is basic block to basic block, the pairing is recognized as successful as long as their belonging functions are matched, even if the two basic blocks are not truly matched.It is worth noting that the above setting is looser than originally used in these tools but is more challenging for Khaos.
Test suite adjustment adaptability.The characteristics of used binary diffing tools were summarized in Table 1.The column "symbol relying" means the un-stripped binaries whether have side-effects or not, for example, BinDiff usually uses function names to reduce the searching space; The column "time consuming" or the column "memory consuming" means the diffing process takes a long time or requires a lot of memory; The column "call-graph lacking" means whether using the call-graph as the feature or not.The test suites for VulSeeker [26] and DeepBinDiff [21] need to be adjusted due to unable to run results.VulSeeker [26] takes more than 1 day to diff two large binaries and often Results.We evaluated the accuracy of these tools by comparing obfuscated and un-obfuscated (un-stripped) binaries on the T-I and T-II.As Figure 8 shows, higher accuracy means lower adversarial effect.Since BinDiff [81] takes the advantage of function names, its result is much higher than others.Although DeepBinDiff [21] uses the basic block level instead of the function level as its granularity, the feature vector of the basic block still encodes the control flow graph and call graph, which have been changed by Khaos, and that's why Khaos can defeat it.With comparable overhead, Khaos can achieve a much better adversarial effect than O-LLVM [36].
Compared with compiler options.We follow the compare method of BinTuner [57] to calculate the similarity score of BinDiff [81] under different compiler settings.For the Bin-Tuner part, we set O0's binary code (same setting in the paper [57]) as the baseline during its iterative compilation.
For the Khaos part, we use binaries generated by FuFi.all.As shown in Figure 9, Khaos has a much lower similarity score in different compiler options.We also compared the overhead of programs generated by BinTuner with the baseline of Khaos (O2 with LTO).The overhead is 30.35%.

The Ability of Hiding Vulnerable Code
We use the T-III to further evaluate the ability of hiding real world vulnerable code.Each program contains at least one vulnerability (detailed in A.2).In this experiment, we only used VulSeeker [26], Asm2Vec [20], and SAFE [45] to calculate the escape@n ratio (the rank of truly matched pair  in the matched result) of vulnerable functions.The reason why BinDiff and DeepBinDiff were not used is that they only give top-1 matched result.We calculated escape@1/10/50 ratio of vulnerable functions.For example, as shown in Figure 10, the escape@50 ratio of FuFi.all on Asm2Vec is over 0. The escape ratio could reflect the ability of hiding the vulnerable code with different obfuscations.With the same precision and binary diffing tool (e.g., escape@50-Asm2Vec), the FuFi.sep and the FuFi.all are better than the FuFi.ori, and all of them are better than the Sub, the Bog, and the Fla in O-LLVM.This ratio could also reflect the diffing ability of binary diffing tools.With the same precision and the settings of obfuscators, e.g., escape@50-FuFi.all,Asm2Vec is more accurate than Safe, and both of them outperform VulSeeker.The experimental results show that Khaos can not only fight against binary diffing tools, but also reduce the pairing ranking of vulnerable functions significantly, achieving the purpose of hiding vulnerable code.

The Statistics of Khaos Internals
We collected some internal information on the T-I and T-II to demonstrate the effectiveness of Khaos.We used the objdump tool to disassemble all the binaries and calculated the histogram of opcodes.After that, we calculated the vector distance between the origin and obfuscated binaries.Since different programs contain different amounts of codes, we used the max distance of all obfuscated programs as the baseline to normalize these distances.As shown in Figure 11, the opcode distance of FuFi.all is the largest, followed by FuFi.sep and FuFi.ori.
We also calculated the statistics of the fission and the fusion individually without the combination.For the fission, we counted the fission ratio (#sepFuncs / #oriFuncs), and the average number of basic blocks in sepFuncs (#BB), the reduced ratio of oriFuncs after the fission (RR).For the fusion, we counted the fusion ratio (ratio of functions aggregated successfully), the reduced parameter number (#RP) by parameter lists compression, and the number of innocuous basic blocks of each function (#HBB).The statistical results are shown in Table 2.
These internal statistics proved that Khaos can obfuscate the oriFuncs with full force.For example, the Fusion Ratio is 97-99%, which means almost all functions are aggregated.It also proved that both optimizations for runtime overhead (e.g., data-flow reduction) and obfuscation enhancement (e.g., innocuous analysis) have worked effectively.

Discussion and Future Work
Aside from obfuscation techniques, we found that existing obfuscators have limitations on their implementation.In O-LLVM [36], Sub can be optimized back under LLVM O3 option, which leads us to choose O2 as our baseline.Bog and Fla skip the exception-relevant functions.For Tigress [12], we were unable to evaluate it in the same way as O-LLVM due to compilation errors.
The diffing process can be seen as a feature searching process.After we separate and aggregate these features, the searching difficulty increases and searching accuracy decreases.From our conclusion in table 1, the lacking of  call-graph consideration makes them unable to adopt interprocedural obfuscation.We believe our study will raise awareness of inter-procedural obfuscation on binary diffing.Smaller diffing granularity brings higher diffing costs.One way to reduce the cost is to use context information to narrow the searching space.Previous works pay much more attention to control flow rather than data flow.From the diffing perspective, data flow is harder to capture and encode.But from the obfuscation perspective, data flow is harder to change, too.Therefore, we predict the potential of data flow representation can be further tapped.

Conclusion
Binary diffing techniques can be used for 1-day/n-day vulnerability searching by attacker.In this paper, we propose an inter-procedural obfuscation technique Khaos to protect software against the state-of-the-art binary diffing.We design two obfuscation primitives -the fission and the fusion.Experimental results show that Khaos is not only effective, but also efficient.We wish our study could not only help developers to protect their software, but also promote the development of binary diffing techniques in turn.

A Appendix
A.1 The tag bits choice.As mentioned in subsection 3.3, the tagged pointer is used to select the code block aggregated from different oriFuncs.On the X86_64 architecture, only 48 bits of the virtual address are effective, so the upper 16 bits of the function pointer are unused and they can be used to place the tag information for the fusion.But this approach is expensive when handling statically initialized pointers, such as global static function pointers and virtual function tables.For the positionindependent executable, the values of these pointers need to be relocated to point to the actual function at load time.
To attach the tag information to these pointers, we need to add an initialization code to rewrite these pointers after the relocation which will make the program load slower.
To address the above problem, we choose to use the lowest bits of function pointers.This is because the addresses of functions are usually 16-bytes aligned with the performance consideration, so the lowest 4 bits of the function pointer can also be used to place the tag information.Actually, the clang compiler has already used the least bit to identify whether a function pointer points to a virtual function or not, so currently, only the 3 bits are unused.Instead of rewriting statically initialized pointers after the relocation, we utilize the relocation mechanism directly by adding the tag's value to the addend field (which is used to add an offset when relocating) of the relocation item, so the tag information can be attached to the pointer during the relocation.This method cannot be applied to support the upper bits tag because it exceeds the range supported by the addend field, i.e., (−2 31 , +2 31 ].

A.2 CVE Detail
As discussed in subsection 4.4, we use the Test Suite III to further evaluate the ability of hiding real world vulnerable code.As shown in Table 3, each program contains at least one vulnerability.

Figure 1 .Algorithm 1
Figure 1.An example of obfuscating a function by using Khaos.

Figure 2 .
Figure2.The control-flow and data-flow graphs of cal_file() in Figure1

Figure 3 .
Figure 3.An example of performing the fusion on two functions.

Figure 4 .
Figure 4. Function reference and indirect call processing.

Figure 5 .
Figure 5.A real-world example of the deep fusion method.

3 :
←  \  ⊲ We won't separate the whole function 6:for dominator tree t in  do 7:     ← basic block count of  8:  ← frequency of 's head 9: if  is in loop then 10:  ← the innermost loop where  is located 11:  ← loop count of  ×

Table 1 .
Summarize of chosen diffing works.
[21]ed due to memory limit.To speed up VulSeeker, we group the related functions into small groups (30 functions per group) to manually reduce the searching space, which is unfavorable to Khaos because the smaller the group size, the easier to diff.DeepBinDiff[21]requires too much memory (sometimes more than 10 TB) due to its representation of basic blocks.Since its diffing process is tightly coupled with binary size, we decide not to modify it and only use programs less than 40k lines.Even with the reduced test suite, it is still time consuming (e.g., over 1 week to diff binaries of 508.namd_r).It's worth mentioning that this is also unfavorable to Khaos because it uses original functions to obfuscate each other, lacking material reduces the obfuscation effect.Other binary diffing tools still use the normal test suites. gets

Table 2 .
Statistics of the fission and the fusion.
8, which means more than 80% of vulnerable functions can not be found within top-50 ranked functions.Moreover, this time we set the obfuscation ratio of Fla in O-LLVM to 100%, which would bring unacceptable overhead in the real scenario.

Table 3 .
Vulnerable functions of Test Suite III