Black-box Attacks Against Neural Binary Function Detection

Binary analyses based on deep neural networks (DNNs), or neural binary analyses (NBAs), have become a hotly researched topic in recent years. DNNs have been wildly successful at pushing the performance and accuracy envelopes in the natural language and image processing domains. Thus, DNNs are highly promising for solving binary analysis problems that are hard due to a lack of complete information resulting from the lossy compilation process. Despite this promise, it is unclear that the prevailing strategy of repurposing embeddings and model architectures originally developed for other problem domains is sound given the adversarial contexts under which binary analysis often operates. In this paper, we empirically demonstrate that the current state of the art in neural function boundary detection is vulnerable to both inadvertent and deliberate adversarial attacks. We proceed from the insight that current generation NBAs are built upon embeddings and model architectures intended to solve syntactic problems. We devise a simple, reproducible, and scalable black-box methodology for exploring the space of inadvertent attacks – instruction sequences that could be emitted by common compiler toolchains and configurations – that exploits this syntactic design focus. We then show that these inadvertent misclassifications can be exploited by an attacker, serving as the basis for a highly effective black-box adversarial example generation process. We evaluate this methodology against two state-of-the-art neural function boundary detectors: XDA and DeepDi. We conclude with an analysis of the evaluation data and recommendations for how future research might avoid succumbing to similar attacks.

Instantiations of these capabilities in the form of deep neural networks (DNNs) have generated substantial interest in recent years.Neural binary analyses (NBAs) are seemingly well-matched to the problem domain, where inference is necessary due to the lossy compilation process.Recent work has shown great promise for performing accurate disassembly [54,76], function boundary detection [17,54,64,76], and static binary similarity detection [22-24, 41, 42, 46, 72, 77, 78] that is simultaneously more efficient than deterministic methods.
Despite this promise, questions remain as to how resilient NBAs are in practice when confronted with the incredible diversity of binary code found in the wild as well as motivated adversaries seeking to actively evade or confuse detection techniques that make use of binary analysis.Adversarial attacks against DNNs have been intensely investigated in other problem domains [13,53], but most of these have been developed for continuous domains (e.g., images) whereas NBAs operate in a discrete domain.Furthermore, due to the issue of problem space mapping [55] one must develop specific black-box attacks against NBAs.
Recent work has criticized the size and scope of data used to train and evaluate NBA techniques published to date.For instance, Kim et al. [36] studied NBAs that perform static similarity detection using a large dataset of programs compiled with a variety of toolchains and compiler options called BinKit.Using this dataset and a simple baseline similarity detector called TikNib, they show that NBAs do not necessarily outperform simpler, explainable methods such as the one implemented by TikNib.Marcelli et al. [45] performed a similar study also focused on static similarity detection NBAs, and show that published results do not necessarily hold when the systems-under-test are trained and evaluated on larger, more representative datasets.Other recent work has demonstrated that DNNs used for static malware detection on binary programs are prone to adversarial attacks [43], though this work relies on traditional adversarial ML techniques to either use white-box gradient descent or black-box hill climbing to find evading transformations.
In this paper, we consider the heretofore unexplored question of NBA attack resilience in the context of function boundary detection.We focus on two exemplars of the state of the art occupying two representative points in the design space: XDA [54], which directly applies the well-known Transformer model architecture [68] and is intended to be robust to compiler optimization level [54, §4], and DeepDi [76], which employs a relational graph convolutional network and is explicitly advertised as intended for binary analysis in adversarial contexts such as malware analysis -e.g., as part of a malware analysis pipeline after dynamic analysis has been used to unpack a sample. 1bserving that current systems are largely based on DNN components developed to solve syntactic problems from other domains, we conjecture that these systems can be evaded using syntactic mutation.Building on this insight, we define a simple, reproducible black-box methodology to identify misclassifying inputs to these state-of-the-art function boundary detection NBAs at scale.Then, we demonstrate how an attacker can systematically leverage these misclassifications to either evade function detection or overwhelm a downstream analysis with false detections via at-will injection of false negatives and false positives.
From the techniques we developed, our analysis of the data leads us to several conclusions.
(1) Sophisticated searches for adversarial examples using gradient descent are not required to significantly degrade the accuracy of NBA-based function boundary detection systems.
(2) Function boundary detection systems that build on embeddings and model architectures intended for solving syntactic problems should be viewed in a similar light as syntactic approaches for attack detection such as first-generation antivirus and signature-based intrusion detection -that is, with healthy skepticism.This likely holds for other binary analysis tasks as well.(3) It is critical that future work is evaluated on large, representative, and openly available datasets that include a range of compiler configurations as well as adversarial examples; building on existing foundations [36,45] or this work would be a good starting point.Otherwise, it is difficult to extrapolate published evaluation results to actual operational performance.
We note that despite these conclusions, we do not intend to completely dismiss the promise of neural binary analyses.We discuss potential avenues for future research to mitigate the attacks found using our methodology in §6.
In summary, the contributions of this paper are the following.
(1) We propose a simple, reproducible black-box methodology for evaluating the resilience of function boundary detection NBAs to attacks at scale.(2) We demonstrate the susceptibility of the current state of the art, represented by XDA [54] and DeepDi [76], to producing overwhelming false negatives and false positives to downstream binary analyses.(3) We discuss and synthesize conclusions from an analysis of the evaluation data, and suggest several paths forward to mitigate similar attacks against neural binary analysis.
The source code and datasets are available at https://osf.io/bcdxq/.

PROBLEM STATEMENT AND MOTIVATION 2.1 Binary Analysis
The term "binary analysis" encompasses a wide range of techniques that all attempt to extract information from programs that have been compiled to a native instruction set architecture (ISA).These techniques range from fundamental analyses such as disassembly [52,54,76] and function boundary detection [9,17,54,76] to downstream tasks that build on prior analyses such as static similarity detection [24,32,42,77,78], type recovery [40], malware detection [3,33,57,70], and full decompilation [25,62,73].Designing accurate and efficient binary analyses is substantially more difficult than for source code due to the inherently lossy compilation process.That is, compiler toolchains discard much of the higherlevel abstractions present in source code when lowering to an ISA.Thus, binary analyses must operate with incomplete information and are virtually always unsound.Compounding this difficulty is that binary analyses are often, though not always, performed under a strong threat model [59] in which active adversaries attempt to evade or otherwise confuse those analyses.
While binary analyses have traditionally employed deterministic methods, the lack of source code naturally suggests inference methods as a promising approach for improving both accuracy and performance.In that light, it should come as no surprise that deep neural networks (DNNs) have come to the fore as a basis for binary analysis research.Table 1 presents an overview of recent work in this vein, to which we refer hereinafter as neural binary analyses or NBAs.
Each entry in Table 1 lists the input, embedding, model architecture, and the binary analyses implemented.An embedding is simply a procedure for mapping input data to a representation on which that model performs training and inference.Common choices of embeddings are one-hot encoding of byte sequences, or text embeddings such as word2vec [47] applied to the token stream produced by a disassembler.The model architecture, on the other hand, is the neural network proper; that is, the set of layers, interconnections, and weights responsible for inference.It is common for NBAs to repurpose model architectures developed for natural language or image analysis tasks; examples include recurrent neural networks (RNNs), convolutional neural networks (CNNs), and the Transformer architecture [68].
Motivation.Recent prior work has studied the accuracy of NBAs for static similarity detection [36,45] and malware detection [43].However, to the best of our knowledge, the question of

Function Boundary Detection
Function boundary detection is a fundamental binary analysis that typically occurs directly after, or even in tandem with, disassembly [52].Identifying functions is crucial for many downstream tasks.For instance, most static similarity algorithms consider pairs of functions when computing distances.Functions are also important inputs to recursive descent disassemblers as starting points for recursive disassembly or as possible callees of indirect call sites.
If function detection is performed as part of a manual process -e.g., interactive reverse engineering provided by tools like IDA Pro [29], Ghidra [50], or Binary Ninja [69] -excessive false positives could lead to user fatigue and, in turn, an unusable tool [8].False negatives, on the other hand, are perhaps even more concerning since failing to identify functions could directly lead to evasion opportunities for attackers that aspire to elude detection.
Function boundary misclassifications can also have a large impact on the accuracy and utility of an automated analysis pipeline.For instance, it is common to combine successive rounds of static and dynamic analysis to, e.g., first unpack a malware sample in a sandbox so that an efficient static analysis can be performed on an unobfuscated dropped or in-memory binary [75].False negative function detections in this scenario could again lead to detection "blind spots," while false positives could degrade the efficiency or accuracy of downstream analyses.
More formally, we can think of a function boundary detection NBA as a procedure that learns a mapping from bytes or instructions in a binary, depending on the embedding, to one of three labels: S for function entry points, E for function exit points, and N for all other points.Let  be the set of binary inputs and N be the set of possible byte or instruction indices in each binary.We can then denote this mapping as Early work in the NBA space heavily borrowed from DNNs built to tackle natural language processing (NLP) problems.The first system to adopt this approach was BiRNN [64], which treated byte sequences comprising binaries as tokens in a language.BiRNN converts each input byte into R 256 vectors using a one-hot encoding, where a byte's value is indicated by the position of the single 1 in a vector.Encoded bytes are then fed to a bi-directional RNN, where the use of two RNNs allows for prediction of a byte label using both preceding and succeeding bytes as context.XDA [54] built upon BiRNN's approach to function boundary detection by adapting another powerful model architecture from the NLP literature: Transformer [68].Transformer pioneered the concept of self-attention, where an attention layer allows the model to process sequential data out of order.This allows Transformerbased models to flexibly learn and infer meaning from context as well as parallelize better than prior architectures like RNN, LSTM, and GRU.Transformer-based models such as BERT [21] (Bidirectional Encoder Representations from Transformers) and the GPT family [14] (Generative Pre-Trained Transformer) represent the state of the art in NLP model architectures.
XDA's implementation [35] directly applies a popular implementation of BERT called RoBERTa (provided by Facebook's Fairseq [51] library) to the binary disassembly and function boundary detection tasks.Binaries are processed in 512-byte chunks, and a one-hot encoding is used to produce R 256 vectors to be processed by the network.In addition to byte values, the input vocabulary defines five additional tokens representing padding, start-of-sequence, endof-sequence, unknown, and mask (not all are used by XDA).In the first of two phases, the model is pre-trained using masked language modeling (MLM), which essentially teaches the model to predict byte values given surrounding context.The resulting model is then fine-tuned in the second phase to transfer the knowledge learned in the first phase to a particular binary analysis task such as function boundary detection.
DeepDi [76] is a state-of-the-art example of an NBA-based disassembly and function boundary detection system. 2 While DeepDi follows in the tradition of BiRNN and XDA by building upon existing model architectures, in this case, R-GCN [61] (Relational Graph Convolutional Model), it improves on prior work in several ways.First, it eschews the use of deep learning altogether for the initial disassembly step, choosing instead to rely on superset disassembly [10] to recover all possible instructions contained in an input binary.The instruction superset, in the form of 4-tuples of ⟨opcode, mod_rm, scale_index, rex_prefix⟩, is then converted into a fixed-dimension embedding using a learned embedding layer.Each embedding is concatenated with the following two instruction embeddings which is fed to an RNN to arrive at a final instruction representation.These representations then serve as input to the R-GCN, which models various relationships between instructions using an Instruction Flow Graph (IFG) in order to weed out invalid instructions from the superset and retain only the "true" disassembly.
To identify function entry points, DeepDi first collects a set of candidate entry points by applying a set of heuristics to instructions identified as valid from the superset.Each candidate instruction is packed with the three preceding and three succeeding instructions and then fed to the entry point recovery model.This model consists of an embedding layer, a GRU layer, and a two-layer perceptron classifier.The authors of DeepDi note that while the model achieved an average F1 score of 98.6% on the function start detection task in their evaluation, their heuristics-based approach "will miss tail jumps and functions with unseen prologues [76, p. 7]."

Semantics, or Merely Syntax?
While systems like XDA make repeated reference to "learning semantics, " these representations do not encode the semantic outcome of the input when executed on a system.We conjecture this limitation is due to the approach of being trained using only disassembled instructions or raw sequences of bytes extracted from binaries, as correspondence is limited to patterns of bytes or textual tokens presented during training.Absent of semantic meaning, code isomorphisms that syntactically appear drastically different might well not be detected as semantically equivalent.
To illustrate, Listing 1 presents a naïve addition function and its compilation to x86_64 assembly using two commonly available optimization levels: O0 and O3.The resulting code, while semantically equivalent, has radically different syntactic forms, and systems relying only on detecting sequences of bytes or instructions would fail to identify the optimized version if they have not encountered a similar example during training.While an argument can be made for generating comprehensive datasets that contain both versions (and indeed virtually all current methods do try to include these common compiler optimizations), we argue that such an approach cannot scale to include every possible combination of all available compiler flags.In essence, this places a hard constraint on what 2 DeepDi only recovers function entry points, and so is more accurately called a function start detection system. is possible for NBA models to learn in the absence of semantic information.
In the remainder of this paper, we build upon this insight to demonstrate systematic attacks against neural binary analyses for function boundary detection.

ATTACKING NEURAL FUNCTION BOUNDARY DETECTION
In our evaluation of neural function boundary detection, we focus on black-box attacks.These attacks are so-named since no information about the model-under-test (MUT) such as its internal weights or structure are assumed.Black-box attacks are advantageous because they do not require a deep understanding of MUTs; instead, only the ability to issue queries and observe results is needed.However, as the search process is unguided by model information, black-box techniques can fail to discover latent vulnerabilities that a white-box adversarial search such as projected gradient descent (PGD) [44] might otherwise uncover.In this sense, the results of our methodology should be considered a lower bound on the vulnerability of MUTs to which it is applied.
< l a t e x i t s h a 1 _ b a s e 6 4 = " E + I y y C X H I s q 5 x e h 7 y m S W z 1 1 T 6 < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 s + X D 3 S j Z 0 I w P d 0 e + R Z n T o L 4 a X s = " > A A A B 8 H i c b V B N S w M x E J 3 U r 1 q / q h 6 9 B I v g q e y K q M d S L x 4 r 2 A 9 p l 5 J N s 2 q z e I a S q Z s l Q Q Y 7 q + l 9 g g I 9 p y K t i 0 1 E s N S w g d k y H r O q q I Z C b I 5 g d P 8 Z l T B j i K t S t l 8 V z 9 P Z E R a c x E h q 5 T E j s y y 9 5 M / M / r p j a 6 C T We define a general black-box vulnerability search procedure with the goal of uncovering and exploiting function boundary misclassifications when performing inference on binary programs.The search proceeds in several phases: (i) input generation, (ii) ground truth generation, (iii) training and inference, and (iv) misclassification analysis.
Input Generation.In the first phase, we gather a corpus of benchmark program source code .Each benchmark is compiled by an array of compiler toolchains  , each of which is equipped with an attack configuration  consisting of a set of compiler flags and code transformations.Given  = | | compilers, we obtain a benchmark binary corpus  consisting of  separate compilations of  with each toolchain and attack configuration tuple ⟨ , ⟩.
Ground Truth Generation.Each configuration ensures that debugging information is generated while simultaneously preventing compiled binaries from being stripped of symbol information.Thus, we can post-process each binary and use these sources of information to construct a ground truth mapping  (1); that is, a function that labels each byte of code in each binary as to whether it is a function start, a function end, or neither.
MUT Training and Inference.In parallel, we split  into training and inference sets.The MUTs  are individually trained and evaluated on these sets.The result for each MUT  ∈  is an inferred mapping F .Since we extracted a ground truth labeling  in the previous phase, we can directly compare F and  to identify misclassifications in the form of false positives  +  and false negatives  −  where Misclassification Analysis.In the last phase, for each MUT we process its misclassification sets  +  ,  −  to identify attack inputs   that can reliably produce function boundary misclassifications in arbitrary binary programs.To do so, we rank-order misclassifications for each model by highest incidence to lowest.The ranked attack inputs then serve as seeds for an adversarial search, where they are each injected in turn into targeted functions of  to produce a mutated corpus B .A separate attack validation round is then carried out by having the MUT  perform inference on B to confirm that the intended misclassifications are replicated in the targeted functions.

Attack Techniques and Threat Models
The vulnerability search procedure relies upon collecting a set of attack techniques in the form of compiler flags and code transformations.Each of these techniques specifically targets function prologues and epilogues.A function prologue is responsible for (i) saving the contents of any registers that it uses and that a caller is responsible for preserving under a given calling convention; and, (ii) allocating space on the current thread's stack for any local variables the function uses.An epilogue, on the other hand, is responsible for reversing the effects of the prologue as well as (optionally) returning a value to the caller.Our attack techniques modify function prologues and epilogues because neural function boundary detection models focus on bytes or instructions that comprise (or are adjacent to) these code regions.
However, while each attack technique targets the same code regions, not all techniques are created equally.Some attacks are inadvertent; that is, an unintended deficiency of the data representation, network architecture, or training set causes the model to misclassify a benign input during inference.Other attacks, however, are inherently adversarial.In this case, an adversary intentionally transforms their input to actively attack the MUT.In this case, the only restriction is that the transformation must preserve the intended functionality of the attack.
Inadvertent Threat Model.Inadvertent attacks represent a weak threat model, in that there is no active adversary and, instead, the MUT or training set is suboptimal with respect to a "naturallyoccurring" binary that has been emitted by a standard compiler toolchain on benign code.While this threat model is weaker, Ren et al. show that in practice "adversaries explore non-default compiler settings to amplify malware differences" [59].
Adversarial Threat Model.In this threat model, no assumptions are made about how the binary was produced.Binaries can be obfuscated or encrypted, and in such cases must be unpacked in a malware sandbox prior to use of a static analysis on an unobfuscated dropped or in-memory executable image, as is common industry practice [75].
We classify the attack techniques employed by the search procedure according to whether they are inadvertent or adversarial.The criterion we use for this classification is whether or not the code resulting from applying a technique can be emitted by an unmodified compiler toolchain given a legal configuration.
An overview of the vulnerability search procedure is shown in Figure 1.In the following, we describe each inadvertent and adversarial attack technique.

Inadvertent Attacks
Inadvertent attacks result from misclassified binary code emitted by a benign compiler toolchain under any possible configuration.However, systematically exploring the entire space of possible compiler configurations in terms of combinations of compiler flags is daunting, to put it lightly.To illustrate, clang v13.1.6(arm64-apple-darwin21.4.0) advertises 1013 distinct command-line options in its default help message when invoked using clang -help.Thus, if we denote the set of possible options as , a rough estimate of number of possible combinations is |P ()| = 2 1013 . 3An exhaustive exploration of this space is clearly intractable in practice, and so we use domain knowledge to select a small number of compiler options that we conjecture will have an effect on function boundary prediction.We describe each of these classes of options below. 4tack Protector."Stack protector" is gcc and clang's modern name for canary/cookie-based anti-stack smashing defenses [18].This defense injects an unpredictable guard value onto the stack as part of the function prologue.In an epilogue, the injected copy of the guard is compared to a global copy.If the values do not match, then a stack smashing attack is assumed to have occurred and the program is terminated before the attacker can gain control of execution via, e.g., a corrupted return address.Otherwise, execution continues as normal.
The defense relies upon several assumptions -for instance, that the guard value contains carries sufficient entropy to make guessing infeasible, that the guard value is not leaked to the adversary, that the global copy of the guard cannot be modified by the adversary, and that all stack-allocated data that can be leveraged to hijack code execution is protected by the stack guard.The defense can in fact take several forms depending on the compiler version and particular flags used, such as: whether all functions are protected by a guard or rather only those that allocate a buffer on the stack; whether some or all stack variables are protected by the guard, which can involve variable reordering; and, the offset of the global guard copy.
Our inclusion of these compiler options is based on the observation that stack guard injection and verification requires modifications to function prologues and epilogues.NBAs that rely on particular byte or instruction sequences comprising prologues and epilogues for function boundary detection might thus be confused by these modifications.Listing 4 ( §A) illustrates a typical example.
Stack Clash Protection.Stack clash vulnerabilities arise when an attacker is able to grow either the stack or another memory region such that the two memory segments overlap [56].While OS kernels such as Linux can inject a guard page to separate the stack from other regions, prior research has shown that guard pages are nevertheless circumventable.Thus, modern compilers include options to enable stack clash mitigations in emitted code.The most popular form of this mitigation centers on breaking large stack allocations into page-sized chunks, and either implicitly or explicitly probing each chunk to ensure that it has not clashed with an existing memory allocation [38].
The impetus for our inclusion of stack clash protector compiler options as an attack technique is the modified allocation pattern for large stack buffers in function prologues and the requirement for explicit probe injection if the compiler deems it necessary.Listing 5 ( §A) illustrates one form of these modifications to function prologues (epilogues are not affected in this case).
Control Flow Integrity.Control flow integrity (CFI) is a general software hardening approach based on the principle that code must execute control transfers if and only if those transfers were intended by the programmer [1].Forward-edge CFI in particular has become a standard feature of production compilers like clang and gcc [67], efficiently protecting indirect calls and jumps through computed pointers of various forms.Architectural support for a weak form of return-edge CFI has also become available in recent x86/x86-64 processor generations in the form of Intel CET [63].While forward-edge CFI checks such as IFCC [67] are typically inserted at call sites, Intel CET enforcement depends on instrumenting valid indirect branch targets with special instructions (endbr32, endbr64) as well as ensuring that these instructions do not accidentally appear anywhere else in an executable memory region.Since this instrumentation can result in modified function prologues, we include CFI enforcement options as a separate attack technique.Listing 6 ( §A) illustrates function prologue modifications resulting from Intel CET and indirect branch tracking enforcement.
SafeStack.SafeStack is another stack-based buffer overflow defense developed as part of code-pointer integrity (CPI) [37] that relies upon separating stacks into safe and unsafe stacks.Securityrelevant data such as return addresses, register spills, and local variables are stored on the safe stack.Accesses to the safe stack are always checked via runtime instrumentation for safety.All other stack-allocated data is stored on the unsafe stack, ensuring that buffer overflows cannot corrupt any safe stack data.
Since the compiler must emit code to manipulate two stacks when enforcing SafeStack, this introduces modifications to both the prologue and epilogue of affected functions.Thus, we include SafeStack as a distinct attack technique.An example of SafeStack prologue and epilogue modifications is shown in Listing 7 ( §A).
Function Alignment.Compilers provide a number of options to control the alignment of functions in memory.Aligning functions to particular address boundaries can be advantageous from Listing 2: Adversarial attack sequence injection example using compiler-emitted NOP sequences (additions in green).In the prologue, a relative jump is injected to bypass an instruction containing an attack sequence encoded as an immediate value.In the epilogue, an attack sequence is directly injected verbatim; it will not be executed due to the unconditional return at line 10.
a performance perspective for architectural reasons, and optimal alignment varies depending on the target architecture.On the other hand, compilers can also be instructed to eschew optional alignment constraints in favor of optimizing for size.In this case, functions will be tightly packed and not conform to an alignment scheme.The reason we include function alignment as an attack technique is two-fold.First, tightly packing functions will remove any interstitial padding between adjacent functions, effectively creating a large change in instruction bytes preceding function prologues and succeeding function epilogues.Second, varying the requested alignment will cause compilers to emit different sequences of padding instructions.This leads to a similar, albeit weaker, change in prologue and epilogue-adjacent instructions.An example of this phenomenon is shown in Listing 8 ( §A).

Adversarial Attacks
In addition to the inadvertent attack techniques we just described, we also separately consider adversarial attack techniques.Consistent with our two-tier threat model introduced in §3.1, adversarial attacks go beyond the inadvertent evasive or false positive-inducing inputs that can be emitted by common compiler toolchains and configurations.Instead, under this stronger threat model an adversary can use arbitrary techniques to craft a binary that will induce misclassifications by a function boundary detection NBA.
The possession of the power to arbitrarily modify binaries does not itself imply the ability to easily discover input byte and instruction sequences that produce misclassifications.However, we find that an unguided search over bounded byte sequences is wholly sufficient to quickly find adversarial inputs that produce significant numbers of false positive or false negative misclassifications in the state of the art.
In particular, we explore the simple technique of injecting arbitrary byte sequences into function epilogues for this purpose.In principle, one could use a binary rewriting framework [71] to perform the injections on arbitrary in a functionality-preserving manner -e.g., that makes the necessary modifications to account for the increased size of the code sections of the mutated binary.However, we take the comparatively simpler approach of recompiling the binary corpus with a compiler configuration that causes NOP sequences of a desired length to be emitted in all epilogues.This renders it straightforward to inject the necessary code to perform attack validation in a length-preserving manner via purely local modifications.
We employ and evaluate two forms of adversarial injection in terms of content: (i) injecting a relative jump over a mov instruction that loads a register with the attack sequence as an immediate value, and (ii) injecting the attack sequence as-is into a function epilogue after a return instruction.Due to the unconditional jump or return that prefaces each form of the injected attack sequence, there is no realistic possibility that the attack sequence will be executed in either form.Listing 2 presents an example of this technique in action.We note that while the injected code sequences could perhaps be identified as dead code and removed, the ability to do this reliably degrades quickly as more complicated instruction sequences are injected (up to the level of an opaque predicate).We revisit this point in §6; however, we do not believe it to be straightforward to identify and remove attack sequences injected by a determined adversary.

IMPLEMENTATION
The inadvertent attack search is implemented using an augmented version of BinKit [36].This framework provides scripts to reproducibly build a number of independent compiler toolchains (i.e., several versions of gcc and clang) as well as to download and compile numerous open source software packages using a variety of configurations.We modified BinKit to support more compiler versions and configurations, and discuss the resulting experimental setup and data in §5.
Our adversarial attacks are implemented via a binary rewriting framework [2] that in turn is based upon open source code drawn from pyelftools [11] and Capstone [15].The framework operates on all ISAs present in BinKit.
We consider the binary rewriting procedure safe since it simply overwrites a number of NOP instructions placed in function epilogues by the compiler using the -fpatchable-function-entry option.This preserves the existing binary layout in terms of addressing, and thus all jump and call targets remain valid.Additionally, all injected code is protected by jumps or pre-existing return instructions that guard their execution.Nevertheless, we manually spot-checked the rewriting procedure as well as ran existing test suites on modified binaries when available.

EVALUATION
In this section, we present the results of our evaluation of two representative state-of-the-art neural binary analyses for function  Models Under Test.We selected the commercial standard IDA Pro v7.7 as a baseline deterministic disassembler.As exemplars of state-of-the-art neural binary analyses for function boundary detection, we selected XDA [54] and DeepDi [76], both of which we previously introduced in §2.XDA was selected for evaluation because its design is heavily inspired by Transformer [68], and its implementation on top of Fairseq's implementation of BERT [21] reflects this.As such, it is a perfect example of an NLP-based approach to neural function boundary detection.DeepDi, on the other hand, was selected as an example of a function boundary detection system that incorporates some semantic information in the form of a graph model of instruction dependencies.Finally, both of these systems publish a public artifact for evaluation: source code in the case of XDA [35], and a binary distribution in the case of DeepDi [20].We thank the authors of these systems for their commitment to open science.
Although the BinKit corpus includes a substantial combination of compiler versions, optimization levels, and specific flags, one cannot assume that compiler options are completely isolated for any particular binary or dataset.For example, one might assume the NoInline dataset would not include code that had been compiled with the flag -fgnu89-inline, which causes inlining, or that a binary compiled at the O0 optimization level would not include code compiled at a different optimization level.Unfortunately, this is not the case due to existence of compiler-generated code and code that is statically linked in from compiler support libraries.We found that binaries compiled with -fstack-protector-strong included code compiled with -fno-stack-protector, although the presence of the latter was dominated by the former.In some cases, such as the xorriso binary with 3000 functions, compiler support code is dominated by the software library code, and thus the presence of code compiled with different flags would have a minor impact on training and evaluation.On the other hand, a software library like coreutils is composed of many small utilities where the ratio of compiler support code to library code is much less.We do not believe that this phenomenon has a significant impact on our results, but we do note that it is non-trivial to ensure uniform compiler configurations on absolutely all code in each dataset and that we did not attempt to achieve this.
Metrics.We report precision, recall, and the balanced F-score ( 1 score) with the standard definitions.In Table 3, we report the mean and standard deviation of the precision, recall, and  1 score as a statistical summary calculated per binary in each dataset.We choose to report mean and standard deviation because performance within a particular dataset can exhibit high variance, as we discuss later.
Computational Resources.All experiments were performed on a dedicated server with a 64 core AMD Ryzen 3995WX CPU @ 4.3GHz, three RTX A6000 GPUs, 1TB memory, and a 4TB SSD.

RQ1: Inadvertent Attacks
In our first experiment, we subjected XDA and DeepDi to our augmented version of BinKit to evaluate their resilience to the full set of inadvertent evasions described in §3.2.We additionally include IDA Pro in this experiment as a baseline representing the state of the art in deterministic function boundary detection.Table 3 presents summary statistics in terms of precision, recall, and F1 score, along with standard deviation for each metric, broken out by the individual datasets comprising BinKit.
From the data, IDA Pro consistently performs best with respect to precision, with little variance.XDA, however, dominates with respect to recall and F1 score.DeepDi produces F1 scores that are very close to the performance of XDA and takes the top spot for exactly one dataset, Obfuscate.There is also clearly some variance across all metrics.However, in this respect the summary statistics do not tell the full story.
Figure 2 presents a series of precision-recall plots for each system.Each point represents one binary, colored according to membership in each of BinKit's constituent datasets.In each plot, the optimal point is the upper-right, indicating perfect precision (all detections were true positives) and recall (all functions were detected).Points towards the x-axis indicate lower precision and thus a higher proportion of false positives.Points towards the y-axis (left) indicate lower recall and thus a higher proportion of false negatives.
One can immediately observe a marked difference between the operating characteristics of the deterministic baseline represented by IDA Pro and the NBA systems.IDA Pro consistently achieves near perfect precision -i.e., when it detects a function, it is highly likely to be a true positive.However, it is prone in some cases to unreported functions.In the worst case, IDA Pro dips below 0.4 recall.
Both XDA and DeepDi, however, exhibit much stronger variance in both precision and recall.XDA in particular presents seeming clusters, i.e., precision-recall that correlates with individual datasets.XDA performs particularly poorly on the CFI dataset, colored in red.Other datasets are biased towards either precision or recall failures.For instance, Obfuscate, colored in purple, tends towards lower recall and false negatives.SizeOpt failures, in contrast, are biased towards lower precision and false positives.
In comparative terms, XDA performs slightly better across the board than XDA and both exhibit better recall than IDA Pro on this data.However, the scatterplot makes it clear that there is a sizable number of outliers in both precision and recall.Thus, we investigated a sample of these outliers.
One such outlier point is shown in Listing 3. On the gcal-4.1 benchmark, DeepDi issued >3000 distinct false positives from multiple occurrences of a single instruction.The instruction, highlighted in red, subtracts 8 bytes from the stack pointer.This is an operation that is often performed in a function prologue to allocate space for local variables on the stack.However, this particular example occurs when marshalling arguments to a call to fprintf in gcal's main function.The reason this occurs is because this particular  call to fprintf, a variadic function, has more than six arguments.The SysV ABI dictates that the first six arguments are passed in registers, while any further arguments are passed on the stack.Stack arguments, however, must be aligned to a 16-byte boundary.This causes the compiler, which was configured to operate at O0 in this case, to directly adjust the stack pointer prior to pushing the seventh argument to fprintf.It appears that the DeepDi model we evaluated never observed this particular pattern in its training set.
One can argue that if the failures we observe are restricted to accidental outliers, then their overall impact should be low.Unfortunately, as we demonstrate next, these inadvertent misclassifications can be systematically exploited by an adversary to build effective adversarial attacks.

RQ2: Adversarial Attacks
To evaluate adversarial attack efficacy, we recompiled the Normal dataset with different optimization options (O0, O3, Os) and the -fpatchable-function-entry=4,4 flag which inserts 4 NOP instructions after the original function epilogue.The effects of this on F1 score are presented in Figure 3.
Both XDA and DeepDi successfully handle the addition of simple "NOP sled" insertion, preserving high F1 scores.Unfortunately, when adversarial mutations are introduced following the methodology described in §3.3, both systems diverge significantly from their published accuracy.Interestingly, we observe that XDA is more resilient to epilogue mutation under the O0 optimization level versus O3 and Os.DeepDi's performance is degraded for all optimization levels, with median F1 scores well below 0.25 at all optimization levels.IDA Pro, however, is largely unaffected by epilogue mutations as evidenced by the near-identical F1 distributions across both datasets.

RQ3: Systematic Attacks
Our results to this point highlight that both XDA and DeepDi are vulnerable to seemingly simple adversarial byte sequence injection, causing them to misclassify significant portions of the functions present using the same sequence across all binaries with no attempt to adapt them to a given program.Unfortunately, during our evaluation we unearthed several cases where the patterns used were particularly effective, leading to almost complete evasion.Specifically, XDA only managed to recover 36 out of the 145  We believe that these cases are due to these particular adversarial patterns being especially effective on the characteristic layout of those programs.Furthermore, our adversarial attacks could potentially be improved by targeting them for particular binaries, paving the way for a novel, insidious way to attack NBAs relying only on static information.As is evident in the DeepDi case, such adversarially mutated binaries would be virtually invisible to detectors that relied on a vulnerable NBA for as part of its analysis pipeline.While the targeted attacks we speculate about here are beyond the scope of this work, we believe that it is a promising line of inquiry and plan to explore it as future work.

RQ4: Expanded Training Sets
To investigate whether inadvertent attacks can be mitigated with additional training, we next conducted a step-wise experiment with XDA. 5 For the inadvertent samples, we chose to evaluate XDA with the CFI dataset as it was the most difficult dataset to classify for all three systems under evaluation.Starting with a very limited subset of the ASE18 dataset, we trained XDA with increasingly more diversity in the number of compilers, compiler versions, and compiler options.The results are shown in Table 4.With only one version of GCC and four optimization levels in the training data, XDA achieved a reasonable F1 score of 0.855 on the CFI dataset.By adding four versions of Clang and GCC, a modest improvement in F1 score obtained (0.865).Notably, adding Clang-compiled binaries to the training set reduced XDA's performance.
We then expanded the original ASE18 dataset by including newer compilers, namely two versions of Clang and GCC, which increased the F1 score to 0.923.Finally, by adding the Os optimization level, XDA achieved a score of 0.924, which is better than both DeepDi and IDA Pro.This demonstrates that XDA's performance can in fact be improved by expanding the training dataset -which is expected -but also that XDA is also quite sensitive to compiler versions and options present in the training data.Evade-ep4 0.938

RQ5: Adversarial Training
In the final experiment, we evaluate whether MUTs can be made resilient to the adversarial attacks we describe by adopting adversarial training.In order to train XDA on these crafted attacks, we created a new dataset based on the NOP dataset described in §5.3.In this dataset, we replaced each 4-byte NOP epilogue with a randomly chosen evasion pattern that is also a valid 4-byte x86 instruction sequence.We then fine-tuned XDA on this expanded dataset and evaluated on both the CFI and Evade-ep4 datasets.With the new model, XDA's performance on the Evade-ep4 dataset improved from 0.198 to 0.938, a significant improvement.Unfortunately, XDA's performance on the CFI dataset was also degraded from 0.924 to 0.810.This suggests that while adversarial training can partially mitigate evasion, it also comes at a significant cost in accuracy for benign samples.
In addition, it is also unclear whether training on adversarial examples represents a trustworthy mitigation.To illustrate, we performed an additional round of adversarial attack search to demonstrate the inherent limitation of training against adversarial techniques.Repeating our 4-byte evasion search, we were able to reduce XDA's performance to 0.488 (STD 0.317) when trained on the Evade-ep4 dataset.Additionally, we studied two alternative attacks using a 3-byte and 8-byte NOP dataset, producing F1 scores of 0.427 and 0.430 respectively.Thus, while one would hope that training on adversarial examples would produce a model that is robust against many different evasion patterns, our experiments show that this is unlikely to be the case as we were able to degrade XDA's performance again without significant effort.

DISCUSSION
Black-box attacks are powerful enough.As is hopefully clear from our evaluation, black-box attacks are sufficiently powerful to discover numerous false positive and false negative-inducing inputs to current generation function boundary detection NBAs.Sophisticated white-box searches for adversarial examples that rely on gradient descent might well find more attacks.However, it is unclear how one might adapt existing searches while preserving the functionality of the mutated binary due to the discrete problem space.Nevertheless, this is an interesting direction to explore.
Inadvertent attacks break pure NLP-based systems.As should also be clear from the evaluation, inadvertent attacks significantly degrade function boundary detection approaches that directly reuse NLP embeddings and models as XDA does.Another way to view this finding is that such approaches do not generalize well to examples that are not observed during training.In retrospect, this naturally follows from our conjecture that syntactic representations are not a sound basis for binary analysis where semantics is virtually always what actually matters.One could argue that simply including misclassified examples in the training set is sufficient mitigation, and there is likely some truth to that.However, in our opinion a realistic counterargument is that anticipating and training on a sufficiently large permutation of compilers, compiler versions, and compiler configurations is combinatorially difficult.To make matters worse, that mitigation does not take adversarial attacks into account.
Domain-specific embeddings and graph models are a marginal improvement.The evaluation shows that DeepDi's domainspecific embedding and use of R-GCN to model instruction dependencies improves its resilience to inadvertent attacks.This is clear evidence that incorporating even a small bit of the latent semantic information present in an instruction stream has utility.However, this improvement is tempered by DeepDi's performance against adversarial examples, motivating our next observation.
Focus on semantics instead of syntax.The overarching conclusion we draw from the evaluation is that syntactic representations are unlikely to be a reliable basis for binary analyses.In a way, this is unsurprising, since syntactic approaches for attack detection such as signature-based IDS and first-generation anti-virus based on pattern matching against byte sequences were criticized for similar deficiencies long ago.While these techniques can of course be useful, they cannot be relied upon in isolation.Instead, mirroring attack detection's move from static pattern matching to dynamic behavioral analysis more than a decade ago, we argue that future work in this space should emphasize semantics over syntax to avoid similar pitfalls.
Evaluation quality is important.In tandem with the semantics question, we believe it is crucial that the research community hews to a standard of evaluations on large, representative, public datasets.This data should include a range of programs with varying functionality, as well as different compilers, compiler versions, and compiler configurations.As shown in our experiments, testing on a more comprehensive dataset such as BinKit [36] over smaller, less representative datasets in the original papers can help identify areas of improvement for underlying models, such as lacking understanding of semantic isomorphisms.Finally, we believe that these benchmark corpora should include adversarial examples generated using techniques such as those described herein to directly test whether future work is susceptible to similar attacks.This inclusion should both to increase robustness in possible security related use cases and to help the model learn patterns of adversarial perturbation that exploit syntactic versus semantic model understanding.A substantial bonus in following such a standard would be to ease reproducibility and comparative evaluation.
Detecting adversarial code is not easy.Finally, we readily acknowledge that the adversarial code we inject as part of our methodology and evaluation is likely to be easy to detect and strip before performing classification.However, we believe that focusing on this is misguided.Code obfuscation is well within the threat model of many contexts in which binary analyses such as function boundary detection operate under.In that light, it is reasonable to suspect that if an adversary wished to do so, they could easily obfuscate the fact that the injected code will never be executed by relying on computed control transfers and opaque predicates.Indeed, if a defender was able to perfectly identify dead code, then a large part of the debloating problem would be perfectly solved which -to our knowledge -is not the case.Instead, as with so many other problems in this space, detecting and removing adversarial code reduces to Rice's Theorem [60].Thus, we believe it is safe to conclude that this is not likely to be a fruitful research direction.

RELATED WORK
Neural binary analysis.Binary analysis is a long-studied and expansive research area.Disassembly is a fundamental task that traditionally has been solved using deterministic algorithms that can be broadly classified as either linear disassembly (provided by tools like objdump from GNU binutils) or recursive descent disassembly (provided by tools such as IDA Pro [29], Ghidra [50], or Binary Ninja [69]).These tools typically also incorporate algorithms for function boundary detection using some combination of symbol table information, debug information, and pattern-based heuristics.Work such as ByteWeight [9] specifically investigated learning-based approaches for performing function boundary detection.Other common binary analysis tasks include measuring similarity between snippets of binary code [32], recovering source code types [40], and decompilation [62,73].In recent years, applying deep learning techniques to binary analysis problems has become a popular topic of study due to the success of DNNs in solving image and text processing tasks, among others.Shin et al. [64] were the first to apply a DNN to a binary analysis problem; in this case, detecting function boundaries using a bi-directional recurrent neural network (BiRNN).The strategy of repurposing embeddings and model architectures originally developed to solve NLP or image processing problems became de rigeur in a way.Numerous NBAs for disassembly [54,76], function boundary detection [17,54,76], value set analysis [31,42], static code similarity [22,41,42,46,72,74,77,78], decompilation [25], and malware analysis [3,33,70] directly use embeddings (e.g., word2vec [47], PV-DM [39]) or models (e.g, RNN, CNN, Transformer [68], BERT [21]) developed for the NLP or image problem domains.One of the conclusions we draw in this paper is that while it is tempting to build on techniques that have been successful in other areas, binary analysis is a strikingly different research area with a different threat model and much stronger accuracy requirements for downstream tasks (see the discussion in §6).For NBAs to be resilient against adversaries that seek to evade or confuse binary analyses, choices of embeddings and model architectures should reflect these requirements.
We are not the first to independently evaluate NBA systems for other tasks.Kim et al. [36] studied NBAs that perform static similarity detection using a large dataset of programs compiled with a variety of toolchains and compiler options called BinKit; we build on BinKit to carry out our own evaluation.Using this dataset and a simple baseline similarity detector called TikNib, they show that NBAs do not necessarily outperform simpler, explainable methods such as the one implemented by TikNib.Marcelli et al. [45] performed a similar study also focused on static similarity detection NBAs, and show that published results do not necessarily hold when the systems-under-test are trained and evaluated on larger, more representative datasets.Finally, Lucas et al. showed that DNNs used for static malware detection on binary programs are prone to adversarial attacks [43].This work lies in contrast to our own not only in the specific problem domain but also in its use of traditional adversarial ML techniques -i.e., white-box gradient descent or black-box hill climbing -to find evading transformations.
Adversarial machine learning.Substantial research has studied the problem of crafting adversarial examples [13,53].Traditionally, this research has been conducted on semi-continuous spaces, here defined as when adjacent values carry semantic information, e.g., pixel values for image classification.In these approaches, attacks use a variety of derivative-based approaches to optimize loss over some non-convex objective function [7,12,16,27,30,49,65].In our case, we examine executable binaries, where we must work under more difficult constraints.First, adjacent values for binary code do not carry semantic meaning.For instance, 0x8F is the binary encoding of the x86 pop instruction, whereas 0x90 is the semantically unrelated nop instruction.This difference is non-trivial as it presents a much harder problem than that of optimization over semi-continuous spaces; in fact, it reduces to integer factorization, an NP-complete problem [34].Pierazzi et al. [55] provide detailed insight into how different problem spaces under which adversarial machine learning is conducted, such as using binary code as the input to a DNN, require specific black-box attacks because traditional gradient-based approaches fail.Another constraint we must satisfy is to produce valid executable binaries.These constraints are similar to those necessary in any attack that attempts to modify binary code [4,43].
As stated in the previous subsection, other work in using deep learning for malware analysis has looked the problem of mapping binaries to either malicious or benign software [57,58].In turn, various work has aimed to attack this type of machine learning model and others like it [4,43].However, this paper presents the first exploration into evaluating the robustness of deep learning models against both inadvertent attacks and crafted adversarial examples.

CONCLUSIONS AND FUTURE WORK
We presented the first study of the resilience of neural function boundary detectors to inadvertent and adversarial attacks.Our methodology demonstrates that straightforward black-box search using a large dataset and toolchain array is sufficient to identify numerous adversarial examples for two state-of-the-art systems: XDA [54] and DeepDi [76]; sophisticated white-box search algorithms are unnecessary.Our conjecture -which we believe is validated by our evaluation -is that these systems are susceptible to attack because they rely on embeddings and model architectures intended for syntactic inference, and do not sufficiently consider the semantics of the ISAs they operate on.This is not to say that this research direction should be abandoned.To the contrary, we believe there remains significant potential for applying deep learning to binary analysis problems.However, future research might well benefit from focusing on instruction semantics rather than syntactic representations.In addition, future work should ensure that evaluations are based on large, representative datasets that includes adversarial examples intended to exploit syntactic dependence.An intriguing research question is whether effective embeddings and model architectures can be developed specifically for binary analysis tasks.We plan to investigate this question in our future work, and hope others will as well.

Listing 3 :
One example of a single instruction that causes DeepDi to issue >3000 false positives for the gcal-4.1 benchmark.

Figure 2 :Figure 3 :
Figure2: Overview of precision versus recall per binary from the BinKit corpus.IDA Pro consistently performs best with respect to precision, with little variance.XDA, however, dominates with respect to recall.It also wins out on F1 score in all but one case, Obfuscate, where DeepDi is best.Variance in these metrics is somewhat apparent, but better observed in Figure3.

Table 1 :
Summary comparison of various neural binary analysis systems.Note that all of these systems are at least in part built on embeddings and model architectures developed to solved problems in the NLP or image processing domains.

Misclassification Analysis Ground Truth Generation Model Training and Inference Benchmark Generation
< l a t e x i t s h a 1 _ b a s e 6 4 = " / A H 6 P M H 2 e 6 Q d A = = < / l a t e x i t > Bm Figure1: Overview of the NBA function boundary detection vulnerability search procedure.In the first phase, benchmark source code  is compiled by an array of compiler toolchains and configurations ⟨ , ⟩ resulting in a benchmark binary corpus .Function boundary ground truth  is extracted from .In parallel, the set of models-under-test  is trained and evaluated on one or more training and inference splits of .Finally, misclassifications in the form of false positives and negatives are collected by comparing , ∈ F .Heavy hitters are identified and injected into  for attack evaluation.

Table 3 :
Results per dataset.For each dataset and metric (precision, recall, F1), the maximum value is highlighted green while the minimum value is highlighted red.Large standard deviations (SD) are set in bold.

Table 4 :
Improving resilience through training.