jTrans: jump-aware transformer for binary code similarity detection

Binary code similarity detection (BCSD) has important applications in various fields such as vulnerabilities detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow information of binary code into Transformer-based language models, by using a novel jump-aware representation of the analyzed binaries and a newly-designed pre-training task. Additionally, we release to the community a newly-created large dataset of binaries, BinaryCorp, which is the most diverse to date. Evaluation results show that jTrans outperforms state-of-the-art (SOTA) approaches on this more challenging dataset by 30.5% (i.e., from 32.0% to 62.5%). In a real-world task of known vulnerability searching, jTrans achieves a recall that is 2X higher than existing SOTA baselines.


INTRODUCTION
Binary code similarity detection (BCSD), which can identify the degree of similarity between two binary code snippets, is a fundamental technique useful for a wide range of applications, including known vulnerabilities discovery [8-11, 18, 20, 21, 24, 33, 40, 49, 50, 55, 59], malware detection [4] and clustering [30,31,37], detection of software plagiarism [41,42,52], patch analysis [32,35,60], and software supply chain analysis [27].Given the continuously expanding number of binary programs and the fact that binary analysis tasks are widespread, there is a clear need to develop BCSD solutions that are both more scalable and accurate.
Prior to the use of machine learning in the field, traditional BCSD solutions heavily relied on specific features of binary code, i.e., control flow graphs (CFGs) of functions, which capture the syntactic knowledge of programs.Solutions such as BinDiff [64], BinHunt [23] and iBinHunt [46] employ graph-isomorphism techniques to calculate the similarity of two functions' CFGs.This approach, however, is both time-consuming and volatile, since CFGs may change based on compiler optimizations.Studies such as BinGo [5] and Esh [8] achieve greater robustness to CFG changes by computing the similarities of CFG fragments.However, these approaches are based on manually crafted features, which have difficulty capturing the precise semantics of binary code.As a result, these solutions tend to have relatively low accuracy.
With the rapid development of machine learning techniques, most current state-of-the-art (SOTA) BCSD solutions are learningbased.In general, these solutions embed target binary code (e.g., functions) into vectors, and compute functions' similarity in the vector space.Some solutions, e.g., Asm2Vec [14] and SAFE [43], model assembly language (of machine code) using language models inspired by natural language processing (NLP).Other studies use graph neural networks (GNNs) to learn the representation of CFGs and calculate their similarity [59].Some studies combine both approaches, and learn representations of basic blocks by NLP techniques and further process basic block features in a CFG by GNN, e.g., [44,62].Despite their improved performance, existing methods have several limitations.
First, NLP-based modeling of assembly language only considers the sequential order of instructions and the relationships among them; information regarding the program's actual execution (e.g., control flows) is not considered.As a result, methods that rely solely on NLP will lack semantic understanding of the analyzed binaries, and will also not be adapt well to possibly significant changes in the code which are the result of compiler optimization.
Secondly, relying solely on CFGs misses semantics of instructions in each basic block.Genius [21] and Gemini [59] propose to expand the CFG with manually extracted features (e.g., number of instructions).However, such features are still insufficient to fully capture the code semantics.Furthermore, these solutions generally use GNN to process CFGs, which only captures the structural information.GNNs are also generally known to be relatively difficult to train and apply in parallel, which limits their real-world application.
Thirdly, the datasets on which existing solutions are trained and evaluated are not sufficiently large and/or diverse.Due to the lack of a common large benchmark, each study creates its own dataset, often from small repositories such as GNUtils, coreutils, and openssl.These small datasets have similar code patterns, and therefore lack diversity, which in turn can lead to over-fitted models and a false impression of high performance.Furthermore, the evaluations of existing solutions often does not reflect real-world use cases.The majority of studies did not conduct experiments on a large pool of candidate functions, which are common in the real world.Under more realistic conditions, the performance of many SOTA solutions drops significantly, as we show in our experimental results in Section 6.
In this paper, we present jTrans, a novel Transformer-based model designed to address the aforementioned problems and support real-world binary similarity detection.We combine NLP models, which capture the semantics of instructions, together with CFGs, which capture the control flow information, to infer the representation of binary code.Since previous work [62] has shown that a simple combination of NLP-based and GNN-based features does not yield optimal results, we propose to fuse the control-flow information into the Transformer architecture.To the best of our knowledge, we are the first to do so.
We modify the Transformer to capture control-flow information, by sharing parameters between token embeddings and position embeddings for each jump target of instructions.We first use unsupervised learning tasks to pre-train jTrans to learn the semantics of instructions and control-flow information of binary functions.Next, we fine-tune the pre-trained jTrans to match semantically similar functions.Note that our method is able to combine features from each basic block using language models without relying on GNNs to traverse the corresponding CFG.
In addition to our novel approach, we present a large and diversified dataset, BinaryCorp, extracted from ArchLinux's official repositories [2] and Arch User Repository (AUR) [3].Our newlycreated dataset enables us to mitigate the over-fitting and lack of diversity that characterize existing datasets.We automatically collect all the c/c++ projects from the repositories, which contain the majority of popular open-source softwares, and build them with different compiler optimizations to yield different binaries.To the best of our knowledge, ours is the largest and most diversified binary program dataset for BCSD tasks to date.
We implement a prototype of jTrans and evaluate it on realworld BCSD problems, where we show that jTrans significantly outperforms SOTA solutions, including Gemini [59], SAFE [43], Asm2Vec [14], GraphEmb [44] and OrderMatters [62].When using the full-sized BinaryCorp in the task of finding the matching function in pools that have 10,000 functions, which is close to real world scenarios, jTrans ranks the correct matching function the highest similarity score (denoted as Recall@1) with a probability of 62.5% on average, while the best of SOTA solutions only achieves 32.0%.For the less realistic (and easier) scenario of pools with 32 functions, our approach outperforms its closest competitor by 10.6% (from 84.3% to 94.9%) for the same Recall@1 metric.Furthermore, when evaluated on a real-world vulnerability searching task, jTrans achieves a recall score that is 2X higher than the SOTA baselines.
In summary, our study offers the following contributions: • We propose a novel jump-aware Transformer-based model, jTrans, which is the first solution to embed control-flow information into Transformer.Our approach is able to learn binary code representations and support real world BCSD.We release the code of jTrans at https://github.com/vul337/jTrans.

PROBLEM DEFINITION
BCSD is a basic task to calculate the similarity of two binary functions.It can be used in three types of scenarios as discussed in [26], including (1) One-to-one (OO), where the similarity score of one source function to one target is returned; (2) One-to-many (OM), where a pool of target functions will be sorted based on their similarity scores to one source function; (3) Many-to-many (MM), where a pool of functions will be divided into groups based on similarity.Without loss of generality, we focus on OM tasks in this study.Note that, we can reduce OM problems to OO problems by setting the size of target functions to 1.We could also extend OM problems to MM, by taking each function in the pool as the source function and solving multiple OM problems.To make the presentation clear, we give a formal definition of the problems as below:  The goal of BCSD is developing a solution to calculate the similarity scores of two functions, where two functions compiled from two source code functions that are same or similar to some extent (e.g., one is a patched version of the other) should be similar.

RELATED WORK 3.1 Non-ML-based BCSD Approaches
Prior to applying machine learning, traditional binary code similarity techniques include static and dynamic methods.Under the assumption that logically similar code shares similar run-time behavior, dynamic analysis methods measure binary code similarity by analyzing manually-crafted dynamic features.This type of solutions includes works such as BinDiff [16], BinHunt [23], iB-inHunt [46], and Genius [21], which are based on the CG/CFG graph-isomorphism (GI) Theory [16,22].These works compare the similarity of two binary functions using graph matching algorithms.ESH [8] employs a theorem prover to determine whether two basic blocks are equivalent.This approach, however, is not applicable to the case of different compiler optimizations due to basic block splitting.BinGo [5], Blex [17] and Multi-MH [49] use randomly sampled values to initialize the context of the function and then compare the similarity by collecting the I/O values.The main shortcoming of these dynamic methods is that they are not suitable for large-scale binary code similarity detection.This is due to the fact that they are computational expensive and require long running time to analyze the whole binary code.
Static methods for BCSD are based on the identification of structural differences in binary code.Methods such as BinClone [19], ILine [34], MutantX-S [31], BinSign [47], and Kam1n0 [13] use categorized operands or instructions as static features for the computation of binary similarity.Tracelet [11] and BinSequence [33] compares the similarity of two binary functions based on the editing distance between instruction sequences.TEDEM [50] and XMATCH [20] compute binary similarity using graph/tree edit distance of the basic block expression trees.While static methods are more efficient than dynamic ones, they generally achieve lower accuracy, as they only capture the structural and syntactical information of the binary, and neglect the semantics and relationship between instructions.

Learning-based BCSD Approaches
The study of learning-based BCSD has been inspired by recent development in natural language processing (NLP) [39,45,56], which uses real-valued vectors called embeddings to encode semantic information of words and sentences.Building upon these techniques, previous studies [14,15,24,40,43,44,51,59,[61][62][63] applied deep learning methods to binary similarity detection.Shared by many of these studies is the idea of embedding binary functions into numerical vectors, and then using vector distance to approximate the similarity between different binary functions.As shown in Figure 2, these methods use deep learning training algorithms to make the vector distances of logically similar binary functions closer.Most learning-base methods use Siamese network [6], which requires ground-truth mappings of equivalent binary functions to be trained.Diff [40], for example, learns binary function embeddings directly from the sequence of raw bytes using convolutional neural network (CNN) [38].INNEREYE [63] and RLZ2019 [51] regard instructions as words and basic blocks as sentences, and use word2vec [45] and LSTM [29] to learn basic block embeddings.SAFE [43] uses a similar approach to learn the embeddings of binary functions, while Gemini [59], VulSeeker [24], GraphEmb [44] and OrderMatters [62] use GNNs to build a graph embedding model for learning attributed control-flow graph (ACFG) of binary functions.Gemini and VulSeeker encode basic blocks with manually-selected features, while GraphEmb and OrderMatters use neural networks to learn the embeddings of basic blocks.Another approach is propsoed by DEEPBINDIFF [15] and Codee [61], which uses neural networks to learn the embeddings of generated instruction sequences instead of embedding the ACFG of binary functions.
Unsupervised learning (learing without labels) has also been explored in the field of BCSD.One representative solution is Asm2Vec [14].This approach also generates instruction sequences with CFG, but does not rely on the ground truth mappings of equivalent binary functions.Asm2Vec uses an unsupervised algorithm to learn the embedding of binary functions.However, its performance is not as good as the state-of-the-art supervised learning methods.
Overall, learning-based approaches are suitable for large-scale binary code similarity detection, as binary code functions can be transformed into vectors.Then similarity can be computed using vector distance, which is computationally efficient.However, existing techniques have limitations.Some approaches [24,59] ignore the semantics of instructions and basic blocks, as they only use manually-selected features to represent basic blocks.Other approaches [14,15,40,43,51,61,63] neglect some or all of the structural information of binary functions, as they do not use the control-flow information, or generate instruction sequences with CFG using random walk.Finally, methods such as [44,62] learn basic block embeddings, and use GNN to learn the embedding of attributed control-flow graph (ACFG) of binary functions.While these approaches are effective in some scenarios, they neglect the co-occurrence between inter-basic block instructions.

METHODOLOGY 4.1 Overview
To address the challenge we discussed in Section 1, we propose a novel model, jTrans, for automatically learning the instruction semantics and control-flow information of binary programs.jTrans is based on the Transformer-Encoder architecture [57], and consists of several significant changes designed to make it more effective for the challenging domain of binary analysis.
The first change we propose to the Transformer architecture is designed to enable jTrans to better capture the code's jump relationships, i.e., the control-flow information.To this end, we first preprocess the assembly code of the input binary so that it contains the program's jump relationships.Next, we modify the embedding of the individual input tokens of the Transformer so that the origin and destination locations of the jumps are "semantically" similar.
The second change we propose relates to the training of our proposed model.In view of the similarity between natural language and programs in terms of data flow, we chose to use the commonly used effective Transformer training approach of Masked Language Model (MLM) [12].MLM-based training requires the model to predict the content of masked tokens based on the content of their neighbors, thus forcing the model to develop a contextual understanding of the relationships among instructions.
Furthermore, to encourage the model to learn the manner by which jumps are incorporated in the code, we propose a novel auxiliary training task that requires the model to predict the target of a jump instruction.This task, which we call Jump Target Prediction (JTP), requires an in-depth understanding of the semantics of the code, and as shown in Section 6.4, it contributes significantly to the performance of our model.

Binary Function Representation Model
jTrans is based on the BERT architecture [12], which is the stateof-the-art pre-trained language model in many natural language processing (NLP) tasks.In jTrans, we follow the same general approach used by BERT for the modeling of texts, i.e., the creation of an embedding for each token (i.e., word), and the use of BERT's powerful attention mechanism to effectively model the binary code.
However, binary codes are different from natural languages in several aspects.First, there are way too many vocabularies (e.g., constants and literals) in binary code.Second, there are jump instructions in binary code.For a jump instruction, we denote its operand token as the source token, which specifies the address of the jump target instruction's address.For simplicity, we denote the mnemonics token of the target instruction as the target token, and represent this jump pair as <source token, target token>.
Therefore, we have to address two problems to apply BERT.
• Out-of-vocabulary (OOV) tokens.As in the field of NLP, we need to train jTrans on a fixed-size vocabulary that contains the most common tokens in the analyzed corpus.Tokens that are not included in the vocabulary need to be represented in a way that enables the Transformer to process them effectively.• Modeling jump instructions.After preprocessing, the binary code has few information left for the source and target token of a jump pair.BERT can hardly infer the connection between them.This problem is exacerbated by the possible large distance between the source and target, which makes contextual inference even more difficult.We propose to address these challenges as follows.

Preprocessing instructions.
To mitigate the OOV problem, we use the state-of-the-art disassembly tool IDA Pro 7.5 [28] to analyze the input binary programs and yield sequences of assembly instructions.We then apply the following tokenization strategies to normalize the assembly code and reduce its vocabulary size: (1) We use the mnemonics and operands as tokens.
(2) We replace the string literals with a special token <str>.
(3) We replace the constant values with a special token <const>.
(4) We keep the external function calls' names and labels as tokens, and replace the internal function calls' names as <function>1 .(5) For each jump pair, we replace its source token (which was absolute or relative address of the jump target) with a token JUMP_XXX, where XXX is the order of the target token of this jump pair, e.g., 20 and 14 in Figure 3.In this way, we can remove the impact of random base addresses of binaries.

Modeling jump instructions.
We now address the challenge of representing jump instructions in a manner that can enable jTrans to better contextualize their bipartite nature (and capture the entire control-flow information as a whole).We chose to use the positional encodings, which are an integral part of the Transformer architecture.These encodings enable the model to determine the distance between tokens.The implicit logic of this representation is that larger distances between tokens generally indicate weaker  The raw assembly code is first tokenized and normalized.Then, each token is converted to a token embedding and a position embedding, while its final input embedding is the sum of these two.For each jump pair, its source token's embedding (e.g., E JUMP_14 ), also called jump embedding, shares parameters with its target token's position embedding (e.g., P 14 ).
mutual influence.Jump instructions, however, bind together areas in the code that may be far apart.Therefore we modify the positional encoding mechanism to reflect the effect of jump instructions.
Our changes to the positional encodings are designed to reflect the fact that the source and target of the jump instructions are not only as close as two consecutive tokens (due to order of execution), but also that they have a strong contextual connection.We achieve this goal through parameter sharing: for each jump pair, the source token's embedding (see  JUMP_14 in Figure 3) is used as the positional encoding of the target token (see  14 ).
This representation achieves two important goals.First, the shared embedding enables the Transformer to identify the contextual connection between the source and target tokens.Secondly, this strong contextual connection is maintained throughout the training process, since the shared parameters are updated for both tokens simultaneously.It is worth noting that we only focus on direct jump instructions in jTrans.We hypothesize that control flow information brought by indirect jumps will further improve the performance of jTrans.However, recognizing targets of such jumps is a well-known open challenge and beyond the scope of our current work.If a solution for recognizing indirect jump targets is proposed in the future, we can embed the operand of the indirect jump with the fusion of the positional embeddings of all jump targets.

4.2.3
The Rationale of Our Proposed Approach.By sharing the parameters between the source and target tokens of the jump pair, we create a high degree of similarity in their representation.As a result, whenever the attention mechanisms that power jTrans assign a high attention weight to one of these tokens (i.e., determine that it is important to the understanding/analysis of the binary), they will automatically also assign high attention to their partner.This representation therefore ensures that both parts of the jump instruction-and the instructions near them in the code-will be included in the inference process.
We now provide a formal analysis of the jump pair's token similarity, and demonstrate that the similarity within this pair is higher than any of them has with any other token.For a given binary func- is the -th token of f.All the tokens will be converted into mixed embedding vectors { ( 1 ), • • • ,  (  )} before being fed into jTrans, where each embedding  (  ) can be represented as a summation of token embeddings    and position embeddings   .We apply the multi-head self-attention mechanism [57] to the mixed embedding vectors { ( 1 ), • • • ,  (  )}.We denote the embedding of the -th layer as we first project the -th embedding to   ,   and   , respectively.Then we used the scaled dot-product attention to get the attention matrix Attention(  ,   ,   ). The are affine transformation matrices of the -th layer,  model is the dimension of the embedding vector.Softmax( ) is the attention weight matrix.We denote the updated embedding by head ℎ as Assume we have  attention head, we get updated embedding  +1 as follows,    ∈ R    × model is the output transformation matrix of the -th layer,     is the feed-forward network of the -th layer.
The final output of jTrans is the last layer of the model.We produce the function embedding   as follows, where   ∈ R  model ×  is the output transformation matrix,   is the dimension of the function embedding,   () represents the embedding of <CLS>.
Next, we present how jTrans deliver the control-flow information of the program.Consider three tokens , ,  in the given function, the corresponding embeddings are   ,   and   .Assume .We prove that the mathematical expectation of    minus   is positive.Which can be formulated as follows This equation shows that  generally pays more attention to  than .This is the internal explanation of jump embeddings.The detailed proof is in Appendix 8.

Pre-training jTrans
The BERT architecture, on which jTrans is based, uses two unsupervised learning tasks for pre-training.The first task is masked language models (MLM), where BERT is tasked with reconstructing randomly-masked tokens.The second unsupervised learning task is designed to hone BERT's contextual capabilities by requiring it to determine whether two sentences are consecutive.We build upon BERT's overall training process, while performing domain-specific adaptations: we preserve the MLM task, but replace the second task with one we call jump task prediction (JTP).As shown in Section 4.3.2, the goal of the JTP task is to improve jTrans's contextual understanding of jump instructions.

4.3.1
The Masked Language Model Task.Our MLM task closely follows the one proposed in [12], and jTrans uses BERT's masking procedures: 80% of our randomly-selected tokens are replaced by the mask token (indicating that they need to be reconstructed), 10% are replaced by other random tokens, and 10% are unchanged.Following the notation in Section 4.2.3, we define the function where   is the -th token of f, and  is the number of tokens.We first select a random set of positions for f to mask out (i.e., m x ).
Based on these definitions, the MLM objective of reconstructing the masked tokens can be formulated as follows: where m x contains the indices of the masked tokens.An example of the masking process is presented in Figure 4. We make the rsp, rdi and call tokens, and task jTrans with reconstructing them.To succeed, our model must learn the basic assembly syntax and its contextual information.Successfully reconstructing the rdi token, for example, requires that the model learns the calling convention of the function, while the rsp token requires an understanding of continuous execution.

Jump Target Prediction.
The JTP task is defined as follows: given a randomly selected jump source token, our model is required to predict its corresponding target token.This task, which is difficult even for human experts, requires our model to develop a deep understanding of the CFG.This in turn leads to improved performance for jTrans, as we later show in Section 6.4.JTP is carried out by first selecting a random subset of the available jump source tokens.These tokens are then replaced with the token <LOC>.
Where l x is the set of positions for jump symbols.JTP's objective function can be formulated as follows: An example of the JTP task is presented in Figure 5, where we replace the JUMP_20 token by <LOC> and the model is tasked with predicting the index.An analysis of jTrans's performance in the JTP task is presented in Table 6, which shows that our model achieves the accuracy of 92.9%.Furthermore, an ablation study that evaluates JTP's contribution to jTrans's overall performance is presented in Section 6.4.These results provide a clear indication that this training task improves jTrans's ability to learn the control flow of analyzed functions.
The overall loss function of jTrans in the pre-training phase is the summation of the MLM and JTP objective functions:

Fine-tuning for Binary Similarity Detection
Upon the completion of the unsupervised pre-training phase, we now fine-tune our model for the supervised learning task of function similarity detection.We aim to train jTrans to maximize the similarity between similar binary functions pairs, while minimizing the similarity for unrelated pairs.As shown in Section 4.2.3, we use Equation 4to represent function f.Our chosen metric for calculating function similarity is Cosine similarity.We use the following notations: let F and G be a set of binary functions, and the sets of similar functions (the "ground truth").For any query function f ∈ F , let g + ∈ G be a function that is similar to  (e.g., compiled from the same source code).Furthermore, let g − ∈ G be an arbitrary function, unrelated to f.We denote the embedding for function f as  f .Finally, we define D as the set of all generated triples f, g + , g − .
The objective function for our fine-tuning process in trained using contrastive learning [25,53] and is performed as follows: where  represents the parameters of the model, and  is a hyperparameter usually chosen between 0 and 0.5 [48].After fine-tuning jTrans on D, we can measure the similarity score of two functions  1 ,  2 by calculating the cosine similarity of their embeddings.Once the fine-tuning process is complete, we can measure the similarity score of two functions  1 ,  2 by calculating the cosine similarity of their embeddings.

Large-Scale Dataset Construction
We build our dataset for binary similarity detection based on the ArchLinux [2] official repositories and Arch User Repository [3].ArchLinux is a Linux distribution known for its large number of packages and rapid package updates.ArchLinux's [2] official repositories contain tens of thousands of packages, including editor, instant messenger, HTTP server, web browser, compiler, graphics library, cryptographic library, etc.And Arch User Repository contains more than 77,000 packages uploaded and maintained by users, greatly enlarging the dataset.Furthermore, ArchLinux provides a useful tool makepkg for developers to build their packages from source code.makepkg can compile the specified package from the source by parsing the PKGBUILD file, which contains the required dependencies and compilation helper functions.Binary code similarity task requires a large number of labeled data, thus we use these infrastructures to construct our dataset.

Projects Filtering.
For compilation compatibility reasons, we choose the c/c++ project in the pipeline to build the datasets.If the build function in PKGBUILD file contains the call of cmake, make, gcc and g++, then it is very likely to be a C/C++ project.On the other hand, if the variable depend in PKGBUILD file contains rustc, go, jvm, then it is not likely to be a C/C++ project, we can remove it before compiling.

Compilation pipeline.
In our pipeline, we wish to cally specify any optimization level we want each time.The toolchain of some projects do not consume environment variables CFLAGS and CXXFLAGS, making it impossible to change the optimization level easily.However, because most projects call the compiler by CC or CXX, we assign environment variables a self-modified version of gcc, g++, clang, clang++.The modified compiler changes the command line parameters that are related to the optimization level to expected compilation parameters.Also, it appends the expected compilation arguments to the original parameters.We use these two ways to ensure the compilation is done with the expected optimization level.

Label Collection.
To collect labels, we need to first obtain unstripped binary and get the offset of functions.We found that many real-world projects call strip during compilation, therefore only specifying parameters in PKGBUILD doesn't solve this problem.We replaced the strip with its modified version.It will not strip the symbol table regardless of passed-in parameters.

EXPERIMENTAL SETUP 5.1 The BinaryCorp Dataset
We now present BinaryCorp, the dataset we created to evaluate large-scale binary similarity detection.BinaryCorp consists of a large number of binaries produced by automatic compilation pipeline, where-based on the official ArchLinux packages and Arch User Repository-we use gcc and g++ to compile 48,130 binary programs with different optimization levels and follow the approach proposed in SAFE [43] to filter duplicate functions.The statistics of our datasets are shown in Table 1.While many previous works use Coreutils and GNUtils as their dataset, Table 1 clearly shows that BinaryCorp-26M operates at a different scale: while our newly-created dataset has approximately 26 million functions compared to GNUtils' 161,202 and Coreutils 76,174.BinaryCorp-26M is therefore more than 160 times the size of GNUtils and more than 339 the size of Coreutils.
The size of our new dataset prevents the use of some of the existing methods, due to their insufficient scalability.We therefore also provide a smaller dataset, named BinaryCorp-3M, which contains 10,265 binary programs and about 3.6 million functions.The number of functions in our smaller dataset is about 22 times that of GNUtils, and 47 times that of Coreutils.
BinKit [36] is the largest binary dataset, which consists 36,256,322 functions.However, BinKit used 1,352 different compile options to generate 243k binaries from only 51 GNU packages, therefore having too many similar functions in the dataset.We, on the other hand, only use 5 different compile options to compile nearly 10,000 projects.While our number of binaries is smaller, our dataset is more diversified than BinKit, in terms of developers, project size, coding style and application scenarios.We argue that our newly generated dataset offers a more diverse-and therefore more realistic-basis for learning and evaluation.Using our datasets, we can evaluate the scalability and efficacy of jTrans (and the other baselines) on a new and larger scale.As mentioned in our contributions in Section 1, we will make our datasets and trained models available to the community.

Baselines
We compare jTrans to six top-performing baselines: Genius [21].The baseline is a non-deep learning approach.Genius extracts raw features in the form of an attributed control flow graph and uses locality sensitive hashing (LSH) to generate numeric vectors for vulnerability search.We implemented this baseline based on its official code 2 .
Gemini [59].This baseline extracts manually crafted features for each basic block, and uses GNN to learn the CFG representation of the analyzed function.We implemented this approach based on its official Tensorflow code 3 , and used its default parameter settings throughout our evaluation.SAFE [43].This baseline employs an RNN architecture with attention mechanisms to generate a representation of the analyzed function, it receives the assembly instructions as input.We implemented this baseline based on its official Pytorch code 4 , and default parameter settings.
Asm2Vec [14].The method uses random walks on the CFG to sample instruction sequences, and then uses the PV-DM model to jointly learn the embedding of the function and instruction tokens.This approach is not open source, and we therefore used an unofficial implementation 5 .We used its default parameter settings.
GraphEmb [44].This baseline uses word2vec [45] to learn the embeddings of the instruction tokens.Next, it uses a RNN to generate independent embeddings for each basic block, and finally uses structure2vec [7] to combine the embeddings and generate representation of the analyzed function.To make this baseline scalable to datasets as large as BinaryCorp-26M, we re-implemented the author's original Tensorflow source code6 using Pytorch.
OrderMatters [62].This method combines two types of embeddings.The first embedding type uses BERT to create an embedding for each basic block, with all these embeddings then combined using a GNN to generate the final representation.The second type of embeddings is obtained by applying a CNN on the CFG.The two embeddings are then concatenated.This method is not open source, and its online blackbox API 7 can not satisfy the need of this study.We implemented on our own using the reported hyperparameters.

Evaluation Metrics
Let there be a binary function pool F , and its ground truth binary function pool G.We denote a query function   ∈ F and its corresponding ground truth function    ∈ G.In this study we address the binary similarity detection problem, and our goal is therefore to retrieve the top-k functions in function pool G, which have the highest similarity to   .The returned functions are ranked by a similarity score,     which denotes their position in the list of retrieved functions.The indicator function I is defined as below The retrieval performance can be evaluated using the following two metrics:

EVALUATION
Our evaluation aims to answer the following questions.
• RQ1: How accurate is jTrans in BCSD tasks compared with other baselines?( §6.1) • RQ2: How well do jTrans and baselines perform on BCSD tasks of different function pool sizes?( §6.2) • RQ3: How effective is jTrans at discovering known vulnerability? ( §6.3) • RQ4: How effective is our jump-aware design?( §6.4) • RQ5: How effective is the pre-training design?( §6.5) Throughout our experiments, all binary files were initially stripped to prevent information leakage.We used IDA Pro to disassemble and extract the functions from the binary code in all of the experiments, thus ensuring a level playing field.For baselines that didn't use IDA Pro, we used their default disassemble frameworks for preprocessing, after extracting functions using IDA Pro.All the training and inference were run on a Linux server running Ubuntu 18.04 with Intel Xeon 96 core 3.0GHz CPU including hyperthreading, 768GB RAM and 8 Nvidia A100 GPUs.

Biniary Similarity Detection Performance
We conduct our evaluation on our two datasets, BinaryCorp-3M and BinaryCorp-26M.Additionally, we use two function pool sizes-32 and 10,000-so that jTrans and the baselines can be evaluated in varying degrees of difficulty.It is important to note that we randomly assign entire projects to either the train or test sets of our experiments, because recent studies [1] have shown that randomly allocating binaries may result in information leakage.
The results of our experiments are presented in Tables 2-5.jTrans outperforms all the baselines by considerable margins.For poolsize=32 (Tables 2) and 4, jTrans outperforms its closest baseline competitor by 0.07 for the MRR metric, and over 10% for the recall@1 metric.The difference in performance becomes more pronounced when we evaluate the models on the larger pool size of 10,000.For this setup, whose results are presented in Tables 3 and  5, jTrans outperforms its closest competitor by 0.26 for the MRR metric, and over 27% for the recall@1 metric.
The results demonstrate the merits of our proposed approach, which utilizes both the Transformer architecture and a novel approach for the representation and analysis of the CFG.We can significantly outperform multiple SOTA approaches such as SAFE, which uses an RNN and performs a less rigorous analysis of the CFG, and OrderMatters, which uses Transformer but analyzes each block independently.
Another important aspect of the results is greater relative degradation in the performance of the baselines for larger pool sizes.In the next section we explore this subject further.

The Effects of Poolsize on Performance
The results of the previous section highlight the effect of the poolsize variable on the performance of binary similarity detection algorithms.These results are particularly significant given the fact that previous studies use relatively small poolsizes between 10 and 200, with some studies [43,59] employing a poolsize of 2. We argue that such setups are problematic, given the fact that for real-world applications such as clone detection and vulnerability search, the poolsize is often larger by orders of magnitude.Therefore, we now present an in-depth analysis of the effects of the poolsize on the performance of SOTA approaches for binary analysis.
The results are presented in Figure 6.We conducted multiple experiments with a variety of poolsizes-2, 10, 32, 128, 512, 1,000, and 10,000-and plotted the results for various optimization pairs.The results clearly show that all baselines' relative performance is worse than jTrans's as the poolsize increases.Furthermore, our approach does not display sharp drops in its performance (note that the X-axis in Figure 6 is logarithmic), while the baselines' performance generally declines more rapidly once poolsize=100 has been reached.This suggests that our approach is not affected so much by the poolsize as it is by the classification problem becoming more challenging due to a large number of candidates.
Finally, we would like to point out that for a very small poolsizes (e.g., 2), the performance of SOTA baselines such as SAFE and Asm2Vec is almost identical to that of jTrans, with the latter outperforming by approximately 2%.We deduce that evaluating binary analysis tools on small poolsizes does not provide meaningful indication to their performance in real-world settings.

Real-World Vulnerability Search
Vulnerability detection is considered one of the main applications in computer security.We wish to evaluate jTrans's performance on the real-world task of vulnerability search.In this section, we apply jTrans to a known vulnerabilities dataset with the task of searching for vulnerable functions.
We perform our evaluation on eight CVEs extracted from a known vulnerabilities dataset [54].We produce 10 variants for each function by using different compilers (gcc, clang) and different optimization levels.Our evaluation metric was the recall@10 metric.To simulate real-world settings, we use all of the functions in the project as the search pool.The number of functions for each project varies from 3,038 to 60,159, with the latter being highly challenging.
Figure 7 presents the results for the recall@10 metric for each of our queries.We compare our approach to the two leading baselines from Sections 6.1 and 6.2-SAFE and Asm2Vec.It is clear that for most of the CVEs, jTrans's performance is significantly higher than that of the two baselines.For example, on CVE-2016-3183 from the openjpeg project containing 3,038 functions, our approach achieved a top-10 recall of 100%, meaning that it successfully retrieved all the 10 variants, while Asm2Vec and SAFE achieved recall@10 values of 36.9% and 28.6%, respectively.Our results demonstrate that jTrans can be effectively deployed as a vulnerability search tool in real-world scenarios, as a result of its ability to perform well on large pool sizes.

The Impact of Our Jump-aware Design
In this section, we test our hypothesis that our jump-aware design significantly contributes to jTrans' ability to analyze the CFG of the binary code.To this end, we train a standard BERT model that does not use our representation of the jump information and compare it to our approach.The hyperparameters used by both  We evaluate the standard BERT model and jTrans on BinaryCorp-3M using Recall@1, with poolsize=10,000.The results of our evaluation are presented in Figure 8.The results clearly show that the standard BERT performs significantly worse than jTrans, as its performance is lower on each optimization pair.On average, jTrans outperforms BERT by 7.3%.These results clearly show that incorporating control flow information into the assembly language sequence modeling is highly beneficial for our model.
To further explore the efficacy of our jump-aware design, we analyze the ability of our pre-trained model to predict the masked jump addresses in the binary code.We conduct our experiment on BinaryCorp-3M.For each function in the evaluation set, we randomly sample some jump positions with 15% probability, and replace them with <LOC> in the function.We then analyze the probability of the model correctly predicting the jump target position for each masked jump target.Our results, presented in Table 6, show that jTrans is highly capable at predicting jump positions.Our pre-trained model can predict the target of the jump instruction with top-1 accuracy of 92.9% and top-10 accuracy of 99.5%.This accuracy is quite high, particularly for top-1, as there are 512 possible jump positions.These results indicate our pre-trained model was able to successfully capture the contextual instruction information of the binary.

Evaluating the Efficacy of Pre-training
As is the original BERT, pre-training is a critical component of our model.Its main advantage is that it can be performed on unlabeled data, which is much easier to obtain in large quantities.To evaluate the effectiveness of the pre-training approach (MLM and JTP), we evaluated a version of our model that does not perform any finetuning.We follow the same approach as in zero-shot learning [58], where we use binaries without label information in the pre-training phase.Then, without fine-tuning the pre-trained model, we immediately apply it to the task of binary similarity search.The results of this model, denoted as jTrans-zero, are presented in Tables 2-5.
The results clearly show the efficacy of our pre-training approach.Even without fine-tuning, jTrans-zero outperforms all the baselines for poolsize=10000: on BinaryCorp-26M, compared to the closest baseline, jTrans-zero improves 0.1 for the MRR metric, and improves 10.6% for the recall@1 metric.In the poolsize=32 setup, jTrans-zero outperforms all baselines except SAFE, with latter outperforming our approach by 11.4%.It is important to note, however, that poolsize-32 is far less indicative for real-world scenarios, and in the more challenging poolsize=10000, even our partially-trained approach performed significantly better.

DISCUSSION
We focus on training jTrans on one architecture (e.g x86) in this paper, but the technique we proposed can be applied to other architectures as well.jTrans provides a novel solution to binary code similarity detection tasks, outperforming state-of-the-art solutions.It can be applied to many applications, including discovering known vulnerabilities in unknown binaries [11,18,49,50], malware detection [4] and clustering [30], detection of software plagiarism [52], patch analysis [32,60], and software supply chain analysis [27].For instance, due to the rapid deployment of IoT devices, code reusing is very common in IoT development.BCSD solutions like jTrans could help detect whether IoT devices have vulnerabilities revealed in open source libraries.In the scenario of blockchains, a huge number of blockchains and smart contracts are developed in the past 5 years, based on numerous code cloning and forking.However, the security risks of blockchains and smart contracts are severe, and a large portion of them are vulnerable.The code dependency between different blockchains and smart contracts makes this issue even worse.We could use jTrans to efficiently detect vulnerabilities in blockchains and smart contracts.
Existing deep learning-based works, as well as jTrans, embed individual binary functions into numerical vectors, and compare similarity between vectors.As a result, their accuracy drops along with the pool size.As shown in Figure 6, the accuracy of most existing solutions drops below 20% if the pool size is 10,000.In real world scenarios, the pool size would be much larger.A model that directly takes two binary functions as input could better capture the inter-function relationships and further improve the performance of BCSD, even in a large pool.However, training a model to directly compare two functions would have higher overheads.We leave balancing the accuracy and overhead when using jTrans in real world BCSD tasks as a future work.

CONCLUSION
In this work we propose jTrans, the first solution to embed controlflow information to Transformer-based language models.Our approach utilizes a novel jump-aware architecture design that does not rely on the use of GNNs.Theoretical analysis of self attention shows the soundness of our design.Experimental results demonstrate that our method consistently outperforms state-of-the-art approaches by a large margin on BCSD tasks.Through intensive evaluation, we also uncover weaknesses in the evaluation of current SOTA methods.Additionally, we present and release to the community a newly-created dataset named BinaryCorp.Our dataset contains the largest amount of diversified binaries to date, and we believe that it can be used as a high-quality benchmark for future studies in this field.

Figure 1 :
Figure 1: An example control flow of a binary function.The left part is linear layout assembly code with jump addresses, and the right part is the corresponding control-flow graph.

Figure 2 :
Figure 2: When using embeddings for binary code similarity detection, the query function and candidate functions in function pool are mapped into a semantic vector space.The embeddings of query function, similar function and dissimilar function denote as   ,   and   , respectively.

Figure 3 :
Figure 3: Input representation of jTrans.The raw assembly code is first tokenized and normalized.Then, each token is converted to a token embedding and a position embedding, while its final input embedding is the sum of these two.For each jump pair, its source token's embedding (e.g., E JUMP_14 ), also called jump embedding, shares parameters with its target token's position embedding (e.g., P 14 ).

Figure 6 :
Figure 6: The performance of different binary similarity detection methods on BinaryCorp-3M.

Table 1 :
Statistics on the number of projects, binaries and functions of the datasets.Project refers to binaries compiled from the same source code.