CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking

Transformer based code models have impressive performance in many software engineering tasks. However, their effectiveness degrades when symbols are missing or not informative. The reason is that the model may not learn to pay attention to the right correlations/contexts without the help of symbols. We propose a new method to pre-train general code models when symbols are lacking. We observe that in such cases, programs degenerate to something written in a very primitive language. We hence propose to use program analysis to extract contexts a priori (instead of relying on symbols and masked language modeling as in vanilla models). We then leverage a novel attention masking method to only allow the model attending to these contexts, e.g., bi-directional program dependence transitive closures and token co-occurrences. In the meantime, the inherent self-attention mechanism is utilized to learn which of the allowed attentions are more important compared to others. To realize the idea, we enhance the vanilla tokenization and model architecture of a BERT model, construct and utilize attention masks, and introduce a new pre-training algorithm. We pre-train this BERT-like model from scratch, using a dataset of 26 million stripped binary functions with explicit program dependence information extracted by our tool. We apply the model in three downstream tasks: binary similarity, type inference, and malware family classification. Our pre-trained model can improve the SOTAs in these tasks from 53% to 64%, 49% to 60%, and 74% to 94%, respectively. It also substantially outperforms other general pre-training techniques of code understanding models.


INTRODUCTION
Transformer models [1,7,18,62] have substantially advanced the state-of-the-art of Natural Language Processing (NLP) applications.They are also the technique behind Large Language Models (LLMs) which have demonstrated unprecedented generalization and reasoning ability.These models feature the self-attention mechanism [72] that allows them to learn correlations/contexts distant in the input space and self-supervised pre-training methods, such as Masked Language Modeling (MLM) that can produce high quality embeddings by masking parts of inputs and forcing the models to predict the masked parts from their contexts.Although Transformer models were initially introduced to improve NLP applications, recent research has shown that they can be used in many software engineering tasks, such as automated program repair [10,24,59,77], software testing [25], vulnerability detection [54,69] and so on [26,35,44,79], outperforming traditional methods.An underlying reason of their impressive performance is that software has rich natural language artifacts (also called symbols in this paper for terminology simplicity), such as comments, documents, variable and function names.These artifacts make programs "understandable" to Transformer models, just like to human developers.For example, the correlation between a statement that defines a variable and another statement that uses the variable can be naturally captured by the attention mechanism due to the common variable name, just like how developers infer dataflow by variable names.
However, when symbols are not available, such as in stripped binary executables, or not informative, such as in obfuscated software, programs become extremely difficult for models to understand, just like they are hard for developers.In particular, they are in a very primitive language in which tokens no longer have rich semantics.For example, in x86 executables, variables are denoted by registers and memory locations dereferenced through registers or constants; a same register may be allocated to multiple variables.As such, the definition of a register and the use of the register may not suggest dependence; neighboring tokens may belong to completely independent contexts/computations.Such context interleavings are difficult to unfold without the help of symbols.As shown by our experiments, code models based on vanilla Transformer architectures and using the vanilla MLM pre-training have degraded performance when symbols are lacking.
To mitigate the problem, researchers have proposed various methods.Trex [57] used microexecutions to acquire input and output operand values of an x86 instruction and then leveraged such values to train a Transformer model to precisely represent instruction level semantics.JTrans [74] used jump target prediction together with MLM to pre-train a Transformer model and used contrastive learning to force the model to learn embeddings that can distinguish similar and dissimilar binary functions.DiEmph [81] further improved JTrans by removing biases (e.g., undesirable code pattern distributions) introduced by compilers.GraphCodeBERT [34] aimed to enhance Transformer code model pre-training leveraging dataflow information.It expanded the raw input sequence with additional tokens denoting variables and introduced extra training losses to force the model to learn data-flow between these tokens.While these proposals demonstrated great improvements over vanilla models, most of them focused on single downstream tasks, instead of general pre-trained models.Some (e.g., GraphCodeBERT) required symbols.Note that there were also a body of works that treated programs as graphs (e.g., control flow graphs and dependence graphs) and leveraged Graph Neural Networks (GNNs) for software engineering tasks [22,30,40,70,83].However, their performance also degrades when symbols are lacking, e.g., due to their need of labeled data in supervised learning and difficulties in capturing correlations that are multiple edges away in graphs [78].More can be found in Section 2.
In this paper, we aim to develop a new technique for pre-trained general code models that targets programs without meaningful symbols, binary executables in particular.It does not fine-tune an Preprint.
existing pre-trained model.Instead, it pre-trains a model from scratch using a new method that regularizes attention.It produces high quality embeddings that encode program dependences and disentangle interleaving contexts and hence enables better performance in downstream tasks.Its intuition is the following.Without meaningful symbols, a programming language degenerates to a very primitive one like an arithmetic language [60], in which the meaning of a variable/statement can only be derived from the direct and transitive computations that produce and use the value of the variable/statement.For example, in a function where all statements in the function body cohesively compute a final return value, the embedding of the return value should reflect the computation of the whole function; a function computing multiple orthogonal output values shall have embeddings reflecting such orthogonality in spite of interleavings of the sub-computations.The language is dissimilar to a natural language, and humans rarely speak in such a primitive fashion.As such, the vanilla MLM method can hardly help the model produce the right attention during pre-training without the help of symbols because MLM mainly leverages the correlations between individual tokens and their left and right contexts (shown in Section 2).
To address the challenge, we propose to use program analysis to derive possible dependences between instructions and then construct attention masks from such dependences.The masks enable self-attention between instructions that have dependences and preclude attention among those that are independent.This aligns well with the aforementioned primitive language.The pre-training then helps the model determine which dependences are more important than others and hence deserve more attention.Besides the dependence masks, additional masks are created to explicitly regularize self-attention to correlations other than program dependences, such as token co-occurrences.We also enhance the MLM pre-training by masking part of program dependences and introducing spurious dependences, and then forcing the model to correctly predict/classify such dependences.During inference, CodeArt takes the subject binary and generates the corresponding attention masks and feeds both the input tokens and the masks to produce output embeddings.Note that the dependence analysis and mask construction are deterministic and transparent to users.Our contributions are summarized in the following.
• We propose a new method to pre-train Transformer code models when symbols are lacking.
Inspired by existing works that utilize program analysis to enhance Deep Learning models [8, 9, 11, 12, 14, 19-21, 31, 42, 51, 64, 65, 67, 68, 73], our method analyzes program dependences and use them to help self-attention.Different from many existing techniques that focus on improving performance of individual downstream tasks, our pre-trained models are general, serving a large number of applications, and use masks to regulate attention.on binary code and show that our model is much more effective.We also conduct an ablation study to justify our design choices. 1

MOTIVATION
We use an example to discuss the limitations of existing techniques and illustrate our technique.
For better readability, we present the example mostly in its source code form.Our technique works on stripped binaries without any symbol information.
Example.Fig. 1(a) shows a code snippet computing mean, variance, and percentile for an array of sorted data.The code blocks in different colors denote the three sub-computations.Fig. 1(b) shows a version equivalent to (a) with statements reordered.Fig. 1(c) is highly similar to (a) except that its line 11 has a buggy operation.We mix (b) and (c) with a set of 5 random functions to form a pool of candidate functions and use (a) to query the most similar function from the pool.We first use a similarity analysis built on the CodeT5 model using source code.The analysis easily identifies that (a) is similar to (b) and not to (c).However, when we use a binary similarity analysis JTrans that operates on stripped binaries (without symbols), JTrans mistakenly considers (a) is similar to (c) instead of (b).In contrast, a similarity analysis built on our model reports the correct similarity result even without symbols.In the following, we explain these different results.
Transformer Models.Recent research has shown that Transformer based code models, including Large Language Models (LLMs), deliver superior performance in many software engineering tasks [28,48,63,76].Transformer was initially introduced for NLP applications [18,61,62,72], and the rich natural language artifacts in software make it suitable for these tasks.However, in tasks where symbols are precluded (e.g., binary analysis) or not informative (e.g., in obfuscated software), code becomes dissimilar to natural language products, causing the model to have sub-optimal attention and hence performance degradation.Fig. 2(a) shows part of the attention map of lines 9-12 in the code in Fig. 1(a), by the CodeT5 model that works on source code.Observe that with the help of variable name mean, the model correctly correlates statements with program dependences, e.g., the strong attention between mean at line 11 and mean at line 9.In contrast, Fig. 2(b) shows the attention map of the corresponding binary code by the JTrans model [74].Here, rax@A0 denotes variable i, rax@B0 denotes mean I try to make the figure fit in a double-column ratio.In (b), we show attention between binary code tokens.For readability, we also include the corresponding source code and part of the assembly.The variable mean is stored in the register rax at line B0 (in red).A line with a darker color denotes a larger attention value.In an instruction, the first operand is destination and the second the source, and comment "rax<-i" means that register rax stores variable i.
I try to make the figure fit in a double-column ratio.and rax@C0 denotes i. Observe that the model has undesirable (strong) attention among the rax's.Moreover, as the binaries for the programs in Fig. 1(a) and (c) are syntactically very similar.The JTrans model produced highly similar (but problematic) attention maps for the two, causing its misclassification of the two as similar functions.In Section 4.1, our results show that CodeArt outperforms the JTrans model on the binary similarity task by over 30% in zero-shot settings.
GNN Models.GNNs are also widely used in software engineering tasks [3,6,16,36,43].The basic idea is that programs have explicit graph structures, such as control flow graph (CFG), abstract syntax tree (AST), and program dependence graph (PDG), which can be leveraged by GNNs.In GNNs, the embedding of a node (e.g., a basic block in CFG) can be derived from the embeddings of neighboring nodes and the content of the node itself.They are hence a plausible solution when symbols are missing.However, as pointed out in [74,78], standard GNN based code models struggle to capture long-range dependences [85].Although scaling GNNs in depth and width may mitigate the problem, it may in the meantime cause optimization instabilities and representation oversmoothing [5,13,41].Our experiments in Section 4.1 show that CodeArt outperforms GNN based code models on the binary similarity task by over 20%.
Our Method.Our key observation is that when symbols are lost or not informative, code becomes dissimilar to natural language products, and exhibits its own characteristics.First, the semantics of a statement is the aggregation of all the statements that directly or transitively contribute to it and those that it contributes to.For example, the meaning of variable mean at line 11 in Fig. 1(a) is determined by lines 8-11 and lines 13-14, which form a context for mean.In an extreme case where all statements in a function are devoted to computing a final return value, the embedding of the value shall reflect the meaning of the whole function.This is quite different from how humans use natural languages.Second, absolute code positions shall be de-emphasized when meaningful symbol information is not present to help disentangle interleaving contexts.Program code tends to have interleaving contexts, which may not be a problem when symbols are present to help disentanglement.For example in Fig. 1(b), the context of mean interleaves with the context for computing percentile, namely, lines 11-13.Although it may not be a problem when variable names are present to disentangle the contexts, it becomes a lot more challenging when symbols are lost.
Based on the observations, we propose a new technique that is based on Transformer, and enhanced with a new pre-training method to explicitly regulate self-attention using masks.In particular, it uses program analysis to determine possible dependences between instructions, including both data and control dependences.The masks ensure that a token can only pay attention to its dependence context, which includes the instructions that it directly or transitively depends on and those that are directly or transitively dependent on the token.As we will show in Section 3.4, additional masks are also derived to regulate attention in other types of contexts such as instruction local contexts that only allow an token to pay attention to other tokens within the same instruction and global contexts that model token co-occurrences (within and across instructions).The pretraining further helps the model learn which of these allowed attentions ought to be strong.As shown in Fig. 3(a), the attention map by CodeArt for the binary version of Fig. 1(a) closely resembles that when the source code is used (Fig. 2(a)).Moreover, the attention map by CodeArt for the binary version of Fig. 1(b) (i.e., the reordered but equivalent version) is also highly similar (see Fig. 3(b)), despite their syntactic differences.In fact, CodeArt produces similar attention maps for all three code snippets in Fig. 1.However, the different operations at line 11 of Fig. 1 (a) and (c) cause different final embeddings, allowing a binary similarity analysis built on our model to correctly recognize the similar functions.
Comparison with GraphCodeBERT.GraphCodeBERT [34] is a source code representation model based on BERT architecture that leverages data-flow information in pre-training.By augmenting (appending) the [comment, source code] input with additional variable tokens, and forcing the variable tokens to align with their corresponding tokens in the source code part of the input, and to have the intended data-flow, GraphCodeBERT aims to have better embeddings.Although GraphCodeBERT has demonstrated advantages over the vanilla CodeBERT [28], porting it to handling binaries without symbols is challenging.In particular, its variable token alignment largely relies on the variable name equivalence.At the binary level, the multiple occurrences of a variable may have completely different register/memory tokens, rendering the alignment much more difficult.Without proper alignment, the data-flow training is infeasible.This is supported by our evaluation results of a binary version of GraphCodeBERT in Section 4.5.

DESIGN
Overview.Fig. 4 shows the pipeline of CodeArt encoder.Given a binary executable, CodeArt first disassembles the code (step (a)).It then tokenizes the disassembly (step (b)) and performs dependence analysis to derive both data and control dependences (step (c)).In (d), the dependence transitive closures are computed for individual instructions, by traversing dependence edges in both forward and backward directions.The closures are further transformed to connectivity graphs in which an edge is introduced between two nodes if one is reachable from the other.In Mask Builder (e), the connectivity graph and the instruction tokens are leveraged to construct an attention mask and a relative distance matrix that measures the dependence distance between two connected Preprint.

Tokenization
CodeArt directly works on binary executables with symbols stripped.Given an executable, CodeArt first disassembles it using IDA-Pro2 .The disassembled code is then tokenized.The tokenizer breaks each (x86-64) instruction to multiple tokens.Specifically, opcode and operands are denoted by separate tokens.A compound operand may be further broken down to multiple tokens.For example, a memory read instruction mov rdx,[rbx+4*rax] that reads a value from an address denoted by rbx+4*rax to rdx is tokenized to a sequence 'mov', 'rdx', ',', '[', 'rbx', '+', '4', '*', 'rax', ']'.In addition, we use a special delimiter token <INST> to denote the beginning of an instruction.This is critical to the attention regularization which we will discuss in Section 3.4.Formally, a tokenized sequence can be described as where   denotes the delimiter for the -th instruction and  , the -th token of the -th instruction.Similar to other BERT-based models [18,28,46], we prepend a [CLS] token to the token sequence.The final input sequence is hence  = [[CLS], ].

Dependence Analysis
Given binary executables, CodeArt first uses program analysis to determine both control dependences and data dependences between instructions, and then uses such information to regulate attention during training and inference.In particular, CodeArt employs IDA Pro [37] to construct the control flow graph (CFG) for each function.Using these CFGs, we resort to a conventional algorithm [29] to determine control dependences.We additionally tailor a source-code data-flow analysis [4] to facilitate the analysis of data dependences in binaries.This data-flow analysis begins by gathering a collection of variables accessed by each statement (or, in the context of binary analysis, an instruction) and subsequently determines the def-use relationship among these variables.Precisely identifying variables accessed by a binary instruction is notably challenging [84], given that all variables are compiled into plain registers and memory locations without any symbol information.To this end, we adopt an approach that overestimates the memory regions an expression can potentially reference (and hence, the variables that an instruction with the expression can potentially access), shown as follows.
• Expressions denoting stack memory addresses with statically decidable offsets (e.g., [rsp + 0x20]) are interpreted to merely reference the corresponding stack locations.• Expressions denoting stack memory addresses with statically undecidable offsets (e.g., [rsp + rax * 8 + 0x30] with rax holding an input parameter) are presumed to reference the entire stack frame of the corresponding function.• Expressions pointing to addresses not statically associated with stack pointers (e.g., [rbx + rax] denoting some heap access) are conservatively considered to reference the entire memory space of the given binary.Note that [rcx] is assumed to possibly access the entire memory space, which encompasses [rsp + 0x10].It is worth noting that our dependence analysis is largely standard, and we include it for completeness.We do not claim contributions on the analysis.

Model Architecture
Fig. 4(f) shows our model architecture, which is a multi-layer Transformer encoder [72], with the component Regularized Multi-head Attention (RMA) containing the main differences from a standard BERT architecture [18,46].As shown in the figure, the encoder takes a sequence  of  tokens and acquires a sequence of input vectors ∈ R  × ℎ by summing the token embeddings and the corresponding trainable absolute position embeddings (the sum operation is not shown in the figure for brevity).Here,  ℎ is the dimension of hidden states.The output embedding is then obtained by applying  layers of transformer blocks, each regulated by the CodeArt attention mask  ∈ R  × and the relative distance matrix  ∈ N  × , as exampled by the block in Fig. 4(f).Formally, the hidden states after layer  are denoted as  ( ) = transformer_block  ( −1) , ,  ,  ∈ {1..}.A transformer block (Fig. 4 (f)) is formally Preprint.defined as follows.
H is the number of heads,   the head dimension, and   the weight for the linear projection after concatenating multiple head outputs. denotes a main difference between our architecture and vanilla Transformer models.It is the multi-head attention mask, where   is set to either 0 or −∞ to determine whether token  is allowed to attend to token  or not, respectively.Intuitively, by setting parts of the mask to 0, we only allow the model to learn self-attention for the corresponding token pairs and preclude spurious attention.We will further explain how to construct the mask later in the section.
As shown in Equation 3 and Fig. 5(b), another main difference between RMA and standard attention is that RMA incorporates relative positional embedding   to explicitly encode position relationship between <INST> delimiter tokens (not between other tokens internal to an instruction).Intuitively, it denotes the dependence distances between instructions as their absolute positions/distances could be misleading when symbols are missing.The relative positional embedding is integrated in the form of bias in Equation 3. It is computed as follows.
, =   (min(  ,  max )) Here  is the relative distance matrix pre-computed by our analysis (details in the next section), and   is the embedding function that maps a distance value to a trainable parameter for head .
The relative position relationship is learned through these parameters during training.Variable  max is the max relative distance.Since larger relative distances are less frequent in practice, the model can hardly learn the relative bias for large relative distances.The max distance ensures that when the distance is too large, the bias becomes indistinguishable to that of at distance  max .As such, the model understands that the distance is large and potentially out of distribution.

Attention Regularization by Masking
In this section, we explain how we construct the attention mask .As discussed earlier, without symbols, model training tends to determine contexts by names of primitive operands (e.g., register names) and absolute positions, which could be misleading.We leverage masking to direct the self-attention to the right places such that correct contexts can be extracted.Fig. 6 provides a conceptual illustration.In (a), a small program consisting of six instructions is shown in the first row (and in their source code form for readability), and each is broken down to multiple tokens.For example in the second row, the green blocks with numbers show the <INST> tokens, each of which is followed by the (grey) tokens of the corresponding instruction.The entire sequence is preceded by a [CLS] token.Assume we are interested in the embedding of the second token of instruction 5 (the 'edx' token in red).First of all, we want the model to pay attention to the co-occurrences Preprint.[CLS] [CLS] Co-occurrence of all tokens.
Fig. 6.Contexts, attention masks and relative position matrix of all the tokens, without focusing on any specific ones.We call it the global context (Fig. 6(b)).Second, the model should pay attention to all the tokens in the same instruction as the red token when encoding it.We call it the local context of the red token (Fig. 6(c)).Additionally, the model should pay attention to the sub-computation which the red token is in, namely, instructions 1 , 3 , 4 , and 6 due to program dependences.We call it the dependence context (Fig. 6(d)).Finally, the neighboring tokens of the red token should be measured by their dependence distances (i.e., how many dependence edges away) instead of their absolute positions.We call it the dependence based relative positioning (Fig. 6(e)).Intuitively, it rearranges the instructions by their dependence distances.For example, instruction 5 is closer to instructions 1, 4, 6 than instruction 3.
In standard Transformer models, the self-attention (without masking) allows a query   ∈ R 1×  (an element of  in Equation 2, corresponding to a token) to attend to individual keys   ∈ R 1×  ( ∈ {1.. }) (an element of  in Equation 2) by computing a scaled dot-product   •    / √   as the attention score, which is further normalized by a softmax function to get attention weights   for aggregating values   ∈ R 1×  ( ∈ {1.. }) by   =  =1     (similar to Equation 3).The aforementioned multi-step context extraction is achieved by attention regularization using masking.As shown in Equation 3, an attention mask  is added to the dot-product before the softmax, which contains a 0 value (to enable attention) or a −∞ value (to disable attention).Note that adding −∞ to the dot-product ensures that the attention weight   after the softmax becomes 0 so that the information cannot be aggregated from value   into the hidden state.When the mask value is 0, the attention weight is the same as the default one, allowing self-attention. consists of three kinds of masks, corresponding to the aforementioned global, local, and dependence contexts. Preprint.
Global Attention Mask  Gl .As shown in Fig. 6(i) in between the first and the second rows on the left, the yellow [CLS] token is used to facilitate learning the global context.Observe that it attends to all the tokens, and vice versa.This allows the model to learn co-occurrences.For example, the tokens in two separate instructions could indirectly attend to each other through the [CLS] token.The yellow cells in Fig. 6(f) show the 0 values in  Gl of our small program in the form of heat-map, with the legends the tokens and the black cells denoting −∞.Formally, Local Attention Mask  Lo . Lo ensures a token inside an instruction has attention to all tokens in the same instruction.Each instruction is broken down to an <INST> delimiter followed by a sequence of instruction tokens denoting opcode and operands (e.g., the 5th instruction in Fig. 6(b)).
As shown in Fig. 6(ii), each token in the 5th instruction attends to all the tokens within the instruction.Moreover, any cross-instruction attention is forbidden in this mask.The green cells in Fig. 6(f) show the 0 values in  Lo of our small program.Formally, Dependence Attention Mask  Dep .A straightforward method to construct the dependence mask is to directly reflect the directed program dependency graph  dep = ( ,  dep ) obtained by our dependency analysis through the <INST> tokens, namely, if an instruction  is dependent on another instruction  (i.e., there is a dependence edge  → , we allow 's <INST> token to pay attention to 's).However, such a simple design suffers two problems: (1) one layer can only pass information from 1-hop neighbors such that signals from multi-hop neighbors become undesirably weak.
(2) the dependence attention is uni-directional whereas bidirectional attention is proved to be better in Transformers pre-training [18].Thus, we use the Floyd-Warshall algorithm [17] to transform the dependence graph to a connectivity graph  con = ( ,  con , D), where an undirected edge between two nodes denotes if one is reachable from the other in the original graph, and D :  ×  ↦ → N is the distance function which maps a connectivity edge to its path length in the original graph.Then, we construct the mask to enable bidirectional attention if a connectivity edge exists.Fig. 6(iii) shows the dependence mask.Observe that since instruction 5 is dependent on instructions 1 and 4, transitively dependent on 3, and 6 is dependent on 5, symmetric attention is allowed between 5 and 1, 3, 4, 5, and 6.The blue cells in Fig. 6(iii) show the 0 values in  Dep of our small example.Note that although we only allow direct attention between <INST> tokens, tokens internal to instructions attend to <INST> and vice versa such that individual internal tokens can attend to other internal tokens in the dependence context through an additional layer of information propagation.Formally, The final mask  is the union of above three masks, as in Fig. 6(f).The distance function  of the connectivity graph is further used to construct the relative distance  in Equation 5. Fig. 6  shows the relative positional biases for the 5th instruction (i.e., the annotations on attention edges).Fig. 6(g) shows the constructed  with different colors denoting different distances.

Model Pre-training
The pre-training of CodeArt consists of two parts: the traditional Masked Language Modeling (MLM) that masks parts of input tokens and Masked Dependence Modeling (MDM) that masks existing dependence edges and introduces spurious dependence edges.The essence of MDM is that well-trained contextual embeddings shall be able to predict existing dependence edges (called positive edges) and preclude spurious edges (called negative edges).
Masked Language Modeling.We perturb the input sequence  following the MLM in RoBERTa [46].
In particular, we sample 15% of the tokens in  .We then replace them with a [MASK] 80% of time, with a random token 10% of time, and have them unchanged in the remaining 10%.During pretraining, the model is supposed to predict the masked tokens.Let X be the perturbed sequence.the The MLM training loss is formally defined as follows.
where M and R are the perturbed attention mask and relative position matrix respectively, which we will discuss next.
Masked Dependence Modeling.In each pre-training step, we randomly sample 40% of the nodes in the connectivity graph  con = ( ,  con ).The sampled set is denoted as   ⊂  .Let  con (  ) be the set of edges in the connectivity graph that have at least one node in   .It denotes the sampled positive edges.We then sample an equal number of pair-wise relations in   ×  ∪  ×   −  con , which denote the negative edges.These two edge sets form a balance set   of positive and negative samples.We then force the model to learn to correctly classify the positive and negative edges during pre-training.Let M and R be the perturbed attention mask and relative matrix after the sampled positive edges removed from  con and the negative edges added, and   ,   the hidden states of  and .The training loss is formally defined as follows. Preprint. where Note that even though MDM is similar to GraphCodeBERT's edge prediction and node alignment losses [34] as they are all in the form of binary cross-entropy, there is a core difference that MDM predicts edges between <INST> tokens instead of variable tokens from original code as in GraphCodeBERT.This avoids the risk that edge prediction loss may affect the representation of variable tokens learned by the MLM loss in an undesirable way.The final loss is the sum of L MLM and L EdgePred .Example.Fig. 7 presents how we use the motivation example in pre-training.The left shows the source code with the sub-computation related to mean highlighted.The x86 instructions corresponding to two of these statements are also shown.On the right, we show the model and the two types of masked modeling.The original input sequence is shown at the bottom with the edges denoting connectivity (derived from dependences).The row above shows the perturbed sequence and the perturbed edges.In particular, a positive edge (in blue) is selected and a negative edge (in red) is introduced.On top of the encoder, the output embeddings could be used to correctly classify the positive/negative edges.

EXPERIMENT
The implementation and pre-training details of CodeArt can be found in Section A in the supplementary material.We evaluate CodeArt on three downstream tasks to demonstrate the effectiveness of the pre-training, including binary similarity, malware family classification, and type inference for binary executables.We aim to answer the following research questions.
• RQ1: How does CodeArt perform on binary similarity analysis?
• RQ2: How does CodeArt perform on malware family classification?
• RQ3: How does CodeArt perform on binary type inference?
• RQ4: Model analysis and ablation study of CodeArt.

RQ1: Performance on Binary Similarity Analysis
Setup.Given an input binary function, a binary similarity analysis queries it among a pool of candidate functions, and tries to identify the function that is compiled from the same source code as the query function [74].It plays a critical role in many security related tasks, such as one day vulnerability detection [23,75], automatic software patching [66], and software plagiarism detection [49].Machine-learning based binary similarity tools typically encode the query function and all the candidate functions to their embeddings.After that, the cosine similarity between the embeddings of the query function and each candidate function is computed.The candidate functions are then ranked by the similarity values.The function with the largest value is considered similar to the query function.
Baseline.We compare the performance of CodeArt with the SOTA GNN-based model [53], Graph Matching Network (GMN)-based model [43], and two SOTA Transformer-based models Jtrans [74] and DiEmph [81].GMN is a GNN-based model that takes as input a pair of programs and outputs a similarity value.It is worth noting that GMN does not generate embeddings for individual functions, Preprint.
and can only be used for pair-wise similarity analysis.Note that although there are recent proposals that can achieve very good performance in binary similarity using advanced dynamic analysis such as [75,82], these techniques require executing the functions (using seed inputs).They are hence not directly comparable.
Training Details.For both CodeArt and JTrans, we finetune their pre-trained models on the BinaryCorp-3M dataset [74].Specifically, each training data sample is a triplet consisting of two binary functions compiled from a same source code function, and another binary function compiled from a different source code function.The training loss is the triplet loss enforcing a large cosine similarity between similar function pairs and a small similarity between dissimilar ones.For DiEmph, the GNN-based model, and the GMN-based model, we follow the scripts provided by the authors [53,81] to train the models.For all models, we use the Coreutils project as the validation dataset to select best checkpoints, and report the performance on the remaining 6 projects.
Metrics.Following previous work [74], we use recall@1 as the metric to evaluate a binary similarity model.Specifically, suppose that we make  queries, recall@1 is computed as  divided by the number of queries that the function compiled from the same source code is correctly returned as the most similar function.We also adapt the setup of previous work to evaluate a binary similarity model with different sizes of candidate function pools.Intuitively, a larger pool size means a more challenging setup for a model.
Results.The results are shown in Fig. 8.We can see that CodeArt outperforms the GNN-based model and the GMN-based model by a large margin in all setups.The improvement is largely due to Transformer models' better capability of capturing long-range dependences in a data-rich scenario, compared to GNNs.For most projects, CodeArt significantly outperforms the Transformer-based models in challenging setups (i.e., a pool size larger than 100).The improvement demonstrates that CodeArt is able to encode program semantics more precisely.For Putty, CodeArt achieves comparable performance to the previous SOTA model DiEmph.We investigate the cases and found functions in Putty are shorter compared to other projects.For example, more than 75% functions in Putty have less than 50 instructions, while, on the other hand, more than half of the functions in Binutils are longer than that.For those relatively simple functions, the problem of spurious correlations (caused by the lack of symbols) in the baseline models is not as severe and thus the improvement introduced by CodeArt is not significant.Also, in simpler setups (i.e., a pool size smaller than 100), both DiEmph and CodeArt can achieve good performance.
To conclude, the results on the binary similarity task indicate CodeArt can generate embeddings that encode program semantics more precisely, benefiting more realistic use scenarios [74].
Zero-Shot Performance.The performance on the binary similarity task relies heavily on the quality of function embeddings generated by a pre-trained model.We thus further evaluate CodeArt on the binary similarity task with the zero-shot setup to measure the effectiveness of pre-training.Specifically, we directly use the embeddings generated by the pre-trained CodeArt model without finetuning it on the binary similarity task.We compare its performance to the pre-trained JTrans model, which is pre-trained on the same dataset as CodeArt.The results are shown in Table 1.We can see that CodeArt demonstrates strong zero-shot performance by achieving over 30% higher performance on all setups.It indicates that the function-level semantics are learned by CodeArt during the pre-training process.We also conduct -test between CodeArt and JTrans' performance and results show that the improvement is statistically significant with -values 1e-6, 6e-8, 1e-6, 2e-9, 3e-9 for pool size 32, 50, 100, 200, 500 respectively.understand binary code.Moreover, the recovered type information is often an input for other analysis such as vulnerability detection [45], decompilation 5 , and legacy code hardening [27].
Machine-learning based binary type inference tools typically formulate the type inference problem as a sequence labeling task [15,56].Following the setup of StateFormer [56], a SOTA binary type inference model, we define 35 common types as labels and include a special label no-access for tokens without groundtruth types.The results are in Table 3.Each row denotes a variant of CodeArt and the first row lists the performance achieved with the default setup (denoted by CodeArt-3M).We first remove the local attention masks from CodeArt (denoted by w/o local mask).Without the local masks, CodeArt degenerates to a pure masked language model, since all tokens can freely attend to each other and there is no mechanism to ensure that <INST> nodes are aligned with the corresponding instructions, rendering the dependence modeling ineffective.Hence, the performance dramatically drops.
To demonstrate the necessity of modeling transitive dependences in CodeArt, we implement a variant that does not include the transitive dependence closure when constructing the dependence attention masks (denoted by w/o trans-closure).That is, the variant mimics the behavior of neighborhood-local aggregating GNNs, meaning that an instruction can only attend to its direct neighbours in the dependence graph.It weakens the perception of transitive dependences.Observe that the performance degrades by 4.8%.It demonstrates that computing transitive closures and using connectivity graphs indeed helps.
We further limit the maximum dependence distances when computing the transitive closures.Specifically, max-trans-closure 4 and max-trans-closure 6 denote variants that only include nodes reachable within 4 and 6 dependence edges, respectively.We can see that the default CodeArt (which does not limit the maximum distance) surpasses the two variants.
Moreover, we remove the relative positional bias from CodeArt (denoted by w/o rel-pos-bias).That is, for a given instruction, all the instructions in its dependence context are considered having the same distance to it.We can see that the recall@1 degrades by 17%.It validates that the relative position bias indeed helps the model distinguish instructions with different distances, enabling the model to learn more precise semantics.
In addition to zero-shot binary similarity, we present the ablation study results for other tasks that require fine-tuning in Table 4, where "MFC" denotes "Malware Family Classification" and "TI" denotes "Type Inference".We can see that CodeArt's default setting demonstrates best overall performance, indicating its strong generalizability on all tasks.

RQ5: Comparison with GraphCodeBERT-like Pre-training
As discussed in Section 2, GraphCodeBERT (GCB) leverages data-flow information to enhance the pre-training process.We compare the pre-training of CodeArt to GCB's in terms of optimization stability and zero-shot performance on the binary similarity task.
Preprint.Training Details.Since the GCB pre-training code is not publicly available and the origin GCB only worked on source code, we implement a pre-training pipeline following GCB's design and use BinaryCorp-3M [74] to pre-train it for the binary code.We detail how we implement the loss of GCB on binary code in Section C in the supplementary material.Specifically, we implement two variants of the training pipeline: (1) GCB-like-default that faithfully follows the loss design in GCB. ( 2) GCB-like-weaker-reg that makes the additional losses smaller which means weaker regularization for the MLM loss.Performance Difference.As is shown in Table 5, GCB-like-weaker-reg has better performance than GCB-like-default due to its stabler optimization.However, both variants are not comparable to CodeArt, which demonstrates the effectiveness of our approach when symbols are lacking.

RELATED WORK
Language Models for Code.A large body of work focuses on pre-training transformer-based language models on source code [2,19,28,39,76].They typically perceive source code as a sequence of tokens and utilize the rich natural language artifacts in software, such as symbol names and code comments, to help models understand code semantics.Due to the intrinsic structure of code, researchers have proposed to enhance pre-training with structural information, such as Abstract Syntax Trees (ASTs) [33].A few works further explore ways to leverage program semantics to improve the quality of language models.For example, OSCAR [58] enhances an IR-level code model with operational semantics of programs.They augment model inputs with abstract program states obtained from static analysis.Their efforts are orthogonal to ours.GraphCodeBERT [34] enhances MLM with data-flow graph structure, and we provide a detailed comparison in Section 4.5.Unlike all the structure-enhanced approaches, our initiative is the first to understand the code structure from a language perspective.Our encoding pipeline and pre-training methods explicitly model the code language characteristics.Additionally, there are some transformer-based models on binary code designed for specific downstream tasks [38,56,57,74].In contrast, CodeArt is designed to pre-train a general model that supports various downstream tasks.
Neural Networks for Explicit Code Structure.Apart from transformer-based language models, numerous code models explicitly encode code structures for specific software engineering tasks, such as program differencing [30], aligning code across platforms [40], disassembling binary code [83], software maintenance [22], and code completion [70].While these techniques leverage graph structures to derive embeddings, they usually require supervised training.Our empirical results demonstrate that integrating self-supervised learning and self-attention in Transformers with dependence graphs is not only feasible but can also lead to superior performance.Moreover, adapting these models for pre-training tasks requires non-trivial engineering efforts [5,13,41].In contrast, CodeArt proposes a unique attention regularization method that is fully compatible with Preprint.
existing efficient and scalable implementations for NLP Transformers.Thus, the pre-training of CodeArt can easily scale to larger datasets.Binary Program Analysis.CodeArt works at the level of assembly code, leveraging disassemblers [83] to decode binary files into textual assembly code.There exists a body of research focusing on binary program dependence analysis [84], which could enhance the analysis components in CodeArt.Our work is built upon these fundamental binary program analyses.

CONCLUSION
We introduce a novel method for pre-training Transformer-based code models in scenarios where symbols are lacking.This method features an innovative attention regularization technique that leverages program analysis to derive potential dependencies between instructions, subsequently forming attention masks.Our pre-trained model is general and can serve a wide range of applications.The empirical results show that our technique substantially outperforms the state-of-the-art, as well as GraphCodeBERT-like pre-trained models, in three downstream tasks, including binary similarity, malware family classification, and type inference for binary executables.

Fig. 1 .
Fig. 1.Code examples calculating statistics: (a) and (b) are equivalent but have different statement orders; (c) is buggy at line 11 with the wrong operation.

Fig. 2 .
Fig. 2. Attention maps of variable mean at line 11 of Fig. 1(a) (in red), by a source code model (a) and by a binary code model (b).In (a), we show attention between source code tokens, corresponding to lines 9-11 in Fig.1(a).In (b), we show attention between binary code tokens.For readability, we also include the corresponding source code and part of the assembly.The variable mean is stored in the register rax at line B0 (in red).A line with a darker color denotes a larger attention value.In an instruction, the first operand is destination and the second the source, and comment "rax<-i" means that register rax stores variable i.

Fig. 3 .
Fig. 3. Attention maps by CodeArt for variable mean in the code snippets in Fig. 1 (a) and (b).The instructions in bold are in the dependence context of the variable and the blue arrows denote dependences.

Fig. 7 .
Fig. 7. Pre-training in CodeArt Optimization Stability.CodeArt has much better stability.The training curves of both CodeArt and GraphCodeBERT-like pre-training are shown in Fig. 11 in Section C in the supplementary material.

•
[74]ddress a number of technical challenges, including enhancements of tokenization, model architecture, a new pre-training method that masks dependences, transforming transitive dependences to connectivity relations to avoid undesirable decay, and new training objectives.Built on top of the BinaryCorps dataset[74], we construct a large-scale training dataset with 26 million stripped binary functions containing explicit dependence information.•We develop a prototype CodeArt (Better CODE models by Attention regulaRizaTion when symbols are lacking) and use it to pre-train a BERT-like general model from scratch.The pre-training converges in four days with an 8×A100 GPU cluster.To demonstrate the generalization of the pre-trained model, we use it in three downstream tasks: binary similarity analysis, malware classification, and binary type inference.We have improved the SOTAs of these tasks from 53% to 64% (Recall@1 with a pool size of 500), from 49% to 60% (LRAP), and from 74% to 94% (F-1 score averaged over different optimizations), respectively.We empirically compare with other general code pre-training approaches like GraphCodeBERT Preprint.

Table 4 .
Ablation Study on Other Downstream Tasks that Require Fine-tuning To study how each component in CodeArt contributes to the final performance, we alter the pre-training process with different setups and observe how the performance changes.As detailed in Section 4.1, the zero-shot performance on the binary similarity task reflects the quality of embeddings generated by a model and thus demonstrates the effectiveness of pre-training.Therefore, we leverage the zero-shot binary similarity performance on Coreutils (with a pool size of 100) as the metric for pre-training.Due to the limitation of computation resources, we conduct the ablation study on the BinaryCorp-3M dataset.Note that all other evaluations are conducted on the CodeArt model pre-trained on the larger BinaryCorp-26M dataset.