Evaluating the Effectiveness of Deep Learning Models for Foundational Program Analysis Tasks

While deep neural networks provide state-of-the-art solutions to a wide range of programming language tasks, their effectiveness in dealing with foundational program analysis tasks remains under explored. In this paper, we present an empirical study that evaluates four prominent models of code (i.e., CuBERT, CodeBERT, GGNN, and Graph Sandwiches) in two such foundational tasks: (1) alias prediction, in which models predict whether two pointers must alias, may alias or must not alias; and (2) equivalence prediction, in which models predict whether or not two programs are semantically equivalent. At the core of this study is CodeSem, a dataset built upon the source code of real-world flagship software (e.g., Linux Kernel, GCC, MySQL) and manually validated for the two prediction tasks. Results show that all models are accurate in both prediction tasks, especially CuBERT with an accuracy of 89% and 84% in alias prediction and equivalence prediction, respectively. We also conduct a comprehensive, in-depth analysis of the results of all models in both tasks, concluding that deep learning models are generally capable of performing foundational tasks in program analysis even though in specific cases their weaknesses are also evident. Our code and evaluation data are publicly available at https://github.com/CodeSemDataset/CodeSem.


INTRODUCTION
Riding on the stunning advancements of deep learning in recent years, deep neural networks have demonstrated remarkable capabilities in inferring semantic programs properties (e.g., bug detection [Allamanis et al. 2018;Wang et al. 2020], program repair [Chen et al. 2019;Dinella et al. 2019], code documentation generation [Feng et al. 2020;Wang and Su 2020]).In light of the growing integration of neural technologies into the programming language (PL) research, a key question remains under addressed: how far can we push deep learning models in the eld of program analysis; are they even capable of solving tasks that are traditionally in the realm of foundational static analysis?By "foundational", we mean static analysis that serves the foundation for various client analyses.The response to this question has a profound impact on the programming language community.If the answer is yes, neural networks can become a viable alternative for developing static analysis applications perhaps with their own advantages (e.g., easier to create, higher e ciency).Conversely, if the answer is no, it can shed light on the de ciencies of existing code models, motivating future research to build ner models for program analysis.In this paper, we aim to answer this question.Speci cally, we undertake a systematic and rigorous evaluation on the e cacy of deep neural networks in performing foundational program analysis tasks.
We face two primary challenges: rst, (1) how to de ne prediction tasks that correspond to the foundational tasks in program analysis; second, (2) how to curate a dataset at scale that serves the tasks being de ned.For the rst challenge, we draw on two foundational problems in PL research -alias analysis and equivalence checking -to derive alias prediction and equivalence prediction respectively.Since the two analyses are among the most well-established, important static analysis, with an extensive literature and broad range of applications, we design our evaluation task in the mold of alias analysis and equivalence checking.Alias prediction requires models to classify if two pointers must alias, may alias 1 or must not alias.Equivalence prediction is about predicting whether or not two programs are semantically equivalent.It is worth mentioning that equivalence prediction di ers in fundamental ways from clone detection [Horwitz 1990], a well-known problem in software engineering.Speci cally, much of the clone detection work focuses on syntactic similarity of source code [Golubev et al. 2021;Jiang et al. 2007;Kamiya et al. 2002;Sajnani et al. 2016;Wang et al. 2018;Yuan and Guo 2012] whereas equivalence checking requires models to predict the semantic equivalence of programs, thus is clearly a better t to the central theme of this work.
To overcome the second challenge, our key idea is to leverage the results of the corresponding program analysis methods so that models can be trained on potentially unlimited amount of labeled data without any human e ort in data labeling.However, the quality of such labeled data poses a risk given that results of any program analysis method are almost certain to contain noise (since static analysis must involve approximation due to Rice's theorem, hence the issue of false positives; and dynamic analysis can not reason about all possible program behavior, hence the issue of false negatives).In response, we propose a key adjustment to a prominent, two-stage learning encompassing pre-training and ne-tuning.This adjustment allows models to exploit the labeled data despite its inherent noise.
Our training approach consists of three phases: generalized pre-training, specialized pre-training, and ne-tuning.While generalized pre-training and ne-tuning directly correspond to the two stages in pre-training and ne-tuning, specialized pre-training, a new, in-between stage, trains models to approximate downstream prediction tasks using the results of the corresponding program analysis methods.Even though data used at this stage carries noise, models can take advantage of it to learn a coarse decision boundary.This boundary is subsequently re ned during the ne-tuning stage, utilizing clean, ground-truth data.Take the alias prediction as an example.In specialized pre-training, we deploy models to learn from results of two alias analysis [Kastrinis et al. 2018;Zheng and Rugina 2008], which produce three classes of alias pairs: ground-truth must-alias and must-not-alias, and noisy may-alias (details are provided in Section 5.2.1).By learning from real must-alias, must-not-alias, and noisy may-alias, models rst solve an easier sub-task of separating must-alias and must-not-alias in the specialized pre-training stage.Then, models conquer the entire task by re ning the decision boundary between may-alias and must-alias/must-not-alias using ground-truth data in the ne-tuning stage.We note that our training work ow, especially the last two phases, resembles an in uential learning strategy called curriculum training [Bengio et al. 2009], under which models are trained on data shu ed in the ascending order of di culty levels.To a certain degree, the correlation to curriculum training validates of our training approach.
Our Goals.In this work, we set out to explore what all is achievable within the machinery of deep neural networks in program analysis.Given that deep learning models are fundamentally based on pattern recognition, an approach that is not considered particularly suitable for simulating classical static analysis algorithms, we believe such an exploration is a worthwhile pursuit.Technically, our primary goal is to evaluate the performance of four in uential code models (i.e., CuBERT, CodeBERT, GGNN, and Graph Sandwiches) in two foundational tasks of program analysis: alias prediction and equivalence prediction.It is worth noting that our aim is not to propose novel solutions to alias or equivalence prediction, or even replace existing algorithms for alias analysis or equivalence checking with deep neural networks.Both of these can be steps too far given the current landscape of deep models of code, and there is no su cient evidence that deep neural networks are even applicable to the kind of foundational static analysis considered in this paper.As a secondary goal, we aim to benchmark the performance of graph models (i.e., GGNN and Graph Sandwiches) w.r.t.di erent graph representations of code including Abstract Syntax Tree-based, Control Flow Graph-based, and Program Dependence Graph-based.Advantages over Existing Benchmarks.Compared to existing benchmarks for code models [Husain et al. 2019;Lu et al. 2021;Puri et al. 2021], our work o ers two crucial advantages.First, our dataset consists of code extracted from programs of real-world agship software rather than coding platforms at which code are written to solve speci c algorithmic problems.It is clear their specialized, non-standard programs (e.g., implement division with subtraction only; test if a given number is a palindrome) are not particularly relevant to real programming settings to which code models are designed to apply.Therefore, results of code models on those benchmarks are unlikely to generalize to useful end goals in the real world.Second, those works evaluate deep learning models in tasks that are well-explored by prior works (e.g., variable misuse prediction [Allamanis et al. 2018], method name prediction [Alon et al. 2019]).Therefore, their ndings provide limited new insights into the strengths and weaknesses of deep learning models.In contrast, we de ne alias prediction and equivalence prediction, derived from two foundational static analysis, as the prediction tasks in our evaluation.
Summary of Main Findings.Our results show that all models are accurate in both prediction tasks, in particular, CuBERT, the most accurate model, achieves 89% accuracy in alias prediction and 84% accuracy in equivalence prediction.We observe that program representation is a key factor to the performance of graph models.The accuracy of the exact same model can vary signi cantly depending on the type of graphs in which programs are represented in both prediction tasks.However, we nd that the program dependency graph-based representation enables GGNN and Graph Sandwiches to achieve their highest accuracy in both prediction tasks.This underscores the power and versatility of this program representation for graph models.Regarding the comparison between our training approach and the existing pre-training and ne-tuning approach, models trained with our approach are almost always more accurate than those undergoing pre-training and ne-tuning, demonstrating the e ectiveness and generality of our approach to training models towards solving the foundational program analysis tasks.
Next, we perform an in-depth analysis on models' results in order to gain a deeper understanding of what models have learned in the two prediction tasks.In alias prediction, we examine the nature of the alias pairs that models detected from the locality aspect.That is, whether the detected alias pairs are generally very local or they span a number of statements.Our analysis reveals that all models achieve sustained accuracies even when the instructions that establish aliasing relations become increasingly distant.In equivalence prediction, we rst investigate if models have resorted to a simple, template-matching approach to determining the semantic equivalence of two programs.That is, did models simply memorize equivalent program pairs in the training set into templates, which they then use to match test pairs by accommodating shallow, syntactic variations.However, we nd that only 11% of equivalent program pairs in the test set can be matched to a pair of training programs with minor syntactic modi cations, indicating the learning capacity of code models that goes far beyond such a template-matching approach.To further demonstrate the challenges that code models have overcome to achieve high accuracies, we consider three well-established equivalence checking tools (i.e., trace alignment [Churchill et al. 2019], ARDIFF [Badihi et al. 2020], and Rêve [Felsing et al. 2014]) as reference points, in particular, we assess their performance on CodeSem.Quite surprisingly, all tools perform exceedingly poorly, with the top-performer achieving under 25% accuracy, and more importantly, we nd that all code models have coped comfortably with the very issues that severely hinder those equivalence checking tools (e.g., limitations in aligning execution traces, di culties in solving complicated path constraints).
Overall, we conclude that in general, the evaluated code models are capable of handling even the foundational tasks in program analysis.On the other hand, we also identify speci c areas in which models still have signi cant headroom to improve.For example, models can be imprecise in dealing with pointers with complicated def-use patterns in alias prediction; in addition, they also display limited capability to perform sophisticated reasoning about the program behavior in equivalence prediction.With regard to the weaknesses of code models, we point out potential directions of future research for continuously improving deep learning models in programming language tasks.
In summary, we make the following contributions in this paper: (1) We de ne two learning tasks -alias prediction and equivalence prediction -for evaluating code models in performing foundational program analysis tasks.
(2) We assemble CodeSem, a dataset for alias and equivalence prediction, using code exclusively extracted from the program of real-world agship software.
(3) We propose a general, novel, three-stage training approach that leverages the results of program analysis tools to train models towards alias prediction and equivalence prediction.(4) We present the results of an extensive quantitative and qualitative evaluation of four prominent models of code (i.e., CuBERT, CodeBERT, GGNN, and Graph Sandwiches) in alias and equivalence prediction.

PREDICTION TASKS
This section de nes alias prediction and equivalence prediction, two downstream tasks that we use to evaluate deep learning models.

Alias Prediction
Alias prediction is derived from alias analysis, which checks whether two pointer variables refer to the same memory location at a particular program point.E ective alias analysis plays an essential role in nearly all program analyses for object-oriented programs [Sridharan et al. 2013].For example, computing a precise inter-procedural control-ow graph, a prerequisite for many program analyses, often requires signi cant reasoning on pointer aliasing to resolve virtual dispatch.Furthermore, any program analysis attempting to discover non-trivial properties of an object must reason about mutations to that object through pointer aliases.As a foundation for many client analyses, alias analysis facilitates typestate analysis [Phulia et al. 2020], taint analysis [Tripp et al. 2013], and value ow analysis [Shi et al. 2018].
Unlike the classical may-alias analysis (for determining may-alias or must-not-alias pairs) or mustalias analysis (for identifying speci cally must-alias pairs), alias prediction classi es if two pointer variables must alias, may alias, or must not alias all at once, a more elegant approach to identifying aliasing relations.While the conventional de nition of must-and must-not-alias [Altucher and Landi 1995;Balatsouras et al. 2017;Fink et al. 2006] still applies to our prediction task (De nition 2.1 and 2.2), we make a minor modi cation to the de nition of may-alias [Altucher and Landi 1995;Horwitz 1997], in particular, we add an additional constraint (underlined in De nition 2.3) to exclude must-alias from may-alias.
De nition 2.1.(Must Alias) Two variables 1 and 2 is must-alias at a program point if for all executions to , 1 and 2 refer to the same location at .
De nition 2.2.(Must-not Alias) Two variables 1 and 2 is must-not alias at a program point if 1 and 2 do not refer to the same location at on any execution to .
De nition 2.3.(May Alias) Two variables 1 and 2 is may-alias at a program point if for some but not all executions to , 1 and 2 refer to the same location at .

Equivalence Prediction
We derive equivalence prediction from equivalence checking, a long-standing problem in computer science.It is an important building block for many client applications.For example, in the setting of compiler veri cation, equivalence checking is employed to verify the correctness of transformations performed by compilers [Necula 2000;Sewell et al. 2013;Tate et al. 2009]; superoptimizers also rely on equivalence checker to ensure that the optimized programs maintain the same semantics of the original programs [Churchill et al. 2019[Churchill et al. , 2017;;Dahiya and Bansal 2017].In addition, equivalence checking has been applied to problem domains like program synthesis [Schkufza et al. 2013] and code refactoring [Ramos and Engler 2011].A central issue to equivalence checking is the de nition of semantic equivalence.In this work, we follow Churchill et al. [2019]'s formalization.
De nition 2.4.(Semantic Equivalence) Two programs 1 and 2 are semantically equivalent if, when 1 and 2 start in identical machine states (e.g., registers, stack, heap), (1) they have identical output registers and heap states; or (2) they encounter the same run-time error (or loop forever).
The output registers and heap states together re ect the state in which machine ends after executing a program.We ignore stack because it is used for allocating local, temporary variables.

TRAINING APPROACH
In this section, we rst introduce the code models selected for this study, followed by a detailed presentation of our training work ow.

Models of Code
CuBERT.Built upon BERT [Devlin et al. 2019] (Bidirectional Encoder Representations from Transformers), the core of CuBERT [Kanade et al. 2020] is a multi-layer bidirectional Transformer [Vaswani et al. 2017].CuBERT is a natural application of BERT in programming language tasks.Similar to BERT, CuBERT is rst pre-trained on a larger dataset to learn general code embeddings and then ne-tuned on a smaller dataset for speci c downstream tasks such as variable-misuse classi cation, wrong binary operator prediction, and exception type prediction.CodeBERT.CodeBERT [Feng et al. 2020] is among the latest models of code based on BERT.Unlike CuBERT, CodeBERT is a bimodal code model for programming language and natural language.A main point of novelty is its capability to pre-train with both bimodal data (i.e., code-text pairs) and unimodal data (i.e., code or text alone).CodeBERT achieves state-of-the-art results on several cross-lingual prediction tasks like code search and code document generation.GGNN.GGNN (Gated Graph Neural Network) [Allamanis et al. 2018], a variant of Graph neural network, has been widely used in machine learning models of source code.Central to GGNN is the message-passing mechanism, and the way it works is at each message-passing step, every node sends messages to its neighbors and summarizes messages received from its neighbors which it uses together with its prior state to compute the new state with a GRU cell [Cho et al. 2014].Graph Sandwiches.Graph Sandwiches [Hellendoorn et al. 2019] aims to replicate the strength of sequence models in learning global data properties for a base graph model.Speci cally, it incorporates sequential layers into message-passing steps of graph neural networks such that the node embeddings are not solely learned from the message-passing layers, the way that node representations are learned in a typical GNN, but also from sequence models like RNN [Rumelhart and McClelland 1987] or Transformer.

Training Workflow
Our training work ow consists of three stages: generalized pre-training, specialized pre-training, and ne-tuning.Below, we give the details of each training stage.
3.2.1 Generalized Pre-Training.The generalized pre-training stage corresponds to the pre-training phase in the widely-used, two-stage learning approach, namely generative pre-training followed by discriminative ne-tuning.The goal of this stage is to learn general, foundational properties of source code independent of downstream prediction tasks.To design the task at the generalized pre-training stage, we can simply adopt the pre-training tasks used in sequence models.Specically, CuBERT is pre-trained with masked sequence token prediction and next sentence prediction concurrently [Devlin et al. 2019;Kanade et al. 2020].The former is to mask some percentage of tokens in an input program at random for the model to predict.The latter is for the model to predict if two given sentences follow each other.A sentence here refers to a logical code line which is the shortest sequence of contiguous lines that makes up a legal statement.For CodeBERT, we use replaced token detection [Clark et al. 2020], in which models learn to detect whether a token is original or counterfeit generated by language models.We note that replaced token detection is the only pre-training task of CodeBERT that can work with unimodal data (i.e., only code in our setting).The other pre-training task, which deals with bimodal data (i.e., text and code), is not applicable to our setting.
As GGNN and Graph Sandwiches lack well-established pre-training tasks, we leverage the pretraining tasks commonly used in sequence models.To this end, we introduce masked graph node prediction, based on CuBERT's masked sequence token prediction task.This task involves randomly masking some nodes in a program graph, which the models must then predict.For simplicity, we only mask nodes that directly correspond to tokens of a program, such as terminal nodes in an abstract syntax tree.Furthermore, we only consider nodes that have previously appeared in the program prior to the masking locations.To replicate the setup of masked sequence token prediction, we mask 15% of the nodes in each program graph for masked graph node prediction.Additionally, we draw inspiration from CuBERT's next sentence prediction to develop adjacent edge prediction for GGNN and Graph Sandwiches.Speci cally, we train these models to predict the type of edge connecting two randomly selected nodes from a graph.We also note that in this task, the absence of an edge is considered a prediction class.
3.2.2Specialized Pre-Training.Specialized pre-training initiates the process of adapting model parameters learned in the generalized pre-training stage towards the speci c prediction tasks.However, unlike ne-tuning which relies on clean, ground-truth data to facilitate knowledge transfer, specialized pre-training merely aims to learn a coarse decision boundary for the speci c prediction task with data that contains an acceptable amount of noise.The advantage of this approach is that it allows models to exploit potentially unlimited amounts of labeled data produced by corresponding program analysis methods during the specialized pre-training stage.This, in turn, can help to more e ectively adapt the learned model to downstream tasks during ne-tuning stage.
For Alias Prediction, we propose a task called noisy alias prediction, where models learn from the results of a sound may-alias [Zheng and Rugina 2008] and sound must-alias analysis [Kastrinis et al. 2018].As explained in Section 1, this task involves a two-step process in which models initially (1) learn a precise decision boundary between must-and must-not-alias given that data in these two categories is ground-truth.Subsequently, (2) models learn a rough decision boundary between must-and may-alias, and must-not-and may-alias, respectively given that may-alias data is noisy.These decision boundaries, those learned in Step (2) in particular, will be re ned in the later ne-tuning stage.Because separating must-and must-not-alias (which is the main objective of the specialized pre-training stage) is an easier sub-task in alias prediction since they are the two extremes of aliasing relations, and di erentiating between must-/must-not-alias and may-alias (which is the main objective of ne-tuning stage) poses a greater challenge.Speci c to GGNN and Graph Sandwiches, our pre-training work ow (generalized pre-training followed by specialized pre-training) aligns with Weihua et al. [2020]'s approach for pre-training graphs models in general: pre-training tasks should enable graph models to capture both local (e.g., nodes and edges) and global (e.g., graph-level) properties.Clearly, the former is the objective of generalized pre-training while the latter is the objective of specialized pre-training in our training approach.
For Equivalence Prediction, we design noisy functional equivalence prediction task where models learn to predict whether or not two programs have the same input-output pair.In this task, models learn from results of a testing method for functional equivalence [Jiang and Su 2009].Since this testing method can only refute the equivalence of two programs, it yields ground-truth data for inequivalent programs and noisy data for equivalent programs.Using these labeled data, models aim to learn a coarse separation of programs w.r.t.(a noisy version of) functional equivalence, which later will be re ned w.r.t.(the exact version of) semantic equivalence.

3.2.3
Fine-Tuning Tasks.After the generalized and specialized pre-training stages, models undergo ne-tuning tasks that directly correspond to the prediction tasks in this evaluation -alias prediction or equivalence prediction.During the ne-tuning stage, only ground-truth data is used to complete the adaption of model parameters learned from the two pre-training stages towards the downstream prediction tasks.Figure 1  noisy functional equivalence prediction during the specialized pre-training stage before being netuned on each prediction task.GGNN and Graph Sandwiches follow the same training work ow, starting with masked graph node prediction followed by adjacent edge prediction.In practice, this order yields minor advantages over models trained concurrently or in reverse order.Afterward, both models are trained with noisy alias prediction or noisy functional equivalence prediction before being ne-tuned on alias or equivalence prediction tasks.

MODEL ARCHITECTURES FOR ALIAS AND EQUIVALENCE PREDICTION
In this section, we explain how we utilize the existing architecture of each code model to solve the two prediction tasks.
Model Architecture for Alias Prediction.Let F be a single function or a set of functions in a call chain; let 1 and 2 denote two variables for which an aliasing relation is predicted at program point within F , we design models to take a 3-tuple input in the form of <F , Def 1 , Def 2 > where Def 1 and Def 2 are the sets of de nitions of 1 and 2 that are live at .The reason we only consider live de nitions of 1 and 2 is because they are the instructions that exclusively determine the aliasing relations of 1 and 2 .Since there may be distinct de nitions of 1 / 2 that can be live at in the context of F 's control ow (e.g., both line 4 and 6 can be a live de nition of at line 9 in Figure 2a), we take into account all of them.Since the way variables are used may also provide useful information, we introduce another design that incorporates uses of variable de nitions from the def-use chain [Stoltz et al. 1994].In this design, models will take a 5-tuple input in the form of <F , Def 1 , Use 1 , Def 2 , Use 2 > as input, where Use 1 /Use 2 is a set of uses from the def-use chain of Def 1 /Def 2 .Figure 2b illustrates the architecture of sequence models for alias prediction using the code in Figure 2a as an example.In both designs, the rst element of the input tuple, F , is represented by token sequence, which is fed into sequence models to generate the nal hidden vector of each token in F .Regarding the embeddings of Def 1 and Def 2 , we extract the nal hidden vector of the token on the Left Hand Side (LHS) of Def 1 and Def 2 (e.g., p 1 in Figure 2b for the de nition of at line 3 in Figure 2a).Our rationale is that LHS can represent the whole de nition because of the way tokens communicate through the self-attention mechanism.If Def 1 or Def 2 contains multiple de nitions, we perform mean pooling over the embeddings of every de nition's LHS (e.g., mean-pooling the embedding of q 1 and q 2 as depicted in Figure 2b).If variable uses are not considered, then we simply feed a softmax regression layer with the concatenation of embeddings of Def 1 and Def 2 to predict the aliasing relation between 1 and 2 .Otherwise, the following steps will be taken.
First, we compute the embedding of Use 1 or Use 2 by mean-pooling the hidden state of every use of Def 1 or Def 2 , where each use is represented by a token that indicates the usage point of 1 or 2 .For instance, p 1 and p 2 are the two use points of the de nition of at line 3 in Figure 2a, and their embeddings are mean-pooled to produce the embedding of the uses of the de nition ... int * p = &x; int * q = NULL; ... q = p ; ++* p ; ... return * q ; ...

concatenation mean pooling concatenation mean pooling
CuBERT or CodeBERT (b) The architecture of sequence models for predicting the aliasing relation between (annotated as p ) and (annotated as q ).Fig. 2. The architecture of sequence models for alias prediction.
of .Next, we concatenate the embedding of Use 1 or Use 2 to that of Def 1 or Def 2 to produce the nal embedding of 1 (denoted by 1 ) or 2 (denoted by 2 ).For example, concatenating the embedding resulted from the mean-pooling of p 1 and p 2 with p 1 produces as depicted in Figure 2b.Finally, we use a softmax regression layer with the concatenation of 1 and 2 to predict the aliasing relation between of 1 or 2 .It is worth mentioning that we have experimented with alternatives to mean-pooling in both cases (for computing embeddings of de nitions and uses) such as max pooling, weighted sum (i.e., soft attention), and none of which display a notable improvement in either case.For graph models, after completion of the message-passing process, we extract the embedding of nodes that correspond to the LHS of every de nition in Def 1 and Def 2 and every use point of Def 1 and Def 2 , then we follow the same procedure described above for sequence models to predict the aliasing relation between 1 and 2 .
Model Architecture for Equivalence Prediction.Given a tuple <F 1 , F 2 > representing the two input programs (here F stands for the same meaning as in alias prediction), all models take the same approach: computing the hidden vector of each token (resp., node) in the sequence (resp., graph), and then performing mean pooling over the hidden vectors of all tokens (resp., nodes) to obtain a single embedding for F 1 or F 2 .Again, Other methods such as max pooling or weighted sum were also experimented with, but did not show notable improvement.Similar to alias prediction, we add a one-layer softmax regression to predict the equivalence of F 1 and F 2 using the concatenation of their embeddings.

THE DATASET
We compile three distinct datasets, including generalized pre-training dataset, specialized pretraining dataset, and ne-tuning dataset to suit our training approach.Regarding the programming languages in which we build our datasets, we take into account the following factors: (1) the analysis of source code, both dynamic and static, in the language to be chosen should be well-supported by existing tool-chains, infrastructure, etc., such that the workload that we undertake in conducting such a large-scale, extensive evaluation can be reduced to a minimum; (2) the source code in the language to be chosen should present su cient challenges in both prediction tasks.For example, in the case of alias prediction, we require languages to be chosen to feature complicated pointer operations (e.g., address-of assignment, load and store operations, pointer arithmetic).Because C and C++ are likely the only languages that satisfy our criteria, and they are also foundational, widely-used programming languages, we build our datasets with C/C++ code.
We select fourteen open-source software for our study: Linux Kernel, GCC, MySQL, Git, tmux, Redis, curl, LevelDB, H2O, libgit2, The Silver Searcher, Protocol Bu ers, aria2, and sh.They range from mid-scale (with tens of thousands lines of code) to large-scale programs (with hundreds of thousands or even millions of lines of code), and all of them are well-established (with decades-long   1 in the supplemental material provides the details of each project.The source code of those software implement a broad range of functionalities such as data transmission, memory management, and cross-compilation, which makes our dataset diverse.All software contribute to the generalized pre-training, the specialized pre-training, and the ne-tuning dataset, in addition, the three datasets are extracted from di erent portions of the program of each software to avoid duplicates.From each project, we collect roughly the same number of samples for each prediction class for all three datasets.Table 1 gives an overview of CodeSem.In the subsequent sections, we explain in detail how the three datasets are created.

Dataset of Generalized Pre-Training
The training data for all generalized pre-training tasks can be easily generated from any valid code.Speci cally, we extract all les that can be parsed by Clang [Lattner 2008] from the codebase of each software mentioned earlier, which resulted in 120,814 les in total.From them, we randomly pick 200,013 functions.Every function is used to generate data for all generalized pre-training tasks for all evaluated models.The rst row in Table 2 gives the length (resp., size) of token sequences (resp., graphs) of the data used in the generalized pre-training tasks.
5.2 Dataset of Specialized Pre-Training and Fine-Tuning 5.2.1 Alias Prediction.We adopt a sound may-and must-alias analysis to obtain data points for the specialized pre-training dataset.A sound may-alias analysis over-approximates aliasing relations, ensuring that all potential alias pairs are found [Sridharan and Bodík 2006;Yong et al. 1999;Zheng and Rugina 2008].As a result, the results of a sound may-alias consist of ground-truth must-notalias and noisy may-alias which are almost certain to contain must-not-alias.In contrast, a sound must-alias analysis computes an under-approximation of the aliasing relations that are guaranteed to hold [Kastrinis et al. 2018].In other words, the must-alias produced by the sound must-alias analysis is a subset of all ground-truth must-alias pairs.Therefore, a combination of the may-and must-alias analysis results in three types of aliasing pairs: ground-truth must-alias pairs (which are directly found by the must-alias analysis), ground-truth must-not-alias pairs (which are directly found by the may-alias analysis), and noisy may-alias pairs, in particular, noisy may-alias pairs are obtained by the subtraction of must-alias pairs (found by the must-alias analysis) from may-alias pairs (found by the may-alias analysis).
As we explained above, may-alias pairs (found by the may-alias analysis) also include mustnot-alias pairs; and must-alias pairs (found by the must-alias analysis) only account for a subset of all real must-alias pairs within may-alias pairs, therefore, the subtraction of must-alias pairs from may-alias pairs results in a mixture of all three types of alias pairs.For this reason, we term may-alias pairs "noisy".The ground-truth must-alias, must-not-alias pairs, and noisy may-alias directly constitute the specialized pre-training dataset.To obtain ground-truth data for the netuning dataset, we begin by running the same set of analyses (on a di erent portion of the program of each selected software) used to generate the specialized pre-training dataset.From the results, we select the must-and must-not-alias that are guaranteed to be correct, and then involve human experts to label the may-alias that the analysis can not precisely determine.
Specialized Pre-training Dataset.For the may-alias analysis, we adopt the method proposed by Zheng and Rugina [2008], which is implemented in LLVM [Lattner and Adve 2004] as an interprocedural, context-, ow-, and eld-insensitive analysis.This approach strikes a balance between precision and scalability and is considered state-of-the-art in alias analysis.For the must-alias analysis, we refer to the work of Kastrinis et al. [2018], who propose a novel data structure based on equivalence classes as the backbone of their must-alias analysis.We realize their approach into an inter-procedural, context-, ow-, and eld-sensitive must-alias analysis.Below, we give a detailed presentation on how to create the specialized pre-training dataset based on those two analyses.
First, we run the must-alias analysis to identify must-alias relations.This step generates alias classes, each of which contains variables that must alias with each other.Therefore, to create mustalias pairs, we simply combine two variables from the same alias class.Second, we run the may-alias analysis integrated in LLVM to obtain may-alias and must-not-alias information.Similarly, LLVM generates alias sets where variables within each set may point to the same memory location.Thus, combining two variables from di erent aliasing sets results in must-not-alias pairs.To construct may-alias pairs, we ensure that we do not reuse variables that are already identi ed with must-alias relation.Speci cally, we lter LLVM's alias sets so that each set only contains one variable from the same alias class generated by our must-alias analysis.Among the remaining variables in each of LLVM's alias sets, we combine two variables from the same set to create may-alias pairs.Regarding the must-alias analysis, we analyzed 54,421 functions out of 120,814 les.These functions do not overlap with the 200,013 functions used to generate the generalized pre-training dataset.As a result, we identi ed 32,196 equivalence classes.On the same set of functions, LLVM produced 69,809 alias sets for the may-alias analysis.To maintain the diversity of our alias pairs, we make each variable appear exactly once among the three classes of alias pairs.In addition, to avoid duplication between the specialized pre-training and ne-tuning dataset, we randomly selected 26,899 functions (out of 54,421) to construct the specialized pre-training dataset and held out the rest for the ne-tuning dataset.In the end, we collected 10,204 must-alias pairs, 10,030 may-alias pairs, and 10,613 must-not-alias pairs in the specialized pre-training dataset for alias prediction.
Fine-tuning Dataset.To obtain ground-truth data for the ne-tuning dataset, we utilize the results of the sound may-and must-alias analysis on the holdout set of functions (27,522 in total), and then manually label the data points that the two analysis can not precisely determine.Because the must-alias and must-not-alias pairs are guaranteed to be correct, we focus on verifying the label of may-alias pairs.We engaged the assistance of 32 PhD students for the labeling task, each of whom is familiar with alias analysis.Below, we give the details of the manual labeling process.
We assigned every may-alias pair to two PhD students.When labeling a may-alias pair, human raters see the entire codebase of the corresponding project for the sake of precision of the labeling process.For di cult alias pairs (e.g., those involving global information) where human raters disagree, we design a separate procedure to resolve their labels.For circumstances where raters disagree whether an alias pair is may-alias or must-alias, we rst instrument all program paths that one rater (i.e., the one who gives the may-alias label) thinks the two pointers are not alias, in particular, we log the memory locations that the two pointers point to on those paths.Then, we use AFL [Zalewski 2016] to fuzz test the program.After the fuzzer nishes, we check via the log if the two pointers ever point to di erent memories on any of these instrumented paths.If so, they are may-alias, otherwise, we check if the fuzzer indeed covered all instrumented paths, if so, we label them must-alias; if not, we simply discard this alias pair because its label can not be determined.Similarly, when human raters disagree on whether an alias pair is may-alias or must-not-alias, we check if the two pointers ever point to the same memories on the paths which one rater (the one who gives the may-alias label) thinks they are alias.If we nd that the two pointers point to the same memory on at least one instrumented path, they are may-alias, otherwise, we label them must-not-alias or discard them depending on whether the fuzzer has covered all instrumented paths.
Based on the above-mentioned approach for determining the labels of alias data, we now validate if the construction of ground truth matches the prediction samples that models would work with.We address each of the three classes of alias pairs separately for discussion.(1) For must-not-alias pairs identi ed by sound may-alias analysis, we con gure LLVM to produce all intermediate analysis steps.In particular, whatever code are used by the analysis will be included in the construction of the data points, in other words, models consume the same information that the analysis does for determining the must-not alias relations.(2) Similarly, for must-alias pairs identi ed by the sound must-alias analysis, we also include all code consumed by the analysis to construct the data points.This means again models observe the same information in a data point that the analysis uses to determine the must-alias relations.(3) For each noisy may-alias pair, human raters, assisted by fuzzers, identify the scope of code from which the ground-truth can be determined.All code falling within this scope will be included in the construction of the data point.Consequently, in all cases, we ensure there is no mismatch between model predictions and the construction of ground truth.
The entire labeling process took approximately ve months.Overall, the average inter-rater reliability is good, with a kappa score of 0.86.In the end, we collected 11,647 must-alias pairs, 11,739 may-alias pairs, and 11,806 must-not-alias pairs for the ne-tuning dataset.The second and third rows of Table 2 present the length (resp., size) of token sequences (resp., graphs) for alias prediction.Clearly, these data points are su ciently complex to ensure the di culty of the alias prediction task.As shown in Table 1, the generalized pre-training dataset contains substantially more data points than the specialized pre-training and ne-tuning datasets because small programs that do not contain any aliases are included in the generalized pre-training dataset but are ineligible for the specialized pre-training and ne-tuning datasets.

Equivalence Prediction.
We build an automated data pipeline to create the specialized pretraining dataset.The pipeline consists of two steps: (1) extracting code fragments from the program of selected software; (2) testing their functional equivalence dynamically.Regarding the ne-tuning dataset, we rst run the same data pipeline on a di erent portion of the program of each selected software, and then manually con rm if the code tested to be functionally equivalent is semantically equivalent, considering that functionally inequivalent code is knowingly semantically inequivalent.
Specialized Pre-training Dataset.In our e orts to create a specialized pre-training dataset for equivalence prediction, we seek out code that are functionally equivalent, meaning, they have the same input-output pairs.To achieve this, we rely on EqMiner [Jiang and Su 2009], which adopts a random-testing-based approach to extract code fragments that are functionally equivalent.For all selected software programs, EqMiner generates 3,680 equivalent sets from 15,067 functions in 120,814 C/C++ les.Again, these functions do not overlap with any of the 200,013 functions reserved for the generalized pre-training dataset.Code fragments within an equivalent set are likely equivalent to each other, while code fragments from di erent equivalent sets are known to be inequivalent.Based on the output of EqMiner, collecting equivalent code pairs is straightforward: combining any two elements in each equivalent set.To form inequivalent code pairs, we combine elements from di erent equivalent sets.Following our approach to creating the dataset for alias prediction, we make each code fragment appear only once in either equivalent code pairs or inequivalent code pairs to ensure the diversity of our dataset.To avoid duplicates between the specialized pre-training and ne-tuning datasets for equivalence prediction, we adopt a similar approach as we did for alias prediction.Speci cally, we randomly selected half of the equivalent sets generated by EqMiner to construct the specialized pre-training dataset, while holding out the other half for the ne-tuning dataset.In total, we collected 66,403 equivalent and 66,538 inequivalent code pairs in the specialized pre-training dataset.
Fine-tuning Dataset.For the remaining equivalent sets (1,840 in total), we check if code fragments in each set are indeed semantically equivalent, considering that code fragments from di erent sets are certain to be semantically inequivalent.We enlisted the help of 130 undergraduate students from the computer science department at our university for this task.Since code segments produced by EqMinder are always self-contained, meaning, they include all the information needed to determine the label, raters only need to inspect the code fragments as presented.Much like in alias prediction, it's important to emphasize that there is no mismatch between the construction of ground truth and the presentation of prediction samples.This is because models deal with the very same code fragments that EqMiner/human raters do when con rming their labels.Each code pair was inspected by 2 students, and the labeling process took just over two months to complete.On average, the inter-rater reliability, as measured by the kappa score, is 0.72.In the end, we obtained 10,655 equivalent code pairs and 10,861 inequivalent code pairs.Inequivalent code pairs are collected in the same way as they are in the specialized pre-training stage.The last two rows in Table 2 give the length (resp., size) of token sequences (resp., graphs) for equivalence prediction.As with alias prediction, specialized pre-training and ne-tuning datasets contain less data than generalized pre-training dataset due to the exclusion of small functions.

EXPERIMENTS
In this section, we provide an overview of our experimental setup and describe the program representations used by each code model in our evaluation.We then present the results of all models in both alias and equivalence prediction tasks.Finally, we conduct an in-depth analysis of the models' results in these two prediction tasks and discuss our ndings.

Experimental Setup
Vocabulary.Since CodeSem does not have a particularly large vocabulary size or frequent occurrence of rare words, we do not adopt BPE [Sennrich et al. 2016] to address the out-of-vocabulary issue.Instead, we take a simpler approach by constructing the vocabulary at the word-level while excluding rare words.Speci cally, we consider tokens that appeared at least ten times among all data points in CodeSem.For tokens that appear less than ten times, we replace them with [ ] in both the training and test datasets.The sequence model has a vocabulary size of 190,029, while the graph model has a vocabulary size of 277,302.
Hyperparameters.To ensure the optimal performance of each code model, we tune their hyperparameters using Bayesian Optimization [Martinez-Cantin 2014], a common method for hyperparameter tuning.We normalize all evaluated models w.r.t. the number of model parameters, which is around 3M per model.Table 3 shows the value for some of the most important hyperparameters for each code model.

Validity of Models.
We validate all code models on their original tasks.The details are provided in Section 2 of the supplemental material.Prediction Se ing.Given the high degree of similarity that can exist between code from the same project, we evaluate the models in a cross-project prediction setting.Speci cally, we perform 14 rounds of cross-validation, with each project left to the test set in turn, and the models trained on the remaining projects.When a project is reserved for the test set, we rst remove all its code from the generalized and specialized pre-training dataset, then use only its data points on the ne-tuning dataset to construct the test set.This approach ensures that models do not observe any code from the project reserved for the test set during the training process.To report our evaluation results, we choose the train-test split where a model achieves the median accuracy among all splits (hereinafter referred to as the median model).Table 4 presents the details about the train-test split for the median model of CuBERT.Those for the other code models are left to the supplemental material (Section 3).
Metric.First, we adopt accuracy, a standard metric, to evaluate the performance of models.In order to provide a holistic view of models' performance, we also report their accuracy for each prediction class.In alias prediction, we report the accuracy of (median) models for must-, may-, and must-not-alias, respectively (represented by the three numbers in parentheses in each cell in Table 5).In equivalence prediction, we report the accuracy of (median) models for equivalent and inequivalent code pairs (represented by the two numbers in parentheses in each cell in Table 6).Additionally, we take con dence intervals into account and use a common con dence level of 95% to compute them [Cao et al. 2022;Pantiuchina et al. 2021;Sun et al. 2022].This provides a more accurate representation of the performance of each (median) model.

Program Representations
We investigate the impact of di erent code representations on graph models only since CuBERT and CodeBERT simply take code as token sequences.
6.2.1 AST with Additional Edges.It is the original code representation designed for GGNN.To represent both the syntactic and semantic structure of code, Allamanis et al. [2018] propose a program graph that incorporates extra edges into ASTs (e.g., connecting each terminal node to its successor, connecting a variable to others that the variable is data or control dependent on).Hellendoorn et al. [2019] for Graph Sandwiches, consists of exclusively the terminal nodes from ASTs.Speci cally, after removing standard AST edges (i.e., parent-child edge), it moves down other edges (inherited from Allamanis et al. [2018] work) from non-terminal nodes -which typically represent a span of multiple tokens -to terminal nodes that represent the starting token of that span.For example, an edge that is used to connect AST nodes of two ForStatement will be moved down to connect the corresponding for tokens.This representation has two advantages: (1) it is substantially more compressed than AST-based graphs, and often uses fewer nodes while retaining most edges; additionally (2) it accommodates sequence models better because all edges are directly connected among tokens.

Control Flow Graph.
We adopt the Control Flow Graph (CFG)-based representation presented by Wang et al. [2020].Speci cally, given a standard CFG, we rst split a graph node representing a basic block into multiple nodes, each of which is a single statement.Subsequently, we add additional edges to connect every statement with its immediate successor within the same basic block.For edges in the original control ow graphs, we change their start (resp., end) nodes from a basic block to its last (resp., rst) statement after the split.Then, we design two statement representation methods: (1) we replace each statement node with a sequence of nodes, among which each node represents a token in the statement.Speci cally, every token node is connected to its immediate successor, and the rst token node will become the new start or end node of edges that were connecting the statement nodes before (hereinafter, this whole program representation scheme is abbreviated to CFG+token seq); (2) we replace each statement node with its abstract syntax tree, the method proposed by Si et al. [2018].The root nodes of each AST are used to connect statements in the CFG.In the remainder of this paper, we refer to this program representation as CFG+AST.

Program Dependence Graph.
The last program representation is based on program dependence graph (PDG) which makes explicit both the data and control dependence for each operation in a program [Ferrante et al. 1987].Like CFG-based graph representation, we represent every node in a PDG, which is a single statement, by its token sequence (hereinafter denoted by PDG+toke seq) or abstract syntax tree (hereinafter denoted by PDG+AST).Section 4 in the supplemental material gives an example of this graph representation.
Inter-Procedure Code Representations.For sequence models, we concatenate the token sequence of each function in the order it is called.For example, when Fun 1 calls Fun 2 which then calls Fun 3 , we concatenate the token sequence of Fun 1 , Fun 2 , and Fun 3 in turn.For graph models, we rst construct the graph for each function as described above, and then connect the node representing the callsite in the caller to the entry (or root) node in the CFG/PDG (or AST) of the callee.

Results of Alias Prediction
During training, we adopt a common approach -early stopping [Prechelt 1998] -to prevent models from over tting.As displayed in Table 3, the patience is set to be 20 (i.e., we wait 20 epochs before early stop if no progress on the validation set).Figure 3  For each code model, the curves correspond to the median model that achieves the highest test accuracy among those adopting di erent def-use configuration, and consuming di erent program representations.These median models (with the highest accuracies) are precisely the ones that learned from the train-test split presented in Table 4 and Section 3 of the supplemental material.accuracy of median models over training epochs for all code models, in particular, at each training stage, models are saved at the point when their accuracies on the validation set reach the apex.We compare the performance of models trained with our approach against those trained with pre-training (using generalized pre-training dataset) and ne-tuning (using ne-tuning dataset).Table 5 presents the results from which we summarize important ndings below: • With a con dence level of 95%, we compute the con dence interval for all models to fall in the range of (-1.0%,+1.0%),strongly indicating our results provide an accurate representation of models' performance in alias prediction.In terms of accuracy, sequence models performed notably better than graph models, with CuBERT and CodeBERT achieving the highest accuracy of 89% and 84% while GGNN and Graph Sandwiches achieving the highest accuracy of 78% and 81%.The choice of program representation is crucial to the performance of graph models.The accuracy of the same model can vary drastically with di erent program representations.For example, when GGNN switches from PDG+token seq to CFG+AST while keeping the other factors unchanged (trained with our approach and adopting the con guration of DEF+USE for variable embedding), it experienced a 29% accuracy drop.• In most cases, embedding variables with uses leads to a higher overall model accuracy than without.Occasionally, incorporating uses can also hurt model accuracy.This may be caused by complicated usage patterns of variables that negatively a ect the precision of their embeddings.After all, the way that variables are used does not fundamentally determine whether they are alias or not.Regarding the comparison between our training approach and pre-training and ne-tuning, models trained with our approach are almost always more accurate than those undergoing pre-train and ne-tuning.In fact, with the same def-use con guration for variable Table 5. Results of alias prediction.All numbers in the table are percentages of accuracy and those in parentheses represent the accuracy of median models for must-alias, may-alias, and must-not-alias, respectively.Numbers in bold are the highest accuracy of each model.all deep learning models are far superior to the two alias analyses, however, alias analyses provide theoretical guarantees on the correctness of certain alias pairs that deep learning models can not.

Results of Equivalence Prediction
Figure 4 plots the training and validation accuracy of median models over training epochs for all code models.Similar to the approach taken in alias prediction, we save models at the point when their accuracies on the validation set reach the apex at each training stage to prevent over tting.We present the results for all median models in Table 6.The key ndings are summarized below.
• Similar to the results in the alias prediction task, all models achieved a con dence interval between -1.4% to +1.4% at a 95% con dence level.Again, this strongly indicates our results are an accurate re ection of models' performance in equivalence prediction.All models achieved comparable high accuracy in this task.Speci cally, CuBERT, which achieves the highest accuracy at 84%.CodeBERT, GGNN, and Graph Sandwiches are close behind achieving the accuracy of 81%, 81%, and 79%, respectively.In general, all models achieve balanced accuracy.• Compared to alias prediction, the role of program representation is less signi cant in the performance of graph models in equivalence prediction.Nevertheless, both GGNN and Graph Sandwiches show a swing in accuracy of 5% to 10% depending on the program representation used.Interestingly, both models achieve their highest accuracy with PDG+AST.Together with the results of alias prediction, this demonstrates the power of PDG as a principled program representation for learning code models for foundational program analysis tasks.
• In almost all cases, models undergoing our training work ow are more accurate than those trained with pre-training and ne-tuning.These results and those of the alias prediction task con rm the role of the specialized pre-training stage in training models towards solving foundational program analysis tasks.et al. 2016] pertaining to NL-level information (as veri ed by human raters in the labeling process), thus, models can not pick up NL artifacts in code examples to link with their labels.
Exploring model performance for alias pairs w.r.t.locality.In a systematic evaluation, we examine how accuracies of code models vary with the locality of alias pairs.Speci cally, we investigate whether code models are only capable of handling local aliases or whether they can also handle aliases that span several statements.To answer this question, we rst de ne a metric called distance ( ) that measures the number of statements between the de nitions that make two variables alias.Formally, let be a program path along which [ 1 , 2 , . . .] are de nitions (executed in order) that make two variables alias.The distance metric counts the number of statements between the execution of the rst de nition 1 and the last de nition .If two variables become alias due to one de nition such as int * , * ; = ;, then their distance is 0. Next, we classify must-and may-alias pairs in the test set of each code model into ve categories based on the distance metric: = 0, 1 ≤ < 4, 4 ≤ < 7, 7 ≤ < 10, and 10 ≤ .We record the accuracy of each code model in each category and report the results in Table 7.Although in most cases all models become less accurate as the distance of alias pairs increases, they still achieve acceptable accuracy in general.For example, no model has a decrease of more than 8% accuracy when < 10 for either must-or may-alias pairs, demonstrating a level of sustainability across the locality spectrum.Last but not least, we also observe that all code models coped with di erent types of assignments that make variables aliases, such as address-of ( = & ), load ( * = ), store ( = * ), indicating that all models have learned a comprehensive set of semantic features for predicting aliasing relations.Overall, our analysis shows that all models have adequately solved the alias prediction task.6.5.2Analyzing Model Results in Equivalence Prediction.We skip the analysis of the impact of NL-level information on model performance.Because (1) like in alias prediction, models do not take into account comments in the code either during training or test; and (2) code examples in CodeSem, which are generated by EqMiner, are already anonymized.
Con rming the Validity of CodeSem.First and foremost, we validate CodeSem for equivalence prediction.As we explained at the very beginning of this paper, equivalence prediction task requires models to predict whether or not two programs are semantically equivalent.This means that data points in CodeSem should not be simple, syntax-level code clones that can be easily detected by syntactic similarity.To con rm this, we run a well-established, highly impactful clone detection tool, Deckard [Jiang et al. 2007], on the test set of each code model.Our results show that in the best case (on CuBERT's test set) Deckard has 61% (resp., 63%) accuracy on pairs of equivalent (resp., inequivalent) programs, which is marginally higher than the chance-level accuracy.We have experimented with a wide range of Deckard's hyperparameters and the results above are the optimal in balancing the precision (aiming to avoid reporting false clones) and recall (aiming to capture all real clones).The experiments con rm the validity of CodeSem for equivalence prediction.
Since semantic clone detection [Roy et al. 2009] shares some similarities with the equivalence prediction task, we choose Tailor [Liu et al. 2023 to evaluate on CodeSem.Tailor exhibits outstanding performance on BigCloneBench [Svajlenko et al. 2014] and OJClone [Mou et al. 2016], achieving close to 100% accuracy on both benchmarks.Like our models, Tailor is deployed to predict in a cross-project setting: we evaluate Tailor on each train-test split of CodeSem and report the median accuracy that Tailor achieves.We note that we have tuned Tailor's hyperparameters using Bayesian Optimization to ensure that it achieves its optimal performance on CodeSem.We nd that Tailor attains a median accuracy of 68% (47% and 88% on pairs of equivalent and inequivalent programs respectively), which is signi cantly lower than its accuracy on BigCloneBench and OJClone.The degradation of Tailor's performance highlights a substantial disparity between equivalence prediction and semantic clone detection.In particular, Tailor's struggle against equivalent programs (which is the main subject of equivalence prediction) strongly indicates the advantages of CodeSem over existing benchmarks.Since CodeSem'data is extracted from well-established, real-world programs, it poses a bigger challenge to code models compared to the simpler programs often found on coding platforms.
Refuting a Template-Matching Approach.Next, we investigate whether code models have relied solely on a simple, template-matching approach for recognizing equivalent programs.That is, given a pair of code examples ( , ) in the test set, do models merely attempt to nd another pair ( , ) that they memorized from the training set such that and are syntactically similar to and (or and ) respectively.To answer this question, we aim to quantify the number of code pairs in the test set that models would have predicted correctly if they had successfully memorized all code pairs from the training set.Again, we use Deckard, set up with the optimal hyperparameters, to conduct this experiment.Results show that in the best case (i.e., on CodeBERT's test set,) less than 11% of equivalent code pairs in the test set have a syntactically similar counterpart in the training set.This suggests that this simple, template-matching approach is insu cient for recognizing equivalent programs in CodeSem.
Comparing Models with Equivalence Checkers.Moving on, we now seek to understand what models have learned in equivalence prediction task.For this purpose, we evaluate state-of-the-art equivalence checkers on CodeSem as baselines for comparison with code models.This experiment can reveal challenges posed by CodeSem that are beyond state-of-the-art equivalence checkers.Thus, by analyzing how well models cope with those challenges, we can gain a deep understanding of their capability in the equivalence prediction task.We pick trace alignment [Churchill et al. 2019], ARDIFF [Badihi et al. 2020], and Rêve [Felsing et al. 2014], which are prominent equivalence checking tools in the literature, to analyze all code pairs in the ne-tuning dataset.As shown in Table 8, none of the tools perform adequately on CodeSem.In fact, the most accurate tool, trace alignment, achieves an accuracy of below 25%.Therefore, we conclude that CodeSem presents signi cant challenges that current equivalence checkers are not equipped to handle.Next, we discuss the speci c challenges that the three equivalence checkers face, and how code models have successfully addressed these challenges, respectively.
Trace alignment checks the equivalence of two programs based on the alignment of their concrete execution traces.It starts by generating test cases to execute the two programs and then aligns their execution traces using a linear function: c 1 v 1 − c 2 v 2 = k, where 1 , 2 ∈ {1, 2, 4, 8, 16}, ∈ Z are parameters, and 1 , 2 are registers or stack-allocated locations in the two programs.Next, it  Fig. 5. Two equivalent programs that trace alignment considers inequivalent.In contrast, all four models correctly predict them to be equivalent.Code omi ed by • • • is not related to the weakness of the tool.constructs an automaton using the aligned traces to simulate the behavior of the combination of the two programs.Finally, the method determines the equivalence between the two programs by verifying the satis ability of the synthesized constraints from the automaton.Among all shortcomings of trace alignment (e.g., simplistic alignment predicate, insu cient coverage of the generated test cases), the major weakness that accounts for most of its wrong results is the simplistic alignment predicate.Speci cally, when reasoning about memory allocations in the heap or using non-linear functions is required to align the execution traces of the two programs, trace alignment would fail.Figure 5 shows an example of equivalent programs in CodeSem that trace alignment considers inequivalent.To establish the alignment of the execution traces of the two programs, trace alignment must consider variable tmp in the left hand side of Figure 5 and buf in the right hand side, which are both allocated on the heap.In addition, trace alignment also needs to deal with non-linear function pow().Since trace alignment can only reason with local variables (allocated on the stack) and linear functions, the tool fails to recognize that the two programs are semantically equivalent.In fact, we nd 4,591 pairs of equivalent programs in the ne-tuning set of CodeSem that trace alignment fails to recognize precisely due to its overly simplistic alignment predicates.In contrast, for those (among the 4,591 pairs of equivalent programs) that are included in the test sets of models, CuBERT, CodeBERT, GGNN, and Graph Sandwiches achieve 85.6%, 83.1%, 74.5%, and 77.9% accuracy respectively, indicating that all models are capable of recognizing the equivalence between programs even if their execution traces denote rather di erent semantics which trace alignment can not align.Badihi et al. [2020] propose ARDIFF for enhancing the scalability of equivalence checking techniques based on symbolic execution.At the core of ARDIFF is a series of heuristics for identifying the parts of a program that can be pruned out to simplify the analysis.Despite a notable step forward, ARDIFF does not solve a fundamental issue with symbolic execution in handling large programs: the path constraints can be too complex (e.g., size, nonlinearity) for the underlying SMT solver to solve.In total, we nd 5,113 pairs of equivalent programs in the ne-tuning set of CodeSem on which ARDIFF timed out due to this limitation.An example is provided in Figure 4 in the supplemental material.In contrast, CuBERT, CodeBERT, GGNN, and Graph Sandwiches achieve 83.7%, 86.2%, 80.1%, and 75.9% accuracy on those (among the 5,113 pairs) that are in their respective test sets.These ndings suggest that models are signi cantly more e ective in handling larger programs with more complicated path constraints.
Rêve converts two programs into logical veri cation conditions (VC) and employs an SMT solver to determine their equivalence.Like ARDIFF, Rêve su ers from signi cant scalability issues, in addition, Rêve is limited to integer programs and does not support arrays.All of these are signi cant contributors to Rêve's poor performance.We nd 5,385 wrong results that Rêve produced on equivalent programs in the ne-tuning set of CodeSem are due to the aforementioned weaknesses.Fig. 6.Aliases that all models fail to recognize.url (highlighted in shadow box) and colon_ptr (underlined) are aliases when exiting from the while loop (from line 14-16) in which case url equals colon_ptr.
6.5.3The Weaknesses of Models.We also thoroughly analyze the mispredictions made by all models to identify their weaknesses.In alias prediction, we nd that models often struggle with pointers that have many de nition and use points, possibly due to a lack of precision in identifying the exact point at which alias occurs.Figure 6 shows an example where url is the pointer that has many de nition and use points, and all models fail to identify it as an alias of colon_ptr when the execution exits from the second while loop (from line 14 to 16).
For equivalence prediction, we discover a specialized class of equivalent programs in CodeSem that pose signi cant challenges to all models, as their equivalence relies on important assumptions about the structure of input data.Consider the programs in Figure 7, which are equivalent only if the input string (represented by the parameter ℎ * line) conforms with a speci c format that involves two-level delimiters: semicolons as the rst and commas as the second (e.g., " , ; , ; , ; ").If the input string fails the format check (i.e., regex_match() function), both programs return NULL, otherwise, they extract the th sub-element (with commas as the separator) within the th element of the string (with semicolons as the separator).The program on the left takes a natural approach of rst splitting the string with semicolons to obtain the ℎ element, and then splitting the ℎ element with commas to obtain the ℎ sub-element.The program on the right uses the two delimiters in a reversed order: commas rst and then semicolons.As a necessary processing step, the original index ( , ) is adjusted in the following manner: when is 1, the index ( , ) becomes ( , + 1) (i.e., ( , 2)); when is 2, the index becomes ( + 1, 1) (line 8 and 9).After the index adjustment, the (new) ℎ sub-element within the (new) ℎ element, obtained by splitting the string with commas and then semicolons refers to the same character as the output of the program on the left.However, there is an exception to this rule when the character to be found is located at the very beginning of the input string, in which case the input string is split exactly once (with commas), and the rst element is directly returned (line 5 to 7).Overall, the two programs are indeed semantically equivalent, but recognizing their equivalence requires a complex reasoning procedure that models may not be capable of.6.5.4Outlook for Future Research.In light of the weaknesses of models, we suggest some directions for future research.Firstly, to mitigate the decrease in model accuracy caused by variables with a high number of de nition and use points in alias prediction, more ne-grained embedding methods could be explored.For instance, the precision of variable embeddings could bene t from the interaction between the two variables aliased with each other.The enhanced precision of variable embeddings could ultimately help to improve the model accuracy in alias prediction task.For equivalence prediction, training models to formally reason about program behavior can be a pathway forward.For example, pre-training models towards objectives pertaining to pre-or postconditions (e.g., predict post-conditions given the pre-condition and the statement to be executed) could help models to capture the semantics of program statements at a deeper level.This, in turn, could improve their overall model accuracy in equivalence prediction task.

THREATS TO VALIDITY
Threats to External Validity.Our work is subject to certain external threats that may impact its validity.Firstly, due to limitations in the available tool-chain and infrastructure, our study focuses exclusively on programs written in C/C++.It is therefore reasonable to question the generalizability of our ndings to other programming languages.However, given that C/C++ remain widely used languages and are often the subject of programming language research, we believe that our ndings are still signi cant and relevant.As for the choice of models, rst, models used in our evaluation are not the latest which are built upon large language models, however, our primary goal is to compare di erent neural architectures and traditional static analysis methods in alias and equivalence prediction.Also, we believe that our ndings are likely to hold for the latest code models given their higher capability.Second, Graph Sandwiches has a particularly larger design space based on the type of sequence models' layers and how they interleave with GGNN layers.In our evaluation, we use RNN Sandwich, which wraps every message-passing layer in GGNN with an RNN, since it is one of the most accurate models according to the evaluation reported in [Hellendoorn et al. 2019].
Threats to Internal Validity.Human errors represent a threat to the internal validity of our study since the labeling process involves humans in the loop.Speci cally, validating the results of LLVM and EqMiner is a rather tedious and error-prone task that could potentially a ect the correctness of our ndings.Despite these challenges, we have taken great care to minimize the impact of human errors by paying close attention to details.Given the practical limitations of our study, we believe that the potential risk of human errors is acceptable and should be tolerated.

RELATED WORK
In this section, we discuss three strands of related work: alias analysis, equivalence checking, and benchmarks for code models.Thiessen and Lhoták [2017] introduce a context-sensitive analysis by combining the CFL-reachability and -limited context strings, so that it obtains advantages of both methods.Phulia et al. [2020] design a sound must-not alias analysis to explore the optimization opportunity enabled by nondeterministic expression evaluation semantics.Wilson and Lam [1995] describe a ow-, contextsensitive pointer analysis algorithm for C programs that summarizes the behavior of procedures to increase its e ciency.Hardekopf and Lin [2009] present an inter-procedural, ow-sensitive pointer analysis that combines the idea of partial static single assignment and a heavy-analysis.Zhang et al. [2013] present two fast algorithms for Dyck-CFL-reachability on bidirected trees and graphs, and apply the algorithms to a context-insensitive alias analysis for Java [Yan et al. 2011].Guo et al. [2019] propose a neural architecture to enhance the capability of Value Set Analysis to perform alias analysis at the binary level.Churchill et al. [2019] introduce a method of building a trace alignment for two given functions in the case of a set of user-provided test cases and constructing a product program for equivalence checking.Sharma et al. [2013] present a data-driven algorithm for checking the equivalence of loops written in x86 assembly, in particular, it solves an over-approximated relationship of input states to output states of the two loops.Dahiya and Bansal [2017] present a black-box equivalence checker to verify transformations produced by modern compilers.Gupta et al. [2018] propose an equivalence checking algorithm that allows the inference of the required invariants through the generation of counter-examples using SMT solvers.On the machine learning side, Kommrusch et al. [2023] aim to nd semantically-preserving rewrite rules to convert one program to another.Regarding compiler optimization which is also related to equivalence checking, Tro n et al. [2021] propose a framework called MLGO1 to integrate machine learning techniques including Policy Gradient and Evolution Strategies into industrial compilers.

Benchmarks for Code Models
The work closest to ours is CodeXGLUE [Lu et al. 2021], which presents a benchmark of 10 tasks for model evaluation.CodeSem di ers from CodeXGLUE in three ways.First, CodeSem is extracted from real-world programs while CodeXGLUE is a collection of programming solutions to algorithmic problems.Second, the prediction tasks CodeSem uses to evaluate models correspond to foundational program analysis tasks compared to those in CodeXGLUE (e.g., clone detection, code translation).Third, CodeXGLUE features only sequential models whereas CodeSem also considers graph models.Another recent work, CodeNet [Puri et al. 2021], presents a large-scale dataset CodeNet.Like Lu et al. [2021], Puri et al. [2021] collect their data from online programming platforms while we assemble CodeSem from large-scale real-world programs.In addition, we propose two new tasks: alias prediction and equivalence prediction.Another related dataset is CodeSearchNet [Husain et al. 2019], which is used for semantic code search task.Wang and Christodorescu [2019] propose COSET, a benchmark for evaluating machine learning models in learning the semantics rather than syntax of code.

CONCLUSION
In this paper, we present CodeSem, a rst-of-its-kind large-scale, real-world, and high-quality dataset designed to evaluate deep learning models in two foundational tasks in program analysis: alias prediction and equivalence prediction.We also propose a general, novel learning approach that makes it possible for models to leverage results of static analysis methods.With this learning approach, we train four in uential code models -CuBERT, CodeBERT, GGNN, and Graph Sandwiches -towards the two prediction tasks.Our evaluation shows that, in general, all models display satisfactory performance in both tasks.However, we also identify the speci c weaknesses of each model that should be addressed in future work.We release all the code and evaluation data for public access, and hope that the scale, diversity, and authenticity of CodeSem will o er unprecedented opportunities in this interdisciplinary area of research.
Proc.ACM Program.Lang., Vol. 8, No. OOPSLA1, Article 112.Publication date: April 2024.Evaluating the E ectiveness of Deep Learning Models for Foundational Program Analysis Tasks 112 Fig.3.The trend of models' accuracy on training and validation sets w.r.t. the number of epochs at each training stage in alias prediction.For each code model, the curves correspond to the median model that achieves the highest test accuracy among those adopting di erent def-use configuration, and consuming di erent program representations.These median models (with the highest accuracies) are precisely the ones that learned from the train-test split presented in Table4and Section 3 of the supplemental material.
Fig. 4. The trend of models' accuracy on training and validation sets w.r.t. the number of epochs at each training stage in equivalence prediction.For each code model, the curves correspond to the median model that achieves the highest test accuracy among those consuming di erent program representations.

Table 2 .
The complexity of CodeSem evidenced by the length (resp., size) of token sequences (resp., graphs) rendered by programs in the dataset.Overall, CodeSem is su iciently complex to ensure the di iculty of the alias and equivalence prediction task.

Table 3 .
Important hyperparameters for each code model.

Table 4 .
The train-test split of CodeSem on which CuBERT achieves the median accuracy.

Table 7 .
Model accuracy w.r.t. the distance of aliasing relations.The first/second number in every cell is the accuracy of a code model for must-alias/may-alias pairs.

Table 8 .
Accuracy of state-of-the-art equivalence checkers on CodeSem.Values in parentheses represent the accuracy of each tool for equivalent, and inequivalent program pairs in the fine-tuning set of CodeSem respectively.We count the result of a tool to be incorrect if it timed out.Increasing the timeout parameter to around ten times its value does not help improve the performance of any tool.
In contrast, CuBERT, CodeBERT, GGNN, and Graph Sandwiches achieved 85.1%, 81.3%, 77.4%, and Proc.ACM Program.Lang., Vol. 8, No. OOPSLA1, Article 112.Publication date: April 2024.Evaluating the E ectiveness of Deep Learning Models for Foundational Program Analysis Tasks 112:2379.8%accuracy, respectively, on the subset of these 5,385 program pairs that are in their respective test sets.This demonstrates that the models are not restricted to certain types of programs, such as integer or oating-point, with or without arrays.
Fig. 7. Two semantically equivalent programs that all four models incorrectly predict as inequivalent.