Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code?

This paper discusses the limitations of evaluating Masked Language Models (MLMs) in code completion tasks. We highlight that relying on accuracy-based measurements may lead to an overestimation of models' capabilities by neglecting the syntax rules of programming languages. To address these issues, we introduce a technique called SyntaxEval in which Syntactic Capabilities are used to enhance the evaluation of MLMs. SyntaxEval automates the process of masking elements in the model input based on their Abstract Syntax Trees (ASTs). We conducted a case study on two popular MLMs using data from GitHub repositories. Our results showed negative causal effects between the node types and MLMs' accuracy. We conclude that MLMs under study fail to predict some syntactic capabilities.


INTRODUCTION
Large language models have illustrated convincing performance across a range of different software engineering (SE) tasks [5,7,23,35,36,[39][40][41].In particular, code generation has been an important area of research for SE tasks such as code completion [8].Code completion is a disciplined technique for generating missing syntactic features of an incomplete snippet based on its semantic and structural context [4].These syntactic features usually adopt the form of identifiers, function names, conditionals, or parameters depending on the granularity of the snippet.Software researchers are particularly interested in improving code completion to optimize time spent during the development and maintenance cycles [12,13].Numerous studies have investigated code completion automation using machine learning [4,15,28,42].Current research has focused on exploiting deep learning representations using LSTMs [33], GPT [32], RoBERTa [20], and T5 [6,9].
Masked Language Models (MLMs) have been recently used for code completion tasks demonstrating promising results (an avg.accuracy of 38.7% in perfect predictions) at different masking levels (i.e., Token, Construct, and Block) [6].Some studies suggest that MLMs statistically learn the underlying structure of Abstract Syntax Trees (ASTs) at certain degree [19,24,37].Yet, given the high accuracy achieved by MLMs [6], few attempts have been made to investigate the role of Syntactic Capabilities for evaluating code completion.Syntactic Capabilities are interpretable prediction estimates for a terminal ( ) and non-terminal (Σ) nodes of ASTs that are ruled by a Context Free Grammar (CFG) of Programming Languages (PLs) [31].
To date, the primary focus on evaluating MLMs has been on the role of accuracy as the principal metric, which may lead to erroneous and/or incomplete interpretation of the syntactic features embedded in neural architectures [25,27,37].Relatively little is understood about incorporating these interpretable prediction estimates into the evaluation of MLMs, hence current evaluation methods do not help practitioners to decide whether MLMs are confidently generating code at AST node granularity and to what extent these syntactic features affect general prediction performance.That is, these methods do not reveal information about syntactic capabilities and their causal effects on the overall MLMs performance.
Our study attempts to establish the causal connection between syntactic features in the form of AST node types and MLMs' performance.Under this premise, we introduce SyntaxEval, an approach that leverages syntactic capabilities to evaluate how good MLMs infer  and Σ AST nodes of a given PL.When evaluating the performance of an MLM, SyntaxEval selectively masks tokens according to the AST Node types defined by the CFG.Subsequently, an MLM predicts the masked tokens.Finally, SyntaxEval measures the causal effect of AST node types on code completion performance.
Our results suggest that although MLMs are homogeneously predicting individual AST node types with high accuracy, we observed no evidence of effects from syntactic features on MLMs' prediction after controlling for confounding factors.Hence, no causal evidence supports the fact that MLMs are statistically learning syntactic structures with acceptable confidence, contradicting recent studies in the explainability field [24,37].We hope that the results of our work will shed more light on the syntactic capabilities of current MLMs to enable a more systematic and rigorous evaluation of code completion tasks.The contributions of this paper are as follows: 1) a technique for evaluating the extent to which MLMs predict AST structures; 2) a case study that leverages causal analysis to understand how different AST node types influence code completion; 3) experimental data, curated datasets, source code, and complementary statistical analysis used in this research are published in an open-source repository [1].

BACKGROUND & RELATED WORK
The accurate identification and generation of code tokens is a widely studied field at the intersection of SE and DL [38].State-of-the-art code generators estimate the token prediction using probabilistic distribution (i.e., a Large Language Model (LLMs)) obtained by training on large amounts of code corpora.Put simply, code completion models should statistically approximate the production rules defined by the CFG.These production rules are recursively applied to terminal  and non-terminal Σ nodes to formally define the structure of a PL.For instance, recent explainability studies have claimed that the syntactic structures of code are encapsulated in the internal layers of LLMs across software tasks, implying a foundational statistical comprehension of code semantics [14,37].In this section, we introduce the concept of MLMs and their current evaluation methods.
Masked Language Models for Code.Considerable research attention has been directed toward the usage of Bidirectional Encoder Representation from Transformers (BERT) on code completion as an attempt to push the predictability boundaries beyond the next token prediction.BERT allows higher granularity syntax structures (i.e., entire code statement) to be generated using self-attention layers trained to restore a masked subset of tokens in the input [6,10].This peculiar form of training the architecture is known as denoising autoencoding, or Masked Language Models (MLM), which we formalize as , where a masking rate  (usually 15%) is applied on the original sequence  of a training corpus .The model attempts to predict the set of masked tokens  given the corrupted context s (the masked version of ) [18].MLMs for code completion are mostly evaluated using metrics such as CodeBleu, EM, F1, and Pass@k [16].
Syntax-Based Evaluation of MLMs.Due to the unpredictable behavior of MLMs while generating tokens, explainability techniques are complementary evaluative methods for understanding the decision-making process by reducing the uncertainty of the models.Such uncertainty can be controlled by exploring the inner layers of the neural net or performing guided perturbations on models' input [3].Recent studies have explored the use of structural information as an interpretability tool for pre-trained models for code [25].For instance, Wan et al. [37] conducted an explainability analysis focusing on three aspects: 1) how the self-attention weights align with the syntax structure, 2) whether the syntax structure is encoded in the hidden layers, and 3) how pre-trained models induce syntax structures.Similarly, Mohammadkhani et al. [24] propose an eXplainable AI method (attention mechanism) on three downstream code tasks: 1) code generation, 2) refinement, and 3) translation.Previous findings imply that Encoder-based models can effectively extract detailed syntactic information using selfattention mechanisms.We used prior observations about encoded information of ASTs to formulate an evaluative approach based on measuring the prediction performance of syntactic capabilities directly from (non)terminal nodes.

SYNTACTIC CAUSAL EVALUATION
SyntaxEval is an evaluative approach organized into two distinct parts.The first part estimates a fine-grained performance, grouped by AST node types, for a given MLM (RQ 1 ).The second part adopts causal interpretability theory to quantify the influence of previously estimated AST node types on the accuracy of the model (RQ 2 ).
Evaluating Syntactic Capabilities.Fig. 1 depicts the process of evaluating syntactic capabilities for code completion using MLMs.This evaluative process is comprised of five steps.Firstly, we must define a set of AST node types  to be analyzed.This set of AST node types is ruled by Python CFG adopting the form of  =  ∪ Σ.Then we search for the positions of these node types  in the code sequence after iterating for each sample   (i.e., snippet) of a given ground truth  (Fig. 1-1 ).Secondly, detected tokens, which correspond to the previously defined set , are masked with the label <mask> (Fig. 1-2 ).Thirdly, we use an MLM to infer the masked tokens for each sample   obtaining a set S of predicted samples s .Fourthly, we parse the AST of   and s to generate a list of extracted nodes for the ground truth and predicted samples using the in-order traversal algorithm (Fig. 1-4 ).Finally, we compare both ground truth   and predicted s lists of extracted nodes by computing three similarity metrics for each sample (i.e., Jaccard, Levenshtein & Sorensen-Dice) (Fig. 1-5 ).
Computing Causal Interpretability.Causal Inference has been adopted to complement the assessment of LLMs by controlling for confounding factors in code data.Palacio et al. [25] introduce   , a post hoc interpretability methodology that explains model predictions by providing causal explanations.These explanations are generated by estimating the effect of binary interventions  , such as masking random tokens  0 versus masking AST node types  1 , on MLMs' performance.Specifically, in SyntaxEval, the treatment  1 refers to samples that are masked on AST node tokens , while the control  0 refers to samples that are randomly masked on any position.The control  0 preserves the same number of masked tokens as in  1 .
SyntaxEval formulates a Structural Causal Model (SCM), which is a graphical model composed of outcomes, treatments, and confounders [26], to explain a set of potential outcomes  (e.g., Jaccard, Levenshtein, Sorensen-Dice) in terms of treatments  (i.e., 52,000  2 codebert-base-mlm [11] 125M 12 50,265 masked AST node types) by controlling for a set of code confounders  to avoid spurious correlations.These code confounders consist of seven variables, which include the # of parsing errors, the height of the AST, the # of nodes, the # of whitespaces, the # of lines of codes, the cyclo complexity, and the token counts.Finally, SyntaxEval computes the Average Treatment Effect () of a treatment  has on the outcomes  after controlling for confounders  .In other words, we want to estimate the expected value The variables  1 ,  0 refer to potential outcomes observed under the treatments  1 , 0 .For the sake of brevity, we do not discuss the details of treatment effects computations.However, these effects are approximated using propensity score methods after applying the the back-door criterion [30].

CASE STUDY DESIGN
This section outlines the methodology employed to consider the potential influence of syntactic capabilities on the evaluation of MLMs, we conducted a case study on two popular architectures to explore the following RQs: RQ 1 [Performance] How good are MLMs at predicting AST nodes?RQ 2 [Causality] How do node types impact MLMs' performance?Data Collection: To mitigate the risk of data snooping, we curated our testbed with 50 Python snippets.This testbed exclusively comprises commits executed between January 01, 2022 and January 01, 2023.We collected the snippets from newly added or updated Python Github repositories with over 1k stars scoring.Additionally, we discarded duplicated samples by referring to the history of the commits.The testbed also contains complementary code features (e.g., LoC, CYCLO, and # of nodes), these features were extracted using Galeras pipeline [29].Masked Language Models: We evaluated two encoder-based transformers trained on CodeSearchNet [17] with different hyperparameters (see Tab. 1).These encoders have been assessed in prior studies in which they were found to capture structural information [37], [37], and [24].Node Types: Tree-sitter CFG defines 196 AST node types  for Python.For the sake of simplicity, we selected a subset of terminal and non-terminal nodes defined in Python's CFG as depicted in the first column of Tab. 2. The subset entails the most basic syntactic structures for control, iteration, operators, and functional programming.This study showcases the nodes that exhibited the most interesting behavior.We chose Python for code completion experiments due to its extended use in recent studies.
Evaluation Methodology.To address RQ 1 , we estimated syntactic capabilities of  1 and  2 encoders using 8 randomly selected samples from the collected testbed.SyntaxEval masks the associated tokens for each chosen node type ( 1 ) and subsequently uses the MLM to infer the missing elements.Then, we compute normalized similarity distances (i.e., Jaccard, Levenshtain, and Sorence-Dice) between the AST in-order traversal of both the predictions and the ground truth.Global results indicate the average prediction accuracy (i.e., normalized distance) for all node types within .In contrast, local results detail the prediction accuracy for individual node types.
To address RQ 2 , SyntaxEval computes the Average Causal Effect between syntactic capabilities and MLMs' performance.This method consists of estimating  using treatments  1 and  0 (i.e., tokens randomly masked) while controlling for confounders in  (code features in Data Collection), to mitigate the presence of spurious correlations.The removal of confounding bias can be formally achieved using both an SCM and the -operator introduced by Pearl et al. [26].To verify the robustness of our SCM, we computed placebo refutations, which is a method that fakes an unrelated treatment by re-estimating the causal effects.That is, we assessed that the causal effects of the fake treatment on the outcome were close to zero.Moreover, to ensure a balanced distribution of randomly masking tokens within  0 , we created 20 distinct variations for each sample.Afterward, we computed the average of the resulting similarity scores.Finally, to ensure statistical significance, we bootstrapped the similarity scores using the  for 500 samples per node type.

RESULTS & DISCUSSION
The aim of this study is to determine the effect of Syntactic Capabilities, in the form of interpretable prediction estimates for node types, on the prediction performance of MLMs.We concentrated on evaluating Encoder-based Transformers beyond accuracy.

RQ 1 Syntactic Capabilities Performance
Global Results.A cursory glance at Tab. 2 reveals that control groups  0 of each performance metric are not significantly different from treatments  1 for both encoders.For example, the control median values greater than 0.8 are within the interquartile range ( ) 0.78 ± 0.22 of the corresponding treatment.Furthermore, the standard deviation () values of the performance are predominantly more dispersed in the treatments than in the control.For example, the  1  of  1 Jaccard is 0.21, while the  0 is 0.17.Appealingly, all average values of performance are above 0.5, this indicates that  1 and  2 models are predicting masking tokens with high confidence despite the group treatments  .Although the median global performance has consistently high accuracy among the metrics (> 0.8), the average separation values between the  groups are not significant with an average median distance of 0.096 and 0.06 for  1 and  2 respectively.However, a preliminary analysis for node types estimations suggests that  1 and  2 have a tendency to not statistically learn syntactic-oriented masked tokens  1 .Our findings reveal a subtle inclination towards predicting random masked tokens over syntactic-oriented ones.Local Results.Fig. 2 shows the Jaccard performance statistical behavior across some selected node types for  1 .Due to the nonoverlapping  between the  0 and  1 , we observed a significant difference between treatment groups in the performance distribution for the nodes comparison_operator and string, revealing that  1 struggles at predicting tokens associated with such types in contrast to random masked tokens.We found that identifier was the only node type that performed better in the treatment than the control group.Fig. 3 presents the Empirical Cumulative Distribution (ECD) plots of  1 Jaccard distance across selected node types.We observed that if_clause was remarkably achieved with the highest score prediction (0.9) at the lowest percentage of the population (42% of the samples in the testbed).Conversely, for_statement was the most difficult node to predict across the population.We believe that MLMs struggle to predict these previous nodes due to their complexity.A node is complex when its block has incorporated other node types.RQ 1 : MLMs tend to complete missing AST-masked tokens with acceptable accuracy (> 0.5).However, the reported performance suffers from high variability (±0.21) making the prediction process less confident compared to completing randomly masking tokens.

RQ 2 Causal Evaluation Effect
This study used a quantitative causal technique to analyze the influence of masking binary treatments (i.e., AST and random) on the performance of both  1 and 2 transformers after defining the Structural Causal Model of the problem.To draw a causal link between syntactic features (i.e., AST nodes) and performance metrics (i.e., Jaccard, Levenshtain, and SD), we expect to observe a positive causal effect.A positive effect would indicate that syntactic features  are affecting models' performance and AST nodes would be statistically learned by MLMs.On the other hand, a negative causal effect would imply that randomly masked tokens have more influence on the performance.That is, tokens without any particular syntactic order are being predicted accurately.
Unlike previous assumptions, it can be inferred from Tab. 2 that the control group (i.e., masking random treatment) is having more impact on MLMs' performance than the actual syntactic features.For instance, a set of samples masked for for_statement tokens are underperforming (a.k.a.negative effects) compared to the same set but randomly masked tokens.This suggests that although transformers are predicting AST node types with confidence (see Fig. 3, these syntactic features are not particularly relevant compared to predicting any other set of unstructured tokens in the snippet (see gray areas in Tab. 2).These findings tend to corroborate Karmakar et al. research [19] in which MLMs do not fully grasp the syntax and structural aspects of code.Our findings offer an alternative perspective compared to claims made by other probing approaches [22,34].For example, Hernandez Lopez et al. [14] argue for the presence of a syntax subspace within the hidden layers that encode structures of PLs.Similarly, Toufique et al. [2] outline that pre-trained language models learn robust representations of code semantics, which implies a deep understanding of syntax elements from the source code.

CONCLUSION & FUTURE PLANS
Our negative causal effect results corroborate recent findings that show flaws when claiming that MLMs are understanding syntax rules of PLs.Such effects amplify the disparities between Natural and Programming languages, underscoring the need for tailored representations in deep learning architectures.These findings pave the way for future research to evaluate semantic capabilities in the form of recursions, dead code, or code smells.We also highlight the necessity to delve deeper into understanding why MLMs are more adept at predicting random masked tokens than syntax-based tokens.This tendency may be linked to the models' pre-training objectives, which frequently involve masking random tokens at a certain rate [10].

Figure 1 :
Figure 1: SyntaxEval Process for the identifier AST Node.

RQ 2 :
The performance of MLMs is negatively impacted by ASTmasked tokens ( < −0.1).Our causal analysis yielded no signs of Transformers' performance being affected or guided by syntactic features, contradicting SOTA explainability findings.

Table 2 :
Global Perf.and Causal Effects for  1 and  2 .Medians are > 0.8.The biggest causal effect  for each node type is in gray. *