Beyond Accuracy: Evaluating Source Code Capabilities in Large Language Models for Software Engineering

This dissertation aims to introduce interpretability techniques to comprehensively evaluate the performance of Large Language Models (LLMs) in software engineering tasks, beyond canonical metrics. In software engineering, Deep Learning techniques are widely employed across various domains, automating tasks such as code comprehension, bug fixing, code summarization, machine translation, and code generation. However, the prevalent use of accuracy-based metrics for evaluating Language Models trained on code often leads to an overestimation of their performance. Our work seeks to propose novel and comprehensive interpretability techniques to evaluate source code capabilities and provide a more nuanced understanding of LLMs performance across downstream tasks.


PROBLEM AND RESEARCH STATEMENT
With the advent of Large Language Models (LLMs), code generation has been addressed as a machine learning problem in downstream tasks such as code completion [11], program repair [10], and test case generation [26].Moreover, industry interest in leveraging LLMs for code completion has also grown as evidenced by tools such as Microsoft's IntelliCode [18], Tabnine [2], OpenAI's Codex [28], and GitHub's Copilot [12].Adopting LLMs for code in practical settings might potentially bring risks associated with the efficacy and trustworthiness of these machines [17].LLMs, which are described often in terms of probabilistic distributions, can sometimes generate correct outputs with high confidence even when the input features do not contain semantically meaningful information (i.e., predicting identifiers from indentation tokens in the prompt).This previous erroneous condition of LLMs is formally known as overinterpretation [15].The intricate nature of LLMs, combined with the way information is learned and encoded in the hidden layers, makes the evaluation process of LLMs particularly complex.
Although scaling up LLMs have been shown to enhance their performance significantly [27] by exhibiting emergent abilities, LLMs for code have been not rigorously evaluated on those emergent abilities.In fact, such emergent abilities have not been fully explored yet by software researchers.A key unresolved problem in the field of Deep Learning for Software Engineering (DL4SE) is the understanding of specific conditions by which a LLM learns complex semantics from mere observational data.This lack of understanding represents a significant gap in the evaluation of NCMs, particularly in explaining these emergent behaviors.Despite the significant progress in the development of techniques for assessing LLMs, the evaluation of code generation models in software engineering remains limited.In other words, current canonical metrics fall short of offering insights into the extent to which the evaluated models learn meaningful semantic information from the source code.
Current evaluation approaches omit coarse-granular semantic properties such as Reliability, Maintainability, Correctness, Security, and Robustness.These semantic properties not only reflect the true quality of code predictions but also represent a crucial distinguishing factor between the capabilities of human programmers and those of trained models.By considering these enriched semantic aspects in the evaluation of proposed architectures, we can improve future research in deep learning for software engineering.
In this dissertation, we aim to extend the boundaries of DL4SE evaluation by developing techniques to assess the behavior of LLMs beyond accuracy.We also seek to bridge the gap between canonical accuracy metrics (e.g., BLEU [22], ROUGE [16], METEOR [4], Code-BLEU [1]) and meaningful semantic information.Our approach adopts elements from Control and Data-Flow analysis and incorporates mathematical frameworks such as Causal Inference (CI) and Category Theory.

PROPOSED RESEARCH
Considering the recent literature on interpretability and the inherent challenges in comprehensively evaluating the decisions made by LLMs in software engineering tasks, our approach is threefold.Firstly, our aim is to gain a comprehensive understanding of the nature of the semantic information these models are expected to capture.Second, we will evaluate how well current architectures learn such knowledge and the underlying reasons behind their decision-making processes.Finally, we propose developing error detection and tools, to provide software practitioners and researchers with effective means to assess the reliability of selected models for specific downstream tasks.The following subsections will explain the rationales of each point.

Modeling Semantics
In simpler terms, Language Models are probability distributions  (  | 1 ,  2 , . . .,   −1 ), where the output   at time step , given the input sequence ( 1 ,  2 , ..,   −1 ), is inferred through the conditional probability  (  |ℎ  ).The hidden state ℎ  encapsulates the properties of the preceding context.LLMs are trained with a huge amount of data, and the text used in the training is totally unstructured.In other words, no grammatical or semantic rules or information are explicitly given to the model, yet such complex architectures seem to produce coherent outputs.Producing coherent outputs implies that some semantic information of the programming language (PL) must also be learned (e.g., a return statement should be always at the end of a function).LLMs learn this complex information just by seeing samples of coherent text.
PLs are not only algebraic (i.e., elements are combined together to form coherent text) but also statistical [6].For example, the expression ′  ++ ′ may occur more frequently than ′ + = 1 ′ , and the meaning of the token 'variable' is defined by the totality of expressions where it appears.With this premise, we will incorporate elements from category theory [14], to represent PLs as categories.Given the functor ( ′  ′ , −), we can obtain the totality of expressions in the dataset that define the meaning of ′  ′ .We plan to model the semantics of every expression in the datasets used to train Language Models by obtaining functors through the Yoneda Embedding [8].This approach has been exprored by Bradley et al. [7] to model the semantics of Natural Language (English).

Evaluating Capabilities
In recent years, a growing interest has emerged in not only evaluating LLMs for code but also providing interpretations about how these models arrive at generated predictions [5, 13, 19-21, 24, 25].Unfortunately, the current evaluation process overly relies on accuracy leaving no consensus as to what other properties or SE settings are impacting the code generation process.
We start by adapting the concept of capability introduced by Ribeiro et al. [23] in our approach.A source code capability refers to a feature that is inherent in a programming language and represents a concept that can be easily understood by a software engineer.Based on this premise, we define three types of source code capabilities: Linguistic, Semantic, and Quality.
Linguistic capabilities are evaluated at a fine-grained level, focusing on the model's ability to understand hidden syntax and grammatical features in the training data.For instance, we seek to understand what types of elements in grammar most influence code predictions.

Very Busy Expressions
An expression  is very busy at Point , if no matter what path is taken from , the expression  is evaluated before any of its operands are defined.

Dead Code
A statement  =  can be considered as dead code if the value of  is not used in the future.

Reduntant Expressions
An expression is redundant if the value computed by the expression is available on some\all paths through a program to that expression.

Vulnerabilities
Source Code vulnerabilities as defined in Common Weakness Enumeration CWE [3].Table 1: Examples of Quality Capabilities to be addressed in this dissertation.
Semantic capabilities are related to the domain knowledge encoded in the generated output.We propose to incorporate sophisticated frameworks such as Category Theory [14] and Mechanistic Interpretability [9] to model the semantics of PLs and reverse engineering the internals of a LLM in order to explain and evaluate the efficacy of these models beyond accuracy.
Finally, Quality capabilities aims to assess the ability of LLMs to produce non-buggy and high-quality optimized code (e.g., Code Smells, Vulnerabilities, and Compiler Optimizations).Table 1 summarizes the most relevant sub-capabilities from this category.

Building Evaluation Tools
Our final goal is to design, develop, and publish usable tools to incorporate the previous elements for error detection and evaluation of LLMs for code.A key feature of our tools will be their ability to translate complex model decisions into human-understandable knowledge in terms of source code capabilities (i.e., Linguistic, Semantic, and Quality).

ANTICIPATED CONTRIBUTIONS
Through conducting the proposed research, we aim to provide a comprehensive framework to help understand why Large Language Models (LLMs) make specific predictions in the context of source code capabilities for software engineering downstream tasks.Our goal is to design a novel evaluation framework based on mathematical theory, incorporating elements from category theory and syntax static analysis.We anticipate that our contribution will empower software practitioners with a deeper understanding of the dynamics of Large Language Models (LLMs), thereby enhancing their ability to ensure and assess reliability in practical scenarios.All datasets, software tools, and overall results from our study will be made publicly available to ensure verifiability.
This work licensed under Creative Commons Attribution International 4.0 License.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.