Beyond Accuracy and Robustness Metrics for Large Language Models for Code

In recent years, Large Language Models for code (LLMc) have transformed the landscape of software engineering (SE), demonstrating significant efficacy in tasks such as code completion, summarization, review, tracing, translation, test case generation, clone detection, and bug fixing. Notably, GitHub Copilot [31] and Google's CodeBot [21] exemplify how LLMc contributes to substantial time and effort savings in software development. However, despite their widespread use, there is a growing need to thoroughly assess LLMc, as current evaluation processes heavily rely on accuracy and robustness metrics, lacking consensus on additional influential factors in code generation. This gap hinders a holistic understanding of LLMc performance, impacting interpretability, efficiency, bias, fairness, and robustness. The challenges in benchmarking and data maintenance compound this issue, underscoring the necessity for a comprehensive evaluation approach. To address these issues, this dissertation proposes the development of a benchmarking infrastructure, named HolBench, aimed at overcoming gaps in evaluating LLMc quality. The goal is to standardize testing scenarios, facilitate meaningful comparisons across LLMc, and provide multi-metric measurements beyond a sole focus on accuracy. This approach aims to decrease the costs associated with advancing LLMc research, enhancing their reliability for adoption in academia and industry.

Therefore, there is an increasing interest in thoroughly assessing these LLMc [4,12,13,36] to establish standardized criteria for evaluating the quality of generated code.However, the current evaluation procedure heavily depends on accuracy metrics [2,14,37] and robustness metrics [29], without a unanimous agreement on which additional features or properties influence the code generation process.In other words, we are currently not able to holistically evaluate some of the factors that influence the quality of LLMc across different scenarios [11].Hence, the problem remains that, when attempting to understand the prediction performance of LLMc, no benchmarks are available to articulate a broad set of desiderata beyond accuracy leading to concerning issues related to interpretability, efficiency, bias, fairness, and robustness.The current state of evaluating LLMc is not unexpected, given the inherent challenges in benchmarking, collecting and cleaning data, and maintaining the necessary tools, metrics, and datasets.These challenges are crucial for supporting the swift progress of research in the field of LLMc.
This research aims to develop a benchmarking infrastructure to overcome challenges for conducting high-impact research related to evaluating the quality of Large Language Models for Code.We will conduct a range of research activities to better understand the current barriers to holistically benchmark LLMc research and the ways in which a future community infrastructure can be used to help researchers address existing key challenges.We envision creating an infrastructure for a Holistic Benchmark Evaluation for LLMc (HolBench) to standardize testing scenarios to meaningfully compare different LLMc and provide multi-metric measurement beyond a single accuracy metric view.We posit that by holistically benchmarking LLMc, the cost of advancing research topics related to LLMc will be substantially decreased, making models more reliable to be adopted by academia and industry.

EXPECTED CONTRIBUTIONS
HolBench is aimed at closing three key open gaps that researchers and practitioners face while working on evaluating LLMc: (1)  1 : Collecting and benchmarking SE-based data to keep pace with new and enhanced LLMc.While initial efforts have been made to benchmark the evaluation of LLMc [9,20], nearly all publicly available datasets and benchmarks are not tailored for automatically curating testbeds.This issue remains since testbed curation requires a platform to centralize and standardize datasets to keep pace with ever-growing/improving LLMc.To address  1 , we suggest the implementation of an automated and ongoing pipeline for curating testbeds, constituting the initial element of our proposed infrastructure.This component has the ability to generate organized and validated artifacts and testbeds, meeting the requirements for comprehensive evaluations.
(2)  2 : Encompassing the Extensive Range of LLMc capabilities.Existing benchmarks and datasets have not comprehensively addressed software scenarios (i.e., anticipated use cases for LLMc) essential for ensuring a baseline of quality assurance for the models' potential capabilities.Acknowledging the impracticality of testing LLMc across all conceivable combinations of software properties (e.g., tasks, programming languages, granularity of artifacts), we introduce a second component of our infrastructure.This component aims to enhance the coverage of LLMc capabilities by delineating critical software scenarios.Effective and valuable LLMc should manifest a diverse range of capabilities that developers and researchers anticipate from these models.
(3)  3 : Holistically evaluating LLMc with more metrics beyond accuracy.The current evaluation practices for LLMc have predominantly emphasized metrics centered on accuracy, such as F1, Recall, AUC, and BLEU [9].Utilizing existing benchmarks typically results in the reporting of a percentage score or a distance metric that only partially assesses the performance of an LLMc.Consequently, we advocate for an extension of the evaluation framework to embrace a multi-metric approach, aligning with a spectrum of societal considerations and requirements identified through preliminary interviews and surveys with LLMc experts.The third component of our infrastructure will initially encompass multi-metric measurements, including interpretability, efficiency, bias, fairness, and robustness.
Our proposed solution aims to address specific gaps by offering practical and actionable measures to enhance interpretability in a way that is accessible to both practitioners and researchers.We intend to achieve this by providing detailed metrics for evaluating LLMc and generating automatic datasets that ensure a fair and comprehensive assessment of emerging models.

PROPOSED PLAN AND EVALUATION
We will develop a novel benchmarking that can holistically evaluate LLMc using a multi-metric approach under different software scenarios.The proposed infrastructure is comprised of three main components.The first component is a pipeline to structure and collect required testbeds to holistically evaluate LLMc.The second component is a combination of software properties that ensembles scenarios required to test a LLMc.The third component is a multi-metric approach that consolidates the holistic evaluation.
(1) Curation Pipeline: The initial component of HolBench constitutes a benchmarking strategy involving a software architecture solution for the curation process.This component plays a role in addressing the primary problem  1 .In the initial phase, we will filter the most popular repositories on GitHub.We aim to capture the most recent commit changes in order to avoid already seen examples for the most recent models.For example our prototype is based on the latest report from ChatGPT [18], we assume that ChatGPT and other analyzed LLMc were not trained on commits between Jan 2, 2022, and Jan 1, 2023.This assumption ensures that our pipeline effectively prevents data snooping, which involves the inappropriate use of data points to assess statistical hypotheses using training samples.Subsequently, we will compile a set of novel methods for each commit.The collection of relevant data points will also include their corresponding documentation, excluding inline comments.The entire data collection process is automated, requiring only a fine-tuning of the query to select pertinent samples.
(2) Software Scenarios: The second component of HolBench is a strategy for coverage capabilities that integrates various attributes of code datasets to generate software scenarios.This component contributes to addressing the second problem  2 .Since the focus of the evaluation is the LLMc itself and not a scenario-specific configuration, it is essential to manage the generation of software scenarios to standardize and impartially compare different models.Ideally, each LLMc should undergo evaluation using identical software scenarios.Unlike previous benchmarks, which are primarily collections of datasets with a single metric (i.e., accuracy), HolBench takes a top-down approach.In this approach, we will explicitly define the required software properties for evaluation based on real use cases.To articulate a scenario, HolBench dissects it into the following properties: task, programming language (PL), input and output granularity, and type of input/output.These properties can be expanded based on new research strategies and LLMc characteristics.Ultimately, the software scenarios serve as inputs for capability testing, allowing researchers and practitioners to evaluate LLMc under controlled settings.
(3) Multi-Metric Approach: The third facet of HolBench is designed to assess LLMc based on a comprehensive set of requirements for each software scenario outlined earlier.This component will contribute to addressing the third problem  3 .These requirements encompass a set of evaluation metrics that go beyond interpretability, including efficiency, bias, fairness, and robustness.The categories of metrics that we define will cover various academic demands and societal considerations.While these metrics are quantitative, it is important to note that some of them are challenging to measure due to their recent emergence or limited exploration in the software engineering field (e.g., fairness, bias, or efficiency).
Unlike the prevailing evaluation system, where benchmarks typically measure a specific metric without considering how well a model performs across other domains independent of the specifics of each scenario, HolBench takes a holistic approach.In HolBench, practitioners can assess multiple metrics per scenario, considering a selected subspace of software scenarios X set of metrics.A comprehensive evaluation should capture the diverse subspaces represented by these combinations.