Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models

Tests suites are a key ingredient in various software automation tasks. Recently, various studies [4] have demonstrated that they are paramount in the adoption of latest innovations in software engineering, such as automated program repair (APR) [3]. Test suites are unfortunately often too scarce in software development projects. Generally, they are provided for regression testing, while new bugs are discovered by users who then describe them informally in bug reports. In recent literature, a new trend of research in APR has attempted to leverage bug reports in generate-and-validate pipelines for program repair. Even in such cases, when an APR tool generates a patch candidate, if test cases are unavailable, developers must manually validate the patch, leading to a threat to validity.


INTRODUCTION
Tests suites are a key ingredient in various software automation tasks.Recently, several studies [15,25,39] have demonstrated that they are paramount in the adoption of latest innovations in software engineering, such as automated program repair (APR) [9].APR is indeed nowadays a well-researched field, where various techniques and approaches are proposed to automatically generate bug-fixing patches towards reducing debugging and fixing time.Prominent approaches in the literature systematically require precise specifications of correct/incorrect behaviour, such as test cases, to drive the localization as well as the validation of generated patches.This requirement is further exacerbated as recent studies [13,14,31] have shown that the correctness of generated patches is dependent on the quality of the test suites.Extensive test suites with high coverage are indeed required to assess whether the generated patch is actually fixing the bug without introducing new ones.
Test suites are unfortunately often too scarce in software development projects [13,37].Generally, they are provided for regression testing, while new bugs are discovered by users who then describe them informally in bug reports.In recent literature, a new trend of research in APR has attempted to leverage bug reports in generate-and-validate pipelines for program repair.Approaches such as iFixR [12] then target the recommendation of patches instead of systematic application on the buggy code.Yet, when an APR tool generates a patch candidate, if test cases are unavailable, developers must manually validate the patch, leading to a threat to validity, as recognized in the evaluation of CapGen [34].
Test suites are therefore essential in APR [38].On the one hand, automatic test generation approaches in the literature [6,22,23], unfortunately, either target unit test cases and thus do not cater to the need for revealing complex bugs that users actually face in the execution of software, or require formally-defined inputs such as the function signatures, or even the test oracle.On the other hand, bug reports are pervasive, but remain under-explored.There is thus a need to investigate the feasibility of test case generation by leveraging bug reports.Our ultimate objective indeed is to address a challenge in the adoption of program repair by practitioners, towards ensuring that patches can be automatically generated and validated for bugs that are reported by users.By filling the gap between test case generation and bug reports, we expect to establish a game-changing setting towards the adoption of program repair in industry.
Concretely, we observe that, while bug reports can quickly be overwhelming (in terms of high quantity and/or low quality) for developers, they are still recognized to contain a wealth of information.Unfortunately, such information hidden in natural language informality can be difficult to extract, contextualize and leverage for specifying program executions.Nevertheless, recent advances in Natural Language Processing (NLP) have opened up new possibilities in software engineering.In particular, with the advent of large language models (LLMs), a wide range of tasks have seen machine learning achieve, or even exceed, human performance.Machine translation [18,28] in particular has been a very active field where several case studies have been explored beyond language translation.For example in software engineering, several research directions have investigated the feasibility of leveraging natural language inputs for producing programming artefacts and vice-versa.Some milestones have been recorded in the literature in code summarization [1,11], program repair [9,21], and even program synthesis [10].Nevertheless, bug reports have scarcely been explored.Yet, automating bug reproduction via analysis of bug reports holds tremendous value.
This paper.In this work, we propose to study the feasibility of exploiting LLMs towards producing executable test cases based on informal bug reports.Our experiments build on ChatGPT [3] and codeGPT [19].The former has recently received much attention and presents the advantage that its model has been trained on a large corpus of natural language text as well as source code of software programs.The latter is a pre-trained model targeting software engineering, and which can be fine-tuned for various tasks.
Contributions.The main contributions of this study are : • Exploration: We present a pioneering research discussion on the feasibility of automatically generating executable test cases based on user-written bug reports.The main contribution is the assessment of the capabilities of current LLMs for this newly defined task.• Findings: We conduct a comprehensive empirical study based on the Defects4J dataset and explore different experimental settings of ChatGPT and codeGPT.The experimental results yield various findings on the executability, validity and relevance ratios of the generated test cases.We also provide a preliminary analysis of the quality requirements of bug reports as well as of the complexity of the generated test cases to support our initial assumption that bug reports could be relevant inputs for producing realistic test cases.Finally, we experimentally show that the generated test cases will be indeed instrumental in an APR pipeline: they enable accurate fault localization and support the validation of patch correctness.• Dataset: Our study produces artefacts that we share with the community to enable future research in this axis.The linked bug reports, the fine-tuned codeGPT model, the timestamped generated test cases by ChatGPT (v 3.5) as well as the labeled results are made publicly available.
In the remainder of this paper we present our experimental setup (Section 2), discussing the benchmarks, dataset, evaluation metrics and present the research questions.Section 3 presents the results from our empirical study, Section 4 overview the threats to validity and limitations followed by the related work (Section 5) before the conclusion in Section 6.

EXPERIMENTAL SETUP
Our experimental setup is framed around the need to address bugs that are discovered by users after the software has been shipped.Such new bugs are getting reported on code repositories, such as GitHub, every day, slowing down users and leading to development costs for bug investigations by the project owners.
To train a model that is able to generate the required test case, our pipeline (see Figure 1) includes the following steps: (1) We start by identifying Java projects included and collecting their bug reports; (2) then, we use LLMs (either an online service or a pretrained model that we fine-tune) for the purpose of generating test cases; (3) once the test cases are generated, they are appended to the existing test suite of the project to assess their executability and validity as well as their relevance for the associated reported bug; (4) finally, once we generated a relevant test case for the given bug, APR tools can now be applied towards generating and validating the bug-fixing patch.

Benchmark and Dataset
The performance of test case generation with LLMs is assessed based on the Defects4J repository [27] which includes real-world faults from various Java software development projects.We collect the bug reports as well as the failing test cases associated to these faults, for every bug.Defects4J additionally provides the buggy and the fixed project version.This benchmark is well suited for this study since it has been widely used to evaluate state-of-the-art APR tools as well as in the software testing community.
To answer RQ1, all available bug reports were used to generate test cases.One must mention that some bugs referred to the same bug report, in that case it was only considered once to avoid bias in the results because of duplicates.For RQs that require finetuning the LLM for the task of test case generations, pairs of bug reports and failing test cases are required.Eventually, the constructed dataset contains 972 pairs of bug reports and failing test cases as detailed in Table 1.

Baselines
For this study we consider different baselines from the literature and from latest open releases.We focused on models that were already used for the task of natural language to code translation: we used ChatGPT directly as baseline to generate test cases while we opted to fine-tune CodeGPT [19] specifically for our task.On the one hand, the ChatGPT API facilitates its integration into the APR pipeline.On the other hand, CodeGPT has been successfully applied to various NLP tasks, including in code-to-code, code-to-NL and NL-to-code tasks.CodeGPT further provides a pre-trained model adapted to Java code.

Prompt Design for ChatGPT
To query ChatGPT for generating a test case based on a bug report, we used the following simple prompt design: we concatenate two pieces of information: the instruction and the bug report.The instruction is unique for all queries to ChatGPT and is as follows: "write a Java test case for the following bug report: ...".For the bug report, our feasibility study considers that no pre-processing should be applied on the bug report.However, the submitted information should not include follow-up comments or attachments.

Fine-Tuning CodeGPT
CodeGPT [19] is a generative pretrained model that was fine-tuned on Java source code to be applied for NL-to-code tasks.Various model versions of CodeGPT are available in the literature repositories.To fine-tune CodeGPT to translate the bug reports into test cases (NL-to-code) we used the dataset that we prepared from the Defects4J faults as mentioned in Section 2.2.After the fine-tuning, CodeGPT was used to translate the unseen bug reports into Java test cases.

Metrics
2.6.1 Test case evaluation.To evaluate the quality of the generated test cases, we rely on the following metrics.
• Executability is a binary metric which describes whether the test case is directly executable on the corresponding project version without any manual changes.[20] test.In order to apply the MWW metric and calculate the p-value, the size of each bug report was represented by the amount of characters of the bug report.The p-value (between 0 and 1) of the MWW test measures the statistical significance of the observed difference between two groups (in our case bug reports leading to executable/relevant or to non executable/irrelevant test cases).This means that a small p-value suggests the evidence against the null hypothesis, meaning that there is a significant difference between two groups, while a large p-value indicates that there is no significant evidence of difference.

Patch Validation.
To evaluate the generated patches we will follow the evaluation of the TBar [16] paper.Thus, we will define a patch as plausible if the project version with the generated patch passes all the tests from the test suite.

EXPERIMENTAL RESULTS
This section presents, for each research question, the experimental objective, the experimental design and implementation details, as well as the yielded results.

[RQ1]: LLM baseline performance on test case generation with ChatGPT
[Experiment Goal]: The goal is to assess whether test cases produced by human developers could have been generated with an off-the-shelf LLM using reports written in natural language for informally describing the bugs.To answer this first research question, we will explore two sub-RQs considering ChatGPT (version 3.5) as a baseline model.We perform two distinct experiments: a first experiment considers the generation of a single test case to demonstrate the feasibility of our approach; a second experiment performs  [Experiment Design] (RQ1.1 -single generation): For our experiments we rely on the ChatGPT API (version 3.5) and used the default parameters.In order to use ChatGPT to generate test cases, the prompt given to the API consists of two parts as introduced in Section 2.4.The ChatGPT API was then used to generate the test cases for all the bugs in the study subjects.In practice, before running the generated test cases, the ChatGPT outputs are parsed to clean them from natural language texts (e.g., descriptive details) that would lead to compilation failures.Afterwards, the test cases are systematically included in the test suite, which is fully executed by the Defects4J test pipeline.Execution results are then logged, allowing us to compute the metrics on executability, validity and relevance.
[Experiment Results] (RQ1.1 -single generation): Figure 2 provides an illustrative example of a bug report (from the CLI project) and the associated test case (Listing 2) (ground truth in Defects4J) and the generated test case (Listing 1) from ChatGPT.As we can see in this example, ChatGPT is able to generate an executable test case from a bug report that can be used to reproduce it.This can enable various software automation tasks, such as spectrum based fault localization, patch validation in program repair, and more generally automated software testing.As we can see in Figure 2 and in Listing 1, ChatGPT was able to extract important semantic parts from the bug report such as relevant inputs and function names.
On the Defects4J dataset, we compute the proportion of bug reports for which ChatGPT is able to successfully generate test cases.We evaluate the amount of generated test cases that are directly executable without the need of further developer changes.Then we evaluate the percentage of valid and relevant test cases.Note that in Table 2 the amount of bug reports is the amount of distinct bug reports, which does not necessarily match the amount of bugs identified in the Defects4J dataset for a given project.Indeed, we found that in most projects there were several bugs that were associated to the same bug report.
As depicted in Table 2, with a single generation attempt, we reached the highest percentage (37%) of executable test cases for the Cli project.The validity of the generated test cases varies greatly from one project to another, which could imply an influence of the quality of the bug reports.As mentioned in Section 2.2, every project had a different source and format of user-written bug reports.Therefore, their quality may differ significantly across projects.We explore the influence of quality in RQ4.
On average we reached an executability of 15% when generating a single test case per bug report.This result confirms that ChatGPT is indeed able to generate test cases using user-written bug reports as prompts.Only 6% of the total amount of generated test cases in this experiment were, however, relevant.While this proportion appears limited, it confirms the feasibility of using LLM for the task of test generation.Moreover, we noted that 38% of the executable test cases are actually relevant, which is promising, since executability is a binary metric that is easy to automatically infer.While, overall, only 7% of the generated test cases were valid, it is noteworthy that 47% of the executable ones are valid.It is further important to mention that due to the randomness1 of ChatGPT generation, only one generation might not be representative of the full potential of the model.
Experimental challenges: an issue we encountered with some Math project versions is that the test suite is compiling but their execution is never ending.Therefore, there were 10 Math bug reports for which we could not determine the relevance of the generated test cases.
Additionally, the low results on the relevance might come from the fact that ChatGPT was directly leveraged to generate test cases without any fine-tuning on the task at hand, and therefore, it potentially had no specific knowledge about the projects' context (in terms of what the test target is) but only some general knowledge on Java syntax.The fact that we gain +35% points of executable, +23% points of valid and +3% points of relevant test cases as described in Table 3, shows that once the initial executability challenge is passed, the tests generated by ChatGPT are actually valid, highlighting its capability of capturing semantics of bug reports, towards translating those bug reports into bug-triggering test cases.These feasibility study results strongly motivate further research in the area.
We provide in Figure 3 the distributions of the cyclomatic complexity [8] values of the generated valid test cases compared to that of the original (ground truth) bug-triggering test cases provided by developers.The difference of median values suggest that test cases generated from bug reports tend to be more complex.Our postulate is that such test cases attempt to fit with the detail inputs for reproducing the reported bug, while developers write test cases that focus to the actual key bug-triggering input.
Answer to RQ1: Our experimental results show that ChatGPT can be prompted with bug reports to generate executable test cases for 50% of the input samples.Beyond executability, about 30% of the bugs could be reproduced with valid test cases, and about 9% of all generated test cases were actually relevant.Nonetheless, we note that over half (59%) of the executable test cases were valid test cases.These results, which are based on an off-the-shelf LLM as-a-service, show promises for automated test case generation, leveraging complex information from user-reported bugs.on the GPT version 2.0, We expect the results to be substantially better once the GPT 4.0 models will be available for fine-tuning.Among the executable test cases, 20% were also relevant for the given reported bug.This performance clearly motivates the finetuning of GPT models, given that they will gain project-specific knowledge, which is essential for generating a large proportion of not only executable but also relevant test cases.

[RQ2
[Experiment Design] (RQ2.2 -multiple generations): For this experiment we used the previously fine-tuned CodeGPT model (Section 3.2) and applied it to make five times generations on the testing dataset while changing the random seed from one experience to the next to get different results.The generated test cases were again evaluated on executability, validity and relevance.
[Experiment Results] (RQ2.2 -multiple generations): As shown in Table 3, performing multiple generations with a fine-tuned model only slightly increase the number of executable test cases.Finetuned models are by nature less random in their generations compared to leveraging pre-trained models as ChatGPT.Additionally, the training dataset was quite small which might have led to overfitting.This explains the only slight increase (+10% points) in executable test cases when generating multiple test cases per bug report with CodeGPT compared to an increase of +35% points of executable test cases when performing multiple generations with ChatGPT.
Answer to RQ2: Fine-tuning CodeGPT yielded an LLM that generates executable test cases for 24% of the bug reports with a single generation attempt.This rate is substantially larger than the one achieved by the ChatGPT single generation baseline (15%).However, when performing several generation attempts, ChatGPT achieves a significantly higher rate of success.These results suggest that a fine-tuned LLM could be beneficial in an automated pipeline where a single shot is adequate, whereas ChatGPT would be more useful in recommendation scenarios.The results further suggest that more powerful models should be investigated in future work.We collect all available Defects4J bug reports that were created after the 1st of October 2021 to be sure that they were not included in the training data of ChatGPT 3.5.There were only new bugs reported for the Cli, Lang and Math projects.We therefore considered additional projects, which are still maintained.
[Experiment Results]: Table 4 indicates the number of new bugs that were assessed as well as the performance of ChatGPT in terms of executability.Within 3 projects from the previous study and 2 additional projects, we collected 38 new bug reports: for 55% of those bug reports, the generated test cases were executable.This performance is on par with the performance achieve with old bug reports of Defects4J (cf.RQ2), suggesting that the previouslyobtained performance is likely unbiased.Furthermore, we note that a large proportion (50% for Lang and 100% for JacksonDatabind) of executable generated test cases are valid.We manually analyse relevance and confirm the feasibility of generating relevant test cases.In the absence of an oracle (fixed version to assess relevance of the test case), we make available our dataset for the community to build on.Listing 3 provides an example of a newly reported bug in Jsoup that we use to prompt ChatGPT for generating the bug triggering test case (i.e., for reproducing the associated bug).Note that this bug report is still unresolved and that no associated test case has been provided by the user nor by other users (when it was again submitted on stackoverflow).With ChatGPT, we generated an executable test case, as shown in Listing 4. A manual assessment further confirmed that inputs extracted by ChatGPT to build the test case are relevant and help to reproduce the reported bug.Answer to RQ3: ChatGPT, an LLM-as-a-service, was proven capable of generating executable test cases for newly reported bugs.Overall, in over 55% cases, the new bug reports could effectively serve as prompts to generate an executable test case.Through manual analysis, we confirmed that the generated test case reflects the described behaviour.

[RQ4]: Impact of Bug Report Quality [Experiment Goal]:
In this experiment, we approximate bug report quality with its size as well as the presence of code artefacts.This approximation is based on prior findings in the literature: very short bug reports are often of low quality (i.e., they do not contain sufficient information for reproducibility), while bug reports that include code excerpts are often written by developers who provide enough details for reproduction.We therefore investigate the impact of the bug report size and content on the performance of the test case generation.For the content we will distinguish between bug reports containing code snippets and bug reports that include only natural language text.
[Experiment Design] (RQ4.4, suggest that bug report size has little influence over the executability and validity of the generated test case.In contrast, relevant test cases are associated with bug reports with larger sizes.The statistical significance of the differences between sets was evaluated based on the Mann Whitney U score.With a p-value that is largely > 0.5 (0.99), we could conclude that the bug report size has no impact on the performance of test generation in terms of executability.
This observation is potentially related to the fact that ChatGPT has been extensively trained on publicly available Java source code, enabling it to generate test cases with minimal syntax errors.Consequently, the executability and validity of the generated test cases remains largely independent of the bug report size.
The influence of the bug report size on the performance in terms of relevance appears to be a more important aspect to assess since syntactic correctness of the generated test cases is not enough.The MWW test indeed confirms that the bug report size has a slight impact : p-value <0.5 (0.2).
[Experiment Design (RQ4)] (RQ4.2 -bug report content): We also consider the presence of code snippets within the bug reports and assess its impact on the performance of the test case generation.Code presence is considered confirmed if at least one parsable code Figure 4: Bug report size impact on generated test cases statement is included.Thus, the simple mention of a function name in a sentence will not count as code in this experiment.
[Experiment Results] (RQ4.2 -bug report content): Our initial observations of the data from RQ1 hint to the fact that the quality of the bug reports matters when generating test cases using ChatGPT with bug reports as prompts.Figure 5 represents the status of the bug report in terms of code presence as well as the evaluation of the generated test cases for datasets involving the Cli, Closure, Lang, Math and Time project samples On average, 63% of all the bug reports in our dataset contained code.Therefore, as shown in Figure 5 the number of bug reports which include code is always higher than the number of bug reports which do not.Nevertheless, the data clearly reveal that the better the test case is (i.e., executable, then valid, and finally relevant), the higher the probability that the associated input bug report contains code.If we compare the percentages of bug reports leading to the generation of non-executable vs those leading to executable test cases, we already observe an increase of 6% points in terms of code presence.Going from executable to valid test cases, there is only a slight increase in code presence (2% points).However, for the relevance, 90% of the bug reports leading to relevant test cases contained code.This clearly highlights the impact of code snippets within the bug report on the generation task by LLMs.
Our assumption is that such code snippets provide a much-needed context for the target context of the test case.
Answer to RQ4: The experimental results suggest that the bug report size has little impact on the executability and validity of the generated test cases.However, it has a slight influence on the probability for the test case to be relevant.With respect to the content (and the presence of code in the bug report), the experiments revealed that 66% of the bug reports leading to the generation of executable test cases contained code, while 90% of relevant test cases are associated to bug reports which contain code.These observations confirm that the code present in the natural language bug report serves as an important guidance to the LLM for generating test cases that are executable and relevant.

[RQ5]: Usages in software engineering
[Experiment Goal]: In this experiment we investigate how relevant the generated test cases are for software engineering tasks such as fault localization and patch validation in automated program repair.The following experiments will demonstrate the feasibility of using test cases generated from user-written bug reports in order to run a full generate-and-validate repair pipeline in production.As shown in Figure 6, the classical pipeline requires test cases to successfully localize bugs (RQ5.1 -Fault Localization) and validate the generated repair patches (RQ5.2 -Patch Validation).In a first experiment we evaluate that fault localization performance can be ensured with the generated test cases.Following the classical steps in the APR literature, we rely on spectrum-based fault localization (SBFL) [15], and evaluate the results obtained by running GZoltar2 (version 1.7.4) with Ochiai, for every project in our study where we were able to generate a valid test case.Running GZoltar for every buggy project version provides a list of all suspicious source code lines ranked by the Ochiai suspiciousness score.We consider a bug as localized if at least one of the actual buggy source code line is in the top ranking of this list.For this study we will distinguish between top 1 and top 5 ranking.In a first step we run GZoltar on the original test suite (i.e., with the ground truth test cases) which will provide our reference performance.In a second step we will run GZoltar on the test suite where we removed the original bug-triggering test case and replaced it with our valid generated test case.Interestingly, however, when running the SBFL on a test suite including the generated test cases (from bug reports) instead of the ground truth test cases, the localization performance is slightly improved: the reference results are improved for 3 of the study projects, in terms of localization both @top-1 and @top-5.Overall, the bug-report based generated test cases where able to help precisely localize 4 bugs that could not be localized even with the ground truth test cases: 2 additional bugs at top 1 and 2 additional bugs at top 5.Given that the generated test cases are more complex than the ground truth ones (cf.RQ1), it is possible that it facilitates a better discrimination of relevant code lines with spectrum-based fault localization.
[Experiment Design] (RQ5.2 -Patch Validation): In a second experiment we evaluate to what extent the generated test cases can help validate automatically-generated patches.We consider patches generated by the TBar [16] template-based APR baseline tool.TBar was selected as it merges fix patterns templates from various works, and it has achieved a high performance on the Defects4J benchmark.This tool systematically attempts to fix each bug by iteratively applying its fix patterns and validating the patched program using the project test suite.Despite this repair technique being simple, it remains among the best performing in terms of the number of bugs fixed [17].We then investigate how many TBar-generated patches that are labeled as correct (resp.plausible) are validated by the generated test cases (i.e., they fail on the unpatched program and pass on the patched ones).Table 6  Similarly, however, we note that several patches that were manually labeled in the TBar dataset as correct did not get validated by the generated test cases.Since we have validated the generated test case by ensuring that it does not fail on the patched version of the program, it is possible that the manual labeling (based on judgement of semantic equivalence of patches) led to some mistakes, or that the actual developer test cases were incomplete in the scoping of the reported bug, or that the generated test case is rather overfitting the bug report.In any case, our experimental results trigger relevant research questions for future work in bug report driven test case generation.Note that TBar did not generate any patches for the Cli project, and for the Chart project there were no intersecting faults for which we had correctly generated patches and valid generated test cases.Overall, the results show a promising research direction in the feasibility of using test cases generated from user bug reports in order to validate generated patches, enabling the adoption of generate and validate APR tools in real-world software development cycles.

Answer to RQ5:
The experiments with spectrum-based fault localization have shown that running test cases generated by LLMs using bug reports as inputs can lead to even better localization performance than with ground truth test cases.Additionally, we have shown that the generated test cases can indeed be leveraged to validate patch correctness.

DISCUSSION
Due to the randomness of ChatGPT, it is essential to verify that ChatGPT is correctly replying to the prompt before starting all the experiments.Meaning that one prompt might lead to the generation of test cases in Java source code one day but lead to the generation of explanations of test cases in natural language another day.
Overall, our empirical study validates the feasibility of using LLMs for automatic test case generation based solely on informal bug reports.However, our proposed pipeline bears some limitations while our empirical results carry some threats to validity.We enumerate those in this section and discuss some research directions for future work to increase the amount of executable and valid test cases among the LLM-generated ones.

Threats to validity
Our Dataset, inferred from Defects4J data and used to fine-tune CodeGPT is relatively small, which can easily lead to overfitting.It would be beneficial to investigate additional projects to collect more bug reports and failing test case pairs.Additionally, the dataset only contains Java test cases, which was easy for a first study on the feasibility of generating test cases from bug reports.However, this focus affects the generalisation of the results.In future work, it would be relevant to investigate the feasibility of generating test cases for programs written in other programming languages.
In RQ5.2 the patch validation is carried out for patches generated by a single APR tool, namely TBar.Since the intersection of valid generated test cases and correctly or at least plausibly fixed patches by a single APR tool is quite small it would be most relevant, in future work, to assess the performance of the LLM-generated test cases on patches generated by different APR tools.

Limitations & Future Work
In this study the Executability only reflects if a generated test case is directly executable or not.This doesn't reflect the amount of effort for a human to make it executable.After manually reviewing the generated test cases, we saw that most can be made executable through the modification of one or two lines of code.The most common issues are usually: missing imports, duplicate function names, or the use of a deprecated function.Those limitations could systematically be fixed in future work (e.g. with prompt engineering) significantly increasing the amount of executable test cases.
In our experiments bug reports where collected and directly used as prompt to demonstrate the feasibility and applicability of our idea to address real software faults reported by the users.Nevertheless, pre-processing the textual data might be beneficial to keep the LLM focused on the main context of the bug report, therefore improving the validity and relevance of the generated test case.
In this study we used the default parameters while querying ChatGPT as well as for fine-tuning CodeGPT.Future work should consider tuning some parameters such as the temperature in order to potentially reach a higher amount of executable and relevant test cases.
Many powerful Pre-trained Language Models where recently released, which are not yet available for fine-tuning, those should also be considered in future work.In future work, further experiments with other models such as codeBERT, PLBART, RoBERTa or the LLaMA [33] model could also be done to determine which model is the most suited for the task.The experiment in Section 3.2 with CodeGPT gave us 24% of executable test cases.A manual investigation gave insights on some of the failing reasons, one simply being that the test case was not generated until the end and thus, was syntactically incorrect.This issue never happened with Chat-GPT.Therefore, we strongly believe that fine-tuning more powerful models, will result in a much higher executability and relevance rate since it will combine the projects knowledge through the finetuning process and it will have a solid basic knowledge not to do any syntactic errors.

RELATED WORK
Writing extensive test suites, i.e., which have a high code coverage is extremely costly in terms of time and developer expertise.Several techniques [2,7,29,30] have therefore been proposed in the literature to help developers automatically generate unit tests for new software units.To address the lack of formal specifications for generating test cases, Fischbach et al. [5] proposed an NLP-based approach for defining informal requirements to the software.This example study shows that some first experiments have been performed on investigating informal sources for test case generation even though their work still requires a strict natural language structure containing conditionals for their requirements.In contrast, in our study, we are aiming at using the unprocessed human written bug report as direct input for test case generation.Recently, some work on test case generation using large language models (LLM) was done by [24], but their TestPilot still requires the functions signature and implementation as prompt.However, we consider that no formal input from the source code is required.Instead, only the informal bug report is required, towards implementing an approach that would be generalizable and software independent.With the recent success of ChatGPT many researchers are investigating it's potential in software engineering tasks.A recent study [40] has proven ChatGPTs capabilities of generating unit tests but for their study they considered the classes source code as prompt.Feng et al. [4] have investigated ChatGPTs ability to help developers reproduce the bug while extracting important steps to reproduce the bug from the bug report.However they did not perform any test case generation.The existing test suite of a project has a significant impact on the quality of the generated patches by an APR tool as investigated by Liu et al. [17].This highlights the importance of being able to generate additional test cases to guide the software repair tools.APR tools do not always generate correct and optimal patches for a given bug [14,31], therefore, augmenting the test suite is essential to help reduce the over-fitting issue.The use of ChatGPT to directly enhance APR techniques [26,35,36] highlights even more the potential of this new LLM.

CONCLUSION
In this work, we confirmed the feasibility of automating test case generation based on bug reports using large language models.Basic experimental results using ChatGPT have indeed suggested that this LLM-as-a-service, despite not beeing fine-tuned for the specific task of test case generation, is also able to successfully use bug reports (written in natural language) as prompts to generate executable and relevant bug-triggering test cases.
Subsequent experiments, showed that while ChatGPT achieves high executability with multiple generations, the fine-tuned LLMs (ex: CodeGPT) achieve a higher number of executable test cases than ChatGPT with a single generation attempt.These findings suggest that, while promising, the research agenda on generating test cases with LLMs still has a performance gap to close.Nevertheless, additional experiments on newly reported bugs have suggested that the achieved baseline performance is genuine: it is not biased by training data leakage.
Finally, investigations on the relevance of the LLM-generated test cases in software engineering have demonstrated their potential for fault localization as well as patch validation, two key steps in automated program repair.Overall, we showed that despite the simplicity of the general translation idea, LLMs offer substantial concrete opportunities in an open problem of software engineering research which is how to automatically construct relevant test suites.These test suites are key ingredients not only in traditional testing research, but also towards the practical adoption of generate-andvalidate techniques in automated program repair.

Figure 1 :
Figure 1: Pipeline for LLM-based generation of test cases from bug reports

Figure 3 :
Figure 3: Cyclomatic complexity of ground-truth bugtriggering test cases (Original TC) vs LLM-generated valid test cases (Generated TC)

Figure 5 :
Figure 5: Impact of Code presence in bug report on the generated test cases

Table 2 :
Performance of ChatGPT on the task of test case generation based on bug report.
[32]nd: exec.→executability;valid.→validity.relev.→relevance[ExperimentDesign](RQ1.2 -multiple generations): RQ1.1 proved the feasibility of generating executable and relevant test cases with ChatGPT, using the bug report as prompt.Due to the randomness of ChatGPT, the generated test cases quality might differ significantly from one generation to the next.Therefore, an additional experiment was done to investigate if at least one out of five generated test cases could be executable and relevant.In this experiment, for every prompt, we query the ChatGPT API five(5)times and assess the different generated test cases.[ExperimentResults](RQ1.2 -multiple generations): In this section we discuss and compare the results received while generating one versus 5 test cases, following the experimental procedure of a recent work[32].According to the data in Table2, on average, executable test cases were obtained for 50% of the bug reports across all projects, while we obtained at least one valid test cases for 30% of bug reports.Furthermore, among the executable test cases, we observed that 59% were valid.Further manual investigations highlighted that executability, validity and relevance can, in a large number of cases, be fixed with minor modifications (e.g.adding relevant package imports or changing duplicate function names).Overall, the obtained results confirm that ChatGPT is a promising tool in the generation of test cases based on bug reports.

Table 3 :
Performance of fine-tuned CodeGPT vs ChatGPT on the task of bug report driven test case generation [Experiment Results] (RQ2.1 -single generation): Table 3 summarizes the experimental results for this RQ.In a single generation attempt, CodeGPT outperforms ChatGPT: the fine-tuned CodeGPT generated 24% of executable test cases.The results suggest that fine-tuning a generative pretrained model on the specific task of translating natural language bug reports into test cases results in a better performance than simply leveraging ChatGPT in a single generation attempt.Since the fine-tuned CodeGPT model is based

Table 4
We leveraged, for our experiments, ChatGPT and CodeGPT, two LLMs that are known to have been trained on public data (up to October 2021).Since we rely on the Defects4J dataset, a widely used benchmark in testing and program repair, it is likely that some samples from this dataset have been included in the models' training data.This constitutes a threat to the validity of our results as it reflects a potential data leakage problem.To address this concern, and to reliably confirm the feasibility of the idea, we evaluate the proposed generation pipeline on newly reported bugs (after October 2021), where we can guarantee that no test cases that reproduce the bugs were part of the models' original training data.
[Experiment Design]: Our experiment is focused on ChatGPT (multiple generation), and evaluates executability automatically.Since there is not yet a fixed version for all new bugs, a manual investigation of the relevance of the generated test case is necessary.
Because of the '[' and '] ' in the URL ...

Table 5
summarizes the spectrum-based fault localization results.On the original test suite, our SBFL implementation was able to perfectly localize the buggy source code line (top 1 ranked) for 22 bugs in

Table 5 :
Number of bugs localized with GZoltar/Ochiai our dataset.When considering the top 5 outputs of the SBFL, 42 bugs were correctly localized in the reference setting.

Table 6 :
summarizes the intersection of projects used for this study.Intersection of bugs with correct patches and valid test cases [Experiment Results] (RQ5.2 -Patch Validation)): Table6summarizes the results.It shows for example that among the 9 patches of Closure that were known to be plausible (i.e., passing the project test suite, but being actually incorrect), none of them were found valid by the generated test cases.Overall, only 8 out of the 29 plausible patches were validated by the generated test cases.This means that the generated test cases could have been more useful than the existing developer-written test suites to discard 21 plausible patches.