Understandable Test Generation Through Capture/Replay and LLMs

Automatic unit test generators, particularly search-based software testing (SBST) tools such as EvoSuite, efficiently generate unit test suites with acceptable coverage. Although this removes the burden of writing unit tests from developers, these generated tests often pose challenges in terms of comprehension for developers. In my doctoral research, I aim to investigate strategies to address the issue of comprehensibility in generated test cases and improve the test suite in terms of effectiveness. To achieve this, I introduce four projects leveraging Capture/Replay and Large Language Model (LLM) techniques. Capture/Replay carves information from End-to-End (E2E) tests, enabling the generation of unit tests containing meaningful test scenarios and actual test data. Moreover, the growing capabilities of large language models (LLMs) in language analysis and transformation play a significant role in improving readability in general. Our proposed approach involves leveraging E2E test scenario extraction alongside an LLM-guided approach to enhance test case understandability, augment coverage, and establish comprehensive mock and test oracles. In this research, we endeavor to conduct both a quantitative analysis and a user evaluation of the quality of the generated tests in terms of executability, coverage, and understandability.


PROBLEM STATEMENT
In today's software-dominated world, software reliability and accuracy hold immense importance [18].Consequently, software quality assurance has become an indispensable asset for software engineers.Automated testing in the form of unit tests has become a crucial element in ensuring high-quality software [6].However, despite the widely acknowledged significance of testing, writing tests is seen as a tedious and time-consuming task [3,7].To alleviate this burden on developers and testers, the research community has devoted considerable effort to developing and evaluating automatic test generation approaches [1,5,12,16].Among the notable test generators are Randoop [22] and Evo-Suite [12].EvoSuite, for example, is a search-based test generator that employs genetic algorithms to construct a test suite [15], which has demonstrated convincing results in terms of coverage [14,25].There are, however, limitations in the quality of the test cases generated based on industrial case studies [2,4,13,17,23,24,28].
These limitations encompass challenges in (1) comprehending the generated test cases, and (2) generating tests for complex scenarios, which need complex test data or specific mock objects [4].One significant limitation revolves around the understandability of the generated test cases, which involves various facets such as meaningful test data, proper assertions, well-defined mock objects, descriptive identifiers, lucid test names, as well as informative comments and summaries.In addition, the difficulty in following the scenario depicted in the test case and the ambiguity surrounding the test data significantly hamper this clarity [2,8].
While search-based unit test generators achieve reasonable test coverage, they fall short in generating understandable tests and struggle with generating tests for complex scenarios.
In this research, my focus is on enhancing the comprehensibility of the generated unit tests and having effective tests that include complex scenarios.

RESEARCH HYPOTHESIS
My hypothesis is that the E2E tests can provide a basis for enhancing the test suite at the unit test level with realistic and domain-specific test scenarios, and that Large Language Models (LLMs) can make generated test cases more like human-written tests.In addition, traditional search-based approaches excel at boosting coverage and generating highly executable tests.
I aim to leverage the strengths of both approaches, Capture/Replay and LLMs, in search-based test generators to improve the understandability of the generated test cases while achieving high coverage all at once.

A. Capture/Replay
The test suite includes a variety of types of tests besides unit testing, including End-to-End (E2E) testing [30].The capture/replay approach captures fine-grained execution information during Endto-End (E2E) testing such as the order of method calls and the actual inputs, and subsequently replays them [11,33].Capture/Replay technique has been used to capture dynamic information for generating tests for regression testing [11] or reproduce a crash [10].However, they have not been used in purpose of enhancing the understandability of generated tests.It is my hypothesis that this approach holds considerable potential to be used in the test generation process to have meaningful test scenarios, containing real and complex test data and effective mock objects.

B. Large Language Models (LLMs)
The realm of Natural Language Processing (NLP) offers a variety of techniques for test generation and optimization.These include traditional NLP methods [34], Deep Learning approaches [26], and the increasingly popular use of Large Language Models (LLMs).These methods are particularly adept at handling text-based tasks, with significant success in tasks like generating identifier names and crafting informative comments and summaries [21,26].Recent advancements in this domain have notably leaned towards deploying contemporary techniques, particularly focusing on LLMs [20,21,27,31,32].These approaches involve fine-tuning pre-existing models, specifically tailored for test generation.Additionally, LLMs can enhance coverage when combined with Search-based algorithms [19].It is my hypothesis that when we combine LLMs with search-based algorithms, we will be able to not only improve the code coverage, but we can also improve the understandability of the generated test cases. 4 An empirical study on test comprehension and a comparison of different approaches from the developer's perspective.I started with  1 , where we introduced the MicroTestCover approach that generates unit tests starting from manual or scripted end-to-end tests.This led to test cases containing meaningful test scenarios and containing actual test data.The results of this study have been published in the following paper [9]: Generating Understandable Unit Tests through End-to-End Test Scenario Carving Amirhossein Deljouyi, Andy Zaidman.In Proceedings of the 23rd IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2023), pp.107-118.

THE EXPECTED CONTRIBUTIONS
In  2 , our next step, I aim to expand my inquiry by integrating LLMs with a search-based test generator, EvoSuite, striving to achieve a higher level of comprehensibility and coverage.Progressing to  3 , we plan to combine dynamic information and the carved tests from E2E testing and the approach outlined in  2 .My hypothesis is that providing Search-Based algorithms and LLMs with realistic and domain-specific test scenarios will lead to the creation of more contextually relevant and understandable tests.
Finally, in  4 , I aim to gather insights from developers/testers about the understandability of the generated test cases by the proposed approaches.Conducting an empirical human study, which is a notable gap, is crucial to determining comprehension of the generated tests by state-of-the-art approaches from the point of view of developers.
Research Impact, Who-What-How [29]: Software engineers and the research community will benefit from this research, particularly through tools for generating understandable unit tests and providing critical insights into the impact of various automated test generation approaches on test case comprehension.For instance, consider a common scenario in software development where a system is rapidly evolving and primarily relies on E2E tests.Software engineers can generate understandable unit tests with the innovations in this research, enabling faster and more precise fault localization.In addition to speeding up testing, understandable tests are easier to modify, update, and reuse.

EVALUATION PLAN
In  1 , we conducted an exploratory case study involving four software systems to assess the feasibility of MicroTestCarver.Additionally, a user study with 20 participants was carried out to compare the understandability of MicroTestCarver-generated tests with EvoSuite-generated and manually-written test cases.
For  2 and  3 , our focus lies in evaluating the proposed approaches through both case studies and human studies.The case study will measure the approaches in terms of coverage, performanceefficiency/cost, and mutation score.Simultaneously, the human study aims to gauge the understandability of these approaches.
4 entails a comprehensive human study to grasp developers' perspectives regarding the generated test cases.
Generating understandable unit tests through E2E tests. 2 Generating understandable unit tests with high-coverage through a combination of Search-Based algorithms and LLMs. 3 Seeding  2 with E2E Tests Scenario Carving.