Autonomic Testing: Testing with Scenarios from Production

My PhD addresses the problem of detecting field failures with a new approach to test software systems under conditions that emerge only in production. Ex-vivo approaches detect field failures by executing the software system in the testbed with data extracted from the production environment. In-vivo approaches execute the available test suites in the production environment. We will define autonomic testing that detects conditions that emerge only in production scenarios, generates test cases for the new conditions, and executes the generated test cases in the new scenarios, to detect failures before they occur in production.

In-house testing, that is performed in testbed, can effectively reveal many software faults [1,13], however, it is impossible to exercise all faulty configurations and detect all faulty statements in complex software applications.Recent studies also discuss failures that occur in scenarios that emerge only in the production environment, and thus cannot be detected in testbed [10,11,20].For example, the Eclipse plugin EclipseLink developed before Java 8 failed to generate tables for persisting objects that included the @Entity annotation, since it silently ignored classes that contained lambda expressions that were introduced in Java 8 [12].Testing in testbed executed before the Java 8 release could not detect the failure.Software faults that escape testing can result in severe consequences.For example, the UK air traffic control system crash that caused a financial loss of $126 million in September 2023 was due to an unexpected flight plan that had never been encountered before [22].
There are several studies to detect field failures by testing software systems.As well discussed in the excellent survey of Bertolino et al. [4], ex-vivo approaches test the system in the development environment with data extracted from the field, which focus on abstracting test templates to be instantiated with new input data, ignoring environmental conditions (e.g., [9]), while in-vivo testing techniques test the software system directly in production, which concentrate on the issues of executing test cases without interfering with the running system (e.g., [17]).
In this PhD, we will define autonomic testing, a new approach to detect failures that escape testing in production.Autonomic testing tests software applications with full scenarios from the production environment.Autonomic testing considers the whole environment that emerges in production, thus differs from both ex-vivo approaches that consider only some information extracted from the environment and in-vivo approaches that focus on the execution rather than on the inputs.
Autonomic testing generates test cases, which are not only simple abstractions of existing ones, with conditions that emerge only in production.Autonomic testing monitors the running system to detect emerging conditions, generates test cases to thoroughly test the system with the detected conditions, and executes the generated test cases without interfering with the production environment.We detect emerging conditions with either dynamic models of the tested system or neural networks.We use dynamic models to detect conditions that emerge at the unit level, similarly to extend Gazzola et al. 's approach that exploits dynamic grammars to model the tested conditions and reveal strings that emerge in the execution [9].We leverage neural networks to detect anomalies that emerge at the integration level, inspired from Denaro et al. 's Prevent approach to detect anomalies in complex cloud systems [7].We define autonomic tests to substitute mocks that characterize operational tests with probes in the execution environment.We generate test oracles for autonomic tests with natural language processing (NLP) and large language models (LLMs) inspired from Blasi et al.'s Jdoctor approach [5].We execute autonomic tests on digital twins to avoid interferences with production.
In the PhD, we address following research questions: RQ1 Can autonomic testing detect field failures effectively and efficiently before they lead to catastrophic consequences in production?
We plan to study the effectiveness and efficiency of autonomic testing by referring to following metrics: the quantity of detected failures with respect to well documented field failures, the earliness of failure detection, that is the interval between the detection and the otherwise unavoidable failure, and each component's runtime overhead.
RQ2 To what extent can the runtime information collected from production detect emerging conditions?
We will study the correlation between different types of runtime information using common statistical methods and evaluate their usefulness in terms of contributions to the final decision of the test trigger using popular machine learning techniques after processing them into standard formats.
RQ3 To what extent can the execution of autonomic testing affect the performance of the deployed software application?
We plan to study the performance overhead of autonomic testing in terms of required resources and extra latency.

RELATED WORK
There are numerous approaches to detect field failures.
Orso et al.'s Gamma approach partially instruments the code to monitor software systems in production with low impact, and gather information for continuous evolution [19], impact analysis and regression testing [18].Yabandeh et al.'s CrystalBall predicts the consequences of distributed nodes' actions and prevents future inconsistencies at runtime [21].
Some approaches leverage monitored information to generate test cases for detecting failures that may occur in production.Metzger et al. [16] investigate a proactive approach to integrate the monitoring service-based applications with online testing for failure prediction.Chaos Monkey tests the resilience of the deployed services by randomly terminating virtual machines that host production services [3].Both Metzger et al.'s approach and Chaos Monkey rely on intrusive operations with significant side-effects.
Many approaches propose sandboxing techniques to execute tests in production without interfering with the running system.Murphy et al. [17] propose in vivo testing to execute test suites with objects extracted from the runtime status that the execution reaches after the call of the methods under test, without interfering with users.The in vivo testing framework leverages unit tests and program invariants created by testers while developing the software to both test a set of methods-under-observation when they are invoked and record the test results.The in vivo testing framework executes the methods-under-observation by forking the methods, with a significant amount of overhead when dealing with a large number of methods-under-observation.
Gazzola et al. [9] define field-ready testing, an approach to detect faults by generating and executing test cases with scenarios that emerge in production.
As Bertolino et al. [4] well summarize in their survey, the approaches proposed so far to test software in production mostly focus on the problem of executing test cases without interfering with the running systems, and rely on existing test cases and simple templates to generate test cases with data from the field.

EXPECTED CONTRIBUTIONS
In this PhD we focus on the problem of automatically generating and executing test cases that exercise conditions that emerge only in production.We define autonomic testing, an approach to trigger execution scenarios that emerge only in production, and generate test cases that exercise the system in the new scenarios to detect otherwise unavoidable failures.
We exploit dynamic analysis to identify scenarios that involve small clusters of classes, and machine learning to detect scenarios in large subsystems, where dynamic analysis would be impractical.Our dynamic analysis extends the dynamic analysis of Gazzola et al. [9].Gazzola et al.'s triggers dynamically identify objects that have not been tested yet with generative grammars, and are limited to objects of type string.Our triggers will handle values of different types with various dynamic analysis techniques, and scale to large subsystems with suitably trained neural networks that detect emerging conditions.
We define a new type of artifact, autonomic tests, that substitute the mocks of classic tests with probes.While mocks artificially build the conditions for executing the test calls, probes gather the conditions to execute the test calls from the execution environment.For example, the autonomic tests for a web shopping application will not mock shopping carts and goods, but will get the values of shopping carts and goods that emerge in production from the production environment, thus exercising the application in the conditions that emerge in production.
We will extend the state-of-the-art techniques that generate oracles from commonly available information (e.g., comments) by means of NLP, such as TOGA [8] and CallMeMaybe [6], to automatically generate assertion oracles for conditions that emerge in the field.
We will largely rely on existing technology for sandbox execution and digital twins to execute autonomic tests without interfering with the production environment.

EVALUATION CRITERIA
We plan to evaluate our approach in terms of both effectiveness and efficiency.We measure the effectiveness of the approach in terms of both the number of detected failures and detection earliness.We measure the number of failures that the approach can detect, by referring to both field failures reported in the literature and new field failures that the approach reveals.We measure the earliness of the detection as the interval between the detection and the incorrect behavior that the user would perceive if the failure occurs.
We measure the efficiency of the approach in terms of both the runtime overhead and the number of test cases that the approach generates and executes to detect a failure, that is, the ratio between the number of detected failures and the number of generated test cases.We measure the overhead of the triggers as the time needed to distinguish unexpected scenarios from common ones.We measure the overhead of the test case generator as the time needed to generate test cases with conditions that emerge in production.
We plan to experiment on several widely used open-source projects and datasets.We will also conduct a comprehensive comparison with state-of-the-art approaches.We will start from wellknown Java projects used in [9] as the initial benchmark.We will move forward with complex systems that contain complicated events and user interactions, for instance, learning management systems and online shopping applications.