Ensuring Critical Properties of Test Oracles for Effective Bug Detection

With software becoming essential in all aspects of our lives, especially in critical areas like medical and avionic systems, the need for robust and reliable software is more critical than ever. Even seemingly insignificant software bugs can compromise system stability and security, as evidenced by a simple copy-paste error in Apple devices accepting invalid SSL certificates and a date formatting issue causing a widespread Twitter outage. These realities underscore the need for effective testing and bug detection mechanisms to ensure software reliability. At the heart of this challenge are test oracles, a fundamental component of testing, which play a crucial role in detecting software bugs. Recognizing the pivotal role of test oracles, my research conducts large-scale studies to understand their impact on bug detection effectiveness, identify limitations in existing test adequacy metrics and automated oracle generation methods. Based on the findings, my research identifies three key properties of test oracles essential for effective bug detection, referred to as CCS (check, correct, strong). These properties ensure that test oracles thoroughly check codes, are correct based on the specification and strong for bug detection. To enforce the CCS properties, my research introduces a set of methods, leading to the development of OracleGuru framework that significantly enhances the quality of test oracles.


INTRODUCTION
Software influences nearly every aspect of our daily lives, including safety-critical medical devices, autonomous vehicles, and aircraft flight control systems [20,30,33].The reliability of such software is crucial to ensure stable, safe, and secure operations [20,25,29,30,33,36].Software bugs, even when seemingly insignificant, can cause major system outages [25], security vulnerabilities [36], or even loss of human lives [32].Therefore, ensuring the reliability of such systems through effective software testing for early bug detection is of great importance.
Testing is the primary method for detecting bugs early in the software development phase, prior to deployment.It requires a test suite composed of different test cases, where each case is defined by specific inputs and corresponding test oracles.The system under test (SUT) is exercised with these inputs, and its output is checked against the expected output defined by the test oracles [24].The key value of testing lies in its ability to detect bugs.This capability is contingent upon the satisfaction of four conditions based on the PIE model [39]: C 1 ) the buggy code must be executed, C 2 ) execution of the buggy code must create an error state, C 3 ) the error state must propagate to the output, and C 4 ) the error state must be detected by applicable test oracles.
Unfortunately, most test adequacy metrics, used to assess the quality and completeness of testing efforts, predominantly focus on C 1 , ensuring that code structures-such as statements, branches, and conditions-are covered by test inputs.Such focus on coveragebased adequacy may inflate perceived testing quality, allowing a test suite to achieve high code coverage yet exhibit poor bug detection effectiveness, often due to inadequate and poor quality test oracles [12,15,18].
Oracle-based coverage metrics [13] address some limitations of regular code coverage by defining code as covered only if it is both executed and checked by test oracles [21,34].A code structure is considered checked if it has a data and/or control dependency that influences a test oracle's computation.Therefore, this metric meets conditions C 1 and C 3 (assuming C 2 is true) and may meet C 4 , although not guaranteed.Given that such metrics are more effective than regular coverage [15,44], developing methods to enhance these measures by automatically adding new test oracles to check previously unchecked code is essential.
Meeting condition C 4 requires test oracles that are both correct and strong.A test oracle is considered correct if it aligns with the expected program behavior and avoids false positives.However, correctness alone does not make an oracle strong.A strong test oracle can detect a wide range of bugs, effectively differentiating between correct and incorrect program states.Despite their cost-efficiency, automated test oracles often suffer from a high percentage of false positives and poor bug detection effectiveness.The reasons for these limitations include automated oracle generation methods not leveraging program specifications at all [10,27], or having limited ability to comprehend and generalize across various specification forms [8].To effectively detect bugs, automated test oracles must exhibit the necessary properties to fulfill the conditions of the PIE model.This includes thoroughly checking codes, aligning with program specifications (correct), and having sufficient strength to detect deviations from intended behavior (strong).
My research hypothesizes that automatically ensuring the CCS (check, correct, strong) properties of test oracles, in alignment with the PIE model, can substantially improve their bug detection effectiveness.Building on this hypothesis, my research aims to achieve four main goals: 1) understanding the extent to which test oracles check codes and assessing the impact of unchecked codes on bug detection effectiveness, 2) developing methods to automatically check codes that were previously unchecked by test oracles, 3) investigating state-of-the-art (SOTA) oracle generation methods to understand their limitations, 4) developing a oracle generation and refinement method to reinforce the correctness and the strength properties.These efforts culminate in answering critical research questions and in developing the Ora-cleGuru framework, a set of tools and methods designed to ensure the CCS (check, correct, strong) properties of test oracles.Figure 1 presents an overview of the OracleGuru framework, which takes the system under test and the initial test suite as inputs.The Checked Adequacy Analyzer in 1 identifies gaps 2 -codes that are executed but not checked by test oracles.Subsequently, the Recommender in 3 utilizes these gaps and provides actionable recommendations to improve checked adequacy through the generation of additional test oracles.These recommendations can be consumed either by developers or by the Test Oracle Generator component shown in 4 , to generate test oracles based on the recommendations.These generated oracles can be further refined based on the validation feedback, as shown in 5 , to ensure they are both correct and strong, thereby satisfy the CCS properties of test oracles.Section 3 discusses further details.

BACKGROUND AND RELATED WORK
This section discusses the relevant background information.

Test Effectiveness
Test effectiveness measures a test suite's capacity to detect bugs, serving as a benchmark for the overall quality and success of the testing process.A widely-used approach to evaluate test effectiveness is mutation testing.This method introduces minor code modifications, known as mutations, to create slightly altered versions of the program, referred to as mutants.The effectiveness of a test suite is then measured based on its ability to detect these mutants [6,31].Besides mutation testing, various real-world bug benchmarks provide standardized means to assess test effectiveness.Defects4J [19] is a notable example.My research utilized both methods to assess the quality of test oracles.

Code Coverage
Code coverage is a widespread test adequacy metric.It measures the percentage of a program's source code executed by a test suite [26,40].Variants of structural coverage, such as statement, branch, and method coverage, are widely used to guide a testing process.
Although useful, studies suggested that coverage alone is not a good adequacy metric and high code coverage does not guarantee high test effectiveness [5,18,44].To address this issue, my research identifies coverage gaps-codes that are executed but unchecked by test oracles-and integrates necessary oracles to enhance test effectiveness.

Oracle-based Coverage
Test oracles play a crucial role in software testing by validating a program's outcome against its expected outcome [1,35].Their effectiveness is closely linked to the capacity of a test suite to detect bugs, a relationship supported by empirical research [15,44].
Integrating test oracles into coverage definition results in a more stronger test adequacy criterion, referred to as oracle-based coverage [13].This criterion considers codes as covered when they are both executed and checked by test oracles, indicating the existence of a data/control dependency from the executed code to at least one test oracle.Examples of oracle-based metrics include State Coverage [21,38], Checked Coverage [34], and Observable Coverage [23,41].My research extends and employs checked coverage to detect unchecked codes, then automatically incorporates new test oracles, enhancing the overall effectiveness of a test suite.

Automated Test Oracle Generation
Automatic generation of test oracles is a challenging problem.Most automated methods largely rely on implicit or regression oracles, each with inherent limitations.Regression oracles, for instance, can only detect regression bugs, missing bugs in current implementation.Learning-based methods that employ natural language processing or pattern recognition utilize code comments and documentation to derive test oracles in line with program specifications [2,3,9,11,28,37].However, these methods exhibit limited generalizability, produce a high rate of false positives, and demonstrate subpar bug detection effectiveness [9,17].
Generating correct and strong oracles requires a deeper understanding of the program and its specification.These specifications, often written in informal or semi-formal formats with natural language descriptions, pose challenges for automated methods to fully leverage them [9,17].Large language models (LLMs) are advanced computational models capable of understanding and generating both natural and programming languages.Recently, LLMs have demonstrated impressive effectiveness in various software engineering tasks [4,7,22,42,43].My research leverages LLMs to learn from informal specifications and transform semi-formal specifications into more formal formats, facilitating the generation of effective test oracles and enabling formal verification.

RESEARCH PROGRESS
Section 3.1 discusses OracleGuru's approach to assessing and ensuring that executed codes are also checked by test oracles.Section 3.2 evaluates state-of-the-art automated oracle generation methods, while Section 3.3 describes the oracle generation and refinement method that focuses on the correct and strong aspects of CCS.

Assessing and Improving the Check Property
Components 1 , 2 , 3 of the OracleGuru framework, as shown in Figure 1, are responsible for assessing and enhancing the check property of test oracles. 1 employs Host Checked Coverage (HCC), an extension of regular code coverage, to identify codes that are executed but unchecked by test oracles, i.e., coverage gap.This gap guides the recommender in improving the check property.

Approach
This section provides formal definitions of the core concepts, followed by a brief discussion of the approach, study, and key findings.

Definition 3.1 (Program Slice).
A program slice with respect to a program point  and a variable  is the subset of the program that may affect the value of variable  at point .This subset preserves the behavior of  at  while possibly eliminating statements unrelated to  at that point.Definition 3.2 (Host Checked Coverage).HCC quantifies the extent to which test oracles check the executed code.Let  denote a coverage criterion (e.g., statement, object branch),  denotes the coverage domain,  is a test case with test oracle . () = map(, slice(, trace())) (1) where trace records the object code execution trace of test , slice computes a backward dynamic slice of its second argument (trace(t)) using the first argument (o) as a slicing criterion, and map converts a sliced trace to a set of coverage elements in domain . can be replaced by the appropriate coverage criterion, e.g.,  and  for statement and object-branch checked coverage, respectively.

Definition 3.3 (Coverage Gap
).This metric identifies the portion of the executed program that remains unchecked by test oracles, signifying potential weaknesses in testing.Given  and  over domain , test suite  , the coverage gap is defined as: where ℎ indicates the host criterion.For instance,   and   represent the gaps for statement and object branch criteria, respectively.
HCC and Gap Computation.The test suite  is run on the program , producing a dynamic trace (TRACE) of executed instructions.From this, a dynamic program dependency graph (DPDG) is constructed, encompassing both data and control dependencies.Using test oracles as slicing criteria, dynamic backward slices (SLICE) are computed.These slices identify statements that influence test oracles within the test suite T.Then, employing Equation 1,  ( ) is calculated for the criterion  .Finally, the gap  ℎ is computed using Equation 2, which is later consumed by the recommender.
Recommendation Generation.The recommender in 3 uses the coverage gap and the system under test as inputs.It employs static analysis to examine the inputs and suggests new test oracles to improve the check property.It identifies and recommends nonvoid methods for inclusion in the test suite based on specific criteria: 1) the method reads a field and  ℎ includes a write to that field; 2) it reads a field, and  ℎ involves a call to a method writing to that field; or 3)  ℎ contains fields that the method reads or writes.By suggesting these candidate methods and integrating them into the test suite -with their outcomes validated by test oracles -the recommender systematically works to ensure that executed codes are also checked by test oracles.

Study and Findings
The study consists of 13 large-scale Java systems, 248k source lines of code, 237k lines of test code, 16k test cases and over 51k assertions.The experiments are designed to investigate the average size of the gaps, impact of gaps, and the effectiveness of the recommender.The findings published at ICSE'23 [15] are presented below: Finding 1: Around 34% of code elements, though executed, remain unchecked by test oracles, highlighting significant coverage gaps.
Finding 2: A strong negative correlation exists between gap size and fault-detection effectiveness, indicating faults hidden within these gaps can easily go undetected.
Finding 3: When removing test oracles from the developerwritten test suites, the recommender achieved a high success rate of recommending those exact methods checked by test oracles.On average, 67% of the removed methods were in the top five recommendations, with nearly half in the top one.
Finding 4: In a real-use case scenario, employing the recommender and adding assertions reduced the coverage gap and improved bug detection by an average of 13 percentage points (pp), with up to 58 pp in some cases.
In conclusion, OracleGuru components 1 , 2 , 3 not only effectively detect unchecked codes but also check them with new test oracles, significantly improving bug detection effectiveness.

Investigating Automated Test Oracle Generation Methods
The OracleGuru recommender necessitates the generation of automated test oracles.To thoroughly investigate the practical usefulness and bug detection effectiveness of the state-of-the-art automated test oracles, this research conducted a large-scale investigation.The findings and insights derived from this study greatly motivated the proposed research in Section 3.3.

Approach
This study evaluates two state-of-the-art test oracle generation methods: EvoSuite and TOGA.EvoSuite is a search-based method for regression oracle generation [10] and TOGA is a state-of-the-art deep learning-based method [8].To assess their effectiveness in real-world scenarios, a comprehensive dataset of 223,557 samples from 25 open-source Java applications was compiled.The generated test oracles were validated to evaluate their false positive rate (FPR), while their bug detection effectiveness was measured through mutation testing.

Study and Findings
The study answers critical research questions characterizing the precision (i.e., the frequency of generating correct oracles) and the fault-detection power of the generated correct oracles.The findings presented below are accepted at ESEC/FSE'23 [17].
Finding 1: The SOTA learning-based method exhibits significant accuracy issues, with a high rate of false positives.For instance, FPR is 18% when an assertion oracle is expected, 81% when an exception oracle is expected.
Finding 2: In scenarios requiring assertion oracles, the method fails to generate any assertion for 62% of the inputs.Of the generated assertions, 47% are false positives.This is particularly evident with the assertEquals oracles, which exhibit a high FPR of 74%.
Though regression-based oracles outperform learning-based oracles, their inability to detect bugs in current program version is a notable drawback.In contrast, learning-based method, despite their promise, exhibits high false positive rates and low bug detection effectiveness.This highlights the need for a more effective approach in test oracle generation to ensure both correctness and strength.

Ensuring Correct and Strong Properties of Test Oracles (In Progress)
To advance my research hypothesis, the proposed part develops methods that generate and refine test oracles ensuring they align with intended program behavior and are strong enough for effective bug detection.An high level overview is shown in Figure 2.

Ensuring Correctness Property
As learned from my previous research [17], existing methods, due to their inability to utilize program specifications, often generate too many false positives or fail to detect bugs in the current program version.To address this limitation, I plan to employ methods capable of comprehending both codes and specifications written in either formal or natural language.Additionally, I also plan to perform differential oracle-based testing to detect discrepancies between specification and implementation.

Ensuring Strength Property
Ensuring the strength property of test oracles is as crucial as their correctness, specially as [17] showed that SOTA method tends to generate generic oracles with limited test effectiveness.Addressing this issue can involve performing fault-based testing and using a feedback loop to continuously refine and strengthen the test oracles for more efficient bug detection.

Proposed Method, Study and Evaluation
As I plan to employ learning-based method, first step is to collect a large-scale dataset that reflect the complexity of real-world code  The oracle generation module will leverage large language models (LLMs) proficient in understanding both natural and programming languages.Various LLMs, such as CodeGPT and CodeParrot, will be explored for their effectiveness in generating test oracles.
Validation involves integrating generated oracles with test inputs and executing them to assess their correctness and strength in detecting bugs and the feedback can be further utilized to refine them.
My research plan is to complete the OracleGuru framework and design studies to answer critical research questions characterizing the effectiveness of the specification in generating correct oracles, validation feedback in generating strong oracles, and the impact of different sizes of LLMs and prompts in oracle generation task.

CONTRIBUTIONS
In summary, my research makes the following contributions: • Develops OracleGuru framework that ensures three critical properties of test oracles: check, correct and strong, for effective bug detection.• To ensure the check property, the HCC metric has been developed to measure checked adequacy and identify coverage gaps, along with a recommender system to mitigate these gaps.• Investigates SOTA automated test oracle generation methods through a large-scale study, revealing critical limitations.• Proposes a test oracle generation and refinement method to ensure correct and strong properties of oracles for effective bug detection.• Contributes to the field by publishing research findings in top software engineering venues [15,17] and sharing public artifacts [14,16], in adherence to the ACM's open science policy to ensure reproducibility, openness, and high-quality research standards.

TIMELINE FOR COMPLETION
I plan to finish the first checkpoint of my test oracle generation and refinement method by April 2024 and the second checkpoint by September 2024.I plan to finish writing the dissertation and defend my doctoral thesis by December 2024.

Figure 1 :
Figure 1: Overview of the OracleGuru Framework

Figure 2 :
Figure 2: An Overview of the Proposed Test Oracle Generation and Refinement method base.The dataset can be collected from GitHub and can be categorized based on the quality of the documentation to understand their impact on the generated oracles.The oracle generation module will leverage large language models (LLMs) proficient in understanding both natural and programming languages.Various LLMs, such as CodeGPT and CodeParrot, will be explored for their effectiveness in generating test oracles.Validation involves integrating generated oracles with test inputs and executing them to assess their correctness and strength in detecting bugs and the feedback can be further utilized to refine them.My research plan is to complete the OracleGuru framework and design studies to answer critical research questions characterizing the effectiveness of the specification in generating correct oracles, validation feedback in generating strong oracles, and the impact of different sizes of LLMs and prompts in oracle generation task.