Selecting and Constraining Metamorphic Relations

Software testing is a critical aspect of ensuring the reliability and quality of software systems. However, it often poses challenges, particularly in determining the expected output of a System Under Test (SUT) for a given set of inputs, a problem commonly referred to as the test oracle problem. Metamorphic Testing (MT) offers a promising solution to the test oracle problem by examining the relations between input-output pairs in consecutive executions of the SUT. These relations, referred to as Metamorphic Relations (MRs), define the expected changes in the output when specific changes are made to the input. Our research is focused on developing methods and tools to assist testers in the selection of MRs, the definition of constraints, and providing explanations for MR outcomes. The research is divided in three parts. The first part focuses on MR collection and description, entailing the creation of a comprehensive repository of MRs from various sources. A standardised MR representation is devised to promote machine-readability and wide-ranging applicability. The second part introduces MetraTrimmer, a test-data-driven approach for systematically selecting and constraining MRs. This approach acknowledges that MRs may not be universally applicable to all test data space. The final part, evaluation and validation, encompasses empirical studies aimed at assessing the effectiveness of the developed methods and validating their suitability for real-world regression testing scenarios. Through this research, we aim to advance the automation of MR generation, enhance the understanding of MR violations, and facilitate their effective application in regression testing.


INTRODUCTION
Software testing is a fundamental quality assurance activity as it ensures the correct operation and quality of software.One of the significant challenges in software testing is known as the test oracle problem.A test oracle determines the correct output of the System Under Test (SUT) for a given input.The test oracle problem arises when the SUT lacks an oracle or when developing one to verify computed outputs is practically impossible [13].To address the test oracle problem, Chen et al. [6] introduced Metamorphic Testing (MT).MT tackles the test oracle problem by delving into the internal properties of the SUT and evaluating how outputs should vary with specific input changes.The essence of MT lies in analysing the relations between inputs and outputs during multiple SUT executions; these relations are referred to as Metamorphic Relations (MRs).Each MR consists of two parts: the input transformation statement part, which outlines how to modify a given input to create a new related input, and the expected output changes part, which defines the relation that the outputs must exhibit when subjected to these input modifications [12].
To assess the correctness of the SUT using MT, it is necessary to compare the output relations defined by the MRs with the outputs produced by both the transformed input and the non-transformed input.If these relations do not hold for a specific MR, it signifies a violation of that MR, indicating a substantial likelihood of a fault in the SUT.Nevertheless, the absence of MR violations does not guarantee a fault-free SUT.MT has been demonstrated to be an effective technique for testing in a variety of systems from different application domains, including autonomous driving [33,37], optimisation process [7,31], cloud and networking systems [4,36], bioinformatics software [25,28], web systems [17,24], cyber-physical systems [1,2], and scientific software [20,22].However, it is well known that the effectiveness of MT highly depends on how "good" the MRs used are [6].
Generating "good" MRs is a complex task that needs a deeper understanding of the SUT and its application domain.Consequently, its generation typically requires manual effort, relying on the expertise and knowledge of testers or developers.To automate MR generation, a fundamental question to address is, Where do MRs come from?, various approaches have been proposed for the automatic generation of MRs, drawing from different sources, including system specifications or documentation [3,8], source code [32], test code [30], and even through interactions with large language models like ChatGPT [35].However, these approaches have limitations in terms of their applicability, often being tailored for specific domains, such as numerical programs [32], or specific types of relations, such as equivalence relations [3,8], and polynomial equality and inequality relations [32].
There are other approaches that, instead of generating MRs from scratch, offer methods for selecting the most suitable MRs from a predefined set for a specific SUT.These approaches leverage previously established MRs that have been used in similar scenarios, allowing for the efficient reuse of MRs and simplifying the overall MT process.An early pioneering work that demonstrated the feasibility of this approach was conducted by Kanewala et al. [15].Kanewala et al. [15] introduced Predicting Metamorphic Relations (PMR) and employs Machine Learning (ML) techniques to determine which MRs from a predefined set are applicable to a specific method based on the method's control flow graph (CFG).Several works have followed the PMR approach; for instance, Hardin and Kanewala [14] extended the initial PMR study using semi-supervised learning techniques on a set of CFG-based features tagged with six predefined MRs.Rahman and Kanewala [22] applied the PMR approach for predicting three pre-defined MRs for matrix-based programs.Zhang et al. [34] introduced RBF-MLMR, a multi-label method for predicting MRs using radial basis function neural networks.Unlike PMR, which employs several binary classifiers, RBF-MLMR predicts all possible MRs for a given method.While RBF-MLMR uses a different approach than PMR, it follows the same PMR methodology and the same feature extraction approach.Duque-Torres et al. [12] and Rahman et al. [21] follow the PMR approach, but instead of using CFG-based features, Duque-Torres et al. used software metrics extracted from the method's source code, and Rahman et al. use Bag of Words (BoW) model over Javadocs as the feature representation.
While the aforementioned approaches have demonstrated promising results, they come with notable limitations: (1) Dependency on Labelled Datasets: Many of these approaches rely on binary classifiers, which require labelled datasets for effective learning.Creating such labelled datasets can be a time-intensive and resource-demanding process, making it impractical in scenarios where obtaining labelled data is challenging or costly.(2) Limitation in Feature Extraction: Another limitation is encountered during the feature extraction process for model training, especially when relying on CFG or source code metrics.These methods often overlook the possibility of code refactoring.This oversight is significant because code refactoring can alter the structure of the code, potentially making previous ML models irrelevant or less accurate.(3) Assumption of Universal Applicability: The logic behind the selection process using binary classifiers assumes that a chosen MR must universally apply to the entire valid input data space.This assumption may not always hold true.An MR that applies to specific input data might not apply to others within the same valid input data space.Relying on the belief that an MR must consistently apply across the entire input data space can lead to false positives, where an MR violation incorrectly suggests a fault when there is none.
In addition to the previous limitations, current practice in MT involves interpreting their outcomes, i.e., whether the MR is violated or not, is largely a manual effort.This manual interpretation can be time-consuming and resource-intensive.Furthermore, the cost of MT is directly influenced by the number of MRs used.As the number of MRs increases, the number of test cases may grow exponentially.This leads to longer execution times and a greater need for manual inspection of MT outcomes [26,27].

RESEARCH STATEMENTS
Achieving full automation in selecting and understanding the reasons behind the MR outcomes is challenging for two main reasons.Firstly, as we mentioned in Section 1, while some MRs are easy to identify, others demand complex reasoning or extensive domain knowledge.Secondly, MR applicability can vary across input data, with certain MRs only applying to specific subsets of input data.Driven by these challenges, the goal of my PhD research is to support the selection of MRs from a pre-defined set, refine the selected MRs by setting constraints based on the test data, and provide clarity on the factors influencing MR outcomes.Ultimately, the aim is to equip testers with tools and methodologies that enhance the MT implementation and usage.
The research is organised into three parts: (

METHODOLOGY & PROGRESS
In the following subsections, delve deeper into each RG, supported by research questions, and provide an overview of the current status.
Additionally, an outline of planned work for each part is presented.

MRs Collection and Description
To address RG 1 , it is imperative to identify and comprehend the potential sources from which MRs can be extracted.By exploring various potential sources, including system specifications, domain knowledge, and even the source code, one can pinpoint and extract MRs.Another crucial aspect of this phase is to develop methods for describing MRs in a machine-readable format, making them applicable to different SUTs and transferable to various domains.

Research questions.
To achieve RG 1 , we need to address the following research questions: • RQ 1.1 : What are the sources and where to search for MRs?
• RQ 1.2 : How to formalise MRs in a machine-readable format?
In addressing RQ 1.1 , the focus is on the exploration of various potential sources, which encompass system specifications, domain knowledge, and even existing works on MT, with the aim of gathering MRs.The method for describing MRs should be versatile, capable of applying to various types of SUTs, and conducive to automated test code generation.Additionally, a prototype for MRs description will be designed, building upon existing proposals with more specific objectives [5,18,19].Second, the MRs in METWiki lack uniform descriptions, introducing inconsistencies that not only hinder user comprehension but also impede the generation of automatic machine-readable representations.Third, there is a lack of correlations between application domains.Although the MRs are categorised into eight domains based on their application, the absence of connections between these domains can restrict broader comprehension and cross-domain utilisation of MRs.This updated database will serve as a valuable reference for researchers and practitioners in the field of MT, facilitating the discovery and application of relevant MRs in various software testing contexts.By doing so, RQ 1.1 will be solved.

Planned work.
Once the database is complete, the next step involves designing an MR representation mechanism that acts as a bridge between the MR database and a machine-readable format.Valuable insights and guidelines for developing this representation format can be found in existing proposals outlined in [5,18,19].These proposals provide a foundation for creating a format that is flexible, adaptable, and compatible with a wide range of SUTs, thereby addressing RQ 1.2 .

MRs Selection and Constraint Definition
One of the significant contributions of this research lies in challenging the notion that an MR must universally apply to the entire valid input data space.Instead, it highlights the relevance of MRs within specific subsets of this space.The rigid belief that an MR must encompass the entire input data space can limit its effectiveness.Real-world scenarios often feature distinct behaviours and characteristics across various subsets of the input data space.Recognising these subsets and introducing constraints based on test input data serves as a means to enhance the effectiveness of MRs.Additionally, emphasising the provision of explanations for MR violations emerges as a crucial aspect.It plays a pivotal role in distinguishing whether a violation stems from a code fault or an MR-imposed constraint.This process involves the identification of patterns, trends, and underlying causes of violations.

Research questions.
To achieve RG 2 , we need to address the following research questions:

Achievements.
The baseline approach for selecting MRs from a predefined set is PMR proposed by Kanewala and Bieman [15].As previously discussed in Section 1, the idea behind PMR is to develop a model capable of predicting whether a specific MR can be used to test a method in a newly developed SUT.To evaluate the generalisability of PMR across various programming languages, we conducted a replication study on PMR [9].Our replication study involved reconstructing the preprocessing and training pipeline.
The results of our replication study validated the reported findings and laid the groundwork for subsequent experiments.
In addition to this, we explored the potential reusability of the PMR model initially trained on Java methods.We assessed its suitability for functionally identical methods implemented in Python and C++.While the PMR model demonstrated strong performance with Java methods, its prediction accuracy notably declined when applied to Python and C++ methods.Nevertheless, we found that retraining the classifiers using CFGs specific to Python and C++ methods led to improved performance.Furthermore, we conducted an evaluation of the PMR approach, considering source code metrics as an alternative to CFG for building the models [12].Our results also led to the conclusion that a generalisation of PMR beyond unit testing, such as its application to system-level testing, does not appear to be feasible.For comprehensive details regarding our replication study and the extension using source code metrics, please refer to our publications in [9] and [12].
Motivated by the findings of our previous work, we began to explore the possibility of selecting MRs based on test data and the information gathered during the MT process itself.This involves the transformation of test data, the execution of test data and transformed test data, and the comparison of their corresponding outputs.We then analyse the outcomes of this process and investigate all potential factors influencing these outcomes.
We initially validated this concept in [10] using a toy example.For a comprehensive understanding of our initial findings, please refer to our publications in [10].Through this evaluation, we have confirmed that MRs may not universally apply to the entire input space.We have expanded upon these initial findings by formalising the approach known as MetaTrimmer, as described in [11].MetaTrimmer is a test data-driven method for selecting and constraining MRs.Similar to PMR, we assume the existence of a predefined list of MRs.However, MetaTrimmer does not depend on labelled datasets and acknowledges that an MR may only be applicable to test data with specific characteristics.
MetaTrimmer consists of three primary steps.i) Test Data Generation (TD Generation), which is responsible for generating random test data for the SUT.ii) MT Process, which is in charge of performing necessary test data transformations based on the MRs.It generates logs to record information about inputs, outputs, and any MR violations during the execution of the test data and the transformed test data against the SUT.iii) MR Analysis, which involves manual inspection of violation and non-violation results.It aims to identify specific test data or ranges where the MR is applicable and derives constraints based on this analysis.In this formalisation of MetaTrimmer, we have evaluated it on 25 Python methods and six pre-defined MRs.A replication package with the full set of data generated during our experiments as well as all scripts can be found in our GitHub repo 1 .The preliminary results obtained from this paper demonstrate a promising potential for MetaTrimmer.

Ongoing work.
We are currently focusing on enhancing the MR analysis step in MetaTrimmer by developing data mining techniques to identify relevant features and patterns in the data.The goal is to automate the derivation of constraints, which will increase the coverage of our analysis and potentially uncover constraints that may have been overlooked in manual inspections.Furthermore, we are conducting an empirical evaluation by extending the experiment to a different SUT with distinct characteristics, such as a varying number of inputs, outputs, and domains.This expanded evaluation will provide a broader perspective on the applicability of MetaTrimmer.
Additionally, alongside MetaTrimmer, we are developing the first prototype of MetaExploreX, a tool that offers visualisation and exploration capabilities to support the MR analysis step of MetaTrimmer.You can find MetaExploreX [here] By formalising and evaluating MetaTrimmer, we aim to answer both RQ 2.1 and RQ 2.2 .

EVALUATION AND VALIDATION
The evaluation and validation part (RG 3 ) of this research aims to assess the effectiveness and efficiency of the developed methods 1 https://tinyurl.com/MetaTrimmerand tools through empirical studies.One of the key objectives is to compare the approach with manual MR selection methods and fully automated methods to validate its ability to identify relevant constraints for the selected MRs and enhance fault detection capabilities.4.0.3Planned work.For RQ 3.1 we plan to conduct a mutation testing analysis.In order to address research question RQ 3.2 , which focuses on comparing the performance of our proposed methods to existing approaches, we will conduct a comprehensive comparison at every step of our research.Throughout the development and evaluation of our methods and tools, we will consistently compare their performance against baseline approaches.These baseline approaches may include existing manual MR selection methods or other fully automated approaches commonly used in the field.

4.0. 1
Research question.To achieve this goal, we focus on answering the following research question:• RQ 3.1 : How effective are the proposed methods in finding faults?• RQ 3.2 : How well do the proposed methods perform compared to existing approaches?4.0.2Ongoing work.We are conducting an empirical evaluation by extending the experiment to a different SUT with distinct characteristics and domains.

I
am currently in the fourth year of my Ph.D. program.During the first half of the fourth year (fall semester 23/24), I have been working on the formalisation of MetaTrimmer and the empirical evaluation by extending the initial experiments.In the second half (spring semester 23/24), I plan to continue working on my research, focusing mainly on RG 1 while polishing the general evaluation of MetaTrimmer.It's worth noting that while the detailed comparison with other approaches is discussed in the "planned work" subsection, we are constantly conducting comparisons with other works.I plan to defend my thesis during the fall semester of 2024.
[8]2Achievements:Exploring alternative sources for extracting MRs led to a deeper investigation of Blasi et al.'s "MeMo" work[3], with a specific focus on the MR-Finder module.This module deduces MRs by analysing sentences found in Javadoc's comments that describe equivalent behaviours among different methods within the same class.The MR-Finder comprises three key components: i) A predefined set of 10 words denoting equivalence (S10W).ii)Amechanism employing Word Move Distance (WDM) to assess semantic similarity between sentences.iii)Abinary classifier to detect sentences indicating MRs.The study performed on MeMo, aimed to enhance the MR-Finder module by reconstructing it and utilising the original dataset provided by the MeMo authors to replicate their reported results, establishing a baseline for future experiments.Two strategies (STRTG) were explored to improve MR-Finder: i) STRTG No.1 involved expanding the S10W set by introducing additional equivalent words.ii)STRTGNo.2 introduced a second template sentence to the MR-Finder module while maintaining the original S10W set.The successful re-implementation of the MR-Finder module yielded results comparable to those obtained using the original S10W set.For further details on our study and findings, please refer to our publication in[8].
[23]3Ongoingwork.Taking inspiration from open bug repositories, METWiki-an MR repository-was developed by Xie et al.[29].METWiki compiles MRs from approximately 110 applications of MT across diverse domains.These MRs have been extracted through an extensive literature review on MT performed by Segura et al.[23].This comprehensive review encompassed a wide array of publications to identify and gather MRs employed in various real-world scenarios, however, despite the considerable efforts invested in creating METWiki.It faces several limitations.First, the lack of updates since its initial publication in 2016 raises concerns regarding the currency and relevance of the MRs contained within METWiki.
.1 : [Selection] How to select MRs base on test data?• RQ 2.2 : [Constraints] How to correctly define constraints on MRs based on test data?RQ 2.1 explores the feasibility of utilising test data to identify the relevant MRs for a specific SUT.The hypothesis is that an MR that is violated 100% of the time does not apply to the tested SUT, while an MR that is not violated in 100% of the cases aligns with the tested SUT.RQ 2.2 delves into the use of test data to define constraints for refinement of MRs, especially in scenarios where violations and non-violations do not reach 100%, which we call mixed cases.The hypothesis suggests that mixed cases can offer valuable insights into MR behaviour and facilitate the formulation of constraints for regression testing purposes.