Keeping Mutation Test Suites Consistent and Relevant with Long-Standing Mutants

Mutation testing has been demonstrated to be one of the most powerful fault-revealing tools in the tester's tool kit. Much previous work implicitly assumed it to be sufficient to re-compute mutant suites per release. Sadly, this makes mutation results inconsistent; mutant scores from each release cannot be directly compared, making it harder to measure test improvement. Furthermore, regular code change means that a mutant suite's relevance will naturally degrade over time. We measure this degradation in relevance for 143,500 mutants in 4 non-trivial systems finding that, on overage, 52% degrade. We introduce a mutant brittleness measure and use it to audit software systems and their mutation suites. We also demonstrate how consistent-by-construction long-standing mutant suites can be identified with a 10x improvement in mutant relevance over an arbitrary test suite. Our results indicate that the research community should avoid the re-computation of mutant suites and focus, instead, on long-standing mutants, thereby improving the consistency and relevance of mutation testing.


I. INTRODUCTION
Mutation Testing has been demonstrated to be one of the Software Testing community's most effective software testing techniques [1].It seeds artificial faults -also known as mutants.When a test distinguishes the behaviour of the mutant from the original, the mutant is said to be 'killed'.Test effectiveness is captured by the mutation score (the proportion of mutants killed) [2], [3], [4].
We argue that there is a remaining barrier to uptake: mutant consistency.We need a consistent set of mutants for a project so that test effectiveness can be consistently tracked against a common baseline over a series of project releases.Sadly, almost all existing research on mutation testing assumes that a fresh set of mutants is created for each release of the system [2], [3].An alternative would be to fix a set of mutants as a baseline and use this to measure ongoing test effectiveness evolution.However, as the results of this paper show, such a fixed mutation set will quickly degrade in its relevance.
We introduce a mutation test brittleness metric, which can be used to assess a mutation suite, and software project, in terms of the rate at which mutant relevance decays over a series of releases.Our results demonstrate that mutants have diverse life spans across program versions.We show that identifying a high-quality suite of long-standing mutants allows us to maintain mutant relevance over a series of releases: a longstanding mutant suite provides test effectiveness relevance for at least 10x longer than a randomly selected suite.In order to increase consistency and relevance of mutation testing, we conclude that the research community should focus on longstanding mutants, their applications, the opportunities they open, and the remaining open questions.
Specifically, this paper's primary contributions are: 1) The introduction of long-standing mutants, as an important category warranting further study.
2) The introduction of metrics for assessing mutant brittleness, and visualisations of how this metric varies for a given project over time.3) An empirical study of long-standing mutants, based on four non-trivial systems and 143,500 mutants.4) The key motivating finding that long-standing mutant suites enjoy an order of magnitude longer relevance than a randomly selected suite, over the four systems studied.5) An important 'special relationship' between longstanding and subsuming mutants: mutants that are subsuming in one version have a high probability of subsuming in the following versions.This relationship is important because it opens optimistic prospects for mutants' ability to maintain consistency, relevance, effectiveness and subsumption over a series of releases.

II. BACKGROUND AND RELATED WORK
Commit-Relevant Mutants: Applying traditional mutation testing in CI processes is unpractical due to its cost.Meanwhile, Commit-Aware Mutation Testing scales and defines commit-relevant mutants as a set of mutants affected by the changed program behaviour that serve as commit-relevant test requirements to guide test assessment by aiming at the changed program functionality [10], [14].More recently, learning-based approaches [15] emerged capable of learning the commitrelevant mutants defined by a regression change in commit time, thus showing potential and opening a direction towards learning mutant's behaviour in evolving context.Howeverit is necessary to realise that to improve the testing process continuously and thus quantify overall testing quality -it Fig. 1: Example of mutants standing through 3 chronological sequences of code versions.The example code snippet comes from Apache commons-io project, while method read() is excerpted from the BoundedReader.java(versions around 81210eb).The green and red rectangles represent associated commit changes.While java comments (//) describe the set of mutants M i,j , where i is the observed program version, and j is a mutant ID.
would require applying the technique after each program change cycle, to not indebt and lose test requirements.The merit of reappearing mature mutants is that they preserve test requirements and thus complement commit-relevant mutants.In particular, the long-standing mutants promise to keep overlooked testing requirements from oblivion and provide test assessment for a prolonged time.Subsuming Mutants: In an attempt to further scale and make test assessment affordable, many recent studies (consult the survey by Papadakis et al. [2], [16]) consider subsuming mutants to reduce the number of mutants required to measure test adequacy [2].Indeed traditional mutation opperators introduce many trivial, duplicated, equivalent and redundant mutants [17], [16].Specifically, a subsumption relationship between mutants emerges from mutant behaviours, thus suggesting that the majority of the mutants fall into the redundancy basket since distinguishing those subsuming mutants will lead to the identification of all other mutants [18].
More formally, given a finite set of mutants M and a finite set of tests T, mutant m i is said to dynamically subsume mutant m j if some test in T kills m i and every test in T that kills m i also kills m j [13].Calculating test effectiveness over subsuming mutants offers a much better test effectiveness indicator than the traditional mutation score since subsuming mutants have an almost linear relationship between number of test requ Although the evidence is strong and the benefits are multifold, calculating subsumption in real time requires knowledge of mutant's behaviour, usually represented through test execution, which is unpractical in real-time.Recently few approaches have used learning-based methods to target subsuming mutants with a certain level of confidence, considering their location and properties [11].In this paper we reuse the guarantee of the quality of test assessment when the mutants are also long-standing.

A. Motivating Example
A key challenge in the current state of mutation regression testing is a sequential version to version execution.Suppose we keep a record of generated mutants to the same element on which the mutant is generated and version them from one version to another.Figure 1, depicts a chronological sequence of 3 different versions of method read() extracted from Apache Commons-io project.Following the evolution of the code we can observe the evolution of the mutant set.We notice that most mutants reside in the same place across the versions.When a mutant location is unchanged from version to version, we consider that it stands through time.While if a mutant does not occur in next version, we consider it to stop standing, e.g., M 2,4 does not exist as M 3,4 due to deletion changes.From the example, we can observe when a system reaches maturity, as in the case of commons-io, the majority of the changes do not touch the core logic and most mutants M 1,n are long-standing (still exist as M 3,n ) precisely six out of eight.In particular, this indicates potential reuse of past subsumption knowledge concerning selection priority (among other things), avoiding redundancy, addressing technical testing debt and aspiring towards test completeness of mature code components.

B. Implementation Details -Mapping
To reuse mutants and follow their standing (w.r.t.how long a mutant 'stands' in one location without being altered), in this study, we map mutants considering changed lines and the context of change from git diff tool [19].Our scripts take two program versions and map code statements from version to version. Figure 1 indicates information of line numbers shift from version to version.Note that for the purpose of this study, the history length of a mutant is computed from the first studied version till the last chronologically observed version.
We start our analysis by extracting files with the most extended history of change from the open-source mature Apache Commons projects.Then, we use the state-of-theart PIT mutation testing tool [20] to generate mutants per each changed (committed) file, followed by the execution of tests and generation of the killing matrix.Next to the killing matrix, for each changed file, we keep metadata (info.about hunks, timestamps, mutants bytecode index, location etc.) Using extracted information, we create a regression history for each file, making a file-specific historical timeline.In the timeline of each file, a time-point represents a commit that introduces changes to the file.While for each time-point, we calculate subsuming mutants (reminder: the set of mutants, when distinguished, distinguish all other mutants).
Besides the set of mutants, each time-point contains information about mapping changes to the consecutive timepoint.Hence, long-standing mutation metric is a function F (M t ,change map) = M t .Where M t is a mutant from time t, and change map is a map containing information about code transition from time t →t', while M t is the mutant at time t'.Therefore long-standing mutants definition is: Definition.Given n and m as timepoints of the first and last versions under the study of program P, where n > m.A mutant M is said to be long-standing if it exists on the same code element E throughout consecutive versions P n , P n+1 , ... ,P m until the point when the mutant M due to a committed program change does not appear on the code element E of a program version P m+1 .
Given that mutants can 'stand' for several versions or just a few, we argue that the rate at which a mutation suite and mutant relevance decays over a series of releases suggests a mutation test brittleness.Knowing the degree to which mutants hold high-quality tests for a series of releases helps provide more prolonged test effectiveness.Accordingly, we introduce the mutation brittleness metric that measures mutants' longevity and overflow between versions.

IV. INITIAL EVALUATION AND EARLY RESULTS
Evaluation data: Table I shows our subject data.From 4 different well maintained Apache Commons projects, we extracted nine files with the longest history of change and their corresponding commits.It is important to emphasize that due to technical reasons (e.g., PIT mutation testing tool requires green test suite to run), the time gap between commits is rather "longer" than one commit.Nevertheless, this didn't stop us from mapping and observing long-standing mutants through the history of evolving systems.
Mutation Brittleness: Figure 2 depicts brittleness of mutants over time.In particular, it tells us how many mutants of a specific file exist through time, from their inception, on the same initial code elements, w.r.t., a code change has not altered mutants.From the figure, we can observe the diversity of the longevity distribution.Moreover, for the file Lexer, the ratio of long-standing mutants is significant, over 80% over observed time-points.On the contrary, we can see that the CsvPrinter file contains significant changes, and the ratio of the mutants degrade below 50% after the first half of observed points and below 20% in the second half of the observed history points.For other observed files, we see changes do not impact over 70% of mutants in the first quartile of the timeline and between 40-60 % for the rest of the time-points.These results demonstrate that mutants have a diverse lifetime over different evolution timelines, which suggests further investigation of whether mutants keep subsuming dynamic relationships over time and how mutant selection can affect test assessment.
Long-Standing Subsuming Mutants: Figure 3 demonstrates to what extent subsuming mutants convey their dynamic behaviour and how mutation selection can affect the test assessment capability of mutation testing over time.In particular, the figure depicts the scenario in which we select subsuming mutants at a certain point in time and observe the capacity in which they exist over time together with how well they can perform test assessment w.r.t., measuring mutation score.We randomly sample 10-30% of mutants (100 times to remove the threat of randomness; we choose these selection intervals as obviously selecting all mutants leads to traditional mutation testing) from each observed file and consider the file history length -the figures show aggregated results since each subject has different history length.Figure 3a illustrates the need for intelligent mutant selection as we can significantly distinguish between two sets, a) sets of mutants more optimal as they stand longer throughout observed history; hence longer enjoying mutant suites relevance and b) other sub-optimal sets that suffer relevance degradation, w.r.t., represent obsolete test requirements.Interestingly, optimal mutant selection promises continuous tracking of test effectiveness as the margin of degradation is ≈10% on the ratio of selected mutants.In comparison, we observe a worst-case sets of mutants which indicate obsolete test requirements and typical arbitrary sets as if there was no other way to select, then we would end up with a random.To observe how capable those mutants are of affecting test assessment over time -keeping their (a) Long-Standing Subsuming mutants for different observed files throughout their studied history.An optimal set of mutants -when selected -shows a long-standing prospect.In contrast, a worst-case set of mutants when selected shows brittleness -indicating obsolete test requirements.A 'special relationship' between long-standing and subsuming mutants exists and indicates that subsuming mutants in one version are probably subsuming in the following versions.
(b) Mean Square Error of Mutation Score of initially selected and long-standing subsuming mutants.The optimal set of long-standing subsuming mutants demonstrates a capability to perform test assessments over time -preserving subsumption relationships -unlike the worst-case or typical sets which show higher MSE.An optimal set of longstanding mutant suites enjoy an order of magnitude longer relevance.
Fig. 3 subsumption relationships -we calculate the mean square error (MSE) of mutation score (MS) between the initially selected set and those long-standing mutant sets.In Figure 3b we assess the difference in the MSE of MS between the optimal and suboptimal sets of long-standing mutants.We observe that the optimal set of long-standing subsuming mutants keeps MS high over time (low MSE ≈ 0.01%-0.04%),indicating a gradual loss in MS as the mutants stand longer in time, thus preserving mutant suite relevance.Accordingly, it is important to realize the potential in conveying knowledge of previously calculated dynamic relationships of mutants for at least 10x longer than a random selection.In particular, by selecting the sub-optimal sets of mutants, the threat of not preserving the knowledge appears, w.r.t., mutants less capable of test assessment over time, suggesting their low priority (higher MSE ≈0.01%-0.20%).

V. FUTURE PLANS
We believe that long-standing mutants are an interesting category in their own right, and worthy of further research.They have implications not only for mutation testing, but also beyond mutation testing.In this section we set out future plans for further evaluation and investigation of the properties of long-standing mutants and their applications.
Implications regarding subsuming long-standing mutants: Despite showing that subsuming relationships can be preserved from version to version and that mutants' utility can be reused, we do not yet fully understand why subsuming mutants tend to last longer than subsumed mutants.A detailed study is needed to fully understand the subsumption and longevity drivers.
Implications for mutation testing tools: Our results also have implications for the development of future mutation testing tools.In particular, our results suggest the development of a robust mutant versioning system.Existing tools [20], [21], [22] focus on the generation of mutants, but not sophisticated mutant versioning.In future work, we need to investigate mutation testing tools that allow logging mutants' maturity, execution history, and fluctuation over time, supporting approaches that learn mutant behaviour and relating this to code changes.Previous work on flaky mutant detection [23], predictive modelling [24] and hyper-heuristics [25] (in particular that focused on mutation testing [26]) may form a good starting point for this research agenda.
Maximising long-standing mutant fault revelation: By focusing on long-standing mutants, we favour mutants that reside in relatively unchanging parts of the code.There is a natural concern that this may, in turn, lead to us favouring test suites that do not tend to reveal faults in changing parts of the code.Fortunately, the fact that a mutant lies in code region A does not render it insensitive to bugs that lie in (lexically separate) code region B. If there are transitive dependencies between A on B then we can expect high degrees of mutant coupling, and even subsumption between the two regions.This suggests future work on identifying mutants that have high 'transitive dependence reach' through their transitive dependencies, using techniques such as slicing [27] and chopping [28].
Implications of long-standing mutants beyond mutation testing research: The findings reported in this paper have implications beyond mutation testing to automated program repair [29], [30] and genetic improvement [31], [32].It is often been argued that program repair is the inverse of mutation testing.Instead of inserting faults, repair seeks to remove them.Long-standing mutants are therefore also likely to find applications and implications in the field of program repair and genetic improvement research.For example, it would be interesting to explore 'long-standing' repairs as a counterpoint to long-standing mutants.One might reasonably conjecture that such repairs would remain relevant for longer than repairs in areas of code subject to high degrees of churn.However, the empirical assessment of this phenomenon remains an open problem for future work.

Fig. 2 :
Fig. 2: Mutant sets for different observed files and their corresponding studied history (timeline).Each cell in the heat map represents a normalised proportion of a subject history length, and the corresponding colour represents a percentage of mutants from the initial set over time through versions.The results show that mutants have diverse longevity, with brittleness, on average, of 52%.

TABLE I :
Observed files through projects evolution