Dynamic Test Case Prioritization in Industrial Test Result Datasets

Regression testing in software development checks if new software features affect existing ones. Regression testing is a key task in continuous development and integration, where software is built in small increments and new features are integrated as soon as possible. It is therefore important that developers are notified about possible faults quickly. In this article, we propose a test case prioritization schema that combines the use of a static and a dynamic prioritization algorithm. The dynamic prioritization algorithm rearranges the order of execution of tests on the fly, while the tests are being executed. We propose to use a conditional probability dynamic algorithm for this. We evaluate our solution on three industrial datasets and utilize Average Percentage of Fault Detection for that. The main findings are that our dynamic prioritization algorithm can: a) be applied with any static algorithm that assigns a priority score to each test case b) can improve the performance of the static algorithm if there are failure correlations between test cases c) can also reduce the performance of the static algorithm, but only when the static scheduling is performed at a near optimal level.


Introduction
Regression testing is used in Continuous Integration (CI) [9] to ensure that new code changes do not cause problems with existing functionality [4].Since each test suite may require an ostensibly long time to run, it is desirable that failing tests are executed as early as possible so that developers can be notified quickly.This means that for each CI cycle (which is an iteration performed when the software is modified), there should be a technique that prioritizes test cases that are most likely to fail.
The techniques applied to regression testing can be divided into three different types: test suite minimization, test case selection, and test case prioritization [17].We focus on test case prioritization and will refer to this technique as TCP.To prioritize test cases, feature sets such as code complexity, textual data, coverage information, user input, and history can be used [9].In our approach, we use historical results from execution of test cases.The techniques to prioritize tests based on historical data can be divided into static and dynamic tests [10].The former create a fixed or, in other words, a static schedule which is then executed [10], and the results of the current CI cycle are considered only for the next.The latter adjust the order of test cases during execution [10], thus verdicts of the already executed test cases are utilized to rearrange the pending ones.For our approach, we focus on the ability of test cases to reveal faults.By fault we mean some defect in a software which is revealed by the test case that failed.We utilize verdicts of test cases and pairwise probability of tests to fail or pass together.Ideally, we want to achieve an optimal schedule where all failing tests are scheduled before the passing ones.
Our solution is based on conditional probability, and we evaluate our algorithm on three industrial datasets.In this paper, we propose an approach that can be applied with any static scheduler that assigns a priority score to each test case.Our approach calculates the conditional probability of failure or success for correlated test cases and is focused on adjusting or rearranging the schedule of test cases based on the results of the current CI cycle.The goal of our study is to improve a schedule created by a static algorithm.The main contributions of this paper are: a) a schema for test case prioritization using a static and a dynamic TCP algorithms1 b) a dynamic TCP algorithm based on the conditional probability of failure c) the evaluation of the dynamic conditional probability algorithm in three industrial datasets.
This article is organized as follows.Section 2 describes previous work in the field.Section 3 provides a general overview of dynamic prioritization, along with our approach, while Section 4 provides a more detailed description of our solution.Sections 5 and 6 present the evaluation results and discussion along with the conclusions in corresponding ways.

Previous Work on Test Case Prioritization
There is a large variety of techniques that can be applied to prioritize test cases.Among the static ones, some possible approaches are machine learning [9], probabilistic inference [6,7], simulated annealing [19], analytical hierarchy process [8], metaheuristic search techniques such as Hill Climbing and genetic algorithms, as well as greedy algorithms [3].
Besides static prioritization, there are dynamic techniques that reprioritize tests during the execution.These techniques apply features and metrics such as textual data, similarity of test cases, and test coverage [12,18,5].For our approach, the most relevant ones are test verdicts and relations between test cases.One of the proposed techniques is called AFSAC [1].The main differences of our approach are that by correlated tests we assume the tests that failed or passed together in the past, and we apply conditional probability to count a value of this correlation.The other approach is based on the failure history data (FHD) prioritization technique [2] and has similarities to AFSAC.Unlike [2], we do not utilize information concerning the inner structure of a test case.Another example is an approach called REMAP [10,11].The difference with REMAP is that our dynamic algorithm is seen as a separate approach, without a certain static one.Additionally, when we analyze the history of test cases, we apply conditional probability to compute relations between these test cases in terms of failure and success and then utilize this information during execution to adjust individual scores of the pending correlated test cases.Another example is an approach named CoDynaQ [20].Our approach differs from CoDynaQ by the nature of datasets, because we prioritize a given set of tests that should be executed in a given software revision, and new sets of tests are not constantly added.Furthermore, we apply conditional probability in a different way, since we do not utilize the past co-failure distribution of test cases to change scores of the related pending tests.

Combined Static and Dynamic Test Prioritization
In this section, we describe a general framework for combined static and dynamic test prioritization.Our problem setting is regression testing in CI development process.We assume that a software under test is developed in small incremental iterations and a CI server builds and tests the system under development automatically as soon as there is a new revision or at certain time intervals, for example every night.The goal is to schedule the tests to be executed according to a given criterion.This criterion is usually that tests that fail should be executed first.
We can formalize our framework as follows.A sequence of software revisions can be defined as (S 1 , T 1 ) . . ., (S N , T N ), where S i is a software revision build and T i is its associated test suite as a set of tests.Each test in a test suite can have a verdict p (pass) or f (fail).A schedule S for a test suite T is a sequence (t 1 , . . ., t m ) as a permutation of the elements in T .We should note that the test suites may evolve over time, and it is possible to add or remove tests in different software revisions.A schedule fitness metric is a function f : S → R that produces a score for the given schedule.
In this article, we will focus on Average Percentage of Fault Detection (APFD).A test case prioritization procedure is any function that given a set of tests T produces a test suite as a sequence (t 1 , . . ., t m ).Ideally, the schedule should be optimal with respect to a given fitness function.With this objective, TCP algorithms often use some kind of test case failure estimator, since most fitness functions favor the execution of failing tests first.A failure estimator is a function that, given a test case t and historic information about the software revisions S as well as its development, provides an estimate of the probability that the outcome of the test is a failure for the current software revision.
Our solution to the TCP problem is implemented by combining the application of a static scheduler and a dynamic one.First, all tests are scheduled with the static scheduler.After this, the test with the highest score as ranked by the static scheduler is executed.The verdict of this test is then used by the dynamic scheduler to reschedule the pending tests.The next test is again selected for execution, and once more, its verdict is used to reschedule the pending tests.This process is repeated until all tests are processed.
The advantage of this approach is that different schedulers can be combined in a single TCP algorithm.As a static scheduler we can use any method that assigns priority scores.We propose a novel algorithm for dynamic scheduling in Section 4. This algorithm is based on the conditional probability for pairs of tests to have the same verdict in a test cycle, and is independent of the static scheduler.

Dynamic Scheduling using Conditional Probability
This section presents the dynamic scheduler we use.The main idea of the algorithm is that it not only uses a test case failure estimator based on previous revisions S to create a schedule T , but also applies knowledge gained from the current execution.First, each test case t receives an individual score based on the ranking assigned by the static scheduler, according to the initial order created by this scheduler.Second, the algorithm analyzes which test cases tend to fail or pass together, again based on previous revisions.We achieve this by searching for pairwise correlations between tests.By pairwise correlations we mean that two test cases fail or pass together in a given software revision S. Cho et al. used a related approach, in which flips were investigated, ie, correlations between changes in verdict, [1].In our algorithm, the individual score of each test is updated based on the correlations and the verdicts of the executed tests.
The individual score of a test is inversely proportional to its scheduling rank, that is, the test scheduled at position n will be assigned a score 1  n .Pairwise correlations between tests are computed as conditional probabilities in the following ways: and . We remind the reader that conditional probability, is the likelihood of event B happening, given that we know that A has happened.To compute conditional probabilities, we chose a certain number of revisions and we will refer to this as history length (see subsection 5.2 for more details).The Dynamic Test Case Prioritization algorithm with conditional probability is shown as Algorithm 1. Depending on the verdict of the currently executed test, the individual scores of the correlated tests are either increased or decreased.For these correlated tests, tests which tend to fail or pass together are considered, and the conditional probabilities of these events are computed before the actual scheduling (corresponding to t_corr_f ail and t_corr_pass in Algorithm 1), as shown earlier in this section.Individual test scores for these correlated tests (the variable t_scores.t_corr)are either increased or decreased by the constant k multiplied by the conditional probability of failure or success accordingly.Then, when the individual scores of tests are updated based on the outcome, the dynamic scheduling algorithm chooses the test with the highest individual score assigned by a static scheduler, and again, when this test is executed, based on its verdict, the individual score of each of the correlated tests is either increased or decreased.The algorithm terminates when all tests have been executed.
As we initially evaluated our algorithm with the Westermo dataset [15] and learned that k = 0.8 demonstrated good performance, we applied the same constant with Paint Control [14] and IOF/ROL [14].Since, ideally, this value should be selected individually for each test system, the expected improvement for the algorithm is that the value for k is dynamically adjusted.

Evaluation
To effectively evaluate the Dynamic Test Case Prioritization approach we compare our solution to Optimal, Worst, and Random algorithms.The Optimal algorithm creates an optimal schedule, i.e., it always schedules all failing tests before the passing ones, thus the dynamic algorithm can either reduce or provide the same performance as the Optimal algorithm.The Worst algorithm acts in the opposite way, and the dynamic scheduler can only improve or provide the same performance as the Worst algorithm.The Random algorithm, as the name reveals, creates a schedule in a completely random manner.Due to non-deterministic nature of the Random algorithm, we execute it 30 times for each evaluated cycle.The dynamic approach can reduce, improve, or provide the same performance as the Random algorithm.
The solution is developed in Python with the help of libraries such as numpy and pandas.In the evaluation, we apply the APFD metric after each software revision.In this article, we assume that each individual test case reveals one unique fault, because we use only test verdicts in our solution.The Python library called dnn-tip (version 0.1.1)2was utilized to compute APFD.This library was developed for [16] and then released.

General description of the datasets
Our code was originally developed for an AIDOaRt3 Hackathon and we evaluated our solution on a dataset provided by Westermo.Then, to evaluate our solution on more industrial datasets, we modified our code to adjust it to different data formats.As additional datasets we chose Paint Control and IOF/ROL which were created by ABB Robotics Norway [14].The general overview of these datasets is shown in Table 1.
As shown in Table 1, the Westermo dataset consists of nine separate test systems4 , while Paint Control and IOF/ROL include one test system each.For each dataset or each test system, we computed the number of unique test cases, the number of test cycles, the number of verdics, which is a total number of tests executed in all test cycles, as well as the percentage of failed test cases.As mentioned in [10] and [11] as well as through our discussions with Westermo, we learned that sometimes the same test case can be executed several times in a CI cycle and the failure of this test case may not be caused by an error in the system under test.Because of this, we pre-processed the data before executing our algorithm and left only the last verdict of each test case that was executed several times in a test cycle.In Table 1, the values in parentheses relate to statistics from the original datasets, while the values without parentheses belong to the statistics from the modified datasets.If there is only one value shown in the cell, it means that this value stayed unchanged.

The Westermo dataset
Among the historical data provided in the Westermo dataset [15], features that were utilized for the current algorithms are verdicts of test cases.We treat each of the nine test systems (see Table 1) separately, as each of these creates a unique environment.As each test system in the dataset includes a large number of revisions, we chose a range of latest 300 revisions for the evaluation.For systems that contain less than 300 cycles, we selected fewer cycles for the evaluation considering the chosen history length.By history length we mean how many previous revisions would be considered for prediction of correlations.In several publications, for example in [13] and [14], the authors used the history length of four as it was empirically proven that it leads to a better performance of reinforcement learning-based techniques.For our algorithms, we observed that too short a history window led to worse performance.The required history length differs between systems, thus an individual parameter should be chosen for each of these.In the current experiments, to apply the same parameter to all systems, the history length was set to 15, since we observed that this history length provided a better performance for the majority of test systems.The expected improvement for the algorithm is that the value of the history length is adjusted for each test system dynamically.Since our goal is to evaluate how the dynamic approach can influence the performance of a static one, we consider only test cycles where there are both failing and passing tests.By fault we consider test verdicts f ail(1) and invalid(2), while 0 corresponds to the passed test.In addition to that, the dataset contains test verdicts equal to 3 which means that some resources were not available when the test should have been executed, but we do not consider these verdicts neither for correlation scores nor for evaluation of our algorithm.

The Paint Control and the IOF/ROL datasets
Paint Control and IOF/ROL are industrial datasets created by ABB Robotics Norway and contain data from testing of robots on a daily basis [14].For both datasets, the approach chosen for range of cycles and history length is the same as for the Westermo dataset.Unlike Westermo, the verdicts of test cases for Paint Control and IOF/ROL are divided only to two groups: 0 if the test passed and 1 if it failed.Unfortunately, we do not have as much information about these datasets as in the Westermo dataset, thus we can rely only on the information we can derive from the datasets themselves.

Results
We compare the three mentioned static algorithms to the following dynamic ones: Optimal + Conditional Probability, Random + Conditional Probability, and Worst + Conditional Probability.Figure 1 demonstrates evaluation of performance in terms of APFD for Westermo, Paint Control, as well as IOF/ROL datatasets.The static approach for Optimal, Random, and Worst algorithms is compared with the dynamic one.As random algorithms were executed 30 times for each cycle, the mean APFD for each cycle is considered.The APFD metric demonstrates a slight decrease in the performance of the dynamic approach when applied to the Optimal algorithm, however in practice it is not a big drawback since practitioners rarely have optimal test suites.Among mean values, the dynamic approach improves the performance for both Random and Worst algorithms, while the most visible change can be seen for the Worst algorithm, where for the majority of cases median values increase with more than 0.8 when comparing to these values for the static approach.The dynamic approach applied to the Random and Worst algorithms showed the lowest improvement for the IOF/ROL dataset.This may mean that tests in this dataset are not correlated.

Discussion and Conclusions
In this article, we presented a novel dynamic test case prioritization algorithm that can be applied with any static prioritizer that assigns a priority score to each test case.We use such static algorithms as Worst, Random, and Optimal to evaluate how the dynamic prioritization algorithm can improve the schedules created by these.Our algorithm relies on scores assigned by the chosen static algorithm, as well as correlation between test cases derived from the history of the chosen time interval.These correlations help to change an individual score of related pending tests and thus reschedule these tests.We evaluated our approach with Westermo, Paint Control, and IOF/ROL datasets.If there are correlations between test cases, the dynamic approach helps to improve the schedule so that faults are revealed faster, both for the Random and for the Worst static algorithms, thus conditional test failure probability helps to improve the schedule created by a static algorithm.If the static schedule is optimal, the dynamic schedule may decrease the performance, which is expected.Thus, dynamic test case prioritization improves suboptimal test schedules, in most cases the improvement is substantial.
For the industry, our approach helps to dynamically adjust schedule of test cases and execute possibly failing test cases earlier, thus decreasing the waiting time for developers.When evaluated with the APFD metric, the dynamic algorithm demonstrated a noticeable increase in performance when applied to Random and Worst algorithms which means that it succeeded to schedule failing test cases earlier.The limitations of our approach are that constant values are applied to choose history length as well as to adjust scores of correlated test cases.As a possible direction for the future work, these values should be adjusted dynamically for each test system.In addition to that, the dynamic approach should be applied to existing static TCP baselines.

Figure 1 :
Figure 1: Comparison for APFD for static and dynamic prioritization algorithms (higher values are better).The first nine systems are from the Westermo dataset, the last two are from Paint Control and IOF/ROL.

Table 1 :
Characteristics of datasets.A similar approach is presented in