Single and Multi-objective Test Cases Prioritization for Self-driving Cars in Virtual Environments

Testing with simulation environments helps to identify critical failing scenarios for self-driving cars (SDCs). Simulation-based tests are safer than in-field operational tests and allow detecting software defects before deployment. However, these tests are very expensive and are too many to be run frequently within limited time constraints. In this article, we investigate test case prioritization techniques to increase the ability to detect SDC regression faults with virtual tests earlier. Our approach, called SDC-Prioritizer , prioritizes virtual tests for SDCs according to static features of the roads we designed to be used within the driving scenarios. These features can be collected without running the tests, which means that they do not require past execution results. We introduce two evolutionary approaches to prioritize the test cases using diversity metrics (black-box heuristics) computed on these static features. These two approaches, called SO-SDC-Prioritizer and MO-SDC-Prioritizer , use single-objective and multi-objective genetic algorithms ( GA ), respectively, to find trade-offs between executing the less expensive tests and the most diverse test cases earlier. Our empirical study conducted in the SDC domain shows that MO-SDC-Prioritizer significantly ( P - value < = 0 . 1 e − 10) improves the ability to detect safety-critical failures at the same level of execution time compared to baselines: random and greedy-based test case orderings. Besides, our study indicates that multi-objective meta-heuristics outperform single-objective approaches when prioritizing simulation-based tests for SDCs. MO-SDC-Prioritizer prioritizes test cases with a large improvement in fault detection while its overhead (up to 0.45% of the test execution cost) is negligible.


INTRODUCTION
Self-driving cars (SDCs) are autonomous systems that collect, analyze, and leverage sensor data from the surrounding environment to control physical actuators at run-time [3,13].Testing automation for SDCs is vital to ensure their safety and reliability [49,50], but it presents several limitations and drawbacks: (i) the limited ability to repeat tests under the same conditions due to ever-changing environmental factors [50]; (ii) the difficulty to test the systems in safety-critical scenarios (to avoid irreversible damages caused by dreadful outcomes) [43,47,80]; (iii) not being able to guarantee the system's reliability in its operational design domain due to a lack of testing under a wide range of execution conditions [49].
The usage of virtual simulation environments addresses several of the challenges above for SDCs testing practices [1, 14,16,30].Hence, simulation environments are used in industry in multiple development stages of Cyber-physical Systems (CPSs) [76], including model (MiL), software (SiL), and hardware in the loop (HiL).As a consequence, multiple open-source and commercial simulation environments have been developed for SDCs, which can be more effective and safer than traditional in-field testing methods [4].
Adequate testing for SDCs requires writing (either manually or assisted by generation tools [2,37]) a very large number of driving scenarios (test cases) to assess that the system behaves correctly in many possible critical and corner cases.The large running time of simulation-based tests and the large size of the test suites make regression testing particularly challenging for SDCs [35,84].In particular, regression testing requires running the test suite before new software releases to assess that the applied software changes do not impact the behavior of the unchanged parts [64,86].
The goal of this paper is to investigate and propose black-box test case prioritization (TCP) techniques for SDCs.TCP methods sort (prioritize) the test cases with the aim to run the fault-revealing tests as early as possible [86].While various black-box heuristics have been proposed for traditional systems and CPSs, they cannot be applied to SDCs as is.Black-box approaches for "traditional" systems sort the tests based on their diversity, computed on the values of the input parameters [52] and the sequence of method calls [19].However, SDC simulation scenarios (e.g., with road shape, weather conditions) do not consist of sequences of method calls as in traditional tests [2,37].Approaches targeting CPSs measure test distance based on signal [9], and fault-detection capability [12].However, this data is unknown up-front without running the SDC tests.
The main challenges to address when designing black-box TCP methods for SDCs concern (i) the definition of features that can characterize SDC safety-critical scenarios in virtual tests; and (ii) design optimization algorithms that successfully prioritize the test cases based on the selected features.Therefore, to address these challenges, we formulated the following research questions: • RQ 1 : To what extent is it possible to prioritize safety-critical tests in SDCs in virtual environments prior to their execution?
We designed and computed 16 static features for driving scenarios in SDCs virtual tests, such as the length of the road, the number of left and right turns, etc.These features are extracted from the test scenarios prior to their execution, and for them, we investigated which ones are non-collinear (see Section 4.2.1) according to Principal Component Analysis (PCA).Hence, we introduce SDC-Prioritizer, a TCP approach based on Genetic Algorithms (GA) that prioritizes test cases of SDCs by leveraging these features.This paper introduces two variants of the SDC-Prioritizer, namely SO-SDC-Prioritizer and MO-SDC-Prioritizer.The former variant utilizes a single-objective genetic algorithm for test prioritization.The latter variant leverages a well-known and commonly used multi-objective genetic algorithm, called NSGA-II [27], to achieve this goal.Any search-based technique needs to balance between exploitation and exploration [25].Exploitation refers to the ability of the search process to visit regions of the search space within the neighborhood of previously generated solutions (here, test execution orders).Exploration refers to the ability to generate entirely new solutions that are different from the current solutions.Poor exploration ability of the search process leads to low diversity between the generated solution, and thereby the search process may easily be trapped in local optima [25].The rationale behind introducing MO-SDC-Prioritizer beside the SO-SDC-Prioritizer is to avoid the lack of exploration ability in SDC-Prioritizer.The NSGA-II algorithm, utilized in MO-SDC-Prioritizer, provides well-distributed Pareto fronts and thereby brings sufficient diversity into the generated solutions.
• RQ 2 : What is the cost-effectiveness of SDC-Prioritizer compared to baseline approaches?
To answer RQ 2 , we conducted an empirical study with three different datasets and composed of test scenarios that target the lane-keeping features of SDCs.In this context, fault-revealing tests are virtual test scenarios in which a self-driving car would not respect the lane tracking safety requirement [38].We targeted BeamNG by BeamNG.research[14] (detailed in Section 2) as a reference simulation environment, which has been recently used in the Search-Based Software Testing (SBST) tool competition1 [66].The test scenarios for this environment have been produced with by SDC-Scissor [15] (which integrates also AsFault [37]), an open-source project that generates test cases to assess SDCs behavior (detailed in Section 2).
By comparing SO-SDC-Prioritizer and MO-SDC-Prioritizer with two baselines -namely random search, and the greedy algorithm-on these three benchmarks, we analyze the performance of our techniques in terms of its ability to detect more faults while incurring a lower test execution cost.
Finally, we assess whether SDC-Prioritizer techniques can be used in practical settings, i.e., it does not add a too large computational overhead to the regression testing process: • RQ 3 : What is the overhead introduced by SDC-Prioritizer?
The results of our empirical study show that MO-SDC-Prioritizer is the best performing technique in terms of identifying more safety-critical scenarios in less time.On average, this technique reduces the time required to identify more safety-critical scenarios by 6%, 25.5%, and 3% compared to SO-SDC-Prioritizer, random test case orders ("default" baselines for search-based approaches [76,86]), and the greedy algorithm for TCP, respectively.It also shows that MO-SDC-Prioritizer leads to an increase of detected faults (about 63 more) in the first 20% of the test execution time compared to the greedy test prioritization (i.e., second best technique according to our assessments).Furthermore, SDC-Prioritizer approaches do not introduce significant computational overhead in the SDCs simulation process, which is of critical importance to SDC development in industrial settings.
The contributions of this paper are summarized as follows: (1) We designed static features that can be used to characterize safe and unsafe test scenarios prior to their execution in the SDC domain.(2) We introduce SO-SDC-Prioritizer and MO-SDC-Prioritizer, two black-box TCP approaches that leverage single and multi-objective Genetic algorithms, respectively, to achieve cost-effective regression testing with SDC tests in virtual environments.(3) A comprehensive and publicly available replication package available on Zenodo [21], including all data used to run the experiments as well as the prototype of SDC-Prioritizer, to help other researchers reproduce the study results.
Paper Structure.In Section 2, we summarize the related work, while in Section 3, we outline the approach we have designed and implemented to answer our research questions.In Section 4, we present our methodology and empirical studies performed to answer our research questions.In Section 5, we report the study results, while in Section 6, we detail the threats to validity of our study.Finally, Section 7 concludes our study, outlining directions for future work.
This section discusses the literature concerning (i) test prioritization approaches in traditional systems; and (ii) studies closely related to test prioritization practices in the context of CPSs (Cyber-physical systems).Finally, the section describes the background on the SDC virtual environment adopted in this study.

Test Prioritization
Approaches aiming at reducing the cost of regression testing can be classified into three main categories [87]: test suite minimization [70], test case selection [20], and test case prioritization [71].Test case minimization approaches tackle the regression problem by removing test cases that are redundant according to selected testing criteria (e.g., branch coverage).Test case selection aims to select a subset of the test suite according to the software changes, coverage criteria, and execution cost.Test case prioritization, which is the main focus of our paper, sorts the test cases to maximize some desired properties (e.g., code coverage, requirement coverage) that lead to detecting regression faults as early as possible.A complete overview of regression testing approaches can be found in the survey by Yoo and Harman [87].

Prioritization heuristics.
Approaches proposed in the literature to guide the prioritization of the test cases can be grouped into white-box and black-box heuristics [87].White-box test case prioritization uses past coverage data (e.g., branch, line, and function coverage) and iteratively selects the test cases that contribute to maximizing the chosen code coverage metrics.
Black-box prioritization techniques rely on diversity metrics and prioritize the most diverse test cases within the test suites (e.g., [5,34,52]).Widely-used diversity metrics include input and output set diameter [34], or the Levenstein distance computed on the input data [52] and method sequence [19].Further heuristics include topic modeling [81], or models of the system [44].Miranda et al. [62] proposed fast methods to speed up the pair-wise distance computation, namely shingling and locality-sensitive hashing.Recently, Henard et al. [45] empirically compared many white-box and black-box prioritization techniques.Their results showed a large overlap between the regression faults that can be detected by the two categories of techniques and that black-box techniques are highly recommended when the source code is not available [45], e.g., in the case of third-party components.Cyber-physical systems (including SDCs) are typical instances of systems with many third-party components [76].
Prioritization heuristics for CPSs differ from those used for traditional software [8].We elaborate more in detail on the related work on test case prioritization for CPSs in Section 2.2.

Optimization algorithms.
Given a set of heuristics (either white-box or black-box), optimization algorithms are applied to find a test case order that optimizes the chosen heuristics.As shown by Yoo et al. [87] test case prioritization (and regression testing in general) is inherently a multi-objective problem because test quality (e.g., code coverage, input diversity) and execution resources are conflicting in nature.The challenge is choosing balanced trade-offs that favor lower execution cost over higher code coverage or test diversity depending on the time constraints and resource availability (e.g., in continuous delivery or integration servers).
Cost-cognizant greedy algorithms are well-known deterministic algorithms introduced for the set-cover problem and adapted to regression testing [20].The greedy algorithm first selects the test case with the most code coverage (white-box) or the most diverse one (black-box).Then, the algorithm iteratively selects the test case that increases coverage the most or that is the most diverse w.r.t.previously selected test cases [87].
Meta-heuristics have been shown to be very competitive, sometimes outperforming greedy algorithms [54,57,64,81].Marchetto et al. [57] used multi-objective genetic algorithms to optimize trade-offs between cumulative code coverage, cumulative requirement coverage, and execution cost.Besides, genetic algorithms have been widely used to optimize test case diversity [81] for black-box TCP.This paper uses greedy algorithm, single-objective genetic algorithm, and multi-objective genetic algorithm to prioritize simulation-based test cases for self-driving cars.This is because each type of algorithm has been shown to outperform its counterparts in different domains and programs [12,54].

Regression Testing for CPSs
Regression testing is particularly critical for CPSs, which are characterized by interactions with simulation and hardware environments.Testing with simulation environments is a de facto standard for CPSs, and it is typically performed at three different levels [59]: MiL, SiL, and HiL.During model in the loop (MiL), the controller (cars) and the environments (e.g., roads) are both represented by models, and testing aims to assess the correctness of the control algorithms.During software in the loop (SiL), the controller model is replaced by its actual code (software), and its testing phase aims to assess the correctness of the software and its conformance to the model used in the MiL.Finally, during hardware in the loop (HiL), the controller is fully deployed while the simulation is performed with real-time computers that simulate the physical signals.The testing phase for the HiL aims to assess the integration of hardware and software in more realistic environments [59].
Regression testing for CPSs is more challenging as the execution time of the test cases is much longer due to the simulation [12].Hence, researchers have proposed different regression testing techniques that are specific to CPSs.Shin et al. [77] proposed a bi-objective approach based on genetic algorithms to prioritize acceptance tests for a satellite system.Their approach prioritizes the test cases according to the hardware damage risks it can expose (first objective) and maximizes the number of test cases that can be executed within a given time budget (second objective).Arrieta et al. [12] used both greedy algorithms and meta-heuristics to prioritize test cases for CPS product lines and with different test levels.In further studies, Arrieta et al. [10] focused on multiple objectives to optimize for both test case generation and test case prioritization for CPSs.The objectives include requirement coverage, test case similarity, and test execution times.While test similarity for non-CPS systems is computed based on the lexicographic similarity for the method calls and test input, Arrieta et al. measured the similarity between the test cases based on the signal values for all the states in the simulation-based test case.Test case similarity computed at the signal-level has also been investigated in the context of test case selection for CPS [9,11].
Our paper differs from the papers above w.r.t. the application domain and the optimization objectives.In particular, we focus on prioritized simulation-based test cases to assess the lane-keeping features of self-driving cars.Instead, prior work focused on different domains, such as satellite [76], electric windows [9], industrial tanks [10,12], and cruise controller [10].In our context, test cases consist of driving test scenarios with virtual roads (e.g., see Figure 1) and aim at assessing whether the simulated cars violate the lane-keeping requirements.
Another important difference is related to the objectives (or heuristics) to optimize for regression testing.Prior works for CPS prioritize the test cases based on fault-detection capabilities [12], and diversity measured for simulation signals [9][10][11].However, the fault-detection capability of the test cases is unknown a prior (i.e., without running the tests).Signal analysis requires knowing the states of the simulated objects in each simulated time step, which is also unknown before the actual simulation.Furthermore, a driving scenario (in our context) is not characterized by signals but only by the initial state of the car and the actual characteristics (e.g., shape) of the roads.Hence, we define features and diversity metrics that consider only the (static) characteristics of the roads that are used for the simulation.Unlike fault-detection capability and signals, our features can be derived from the driving scenario before the actual test execution.[41,78], rigid-body [55,88], and soft-body simulations [36,69].
Basic simulation models, such as MATLAB/Simulink models [41,78], implement fundamental signals but target mostly non-real-time executions and generally lack photo-realism.Consequently, while they are utilized for model-in-the-loop simulations and Hardware/Software co-design, they are rarely used for integration and system-level software testing.
Rigid-body simulations approximate the physics of static bodies (or entities), i.e., by modeling them as undeformable bodies.Basic simulation bodies consist of three-dimensional objects such as cylinders, boxes, and convex meshes [2].
Soft-body simulations can simulate deformable and breakable objects and fluids; hence, they can be used to model a wide range of simulation scenarios.Specifically, the finite element method (FEM) is the main approach for solid body simulations, while the finite volume method (FVM) and finite difference method (FDM) are the main strategies for simulating fluids [60].
Rigid-body v.s.Soft-body simulations Both rigid-and soft-body simulations can be effectively combined with powerful rendering engines to implement photo-realistic simulations [14,16,30,83].However, soft-body simulations can simulate a wider variety of physical phenomena compared to rigid-body simulations.Soft-body simulations are a better fit for implementing safety-critical scenarios (e.g., car incidents [36]), in which a high simulation accuracy is of key importance.As follows, we describe the soft-body environment we used in our research investigation, i.e., BeamNG [14].

BeamNG & AsFault.
Creating adequate test scenario suites for SDCs is a hard and laborious task.To tackle this issue, Gambi et al. [38] developed and proposed a tool called AsFault [37] to generate driving scenarios for testing SDCs automatically.From a high-level point of view, AsFault combines procedural content generation and search-based testing in order to automatically create virtual scenarios for testing the lane-keeping behavior in SDC software.Specifically, AsFault leverages a genetic algorithm to iteratively refine virtual road networks towards those which cause the ego-car (the simulated car controlled by the SDC software under test) to move away from the center of the lane.The virtual roads are generated inside a driving simulator called BeamNG [14], which can generate photo-realistic, but synthetic, images of roads.Given such characteristics, BeamNG [14] has also been used as the main simulation platform in the 2021 edition of the SBST tool competition [66].Lane-keeping systems (described in the next sections) continuously track the striped and solid lane markings of the road ahead using advanced image processing, deep learning, or machine learning techniques and triggers needed control mechanisms (e.g., steering, braking, and speeding) to keep the car at the proper location regarding the road structure.
To evaluate the criticality of generated test cases, the road networks are instantiated in a driving simulation, during which the ego-car is instructed to reach a target location following a navigation path selected by AsFault.During the simulation, AsFault traces the position of the ego-car at regular intervals such that it can identify Out of Bound Episodes (OBEs), i.e., lane departures.An out-of-bound incident is defined as "the case when the car went more than two meters out of the lane center".In our experiments, we use this information to label test scenarios as safe (causing no OBEs) or unsafe (causing at least one OBE).
Figure 1 illustrates a sample test scenario generated and executed by AsFault [38].It includes start and target points for the ego-car on the map, the whole road network, the selected driving path (colored in yellow), and the detected OBE locations during the execution of the scenario by the ego-car.Hence, each generated test scenario by AsFault consists of a JSON file generated by AsFault, which reports multiple nodes and their connections, and form a road network, with the start and destination point and the driving path of the ego-car [38].
2.3.3SDC Software Use-cases.AsFault supports two AI engines as test subjects while generating test cases, which we use to generate our test suites.These two test subjects allow to drive the ego-car by computing an ideal driving trajectory, which places the ego-car in the center of the lane while driving within a configurable speed limit: • BeamNG.AI. 2 BeamNG.researchships with a driving AI that we refer to as BeamNG.AI.BeamNG.AI can be parameterized with an "aggression" factor which controls the amount of risk the driver takes in order to reach the destination faster.BeamNg.researchdevelopers say that low aggression factors (e.g., 0.7) result in a smooth driving whereas high aggression factors (e.g., 1.2 and above) lead the car to edgy driving and might cut corners [38].• Driver.AI. 3 Driver.AI is a trajectory planner shipped with AsFault [38].AsFault leverages an extension of Driver.AI, which monitors the quality of its predictions at run-time.Hence, differently from BeamNG.AI, Driver.AI analyzes the road geometry and plans the trajectory of the car by computing, for each turn, the maximum safe driving speed () using the reference formula for centripetal force on flat roads with static friction () [22]: where  is the turn radius and  is the free-fall acceleration.It is important to note that, we use BeamNG since: -BeamNG can be easily used by developers via Python APIs for creating scenarios -BeamNG can access to sensor data, Camera, Lidar, IMU the BeamNG AI engine can simulate:

SDC Road Features
In the context of SDC, we target the definition of features (or metrics) that characterize SDC tests in virtual environments according to the following requirements: the features (1) can be extracted before the actual execution of the virtual tests; and (2) these features can characterize (or identify) safe and unsafe scenarios without executing them.In the following, we describe how the SDC features have been designed and measured considering the BeamNG as the targeted SDC virtual environment.
In the context of BeamNG, it is possible to compute static features concerning the actual road characteristics of SDC virtual tests.Specifically, as illustrated in Figure 1, each virtual test scenario generated by AsFault (virtual roads), consists of multiple nodes and their connections (i.e.road segments) forming a so-called road network, along with the start and destination points and the driving path of the ego-car.This allows us to compute what we call Road Features, i.e., features or characteristics of the road that will be used during the simulation within the BeamNG virtual environment.
Fig. 1.Sample driving scenarios generated by SDC-Scissor [15] (which integrates also AsFault [38] ) From the road data reported by AsFault, we extract various features for each test scenario (as described in the following paragraph), and we investigate ways to leverage these features to determine the criticality of the test scenarios (as described in Section 4).
Road Features extraction.To extract the features corresponding to each of the generated test scenarios, we leverage the JSON file generated as output by AsFault.These files, as explained before, consist of multiple nodes and their connections, and form a road network, with the start and destination point and the driving path of the ego-car.Hence, we extract two sets of road features, the general road characteristics, and the road segment statistics.The general road characteristics are attributes that refer to the road as a whole, e.g., direct distance and road length between the start and destination points, the total number of turns to left or right.For each road segment (see figure 1), we can extract individual metrics such as road angle and pivot radius.For the segment statistics features, we apply aggregation functions (e.g.minimum, maximum, average) on these individual segment metrics for all road segments in the scenario path.Table 1 reports the features extracted from the original fields in AsFault JSON (i.e., F1-16 features), specifying their description, type, and expected range of values for each feature.In the next sections, we described how the designed features are used as inputs to test case prioritization strategies.

Single-Objective Genetic Algorithm
Several prior studies have utilized evolutionary algorithms (particularly genetic algorithms) for test prioritization to reduce regression testing costs in different types of systems [54].A typical Genetic algorithm (GA) starts with generating a population of randomly generated individuals (box 1 in Figure 2).Each individual can be described as a sequence of parameters, called the chromosome, which encodes a potential solution to a given problem.This encoding can be performed in many forms (such as string, binary, etc. ).After generating the first population, this algorithm determines the "fitness" of the individuals according to a fitness function (box 2 in Figure 2).Then, in the Selection phase (box 3 in Figure 2), a subset of individuals are selected according to their fitness values to be used as parents for mating.Next, two genetic operators are applied to generate the next population using the selected parents: Crossover and Mutation.The former (box 4 in Figure 2) operator combines two parents to produce new individuals (called offspring).The latter (box 5 in Figure 2) operator alters one or more elements in the offspring to explore nearby solutions in the search space.Finally, the newly generated individuals are saved in a new population (box 6 in Figure 2).The process of generating a new population of individuals from

Fig. 2. An overview of Genetic Algorithm
the previous one will continue until either the search objective is fulfilled or when the algorithm reaches the maximal number of generations (iterations).This section introduces a single-objective genetic algorithm called SO-SDC-Prioritizer that prioritizes the most diverse tests (according to their corresponding feature vectors) per unit of cost in self-driving cars.The following subsections describe detailed information regarding the encoding, operators, and fitness function used in the SDC-Prioritizer.

Encoding.
Since the solution for the test prioritization is an ordered sequence of tests, SDC-Prioritizer uses a permutation encoding.Assuming that, in our problem, we seek to order the execution of N tests, our approach encodes each chromosome as an N-sized array containing integers that denote the position of a test in the order.For example, let  = ⟨ 1 ,  2 ,  3 ⟩ be a chromosome for a test suite with three test cases; then, test case  1 will be executed first, followed by  2 and  3 during regression testing.

Partially-Mapped Crossover (PMX).
In the crossover, an offspring  is formed from two selected parents  1 and  2 , with the size of N, as follows: (i) select a random position  in  1 as the cut point; (ii) the first  elements of  1 are selected as the first  elements of ; (iii) extract the  −  elements in  2 that are not in  yet and put them as the last  −  elements of .

Mutation operators.
A chromosome  can be mutated one or more times according to the given mutation probability.In each round of mutation, one of the three following mutation operators [75] is selected randomly with an equal chance of 0.33% to perform the mutation: • SWAP mutation: This mutation operator randomly selects two positions in a chromosome  and swaps the index of two genes (test case indexes in the order) to generate a new offspring.• INVERT mutation: This mutation operator randomly selects a segment (with a random size) of the given chromosome .Then, it reverses the selected segment end to end and reattaches it to generate a new offspring.• INSERT mutation: This mutation randomly selects a gene in the chromosome  and moves it to another index in the solution to generate a new offspring.We consider the three operators above since prior studies [75] showed that using multiple mutation operators for permutation-based optimization problems increases the likelihood of escaping from solutions that are locally optimal under one mutation operator.This procedure used for the mutation is the same in both of the SDC-Prioritizer variants introduced in this paper.

3.2.4
Fitness function in SO-SDC-Prioritizer.Our goal is to promote (1) the diversity of the selected test cases and (2) minimize the execution cost.Hence, the ultimate goal is to run the most diverse test within a given time constraint.Hence, we define a fitness function that incorporates both test diversity and execution cost.This is in line with current practice in the literature, which combines surrogate metrics for test effectiveness (e.g., code coverage) with execution cost [53,64,85].More specifically, let  = ⟨ 1 , . . .,   ⟩ be a given test case ordering, its "fitness" (quality) is measured using the following equation: where  is the number of test cases;   is the -th test in the ordering ;  (  ) is the execution cost (simulation time) of the test case   ; and  (  ,  −1 ) measures the Euclidean distance between the test cases   and  −1 .In other words, each test case in position  positively contributes to the overall fitness (to be maximized) based on its distance to the prior test  −1 in the order .Since we want to have as many diverse tests as possible in the same amount of time, the diversity score of each test   is divided by its execution cost (to be minimized) and its position  in .The factor  in the denominator of Equation 2 promotes solutions where test cases with the best diversity-cost ratio are prioritized early, i.e., they appear early within the order .
The distance between two tests   and   is measured using the Euclidean distance and computed on the feature vectors described in Table 1.It is important to highlight that the different features have different ranges and scales, as reported in Table 1.Hence, the distance values computed using the Euclidean distance might be biased toward the features with larger ranges.To remove this potential bias, we normalized the features using z-score normalization, which is a well-known method to address outliers and to re-scale a set of features with different ranges and scales [39].The z-score normalization scale the features using the formula −  , where  is the feature to re-scale,  is its arithmetic mean, and  is the corresponding standard deviation [39].
The execution cost of each test case   is estimated based on the past execution cost gathered from previous test runs, as recommended in the literature [32,86].This estimation is accurate for SDC since the cost of running simulation-based tests is proportional to the length of the road and the cost of rendering the simulation, which are fixed simulation elements.

3.2.5
Selection in SO-SDC-Prioritizer.The fitness function defined in Section 3.2.4allows GAs to determine the fittest individual (permutations in our case) that should have higher chances to be selected for mating.The selection is made using the roulette wheel selection [40], which assigns a selection probability to each of the individuals according to their fitness values (calculated by a fitness function).Assuming that our problem is a maximization problem, the selection probability of an individual   is calculated as follows: where  is the number of individuals in the population and   is the fitness value of   .
After allocating selection probability to individuals, the algorithm randomly selects some individuals according to their selection chance.Each individual with a lower fitness value has a lower allocated selection probability and thereby has a lower chance of transferring its genetic material to the next generation.

Multi-objective Genetic Algorithm
This paper also proposes MO-SDC-Prioritizer, a multi-objective variant of SDC-Prioritizer that considers the execution cost and test case diversity as two different objectives to optimize simultaneously.Assume that  = ⟨ 1 , . . .,   ⟩ is a solution (i.e., test execution order) generated by the search process.The first goal to optimize is computed using the following equation: where  (  ,  −1 ) denotes the distance between a test   and its predecessor  (  − 1) in the ordering.The contribution of each test case   to the cumulative diversity is divided by its position  in the ordering .In other words, this objective promotes solutions where the most diverse test cases are executed earlier.
The second objective in MO-SDC-Prioritizer measures how steadily the cumulative cost increases when executing the tests with a given order : where  (  ) denotes the cost of executing the test case   in .Different from SO-SDC-Prioritizer, finding optimal solutions for problems with multiple criteria requires trade-off analysis.Given the conflicting nature of our two objective 4 , it is not possible to obtain one single solution that optimizes both objectives at the same time [24].Hence, we are interested in finding the set of solutions that are optimal compromises between the two objectives.For multi-objective problems, the concept of optimality is based on concepts of Pareto dominance and Pareto optimality [24].In particular, a solution   dominates another solution   (  <    ) if and only if at the same level of diversity,   has a lower cost than   .Alternatively,   dominates   if and only if, at the same level of cost,   has a larger diversity than   .Among all possible solutions, we are interested in finding those that are not dominated by any other possible solution (Pareto optimality).Pareto optimal solutions form the so-called Pareto optimal set while the corresponding objective values form the Pareto front.
Figure 4 provides a graphical example of Pareto optimality and non-dominance.All solutions in the grey rectangle (including ) dominate  since they achieve both lower cost and higher diversity.Instead, all solutions in the blue rectangle (including  and ) are dominated by , since  achieves higher diversity with lower execution cost.Finally, , , and  do not dominate one another while  and  are dominated solutions.

NSGA-II.
To find Pareto optimal solutions, MO-SDC-Prioritizer uses NSGA-II [27].This genetic algorithm provides well-distributed Pareto fronts and performs best when dealing with two or three search objectives [27].NSGA-II shares the main loop of the genetic algorithm depicted in Figure 2. Thus, it shares the same encoding schema as well as mutation and crossover operators discussed in Section 3.2.However, it differs on how parents are selected for reproduction and how the new population is formed for the next generation.Parents are selected using the binary tournament selection, which compares pairs of solutions in tournaments and selects the "fittest" solution from each pair for reproduction.Finally, the population for the next generation is obtained by selecting the "fittest" solutions among parent and offspring solutions (elitism).
In NGSA-II, the "fitness" of the solutions is determined using the fast non-dominated sorting algorithm and the concept of crowding distance [26].The former ranks the solutions according to their dominance relations.All non-dominated solutions within a given population are inserted in the first front  1 (rank  = 1); the subsequent front  2 (rank  = 2) contains all solutions that are dominated only by the solutions in  1 ; and so on.Hence, solutions in the fronts with lower rank are "fitter" according to the Pareto optimality.
Instead, the crowding distance aims at promoting more diverse (isolated) solutions within each dominance rank.The crowding distance for a given solution is computed as the sum of the distances between such an individual and all the other individuals with the same rank.This heuristics is put in place to avoid selecting individuals that are too similar to each other.

3.3.2
Choosing a Pareto optimal solution.As explained in Section 3.3.1,NSGA-II returns a set of non-dominated solutions at the end of the search process.Hence, the next step is to decide which Pareto optimal solution (best trade-off) among the many different alternatives.The necessity of this decision-making approach is also experienced in other optimization methods for various engineering problems [58].Researchers have suggested considering various points of interest in the Pareto front, such as the knee points [17], mid points [63], or the extreme of the Pareto front [65].
One of the common techniques to select solutions from the Pareto front is to identify knee points [17,61], which are the solutions that minimize the distance to a point in the vector of the objective function, called Utopia Point [58].The utopia point is a (usually unreachable) point with the most-optimum observed value for each objective function.Assume that MO-SDC-Prioritizer returns a set of solutions  =  1 ,  2 , ...,   as the final answer.These solutions are non-dominated according to two search objective functions diversity ( 1 () in Equation 4) and test execution cost ( 2 () in Equation 5).In this case, the Utopia Point  is the following point in the two-dimensional objective functions vector: Since the utopia point usually does not exist in the returned solutions, we select the closest non-dominated solution to this point as the trad-off to select for regression testing.
One common way to measure the distance between two points is using the Euclidean distance  (), which is defined as: were   is the value of the Pareto optimal solution  for each objective.Here, MO-SDC-Prioritizer has  1 and  2 , as explained in Section 3.3.  is also the value of the utopia point for the th objective fitness function.
It is notable that if the fitness functions have different units, the Euclidean norm becomes insufficient to represent the closeness [58].This is the case in MO-SDC-Prioritizer as the execution cost and the test diversity have different units.To tackle this issue, we need to normalize the values to make them dimensionless.The most robust technique to perform this normalization is [51,58,68]: Where   () is the fitness actual value of solution  according to search objective fitness function   , and  (  ) is the maximum fitness value of generated solutions for   .

Black-box Greedy Algorithm
Greedy algorithms are well-known deterministic algorithms that iteratively build a solution (tests ordering) based on greedy steps.Greedy algorithms have been widely used in regression testing for both white-and black-box test case prioritization [81,86].Hence, we adapt the greedy algorithm to our context and use the set of features we have designed for SDCs (see Section 3.1).
The greedy algorithm first computes the pairwise distance among all test cases in the given test suite.Similarly to GAs, the distance between two test cases   and   is computed using the Euclidean distance between the corresponding feature vectors.These features are normalized up-front using the z-score normalization as done for GA as well.Then, the greedy algorithm computes the diversity per unit cost of each test   using the following equation: where  (  , ) measures the distance between   and the tests   ∈  selected in the previous iterations of the algorithm.In this equation, a higher score for a test means that it has the highest dissimilarity to previously selected tests with the lowest execution cost.The greedy algorithm initializes the test order  by selecting the test with the largest ratio between (1) its average distance to all other tests in the suite and (2) its execution cost.Then, the algorithm iteratively finds the test case (among the non-selected ones) with the largest average (mean) score to the (already selected) test cases in .This selection step corresponds to the greedy heuristic.Suppose multiple tests have the same average score to .In that case, the tie is broken by randomly choosing one of the equally distant test cases.This process is repeated until all test cases are prioritized.

STUDY DESIGN
Study design overview.Our empirical study is steered by the following research questions: • RQ 1 : To what extent is it possible to prioritize safety-critical tests in SDCs in virtual environments prior to their execution?• RQ 2 : What is the cost-effectiveness of SDC-Prioritizer compared to baseline approaches?• RQ 3 : What is the overhead introduced by SDC-Prioritizer?
In Section 3.1, we have introduced multiple static features to virtual driving scenarios (see Table 1), some of which might be collinear or not useful for prioritizing test cases in a cost-effective way.Hence, our first research question (RQ 1 ) aims to determine which features to consider, by leveraging statistical methods based on collinearity analysis [29,89].Our second research question (RQ 2 ) aims to assess the extent to which test case orders produced by SDC-Prioritizer techniques (SO-SDC-Prioritizer and MO-SDC-Prioritizer) can detect more faults (effectiveness) and with lower execution cost (efficiency) with respect to a naive random search.Specifically, as elaborated in detail later, a random search is a critical baseline for search-based solutions since it is a "sanity-check" to assess whether more "sophisticated" techniques are needed for a given domain [76].In RQ 2 , we compare the internal search algorithms discussed in Section 3, namely the greedy algorithm, single-objective and multi-objective genetic-algorithms.With our last research question (RQ 3 ), we want to measure the overhead required to prioritize SDC test cases in virtual environments with SDC-Prioritizer techniques.This is an important aspect to investigate since a critical constraint in regression testing is that the cost of prioritizing test cases should be smaller than the time needed to run the test suite [86].Therefore, fast approaches are fundamental from a practical point of view to enable rapid and continuous test iterations during SDC development [62].

Benchmark Datasets
The benchmark used in our study consists of three experiments performed on corresponding datasets.For each experiment, virtual test scenarios are generated and labeled as safe or unsafe by SDC-Scissor [15] (which integrates also AsFault).As described in Table 2, the first experiment leverages a dataset (referred to as BeamNG.AI.AF1) that includes 1,178 virtual test scenarios generated with respect to BeamNG.AI with an aggression factor set to 1. Since this is a cautious driving setup for BeamNG.AI, this dataset includes mostly safe scenarios, with about 26% of the scenarios being unsafe (causing OBEs).For the second experiment, we created a new dataset (referred to as BeamNG.AI.AF1.5)where we configured BeamNG.AI to drive in a more aggressive driving style.This resulted in 5,638 test scenarios among which 45% are unsafe.
To increase the level of reliability and applicability of our results, we used another SDC driving AI, namely Driver.AI, to generate the dataset of our last experiment.This last experiment was needed because using test scenarios with Driver.AI allows drawing a direct comparison with BeamNG.AI and investigating if the features we investigate are limited to BeamNG.AI or can be applied to other driving AIs.Thus, we used SDC-Scissor [15] (which integrates also AsFault) to re-run the test scenarios in BeamNG.AI.AF1.5 with Driver.AI, resulting in a more cautious driving with only 19% of the scenarios being unsafe.

Analysis Method
4.2.1 RQ 1 : Feature Analysis.In a real scenario, we do not determine the tests' safety without executing all of them.Hence, we do not include the feature that indicates if a test is safe or unsafe in this research question.So, to answer our first research question, we analyze the orthogonality of the other 16 different features introduced in Section 3.1.In particular, we use the PCA to statistically assess whether all features are useful for test case prioritization or whether certain features are multicollinear.A group of features is said to be collinear if they are linearly related and implicit measures of the same phenomenon (road characteristics in our case).Addressing data collinearity is vital to avoid distance measurements being skewed toward the collinear features [29].Besides, distance metrics (including the Euclidean distance) might not truly represent the extent to which the data points (test cases) are truly diverse when using a large number of features [31].
PCA is a well-founded, analytical, and established technique that allows to identify the orthogonal dimensions (principal components) in the data and measure the contributions of the different features to such components.Features that contribute to the same principal components are collinear and can be removed via dimensionality reduction.In particular, the PCA decomposes each dataset  (e.g., BeamNG.AI.AF1) in two matrices: In this equation,  is the number of test cases;  is the number of original features;  is the number of principal components;  denotes the features-to-component score matrix.More specifically,  contains each feature's scores (contributions) to the latent components identified by the PCA.In an ideal dataset with zero collinearity, the features should exclusively contribute to different principal components.
PCA can be used not only to detect but also to alleviate collinearity via dimensionality reduction [31].In particular, a lower-dimensional matrix can be obtained by choosing the top ℎ <  principal components and reconstructing the matrix as: (10) Notice that  ′ will contain new (non-collinear) features that are built as a combination of the old ones.This process is widely known in machine learning as feature extraction [39].
To answer RQ 1 , we use PCA to detect (eventual) multi-collinearity among the different road features.In the case multi-collinearity is detected, we use PCA for dimensionality reduction and feature extraction by selecting the top  principal components corresponding to 98% of the original data variance, as recommended in the literature [39].The selected, relevant features in RQ 1 (discussed in Section 5.1) are then considered to investigate RQ 2 and RQ 3 and applied for all search algorithms, i.e., for both greedy and evolutionary algorithms.

RQ 2 :
Cost-effectiveness of SDC-Prioritizer Compared to Baseline Approaches.To assess the effectiveness of test case prioritization techniques introduced in this study, we look at the rate of fault detection (i.e., how fast faults are detected during the test execution process).Hence, a better technique provides a test execution order that detects more faults while executing fewer tests.To indicate the rate of fault detection in our evaluation, we use a well-known metric in test case prioritization, called Cost cognizant Average Percentage of Fault Detection (APFD  ) [32,33,48,56,72].In this metric, higher APFD  means a higher fault detection rate.Since there is no technique introduced for measuring the fault severity in the SDC domain, we consider the same severity for all of the faults.Hence, in our case, APFD  can be formally defined as follows: where  is the list of tests that need to be sorted for execution;   is the execution time required to run the test positioned as the th test;  and  are the number of tests and faults, respectively; and    is the position in the given test permutation that detect fault .We also assessed whether there is no significant variation in execution time (simulation time) of the simulation-based tests by executing them multiple times.In particular, we randomly selected 50 tests from our dataset and ran them ten times each.As a result, the average standard deviation of test execution time is 1.67s (less than 1% variation) and the average coefficient of variance is 0.01.
To draw a statistical comparison between SO-SDC-Prioritizer, MO-SDC-Prioritizer, random search, and greedy algorithm, we use Vargha-Delaney Â12 statistic [82] to assess the effect size of differences between the APFD  values achieved by these approaches.A value Â12 > 0.5 for a pair of factors (A, B) confirms that A has a higher

4.2.4
Parameter setting.We used the default parameter values of the genetic algorithm as used in previous studies on TCP (e.g., [33,64,81]).In particular, we use the following parameter values: • Population size: we used a pool of 100 test permutations.
• Crossover operator: we used the partially-mapped crossover (PMX) for permutation problems (see Section 3.2) with a crossover probability   = 0.80.This corresponds to the default value in Matlab and it is inline with the recommended range 0.45 ≤   ≤ 0.95 [18,23].• Mutation operator: we used the hybrid mutation operator, introduced in Section 3.2.3, with a mutation probability   = 1/, where  is the number of the test cases to prioritize.This choice is in line with the recommendations from previous studies [18,74] that showed how   values proportional to the chromosome length produce better results.• Stopping criterion: the search ends after 4,000 generations (or equivalently 400K fitness evaluations).We opted for a larger number of generations compared to prior studies in test case prioritization (e.g., [12,28,54]) since the test suites in our benchmark are much larger than those used in prior studies in TCP for traditional software (e.g., the programs in the SIR dataset [46]).

RESULTS
This section reports, for each research question, the obtained results and main findings.

RQ 1 : SDC Features Analysis
Tables 3, 4, and 5 show the results of the PCA for datasets BeamNG.AI.AF1, BeamNG.AI.AF1.5, and Driver.AI, respectively.As we can observe, for each dataset, PCA identifies 16 (independent) principal components, whose relative importance is reported on the last (bottom) row of the corresponding Table .As these rows indicate, the importance of components in all of the tables (i.e., datasets) are similar: the first component (C1) covers about 30% of the variance in the data (importance), followed by the second components (C2) with about 25%, and so on.Moreover, in all of the datasets, the last six principal components are negligible as they contribute to less than 1% of the total variance.Looking at the scores achieved for the different features, we can observe that they contribute to different (orthogonal) latent components.Hence, the features capture different characteristics of the road segments in the test scenarios.Individual features exclusively capture certain components.For example, in Table 3, C2 (which corresponds to 26% of the proportion) is fully captured by the feature F6 (i.e., number of turns) with a score greater than 79%.Similar observations can be made for other components: C3 (14% of importance) is captured by F1 (direct end-to-end distance) with an 87% score; C5 (6% of importance) is exclusively related to F5 (number of straight segments) with 96% score; and so on.Similar results can also be observed in Tables 4 and 5.
Closely looking at C1, C9, and C10, in Table 3, (or C1, C4 in Table 4 and C8 and C10 in Table 5) we can observe that there are at least two features that equally contribute to them.In other words, some road features show some degree of collinearity.Finally, Features F3 (number of left turns) and F7 (median angle of turns in the road) both contributed about 40% to the first components (C1), which is the most important component according to PCA.
Therefore, we can conclude that the designed road features show some level of multi-collinearity, which is limited to a few features and for a few latent components.Hence, we use PCA for dimensionality reduction and feature extraction as described in Section 4.2.1.In particular, we select the top ℎ = 10 principal components as they correspond to (cumulatively) 98% of the original data variance.According to the PCA Tables, the last six components are negligible as together account for less than 2% of the data variance in all of the datasets.
Given the results above, we used the lower-dimensional  ′ matrix produced by the PCA with ℎ = 10 to compute the Euclidean distances and the fitness function used by SDC-Prioritizer and greedy-based test prioritization in RQ 2 and RQ 3 .In particular, we use the new set of (non-collinear) features obtained with Equation 10.
Finding 1.The designed road features show some level of multi-collinearity.The first ten principal components produced by PCA allowed the identification of the ten meta-features, representing 98% of the original datasets' variance, to consider for experimenting with prioritization strategies (i.e., RQ 2 ).

RQ 2 : Cost-effectiveness of SDC-Prioritizer Compared to Baseline Approaches
This section compares SO-SDC-Prioritizer, MO-SDC-Prioritizer, random and greedy-based test prioritizations in terms of APFD  .For both SDC-Prioritizer and the greedy-based approach, we use the first 10 principal components produced by PCA (detailed in Section 5.1).This allows us to perform an unbiased evaluation.We do not use these features nor the PCA for random search since (unlike SDC-Prioritizer and greedy) it does not require features to measure the distance between two tests.Figure 5 depicts the APFD  values achieved by SO-SDC-Prioritizer, MO-SDC-Prioritizer, greedy-based, and random test prioritization approaches.As we can see in this figure, the best performing test prioritization in all of the datasets is MO-SDC-Prioritizer.In each dataset, the minimum APFD  achieved by MO-SDC-Prioritizer is higher than the maximum APFD  achieved by other test prioritization configurations.In all three datasets, the minimum APFD  achieved by MO-SDC-Prioritizer is at least 2%, 4%, 30% is higher than the highest APFD  produced by greedy, SO-SDC-Prioritizer, and random test prioritization, respectively.On average, MO-SDC-Prioritizer reaches about 3%, 6%, and 25.5% higher APFD  than Greedy, SO-SDC-Prioritizer, and random test prioritization, respectively.The second-best test prioritization technique is the greedy search (achieving an average APFD  of 79.5%), followed by SO-SDC-Prioritizer (with an average APFD  of 76.5%) and random test prioritization (with an average APFD  of 49.9%).Moreover, as reported in Table 6, MO-SDC-Prioritizer significantly (p-values< 1.0 − 10) outperforms (as all Â12 values are all higher than 0.5) both random and greedy test prioritization in terms of APFD  score.The magnitude of the difference (effect size) is large in all datasets.Same as MO-SDC-Prioritizer, SO-SDC-Prioritizer significantly outperforms random test prioritization.However, this test prioritization technique achieves significantly lower APFD  values in comparison with greedy-based test prioritization in all datasets.Similar to the pairwise comparison of SDC-Prioritizer variants with baselines, MO-SDC-Prioritizer significantly achieves higher APFD  than SO-SDC-Prioritizer in all datasets (p-values< 1.0 − 10, Â12 = 1, and large magnitude of effect sizes).
To provide more insights into these results, we graphically compare the cumulative number of faults detected by the different approaches when running the test cases incrementally according to the test prioritizations they produced.For each dataset, we took a more detailed look at the permutations generated by each SDC-Prioritizer variant that achieve an APFD  value equal to the median of the APFD  values delivered by all applications of that SDC-Prioritizer variant on a specific dataset.Specifically, for each of the SO-SDC-Prioritizer and MO-SDC-Prioritizer, we sampled three permutations generated by these techniques for each of the datasets.For each dataset, we compare the sampled permutations against the best output of random (i.e., the permutation generated by random that gains the best APFD  ) and greedy strategies.For this comparison, we analyze the rate of fault occurrences during the execution of tests, according to the generated permutations.
Figure 6 depicts this comparison for each dataset.As we can see from the figure, in all of the benchmarks, running the tests using the test case orders generated by MO-SDC-Prioritizer leads to a higher rate of fault occurrence in a shorter time.As a concrete example, in this figure, we highlighted the number of faults that  DeerDriving dataset in which, on average, 99.7% of solutions have higher APFD  than the ones generated by Greedy test prioritization.
To better understand the impacting factors that lead the generated non-dominated solutions to achieve a high APFD  , we manually analyzed the APFD  values of Pareto fronts generated by MO-SDC-Prioritizer in each dataset.In all of the cases, we observed the same trend as the sample, presented in Figure 8.This figure is a two-dimensional vector in which each dimension indicates one of the MO-SDC-Prioritizer's search objectives (diversity and execution cost).As we can see, all solutions with the lowest APFD  (red points in the Pareto front) are the extreme points with the maximum diversity and maximum test execution costs.In addition, the solution with the highest APFD  (the orange diamond point) is not in the extreme parts of the Pareto front (i.e., it has a good balance between the diversity and execution cost).As we can observe, the knee point selected by MO-SDC-Prioritizer is among the middle points in the front with the largest APFD  .Besides, it is very close to the best point (in terms of APFD  ) within the Pareto front.This observation empirically supports the technique we used for selecting the final test order (the yellow diamond point).
Finding 3. On average, the majority (94%) of the solutions generated by MO-SDC-Prioritizer has higher APFD  than Greedy (the second-best test prioritization technique for detecting faults in a shorter time).By taking a deeper look at non-dominated solutions generated by MO-SDC-Prioritizer, we can see that the few solutions with lower APFD  are at the extremes of the Pareto front.Moreover, the solutions with the highest APFD  values are the ones that have a balance between tests diversity and test execution cost.

RQ 3 : Overhead of SDC-Prioritizer
Figure 9 illustrates the distribution of the time consumed by SO-SDC-Prioritizer, MO-SDC-Prioritizer, and greedy test prioritization.As this figure shows, on average, SO-SDC-Prioritizer and MO-SDC-Prioritizer require about 12.5 and 11.5 minutes to finish the search process with 4,000 generations, respectively.Practically, this amount of time is negligible if we consider the total 16 to 106 hours needed to run the entire set of tests, and that both variants of SDC-Prioritizer do not negatively impact the performance (e.g., on fault detection) of testing practices.In fact, the overall overhead accounts for 0.38% (for Driver.AI) and a maximum of 0.45% (for BeamNG.AI.AF1.5) of the cost needed to run the entire test suites.Finally, it is worth mentioning that SDC-Prioritizer techniques include two main parts: (i) pairwise comparison of distances between every two tests (using Euclidean distance), and (ii) running the genetic algorithm.The former is a one-time task (i.e., by one execution, we can run the genetic algorithm multiple times) with the time complexity of  ( 2 ), where  is the number of tests.Since the latter part uses the values calculated in pairwise distance calculation for fitness function evaluation, the complexity of this task is  () (this complexity is due to the search for the most diverse test).Also, the time complexities of mutation and crossover operators are  ().Hence, SDC-Prioritizer has  ( 2 ) one-time cost (for calculating the distances) and  ( × ) for the whole search process, where  is the number of tests, and  is the number of fitness evaluations.According to this information, we can confirm that SDC-Prioritizer scales for a large-size test set.Similarly, the test suites used in our study are much larger than the other ones reported in prior studies on regression testing [71].Our largest test suite (Driver.AI) contains 5,630 tests.On average, SDC-Prioritizer approaches performed the test prioritization for this test suite in less than 25 minutes.

THREATS TO VALIDITY
Threats to construct validity concern the relationship between theory and observation.In this case, threats can be mainly due to the imprecision in simulation realism as well as the automated classification of safe and unsafe scenarios.We mitigated both threats by leveraging BeamNG (used in this year's SBST tool competition [66]) as a soft-body simulation environment (which ensures a high simulation accuracy in safety-critical scenarios) and SDC-Scissor [15] (which integrates also AsFault) as a technological reference solution to generate and execute test cases, as detailed in Section 4. Furthermore, to address the potential threat to have high variability in execution time of the executed tests, we selected a sample of 50 test cases (using a stratified random sampling, equal distribution of safe and unsafe tests) and executed them 10 times each.As mentioned in Section 4.2.2, the standard deviation of the execution time is negligible.Threats to internal validity may concern, as for previous work [38], the relationships between the technologies used to generate the scenarios and the realism of simulation results.Specifically, we did not recreate all the elements that can be found on real roads (e.g., weather conditions, light conditions, etc.).However, to increase our internal validity, we focused on the usage of both BeamNG.AI and Driver.AI as test subjects.This allows us to assess the cost-effectiveness of our approach by experimenting with different driving styles and driving risk levels.Both BeamNG.AI and Driver.AI leverage a good knowledge of the roads, which means that they do not suffer from limitations of vision-based lane-keeping systems.However, since with BeamNG.AI it is possible to adjust the driving risk level, a higher amount of unsafe test scenarios can be observed.Hence, an AI implemented in physical SDC might be much more conservative in its driving style, which is something we plan to investigate for future work.
Finally, threats to external validity concern the generalization of our findings.The number of experimented test case scenarios in our study is larger than in previous studies [38] and we experimented with different AI engines.However, our results could not generalize with the universe of general open-source CPS simulation environments used in other domains.Therefore, further studies considering more SDC data, other CPS domains, and different safety requirements are expected.To minimize potential external validity in our evaluation setting, we followed the guidelines by Arcuri et al. [6]: we compared the results of SDC-Prioritizer with randomized test generation algorithms (the baseline approaches described in Section 4) presented and repeated the experiment 30 times.Finally, we applied sound non-parametric statistical tests and statistics to analyze the achieved results.
Regression testing for self-driving cars (SDCs) is particularly expensive due to the cost of running many test driving scenarios (test cases) that interact with simulation engines.To improve the cost-effectiveness of regression testing, we introduced two black-box test case prioritization approaches, called SO-SDC-Prioritizer and MO-SDC-Prioritizer.These approaches rely on a set of static road features and are suitably designed for SDCs.These features can be extracted from the driving scenarios prior to running the tests.Both of these techniques utilize genetic algorithms (GAs) to prioritize the test cases based on their distances (diversity) computed using the proposed road features and test execution costs.SO-SDC-Prioritizer performs a single-objective optimization to fulfill this task (i.e., both test diversity and execution costs are included in a single fitness function), while MO-SDC-Prioritizer leverages one of the common multi-objective genetic algorithms (NSGA-II) to prioritize tests according to two search objectives (one for differences of tests and the other one for test execution costs).
We empirically investigated the performances of SO-SDC-Prioritizer and MO-SDC-Prioritizer and compared it with two baselines: random search and greedy algorithms.Finally, we assessed whether these proposed techniques do not introduce a too large computational overhead to the regression testing process.Our results show that MO-SDC-Prioritizer is more cost-effective than the baseline approaches.Specifically, the single solution provided by MO-SDC-Prioritizer dominates the solutions provided by SO-SDC-Prioritizer and the baselines in terms of test execution time and fault detection capability.Moreover, both SDC-Prioritizer techniques successfully prioritize the test cases independently of which AI engine is used (i.e., Driver.AI and BeamNG.AI) or different risk levels (i.e., different driving styles).Interestingly, looking at the running time, we can observe that the overhead required by SO-SDC-Prioritizer and MO-SDC-Prioritizer in prioritizing the test scenarios is negligible with regards to the overall test execution cost.
We plan to replicate our study on further SDC AIs and additional SDC features as future work.Moreover, we plan to perform new empirical studies on further CPS domains to investigate additional safety criteria concerning new types of faults different from those investigated in this work.Specifically, important for this is to investigate approaches that are more human-oriented or are able to integrate humans into-the-loop [42,67,79,80].Moreover, we want to investigate different meta-heuristics in addition to the GA used in this paper.Complementary, we aim to investigate different distance functions to measure the diversity of the test cases (e.g., graph-based distances over feature-vector-based distances).Finally, we plan to integrate the proposed solution based on the experimented simulation environments to pririotize devise signals into industrial context such as AICAS context 5 , involved in the COSMOS H2020 project 6 .

)Fig. 3 .
Fig.3.Graphical representation of Pareto dominance for our two objectives, namely (1) test diversity (to maximize) and test cost (to minimize).In the example, points , , and  do not dominate one another, while point  dominates both  end .

Fig. 4 .
Fig. 4. Graphical representation of a Pareto front (in blue), the utopia (black point), and the knee point (red point).

Fig. 6 .
Fig. 6.Cost-effectiveness curves produced by the different TPC methods.Each curve depicts the cumulative number of detected faults the cumulative test execution costs yielded by the test case prioritizations.

Fig. 8 .
Fig. 8.A sample of Pareto front generated by MO-SDC-Prioritizer in BeamNG.AI.AF1.5 dataset.Each circle point represents one of the non-dominated solutions in the Pareto front.The blue points are the solutions with an APFD  score larger than the one produced by the greedy algorithm.The orange and yellow diamond points indicate the solution with the highest APFD and the closest solution to the utopia point, respectively.

Finding 4 .
The overhead introduced by each SDC-Prioritizer variants is less than 13 minutes and is imperceptible for an SDC simulation pipeline used by developers to test the SDCs behavior in critical scenarios.

Figure 9
Figure 9 shows that (right side of the Figure) the average time required by the greedy approach is about five times shorter than what SO-SDC-Prioritizer or MO-SDC-Prioritizer needs.Even though MO-SDC-Prioritizer is slower than greedy (i.e., it needs about 10 minutes more time), it performs better in terms of APFD  score (as shown by Section 5.2).

Finding 5 .
On average, MO-SDC-Prioritizer needs about 10 minutes more than the greedy test prioritization.However, this negligible extra overhead significantly increases the APFD  values achieved by the subsequently generated test prioritization.

Fig. 9 .
Fig. 9. Running Time of the different TCP approaches Proc.ACM Meas.Anal.Comput.Syst., Vol.37, No. 4, Article 111.Publication date: August 2018.111:6 • Christian Birchler, Sajad Khatiri, Pouria Derakhshanfar, Sebastiano Panichella, and Annibale Panichella 2.3 Background on SDCs Simulation 2.3.1 Main simulation approaches.Simulation environments have been developed to support developers in various stages of design and validation.In the SDC domain, developers rely mainly on basic simulation models

Table 1 .
Road Characteristics Features The contribution of each test case   to the cumulative cost is divided by its position  in the ordering , with the goal of promoting solutions where the least expensive test cases are executed earlier.Notice that this objective should be minimized.:12 • Christian Birchler, Sajad Khatiri, Pouria Derakhshanfar, Sebastiano Panichella, and Annibale Panichella Proc.ACM Meas.Anal.Comput.Syst., Vol.37, No. 4, Article 111.Publication date: August 2018.111

Table 3 .
Results of the Principal Component Analysis (PCA) for BeamNG.AI.AF1.Values in boldface indicate the features that contribute the most to the main components (Cs) extracted by PCA.
fault detection rate and vise versa.Furthermore, to examine if the differences are statistically significant, we use the non-parametric Wilcoxon Rank Sum test, with  = 0.05 for Type I error.4.2.3RQ 3 : Overhead Introduced by SDC-Prioritizer.For RQ 3 , we monitor the running time needed by SO-SDC-Prioritizer, MO-SDC-Prioritizer, and the greedy algorithm to prioritize the test cases.This analysis aims to verify whether the extra overhead introduced by SDC-Prioritizer techniques, on average, leads to a disruption in the testing process or is negligible compared to the total time needed to run the entire test suite.To have a more reliable estimation of the running time, we run both SO-SDC-Prioritizer and MO-SDC-Prioritizer 30 times and using the parameter values discussed in Section 4.2.4.Then, we measure the overhead of the different algorithms as the average running time over the 30 runs.

Table 4 .
Results of the Principal Component Analysis (PCA) for BeamNG.AI.AF1.5.Values in boldface indicate the features that contribute the most to the main components (Cs) extracted by PCA.

Table 5 .
Results of the Principal Component Analysis (PCA) for Driver.AI.Values in boldface indicate the features that contribute the most to the main components (Cs) extracted by PCA.

Table 6 .
Comparison of APFD  score achieved by SO-SDC-Prioritizer and MO-SDC-Prioritizer against the baselines, for each of the datasets used in this study.-valuesforWilcoxontests,VarghaDelaney'sestimates(Â12),andmagnitudesare reported.first20%ofthetestexecution.In the dataset BeamNG.AI.AF1 (Figure6a), the permutation generated by MO-SDC-Prioritizer leads to the detection of 234 faults in the first 20% of test execution time.This value reduces for greedy (203), SO-SDC-Prioritizer (176), and random (87) test prioritization approaches.Similarly, in the second dataset (Figure6b), MO-SDC-Prioritizer generates a permutation, which is able to detect 469 faults in the first 20% of the test execution.Also, in this case, this number is lower for the other approaches: 394, 335, and 154 faults detected by the greedy, SO-SDC-Prioritizer, and random approaches, respectively.The same trend is observed in the dataset of Driver.AI (Figure6c), in which, the sampled permutation from MO-SDC-Prioritizer can detect 845 faults, i.e., +85, +126, and +620 more faults compared to greedy, SO-SDC-Prioritizer, and random algorithms, respectively.Finding 2. MO-SDC-Prioritizer increases the APFD  score on average compared with random and greedy approaches.The improvement achieved by SDC-Prioritizer, in terms of fault detection rate, is statistically significant.Unlike MO-SDC-Prioritizer, which is the best performing test prioritization technique in terms of fault detection capability, SO-SDC-Prioritizer only achieves higher APFD  than random approach.This observation stems from the lack of exploration ability in this single-objective meta-heuristic, which drives the search process to trap local optima.5.2.1 Pareto fronts in MO-SDC-Prioritizer.As explained in Section 3, same as any other multi-objective approaches, MO-SDC-Prioritizer returns a set of non-dominated solutions in output.To answer RQ 2 , we selected the closest non-dominated solution to the utopia point (explained in Section 3.3.2). Results presented by this section indicated that this solution has higher APFD  compared to the test execution orders generated by other techniques.However, we perform a more in-depth analysis to understand whether other non-dominated solutions could be selected from the Pareto front.To this aim, we compare the Pareto fronts (i.e., non-dominated test orders) generated by each MO-SDC-Prioritizer's run with the APFD  achieved by the second-best technique (i.e., greedy-based test prioritization) in terms of fault detection capability.Figure7presents the percentage of non-dominated solutions generated by MO-SDC-Prioritizer that achieves a higher APFD  compared to the Greedy approach.On average, about 94% of non-dominated solutions generated by MO-SDC-Prioritizer can detect more unsafe tests than Greedy and in shorter times (i.e., they have higher APFD  ).Even in the worst scenario (17th execution of MO-SDC-Prioritizer on BeamNG.AI.AF1 dataset), more than 61% of generated solutions in the final Pareto front produced by MO-SDC-Prioritizer has higher APFD  compared to Greedy.The highest performance of MO-SDC-Prioritizer can be observed when this test prioritization technique is utilized to prioritize tests for the :22 • Christian Birchler, Sajad Khatiri, Pouria Derakhshanfar, Sebastiano Panichella, and Annibale Panichella Fig.7.Percentage of non-dominated solutions generated by MO-SDC-Prioritizer that achieve a higher APFD  compared to greedy-based test prioritization.