Estimating Probabilistic Safe WCET Ranges of Real-Time Systems at Design Stages

Estimating worst-case execution times (WCET) is an important activity at early design stages of real-time systems. Based on WCET estimates, engineers make design and implementation decisions to ensure that task executions always complete before their specified deadlines. However, in practice, engineers often cannot provide precise point WCET estimates and prefer to provide plausible WCET ranges. Given a set of real-time tasks with such ranges, we provide an automated technique to determine for what WCET values the system is likely to meet its deadlines, and hence operate safely with a probabilistic guarantee. Our approach combines a search algorithm for generating worst-case scheduling scenarios with polynomial logistic regression for inferring probabilistic safe WCET ranges. We evaluated our approach by applying it to three industrial systems from different domains and several synthetic systems. Our approach efficiently and accurately estimates probabilistic safe WCET ranges within which deadlines are likely to be satisfied with a high degree of confidence.


INTRODUCTION
Safety-critical systems, e.g., those used in the aerospace, automotive and healthcare domains, require that their executions always complete before their specified deadlines in all execution scenarios, including the worst cases. The systems that must perform their operations in such a timely manner are known as real-time systems (RTS) [18]. To ensure that a real-time system meets its deadlines, we need an accurate estimation of the worst-case execution times (WCET) of software tasks that concurrently run in the system. For instance, the Anti-lock Braking System (ABS) of a vehicle has to activate within milliseconds after the driver brakes. However, an ABS taking more time for activation than the estimated WCET may result in a vehicle skid due to the wheels locking up.
Accurately estimating WCET values of a real-time system is particularly important at early design stages when real-time tasks are not yet fully implemented. Accurate WCET estimates greatly support engineers during development as they provide targets driving design and implementation choices. software tasks and their different states increase. More recently, stress testing and simulation-based approaches [2,17] have been proposed to stress RTS and generate test scenarios where their deadline constraints are violated. Such approaches cast the schedulability test problem as an optimisation problem to find worst-case task execution scenarios exhibiting deadline misses. However, none of the existing simulation-based approaches account for uncertainties in WCET values and therefore do not handle WCET value ranges. Our work complements the simulation-based stress testing approach and extends it to account for uncertainties in WCET values.
Contributions. In this article, we propose a Safe WCET Analysis method For real-time task schEdulability (SAFE) to estimate WCET ranges under which tasks are likely to be schedulable with a probabilistic guarantee. Our approach is based on a stress testing approach [17] using meta-heuristic search [47] in combination with polynomial logistic regression models. Specifically, we use a genetic algorithm [47] to search for sequences of task arrivals which likely lead to deadline misses. Then, logistic regression [40], a statistical classification technique, is applied to infer a safe WCET border in the multidimensional WCET space with a probabilistic guarantee. This border aims to partition the given WCET ranges into safe and unsafe sub-ranges for a selected deadline miss probability , and thus enables engineers to investigate trade-offs among different tasks' WCET values. WCET ranges are deemed to be probabilistically safe if tasks, within such ranges, have a high probability to complete their executions before their specified deadlines. In this article, for the sake of simplicity, we refer to probabilistically safe WCET ranges as safe WCET ranges. We evaluated our approach by applying it to a complex, industrial satellite system developed by our industry partner, LuxSpace, as well as two industrial systems from different domains and several synthetic systems. Results show that our approach can efficiently and accurately compute safe WCET ranges. SAFE scales to complex industrial systems as an offline analysis method. Execution times of SAFE on our industrial systems are practically acceptable, i.e., at most 27h. To our knowledge, SAFE is the first attempt to estimate safe WCET ranges within which real-time tasks are likely to meet their deadlines for a given level of confidence, while enabling engineers to explore trade-offs among tasks' WCET values. Our full evaluation package is available online [43].
Organization. The remainder of this article is structured as follows: Section 2 motivates our work. Section 3 defines our specific schedulability analysis problem in practical terms. Section 4 describes SAFE. Section 5 evaluates SAFE. Sections 6 compares SAFE with related work. Section 7 concludes this article.

MOTIVATING CASE STUDY
We motivate our work with a mission-critical real-time satellite system, named Attitude Determination and Control System (ADCS), which LuxSpace, a leading system integrator for microsatellites and aerospace systems, has been developing over the years. ADCS determines the satellite's attitude and controls its movements [29]. ADCS controls a satellite in either autonomous or passive mode. In the autonomous mode, ADCS must orient a satellite in proper position on time to ensure that the satellite provides normal service correctly. In the passive mode, operators are able to not only control satellite positions but also maintain the satellite, e.g., upgrading software. Such a maintenance operation does not necessarily need to be completed within a fixed hard deadline; instead, it should be completed within a reasonable amount of time, i.e., soft deadlines. Hence, ADCS is composed of a set of tasks having real-time constraints with hard and soft deadlines.
Engineers at LuxSpace conduct real-time schedulability analysis across different development stages. At an early design stage, when task implementations and system hardware are not available, the engineers use a theoretical schedulability analysis technique [44] which determines that a set of tasks is schedulable if CPU utilisation of the task set is less than a threshold, e.g., 69%. As mentioned earlier, at an early design stage, engineers estimate task WCETs as ranges and often assign large values to the upper bounds of such ranges. To be on the safe side, engineers tend indeed to be conservative in their analysis.
Engineers, however, are still faced with the following issues: (1) An analytical schedulability analysis technique, e.g., utilisation-based schedulability analysis [44], typically indicates whether or not tasks are schedulable. However, engineers need additional information to understand how tasks miss their deadlines. For instance, a set of tasks may not be schedulable for a few specific sequences of task arrivals. (2) Engineers estimate WCETs without any systematic support; instead, they often rely on their experience of developing tasks providing similar functions-to-develop. This practice typically results in imprecise estimates of WCET ranges, which may cause serious problems, e.g., significantly changing tasks at later development stages. To this end, LuxSpace is interested in SAFE as a way to address these issues in analysing schedulability.

PROBLEM DESCRIPTION
This section first formalises task, task relationship, scheduler, and schedulability concepts. We then describe the problem of identifying safe WCET ranges under which tasks likely meet their deadline constraints at a certain level of confidence; i.e., tasks are schedulable with a certain probability.
Task. A real-time system is composed of a set of tasks that should complete their executions within specified deadlines after they are activated (or arrived). We denote by a real-time task indexed by in the range from 1 to . Every real-time task has the following properties: priority denoted by , deadline denoted by , and worst-case execution time (WCET) denoted by . Task priority determines if an execution of a task is preempted by another task. Typically, a task preempts the execution of a task if the priority of is higher than the priority of , i.e., > . The deadline of a task relative to its arrival time is denoted by . A task deadline can be either hard or soft. A hard deadline of a task specifies that must complete its execution within a deadline after is activated. While violations of hard deadlines are not acceptable, depending on the operating context of a system, violating soft deadlines may be tolerated to some extent. Note that, for notational simplicity, we do not introduce new notations to distinguish between hard and soft deadlines. In this article, we refer to a hard deadline as a deadline. Section 4 further discusses how our approach manages hard and soft deadlines.
We denote by and , respectively, the minimum and the maximum WCET values of a task . As discussed in the introduction, at an early development stage, it is difficult to provide exact WCET values of real-time tasks. Hence, we assume that engineers specify WCETs using a range of values, instead of single values, by indicating estimated minimum and maximum values that they think each task's WCET can realistically take.
In this article, real-time tasks are either periodic or aperiodic. Periodic tasks, which are typically triggered by timed events, are invoked at regular intervals specified by their period. We denote by the period of a periodic task , i.e., a fixed time interval between subsequent activations (or arrivals) of . Aperiodic tasks have irregular arrival times and are activated by external stimuli which occur irregularly, and hence, in general, there is no limit on the arrival times of an aperiodic task. However, in real-time analysis, we typically specify a minimum inter-arrival time denoted by and a maximum inter-arrival time denoted by indicating the minimum and maximum time intervals between two consecutive arrivals of an aperiodic task . In real-time analysis, sporadic tasks are often separately defined as having irregular arrival intervals and hard deadlines [45]. In our conceptual definitions, however, we do not introduce new notations for sporadic tasks because the deadline and period concepts defined above are sufficient to characterise sporadic tasks. Note that for a periodic task , we have = = . Otherwise, for an aperiodic task , we have > .
Task relationship. The execution of a task depends not only on its own parameters described above, e.g., priority and period , but also on its relationships with other tasks. Relationships between tasks are typically determined by task interactions related to accessing shared resources [1], such as memory, file, and IO devices. Specifically, if two tasks and access a shared resource in a mutually exclusive way, may be blocked from executing for the period during which accesses the resource. We denote by dp( , ) the resource-dependency relation between tasks and that holds if and have mutually exclusive access to a shared resource such that they cannot be executed in parallel or preempt each other, but one can execute only after the other has completed its access to the resource. We note that resource-dependency relations are defined at the level of tasks, following prior works [5,26,46] describing the industrial case study systems used in our experiments (see Section 5.2). The dp( , ) relation is symmetric, i.e., dp( , ) = dp( , ).
Scheduler. Let Γ be a set of tasks to be scheduled by a real-time scheduler. A scheduler then dynamically schedules executions of tasks in Γ according to the tasks' arrivals and the scheduler's scheduling policy over the scheduling period T = [0, t]. We denote by , the th arrival time of a task ∈ Γ. The first arrival of a periodic task does not always occur immediately at the system start time 0. Such offset time from the system start time 0 to the first arrival time ,1 of is denoted by . For a periodic task , the th arrival of within T is , ≤ t and is computed by , = + ( − 1) · . For an aperiodic task , , is determined based on the −1th arrival time of and its minimum and maximum arrival times. Specifically, for > 1, where , ≤ t. A scheduler reacts to a task arrival at , to schedule the execution of . Depending on a scheduling policy (e.g., rate monotonic scheduling policy [44] and single-queue multi-core scheduling policy [7]), an arrived task may not start its execution at the same time as it arrives when a higher priority task is executing. Also, task executions may be interrupted due to preemption. We denote by , the end execution time for the th arrival of a task . Depending on actual worst-case execution time of a task , denoted by , within its WCET range [ , ], the , end execution time of satisfies the following: , ≥ , + .
During the system operation, a scheduler generates a schedule scenario which describes a sequence of task arrivals and their end time values. We define a schedule scenario as a set of tuples ( , , , , ) indicating that a task has arrived at , and completed its execution at , . Due to the randomness of task execution times and aperiodic task arrivals, a scheduler may generate a different schedule scenario in different runs of a system. Figure 1 shows two schedule scenarios produced by a scheduler over the [0, 23] time period of a system run. Both Figure 1a and Figure 1b describe executions of three tasks, 1 , 2 , and 3 arrived at the same time stamps (see , in the figures). In both scenarios, the aperiodic task 1 is characterised by: 1 = 5, 1 = 10, 1 = 4, and 1 = 1 = 2. The periodic task 2 is characterised by: 2 = 8 and 2 = 6. The aperiodic task 3 is characterised by: 3 = 3, 3 = 20, 3 = 3, and 3 = 3 = 1. The priorities of the three tasks satisfy the following: 1 > 2 > 3 . In both scenarios, task executions can be preempted depending on their priorities. We note that a WCET range of the 2 task is set to 2 = 1 and 2 = 3 in Figure 1a, and 2 = 1 and Schedulability. Given a schedule scenario , a task is schedulable if completes its execution before its deadline, i.e., for all , observed in , , ≤ , + . Let Γ be a set of tasks to be scheduled by a scheduler. A set Γ of tasks is then schedulable if for every schedule of Γ, we have no task ∈ Γ that misses its deadline.  1. Example schedule scenarios of three tasks, 1 , 2 , and 3 , running on a single core system. (a) 3 is not schedulable, i.e., 3,2 > 3,2 + 3 . (b) All the three tasks are schedulable. When 2 executes over 3 (WCET) time units, it causes a deadline miss of 3 . When the WCET is reduced to 2, the three tasks are schedulable even for the same sequence of task arrivals. Figure 1a, a deadline miss occurs after the second arrival of 3 , i.e., 3,2 > 3,2 + 3 . During [ 3,2 , 3,2 + 3 ] period, the 3 task cannot execute because the other tasks 1 and 2 with higher priorities are executing. Thus, 3 is not schedulable in the schedule scenario of Figure 1a. This scheduling problem can be solved by restricting tasks' WCET ranges as discussed below.

As shown in
Problem. Uncertainty in task WCET values at an early development stage is a critical issue preventing the effective design and assessment of mission-critical real-time systems. Upper bounds of WCETs correspond to worst-case WCET values and have a direct impact on deadline misses as larger WCET values increase their probability. Lower bounds of WCETs are estimates of tasks' bestcase WCET values, below which task implementations are likely not feasible. Our approach aims to determine the maximum upper bounds for WCET under which tasks are likely to be schedulable, at a given level of risk, and thus provides an objective to engineers implementing the tasks. Specifically, for every task ∈ Γ to be analysed, our approach computes a new upper bound value for the WCET range of (denoted by max * ) such that max * ≤ and by restricting the WCET range of to max * we should, at a certain level of confidence, no longer have deadline misses. That is, tasks Γ become schedulable, with a certain probability, after restricting the maximum WCET value of to max * . For instance, as shown in Figure 1b, restricting the maximum WCET of 2 from 2 = 3 to = 2 enables all the three tasks to be schedulable. We note that, in our context, both arrival time ranges for aperiodic tasks and WCET ranges for all tasks are represented as continuous intervals. Since our approach works based on sampling values from these continuous ranges, our approach cannot be exhaustive and cannot provide a guarantee that the tasks can always be schedulable after restricting their WCET ranges. Our approach instead relies on sampling values within the WCET and arrival time ranges, simulating the scheduler behaviour using the sampled values and observing whether, or not, a deadline miss occurs. of exhaustiveness, we rely on statistical and machine learning techniques to provide probabilistic estimates indicating how confident we are that a given set of tasks are schedulable. Figure 2 shows an overview of our Safe WCET Analysis method For real-time task schEdulability (SAFE). Phase 1 of SAFE aims at searching worst-case task-arrival sequences. A task-arrival sequence is worst-case if deadline misses are maximised or, when this is not possible, tasks complete their executions as close to their deadlines as possible. Building on existing work, we identify worst-case task-arrival sequences using a search-based approach relying on genetic algorithms. Phase 2 of SAFE, which is the main contribution of this article, aims at computing safe WCET ranges under which tasks are likely to be schedulable. To do so, relying on logistic regression and an effective sampling strategy, we augment the worst-case task-arrival sequences generated in Phase 1 to compute safe WCET ranges with a certain deadline miss probability, indicating a degree of risk. We describe in detail these two phases next.

Phase 1: Worst-case task arrivals
The first phase of SAFE finds worst-case sequences in the space of possible sequences of task arrivals, defined by their inter-arrival time characteristics. As SAFE aims to provide conservative, safe WCET ranges, we optimise task arrivals to maximise task completion times and deadline misses, and indirectly minimise safe WCET ranges (see the safe area visually presented in Figure 2). We address this optimisation problem using a single-objective search algorithm. Following standard practice [30], we describe our search-based approach for identifying worst-case task arrivals by defining the solution representation, the scheduler, the fitness function, and the computational search algorithm. We then describe the dataset of sequences generated by search and then used for training our logistic regression model to compute safe WCET ranges in the second phase of SAFE. Our approach in Phase 1 is based on past work [17], where a specific genetic algorithm configuration was proposed to find worst-case task arrival sequences. One important modification though is that we account for uncertainty in WCET values through simulations for evaluating the magnitude of deadline misses.
Representation. Given a set Γ of tasks to be scheduled, a feasible solution is a set of tuples ( , , ) where ∈ Γ and , is the th arrival time of a task . Thus, a solution represents a valid sequence of task arrivals of Γ (see valid , computation in Section 3). Let T = [0, t] be the time period during which a scheduler receives task arrivals. The size of is equal to the number of task arrivals over the T time period. Due to the varying inter-arrival times of aperiodic tasks (Section 3), the size of will vary across different solutions.
Scheduler. SAFE uses a simulation technique for analysing the schedulability of tasks to account for the uncertainty in WCET values and scalability issues. For instance, an inter-arrival time of a software update task in a satellite system is approximately at most three months. In such cases, conducting an analysis based on an actual scheduler is prohibitively expensive. Instead, SAFE uses a real-time task scheduling simulator, named SafeScheduler, which samples WCET values from their ranges for simulating task executions and applies a scheduling policy, i.e., single-queue multi-core scheduling policy [7], based on discrete simulation time events. Note that we chose the single-queue multi-core scheduling policy for SafeScheduler since our case study systems (described in Section 5.2) rely on this policy.
SafeScheduler takes a feasible solution for scheduling a set Γ of tasks as an input. It then outputs a schedule scenario as a set of tuples ( , , , , ) where , and , are the th arrival and end time values of a task , respectively. Recall from Section 3 that SafeScheduler computes task arrivals based on periodic tasks' offsets and periods and aperiodic tasks' inter-arrival times. For each task , SafeScheduler computes , based on its scheduling policy and a selected WCET value for within the WCET range [ , ], while accounting for resource-dependency relationships (see Section 3). Hence, each run of SafeScheduler for the same input solution will likely produce a different schedule scenario.
SafeScheduler implements a single-queue multi-core scheduling policy [7], which schedules a task with explicit priority and deadline . When tasks arrive, SafeScheduler puts them into a single queue that contains tasks to be scheduled. At any simulation time, if there are tasks in the queue and multiple cores are available to execute tasks, SafeScheduler first fetches a task from the queue in which has the highest priority . SafeScheduler then allocates task to any available core. Note that if task shares a resource with a running task in another core, i.e., the dp( , ) resource-dependency relationship holds, SafeScheduler follows standard task-blocking rules [45], i.e., will be blocked until releases the shared resource.
SafeScheduler works under the assumption that context switching time is free, which is also a working assumption in many scheduling analysis methods [8,26,44]. Note that the assumptions are practically valid and useful at an early development step in the context of real-time analysis. For instance, our collaborating partner accounts for the waiting time of tasks due to context switching between tasks through adding some extra time to WCET ranges at the task design stage. Note that SAFE can be applied with any scheduling policy, including those that account for context switching time and multiple queues.
Fitness. Given a feasible solution for a set Γ of tasks, we formulate a fitness function, ( , Γ , ), to quantify the degree of deadline misses regarding a set Γ ⊆ Γ of target tasks, where is a number of SafeScheduler runs to account for the uncertainty in WCET. SAFE provides the capability of selecting target tasks Γ as practitioners often need to focus on the most critical tasks. We denote by dist ( , ) the distance between the end time and the deadline of the th arrival of task and define dist ( , ) = , − , + (see Section 3 for the notation end time , , arrival time , , and deadline ).
To compute the ( , Γ , ) fitness value, SAFE runs SafeScheduler times for and obtains schedule scenarios 1 , 2 , . . . , . For each schedule scenario ℎ , we denote by dist ℎ ( , ) the distance between the end and deadline time values corresponding to the th arrival of the task observed in ℎ . We denote by lk( ) the last arrival index of a task in . SAFE aims to maximise the ( , Γ , ) fitness function defined as follows: We note that soft deadline tasks also require to execute within reasonable execution time ranges. Hence, engineers also estimate safe WCET ranges for soft deadline tasks. As the above fitness Table 1. An example operation of SafeCrossover. It swaps all task arrivals of task 1 and 2 between two parent solutions and to produce offspring ′ and ′ .
Task 1 Task 2  Task 3 Parent function returns a quantified degree of deadline misses, SAFE uses such function for both soft and hard deadline tasks. Computational search. SAFE employs a steady-state genetic algorithm [47]. The algorithm breeds a new population for the next generation after computing the fitness of a population. The breeding for generating the next population is done by using the following genetic operators: (1) Selection. SAFE selects candidate solutions using a tournament selection technique, with the tournament size equal to two which is the most common setting [31]. (2) Crossover. Selected candidate solutions serve as parents to create offspring using a crossover operation. (3) Mutation. The offspring are then mutated. Below, we describe our crossover and mutation operators.
Crossover. A crossover operator is used to produce offspring by mixing traits of parent solutions. SAFE modifies the standard one-point crossover operator [47] as two parent solutions and may have different sizes, i.e., | | ≠ | |. Let Γ = { 1 , 2 , . . . , } be a set of tasks to be scheduled. Our crossover operator, named SafeCrossover, first randomly selects an aperiodic task ∈ Γ. For all ∈ [1, ] and ∈ Γ, SafeCrossover then swaps all arrivals between two solutions and . As the size of Γ is fixed for all solutions, SafeCrossover can cross over two solutions that may have different sizes. Table 1 shows an example operation of SafeCrossover using a system with three aperiodic tasks, 1 , 2 , and 3 . Let two parent solutions and be as follows: . ., ( 2 , 6), . . ., ( 3 , 13)}, where ( , ) denotes task arrives at time . Given the two parents and , SafeScheduler randomly selects a task, i.e., 2 in this example, then it swaps all arrivals of 1 and 2 between and . As shown in Table 1, SafeCrossover then generates the offspring ′ and ′ as follows: The shaded (resp. unshaded) cells in Table 1 indicate which task arrivals in child ′ (resp. ′ ) come from which parent.
Mutation operator SAFE uses a heuristic mutation algorithm, named SafeMutation. For a solution , SafeMutation mutates the th task arrival time , of an aperiodic task with a mutation probability. SafeMutation chooses a new arrival time value of , based on the [ , ] interarrival time range of . If such a mutation of the th arrival time of does not affect the validity of the +1th arrival time of , the mutation operation ends. Specifically, let * , be a mutated value of , . In case , +1 ∈ [ * , + , * , + ], SafeMutation returns the mutated solution. After mutating the th arrival time , of a task in a solution , if the +1th arrival becomes invalid, SafeMutation corrects the remaining arrivals of . We denote by * , the mutated th arrival time of . For all the arrivals of after * , , SafeMutation first updates their original arrival time values by adding the difference * , − , . Let T = [0, t] be the scheduling period. SafeMutation then removes some arrivals of if they are mutated to arrive after t or adds new arrivals of while ensuring that all tasks arrive within T.
Given the offspring presented in Table 1, SafeMutation, for example, mutates a child solution [2,8] be the inter-arrival time range of task 1 , T = [0, 18] be the time period during which SafeScheduler receives task arrivals, and SafeMutation selects the second arrival of task 1 , i.e., ( 1 , 7) in Table 1, to mutate. Based on the inter-arrival time range of 1 , SafeMutation randomly chooses a new arrival time, e.g., 5, for the second arrival of 1 . The third arrival ( 1 , 14) of 1 then became invalid due to the mutated second arrival ( 1 , 5); i.e., 1 cannot arrive at time 14 because 14 ∉ [5 + 2, 5 According to the correction procedure described above, the third arrival of 1 is modified to ( 1 , 12) as 12 = 14 + (5 − 7), where 14, 5, and 7 are, respectively, the original thrid arrival time of 1 , the original second arrival time of 1 , and the mutated second arrival time of 1 . As SafeScheduler can receive new arrivals of 1 after time 12, SafeMutation may add new arrivals of 1 based on the inter-arrival time range of 1 .
We note that when a system is only composed of periodic tasks, SAFE will skip searching for worst-case arrival sequences as arrivals of periodic tasks are deterministic (see Section 3), but will nevertheless generate the labelled dataset described below. When needed, SAFE can be easily extended to manipulate varying offset (and period) values for periodic tasks, in a way identical to how we currently handle inter-arrival times.
Labelled dataset. SAFE infers safe WCET ranges using a supervised learning technique [62] which requires a labelled dataset, namely logistic regression. In our context, a supervised learning technique creates a model that correlates tasks' WCET values with schedulability results indicating whether these tasks meet their deadlines or not. Supervised learning is conducted based on pairs of tasks' WCET values and a schedulability result, i.e., a labelled dataset. Specifically, SAFE uses logistic regression because it allows engineers to have probabilistic interpretation of safe WCET ranges and to investigate trade-off relationships among different tasks' WCETs. Section 4.2 describes this learning process in detail.
Recall from the fitness computation described above, SAFE runs SafeScheduler times to obtain schedule scenarios ={ 1 , 2 , . . . , }, and then computes a fitness value of a solution based on . We denote by ℎ a set of tuples ( , ) representing that a task has the WCET value in the ℎ schedule scenario. Let #» be a labelled dataset to be created by the first phase of SAFE. We denote by ℓ ℎ a label indicating whether or not a schedule scenario ℎ has any deadline miss for any of the target tasks in Γ , i.e., ℓ ℎ is either safe or unsafe which denotes, respectively, no deadline miss or deadline miss. For each fitness computation, SAFE adds number of tuples ( ℎ , ℓ ℎ ) to #» . Specifically, for a schedule scenario ℎ , SAFE adds ( ℎ , unsafe) to #» if there are ∈Γ and

Phase 2: Safe ranges of WCET
In Phase 2, SAFE computes safe ranges of WCET values under which target tasks are likely to be schedulable. To do so, SAFE applies a supervised machine learning technique to the labelled dataset generated by Phase 1 (Section 4.1). Specifically, Phase 2 executes SafeRefinement (Algorithm 1) which has following steps: complexity reduction, imbalance handling and model refinement.
Complexity reduction. The "reduce complexity" step in Algorithm 1 reduces the dimensionality of a labelled dataset #» obtained from the first phase of SAFE (line 2). It predicts initial safe WCET ranges based on the WCET variables for the tasks in Γ (line 3) that have the most significant effect on deadline misses for target tasks. A labelled dataset #» obtained from the first phase of SAFE contains tuples ( , ℓ) where is a set of WCET values for tasks in Γ and ℓ is a label of indicating either no deadline miss (safe) or deadline miss (unsafe) (Section 4.1). Note that some WCET values in may not be relevant to determine ℓ. Hence, #» may contain irrelevant variables to predict ℓ.
Algorithm 1: SafeRefinement. An algorithm for computing safe WCET ranges under which target tasks are schedulable. The algorithm consists of three steps as follows: "reduce complexity", "handle imbalanced dataset", and "refine model" steps. for each solution ∈ do 11: for each scenario ℎ ∈ { 1 , 2 , . . . , ns } do 13: if ℎ has any deadline miss then 14: WCET values with a significant effect on ℓ. To that end, SafeRefinement employs a standard feature reduction technique: random forest feature reduction [16] which has been successfully applied to high-dimensional data [39,59]. Given the labelled dataset #» , random forest creates a set of decision trees based on the parameter values such as the number of trees and tree depth. Decision trees obtained by random forest allow us to rank features, i.e., task WCETs, based on their importance as measured by Gini impurity [16]. Hence, by setting a particular threshold for importance, we can select a subset of the features. Note that Section 5.6 describes the parameter values for the feature reduction step in detail.
After reducing the dimensionality of the input dataset #» in Algorithm 1, resulting in the reduced dataset #» , SafeRefinement learns an initial model to predict safe WCET ranges. SafeRefinement uses logistic regression [41] because it enables a probabilistic interpretation of safe WCET ranges and the investigation of relationships among different tasks' WCETs. For example, Figure 3 shows a safe border determined by an inferred logistic regression model with a probability of deadline misses. Note that a safe range, e.g., [ 1 , 1 ] of task 1 in Figure 3, is determined by a point on the safe border in a multidimensional WCET space. A safe border distinguishes safe and unsafe areas in the WCET space. After inferring a logistic regression model from the input dataset, SafeRefinement selects a probability maximising the safe area under the safe border determined by and while ensuring that all the data instances, i.e., sets of WCET values, classified as safe using the safe border are actually observed to be safe in the input dataset, i.e., no false positives (lines 3-4). We note that engineers can also select an adequate probability, which may yield false positives or not maximise the area under the safe border, depending on their needs. SafeRefinement uses a second-order polynomial response surface model (RSM) [56] to build a logistic regression model. RSM is known to be useful when the relationship between several explanatory variables (e.g., WCET variables) and one or more response variables (e.g., safe or unsafe label) needs to be investigated [48,56]. RSM contains linear terms, quadratic terms, and 2-way interactions between linear terms. Let be a set of WCET variables in #» . Then, the logistic regression model of SafeRefinement is defined as follows: As shown in the above equation, an RSM equation, i.e., the right-hand side, built on the reduced dataset #» has a higher number of dimensions, i.e., the number of coefficients to be inferred, than | | as RSM additionally accounts for quadratic terms ( 2 ) and 2-way interactions ( ) between linear terms. Hence, SafeRefinement employs a stepwise regression technique (line 3), e.g., stepwise AIC (Akaike Information Criterion) [75], in order to select significant explanatory terms from the RSM equation. This allows the remaining "refine model" step of SafeRefinement to execute efficiently as it requires to run SafeScheduler and logistic regression multiple times within a time budget (line 8), both operations being computationally expensive. Imbalance handling. Recall from Section 4.1 that SAFE searches for worst-case sequences of task arrivals and is guided by maximising the magnitude of deadline misses, when they are possible. Therefore, the major portion of #» , the dataset produced by the first phase of SAFE, is a set of task arrival sequences leading to deadline misses. Supervised machine learning techniques (including logistic regression) typically produce unsatisfactory results when faced with highly imbalanced datasets [11]. SafeRefinement addresses this problem with the "handle imbalanced dataset" step in Algorithm 1 (lines 5-6) before refining safe WCET ranges. SafeRefinement aims to identify WCET ranges under which tasks are likely to be schedulable. This entails that WCET ranges under which tasks are highly unlikely to be schedulable can be safely excluded from the remaining analysis. Specifically, SafeRefinement prunes out WCET ranges with a high probability of deadline misses above a high threshold and thus creates a more balanced dataset #» compared to the original imbalanced dataset #» (line 6). SafeRefinement automatically finds a minimum probability which leads to a safe border classifying no false unsafe (negative) instances in #» . SafeRefinement then updates the maximum WCET of a task based on the intercept of the logistic regression model (with a probability of ) on the WCET axis for . Figure 4 shows an example dataset #» with a safe border characterised by a high deadline miss probability, i.e., = 0.99, to create a more balanced dataset #» within the restricted ranges Model refinement. The "refine model" step in Algorithm 1 refines an inferred logistic regression model by sampling additional schedule scenarios selected according to a strategy that is expected to improve the model. As described in Section 4.1, the SAFE search produces a set (population) of worst-case arrival sequences of tasks Γ which likely violate deadline constraints of target tasks Γ ⊆ Γ. For each arrival sequence in , SafeRefinement executes SafeScheduler ns times to add ns new data instances to the dataset #» based on the generated schedule scenarios and their schedulability results (lines 9-19). After adding ns · | | new data instances to #» , SafeRefinement runs logistic regression again to infer a refined logistic regression model and computes a probability that ensures no false safe instances (positives) in #» and maximises the safe area under the safe border defined by and (lines 20-25).
In the second phase of SAFE, SafeScheduler selects WCET values for tasks in Γ to compute a schedule scenario based on a distance-based random number generator, which extends the standard uniform random number generator. The distance-based WCET value sampling aims at minimising the Euclidean distance between the sampled WCET points and the safe border defined by the inferred model and the selected probability . SafeScheduler iteratively computes new WCET values using the following distance-based sampling procedure: (1) generating random samples in the WCET space, (2) computing their distance values from the safe border, and (3) selecting the closest point to the safe border. SafeRefinement stops model refinements either by reaching an allotted analysis budget (line 8 of Algorithm 1) or when a precision reaches an acceptable level pt, e.g., 0.99 (lines [23][24][25]. SafeRefinement uses the standard precision metric [71] as described in Section 5.5. In our context, practitioners need to identify safe WCET ranges at a high level of precision to ensure that identified safe WCET ranges can be trusted. To compute a precision value, SafeRefinement uses a standard k-fold cross-validation [71]. In k-fold cross-validation, #» is partitioned into k equal-size splits. One split is retained as a test dataset, and the remaining k-1 splits are used as a training dataset. The cross-validation process is then repeated k times to compute a precision of inferred safe borders which are determined by a logistic regression model and a probability (lines 21 and 22) Selecting WCET ranges. A safe border defined by an inferred logistic regression model and a deadline miss probability of represents a (possibly infinite) set of points, corresponding to safe WCET ranges of tasks, e.g., Figure 3. In practice, however, engineers need to choose a specific WCET range for each task to conduct further analysis and development. How to choose optimal WCET ranges depends on the system context. At early stages, however, such contextual information may not be available. Hence, SAFE proposes a best-size point, i.e., WCET ranges, on a safe border which maximises the volume of the hyperbox the point defines. In general, the larger hyperbox, the greater flexibility the engineers have in selecting appropriate WCET values. Choosing the point with the largest volume is helpful when no domain-specific information is available to define other selection criteria. In general the inferred safe border enables engineers to investigate trade-off among different tasks' WCET values.

EVALUATION
We evaluate SAFE using an industrial case study from the satellite domain. Our full evaluation package is available online [43].

RQ1 (baseline comparison):
How does SAFE perform compared with a baseline approach? With RQ1, we investigate whether SAFE can outperform WCET estimation based on random search. Note that such RQ is an important sanity check for search-based solutions in general [6,36]. Our conjecture is that SAFE, although computationally expensive, will significantly outperform a random search solution with respect to estimating safe WCET ranges with a higher degree of confidence. RQ2 (effectiveness of distance-based sampling): How does SAFE, based on distance-based sampling, perform compared with random sampling? We compare our distance-based sampling procedure described in Section 4.2 and used in the second phase of SAFE with a naive random sampling. Our conjecture is that distance-based sampling, although expensive, is needed to improve the quality of the training data used for logistic regression. RQ2 assesses this conjecture by comparing distance-based and random sampling. RQ3 (usefulness): Can SAFE identify WCET ranges within which tasks are highly likely to satisfy their deadline constraints? In RQ3, we investigate whether SAFE identifies acceptably safe WCET ranges in practical time. We further discuss our insights regarding the usefulness of SAFE from the feedback obtained from engineers in LuxSpace. RQ4 (scalability): Can SAFE find safe WCET ranges for large-scale systems with a practical time budget? In this RQ, we study the relationship between the execution time of SAFE and the parameters of study subjects. We use several synthetic subjects to be able to freely control key real-time systems' parameters. Table 2. Description of the three industrial subject systems: number of periodic and aperiodic tasks, resource dependencies, and platform cores. The full task descriptions are available online [43].

Industrial Study Subjects
We evaluated SAFE by applying it to our motivating case study subject, i.e., the satellite attitude determination and control system (ADCS) described in Section 2, as well as two industrial study subjects from the literature [61,65]. Table 2 summarizes the relevant attributes of these subjects, presenting the number of periodic and aperiodic tasks, resource dependencies, and processing cores. The subjects are characterized by real-time parameters, e.g., priorities, WCETs, periods and deadlines, described in Section 3. The full task descriptions of the subjects are available online [43].
The main missions of the three subjects are describe as follows: • ADCS is a satellite system that aims at orienting a satellite in a proper position on time to ensure that the satellite provides normal service correctly (see Section 2). LuxSpace, our industry partner, developed ADCS for an ESA project. • ICS is an ignition control system that checks the status of an automotive engine and corrects any errors of the engine [61]. The system was developed by Bosch GmbH 1 . • UAV is a mini unmanned air vehicle that follows dynamically defined way-points and communicates with a ground station to receive instructions [65]. The system was developed in a collaboration with the University of Poitiers France and ENSMA 2 .
LuxSpace is a leading system integrator of micro satellites and aerospace systems. ADCS includes a set of 15 periodic and 19 aperiodic tasks. Eight tasks out of the 19 aperiodic tasks are constrained by hard deadlines, i.e., sporadic tasks. Out of the 34 tasks, engineers provided single WCET values for eight tasks. For the remaining 26 tasks, engineers estimated WCET ranges due to uncertain decisions, e.g., implementation choices and hardware specifications, made at later development stages (see Section 2). The differences between the estimated WCET maximum and minimum values across the 26 tasks varies from 0.1ms to 20000ms. Our collaboration with LuxSpace enabled us to discuss SAFE results with engineers to draw important qualitative conclusions and to assess the benefits of SAFE (see Section 5.7).
For the experiments with ICS and UAV, we used the task descriptions reported in a previous study [26] and modified their tasks' WCETs from point values to ranges. Though the problem of schedulability analysis of real-time tasks has been widely studied [3,15,26,35,69], none of the prior work addresses the same problem (see Section 3) as that addressed by SAFE. Hence, the public study subjects in the literature do not fit our study's requirements. In particular, none of the public real-time system case studies [26] contains estimated WCET ranges in their task descriptions. These ranges, however, are necessary to apply SAFE and to evaluate its effectiveness. In order to evaluate SAFE in various and realistic system contexts, we chose to apply SAFE to existing industrial subjects, i.e., ICS and UAV, described in prior work [26] and made necessary changes only to task WCETs of the subjects as described below. Compared to ADCS, ICS and UAV have different task characteristics, such as resource dependencies and number of processing cores.
We note that estimating (practically valid) WCET ranges requires significant domain expertise. For public domain case study systems such as ICS and UAV, however, we do not have any access to the engineers who have developed those subjects. Hence, we chose to apply a simple and straightforward method to convert a point WCET value to a WCET range as follows: (Step 1) We first check whether the system under analysis is schedulable or not. For a task in the system, we denote by an original point WCET value of . (Step 2) If the system is evaluated to be schedulable, it indicates that the system's tasks may be able to handle higher execution times than their estimated WCETs. Hence, we simply define the WCET range of by [ , · ], where > 1, as input WCET ranges for SAFE. This modification enables SAFE to find more relaxed safe WCET ranges. (Step 2 ′ ) Otherwise, if the system is evaluated to be unschedulable, we define the WCET range of by [ ′ · , ], where ′ < 1, as input for SAFE. This modification allows SAFE to find appropriate WCET estimates, ensuring the system is likely to be schedulable under the WCET ranges found by SAFE. As ICS and UAV are likely to be schedulable [26], for all task in ICS and task in UAV, we created the modified task descriptions of ICS and UAV based on Step 2. We conducted experiments using simulations to set the values for ICS and UAV by configuring to 1.1, 1.2, . . ., 1.5 incrementally until we could find deadline misses in each system, i.e., unsafe WCET values. Recall from Section 4.2 that SAFE relies on logistic regression to partition the given WCET ranges into safe and unsafe sub-ranges for a selected deadline miss probability. Hence, we modified the estimated WCET ranges of ICS and UAV to include both safe and unsafe WCET ranges. Given the experiment results, we set the WCET ranges of ICS and UAV to [ , 1.2· ] and [ , 1.5· ], respectively. The full original and modified task descriptions of ICS and UAV are available online [43].

Synthetic study subjects
We evaluated the scalability of SAFE using synthetic systems, following the common scalability analysis practice applied in many real-time system studies [23,25,28,32,70,78]. As shown in Algorithm 2, we synthesise a set of real-time tasks by varying key task parameters as described below. The algorithm first synthesises a set of periodic tasks (lines 2-8) and then converts some of these tasks to aperiodic tasks (lines 9-10). Last, the algorithm configures some tasks with WCET ranges (lines [11][12].
As shown on line 3 of Algorithm 2, the algorithm first creates a set U of task utilisation values by using the UUniFast-Discard algorithm [23] that is devised to give an unbiased distribution of task utilisation values. The UUniFast-Discard algorithm takes as input the number of tasks to be synthesised, , and a target utilisation value, . It then outputs utilization values, 1 . . . , where 0 < < 1 for all and =1 = . As for line 4 in Algorithm 2, the algorithm generates task periods, 1  . The parameter is used to determine the granularity of period values as multiples of . Lines 5-7 of Algorithm 2 describe how the algorithm synthesises tasks' WCET values. Specifically, for each task , the algorithm computes the WCET value of as = · .
Given the task periods T and the WCET values C, line 8 of Algorithm 2 synthesizes a set Γ of periodic tasks accounting for offsets, priorities, and deadlines. A periodic task is characterised by a period , a WCET , an offset , a priority , and a deadline (see Section 3). A task offset is randomly selected from an input range [0, ] of offset values. The algorithm relies on the Algorithm 2: An algorithm for creating a synthetic subject while accounting for the task characteristics described in Section 3.
Input: -: number of tasks -: target utilisation min : minimum task period max : maximum task period -: granularity of task periods -: maximum offset value -: ratio of aperiodic tasks -: range factor to determine inter-arrival times -: number of WCET ranges -: range factor to determine WCET ranges C ← C ∪ { · }, where ∈ U and ∈ T //WCETs 7: end for 8: Γ ← generate_task_set(T, C, , ) 9: //convert some periodic tasks to aperiodic tasks 10: Γ ← convert_to_aperiodic_tasks(Γ, , ) 11: //convert some WCET point values to WCET ranges 12: Γ ← convert_to_WCET_ranges(Γ, , ) 13: return Γ rate-monotonic scheduling policy [44] to decide task priorities and deadlines. Specifically, tasks with shorter periods are given higher priorities and tasks' deadlines are equal to their periods. Line 10 of Algorithm 2 synthesises aperiodic tasks. The algorithm converts some periodic tasks into aperiodic tasks according to a ratio of aperiodic tasks among all tasks. The algorithm then uses a range factor to determine minimum and maximum inter-arrival times of aperiodic tasks. Specifically, for a task to be converted, the algorithm computes a range

Experimental Setup
To answer RQ1, RQ2, and RQ3 described in Section 5.1, we rely on case study data pertaining to ADCS, provided by LuxSpace, as well as the ICS and UAV subjects described in Section 5.2. To answer RQ4, we used 800 synthetic subjects (see Section 5.3). We conducted four experiments, EXP1, EXP2, EXP3, and EXP4, as described below.
EXP1. To answer RQ1, we developed a baseline solution that estimates task WCETs based on random search (RS). The baseline replaces the GA in Phase 1 with RS and does not infer a safe border using logistic regression. Note that the baseline uses the same fitness function (see Section 4.1) and also maintains the best population during search; however, it does not employ any genetic operators, i.e., crossover and mutation. The baseline solution also produces a labelled dataset #» that contains tuples ( , ℓ) where is a set of task WCETs and ℓ is a label of indicating either safe or unsafe (see Section 4.1). Given the labelled dataset, the baseline selects the best task WCETs that are safe and maximise the volume of the hyperbox they define. Specifically, the baseline finds a particular tuple ( , ℓ ) in #» that maximises the volume of the hyperbox defined by while satisfying the following condition: For all tuples ( , ℓ ) in #» , the hyperbox defined by is contained in the hyperbox defined by , ℓ = safe, and ℓ = safe.
EXP1 compares the results obtained from executing SAFE and the baseline. For comparison, SAFE selects a best-size point, i.e., WCET ranges, on a safe border that maximises the volume of the hyperbox the point defines (see Section 4.2). Given two solutions, i.e., estimated WCET ranges, obtained by SAFE and the baseline, EXP1 checks the schedulability of the two solutions using simulations. To do so, we ran simulations multiple times by varying task arrivals and task execution times within their estimated WCET ranges and checked whether there was a deadline miss in each simulation result.
EXP2. To answer RQ2, EXP2 compares our distance-based WCET sampling technique (described in Section 4.2) with the naive random WCET sampling technique, for the second phase of SAFE. To this end, EXP2 first creates an initial training dataset by running the first phase of SAFE. EXP2 then relies on this initial training data for model refinement (Section 4.2) by using both distance-based and naive random sampling. For comparison, EXP2 creates a test dataset by randomly sampling WCET values, which is independently created from the second phase of SAFE, and then compares the accuracy of the two sampling approaches in identifying safe WCET ranges for the test dataset.
EXP3. To answer RQ3, EXP3 computes precision values for SAFE, obtained from 10-fold crossvalidation (see Section 4.2), over each model refinement for the ADCS subject. We note that EXP3 focuses on ADCS to evaluate the practical usefulness of SAFE as we do not have any access to the engineers who have developed the other study subjects, i.e., ICS and UAV. In our study context, i.e., developing safety-critical systems, engineers require very high precision, i.e., ideally no false positives, (see Section 4). Hence, EXP3 measures precision over model refinements to align with such practice. EXP3 then measures whether SAFE can compute safe WCET ranges within practical execution time and at an acceptable level of precision.
EXP4. To answer RQ4, EXP4 measures the execution time of SAFE with 800 synthetic systems. We use the task generation algorithm described in Section 5.3 to create synthetic systems. In order to conduct controlled experiments to study correlations between the execution time of SAFE and a particular system parameter (e.g., number of tasks), we first create a baseline synthetic system by setting the parameters of Algorithm 1 as follows: (1) We set the number of tasks to 20, the ratio of aperiodic tasks to 0.45, and the maximum offset to 0. Note that these parameter values are the average values of the parameters in our industrial subjects. (2) With regard to task periods, we set the range [ , ] of minimum and maximum periods to [10ms, 1s], which are common values in many real-time subjects [10]. The granularity of task periods is set 10ms in order to increase realism as most of the task periods in our industrial subjects are multiples of 10ms. (3) For the range factor to determine inter-arrival times for aperiodic tasks , the number of WCET ranges , the range factor to determine WCET ranges , and the target utilisation per processing core , we assign = 0.25, =2, = 0.25, and = 0.9, respectively. We set these parameter values based on initial experiments to ensure that the executions of the synthetic systems examined in EXP4 sometimes violate their deadlines. Recall from Section 4 that SAFE relies on logistic regression and a labelled dataset, containing both safe (positive) and unsafe (negative) data instances. (4) We set the number of processing cores to 1 as a baseline. (5) For the simulation time of SafeScheduler (see Section 4.1), we assign 30s in order to ensure that any aperiodic task arrives at least once and all possible arrivals of periodic tasks are analysed during that time. Given the baseline system, we create several synthetic systems to be examined in EXP4 by varying the parameters' values as follows: (1)  Resource dependencies are not controlled when generating synthetic systems as they do not impact SAFE's searching and learning (see Section 4) but only simulations, which are investigated by varying simulation time. Due to the degree of randomness in our approach to generating synthetic systems (see Section 5.3), we create ten synthetic systems for each control parameter.

Metrics
We use the standard precision and recall metrics [71] to measure the accuracy in our experiments. To compute precision and recall in our context, for EXP2, we created a synthetic test dataset for each study subject containing tuples of WCET values and a flag indicating the presence or absence of deadline miss obtained from running SafeScheduler. Note that creating a test dataset by running an actual study subject with varying task WCETs is prohibitively expensive. We therefore used a set of task arrival sequences obtained from the first phase of SAFE for each subject as we aim at testing sequences of task arrivals which are more likely to violate their deadlines. We then ran SafeScheduler to simulate task executions for the set of task arrival sequences with randomly sampled WCET values. We note that WCET values were sampled within the restricted WCET ranges after the "handling imbalance" step in Algorithm 1. Parts of the WCET ranges under which tasks are unlikely to be schedulable are therefore not considered when sampling. For EXP3, we used 10-fold cross-validation based on the training dataset at each model refinement step (phase 2).
We define the precision and recall metrics as follows: (1) precision = TP/(TP + FP) and (2) recall = TP/(TP + FN ), where TP, FP, and FN denote the number of true positives, false positives, and false negatives, respectively. A true positive is a test instance (a set of WCET values) labelled as safe and correctly classified as such. A false positive is a test instance labelled as unsafe but incorrectly classified as safe. A false negative is a test instance labelled as safe but incorrectly classified as unsafe. We prioritise precision over recall as practitioners require (ideally) no false positives -an unsafe instance with deadline misses is incorrectly classified as safe -in the context of mission-critical, real-time satellite systems. For EXP2, precision and recall values are measured based on a synthetic test dataset. For EXP3, precision values are computed using collective sets of true positives and false positives obtained from 10-fold cross-validation at each model refinement.
Due to the inherent degree of randomness in SAFE, we repeat our experiments 50 times. For EXP1, we ran 40000 simulations to check the schedulability of the solutions obtained by SAFE and the baseline. To statistically compare our results, we use the non-parametric Mann-Whitney U-test [49] and Vargha and Delaney'sˆ1 2 effect size [66]. Mann-Whitney U-test determines whether two independent samples are likely or not to belong to the same distribution. We set the level of significance, , to 0.05. Vargha and Delaney'sˆ1 2 measures probabilistic superiority -effect sizebetween search algorithms. Two algorithms are considered to be equivalent when the value ofˆ1 2 is 0.5.

Implementation and Parameter Tuning
To implement the feature reduction step of Algorithm 1, we used the random forest feature reduction [16] as it has been successfully applied to high-dimensional data [39,59]. For the stepwise regression step of Algorithm 1, we used the stepwise AIC regression technique [75] which has been used in many applications [52,79]. Recall from Section 4.2 that our distance-based sampling and best-size region recommendation require a numerical optimisation technique to find the nearest WCET sample and a maximum safe region size based on an inferred safe border. For such optimisations, we applied a standard numerical optimisation method, i.e., the Nelder-Mead method [58].
To compute the GA fitness, we set the number of SafeScheduler runs (Section 4.1) for each solution ( in Section 4.1) to 20. This number was chosen based on our initial experiments. We observed that 20 runs of SafeScheduler per solution keeps execution time under a reasonable threshold, i.e., <1.2m for all the subjects, and is sufficient to compute the fitness of SAFE. SafeScheduler schedules 34 tasks in ADCS for 1800s, 6 tasks in ICS for 150ms, and 16 tasks in UAV for 1500ms during which SafeScheduler advances its simulation clock by 0.1ms, 0.01ms, 0.01ms, respectively, for adequate precision. We chose the time periods to ensure that all the tasks in each subject can be executed at least once.
For the GA search parameters, we set the population size to 10, the crossover rate to 0.7, and the mutation rate to 0.2, which are consistent with existing guidelines [38]. We ran GA for 1000 iterations after which we observed that fitness reached a plateau in our initial experiments. Note that for the baseline comparison to be fair, we ran RS for 1500 iterations to ensure that the generated dataset #» contained 30000 data instances, which are the same number of data instances obtained by SAFE.
Regarding the feature reduction step of Algorithm 1, we set the random forest parameters as follows: (1) the tree depth parameter is set to √︁ | |, where | | denotes the number of features, i.e., 26 WCET ranges in ADCS, 6 WCET ranges in ICS, and 16 WCET ranges in UAV, based on guidelines [37]. (2) The number of trees is set to 100 based on our initial experiments. We observed that learning more than 100 trees does not provide additional gains in terms of reducing the number of features.
Note that all the parameters mentioned above can probably be further tuned to improve the performance of SAFE. However, since with our current setting, we were able to convincingly and clearly support our conclusions, we do not report further experiments on tuning those parameters.
We ran our experiments over the high-performance computing cluster [67] at the University of Luxembourg. To account for randomness, we repeated each run of SAFE 50 times for all the experiments. Each run of SAFE was executed on a different node of the cluster. It took around 35h for us to create a synthetic test dataset with 50000 instances. When we set 1000 GA iterations for the first phase of SAFE and 10000 new WCET samples (100 refinements × 100 new WCET samples per refinement) for the second step of SAFE, each run of SAFE took at most 27.1h -phase 1: 16.361h and phase 2: 10.74h. The running time is acceptable as SAFE can be executed offline in practice. Table 3. Comparing SAFE and our baseline method using (1) the volumes of the hyperboxes that are defined by the best-size points computed by each method, (2) the number of simulation runs that contain deadline misses out of total 40000 runs, and (3) Table 3 compares SAFE and our baseline method (Baseline) using the following three metrics: (1) the volume of the hyperbox that is defined by the best-size point (see Sections 4.2 and 5.4) computed by each method, i.e., SAFE and Baseline, (2) the number of simulation runs, out of 40000 runs, that contain any deadline misses when tasks execute within their estimated WCET ranges defined by the best-size points, and (3) the execution time of SAFE and Baseline to estimate WCET ranges. The results presented in the table are the mean values obtained from 50 runs of SAFE and Baseline for each of the three subjects. To enable accurate comparisons, we ran 40000 simulations of each execution of SAFE and Baseline, aiming at evaluating their estimated WCET ranges as described in Section 5.4. Statistical comparisons of the results obtained from 50 runs of SAFE and Baseline are summarized using p-values andˆ1 2 values as described in Section 5.5.

RQ1.
As shown in Table 3, compared to Baseline, SAFE provides more relaxed WCET ranges for ADCS and UAV. Note that the larger the hyperbox, the greater flexibility the engineers have in selecting appropriate WCET values. For ICS, however, Baseline finds a larger hyperbox than the best-size hyperbox produced by SAFE. This is likely due to the fact that ICS has a small number of tasks and is therefore much simpler than the other two subjects. Further, we recall that SAFE also provides engineers with a probability of deadline misses and trade-off relations between task WCETs based on an inferred logistic regression model (see Section 4.2). We further discuss these benefits from a practitioner's perspective in RQ3.
EXP1 evaluates the estimated WCET ranges that are defined by the best-size points obtained from 50 runs of SAFE and Baseline for each subject. The estimated WCET ranges are examined through 40000 simulation runs by varying task arrivals and their execution times within the estimated WCET ranges. The "Deadline misses" row in Table 3 shows the mean number of simulation runs (out of 40000 runs) containing deadline misses. Across all the subjects, the differences between SAFE and Baseline are statistically significant as the p-values are less than 0.05. Theˆ1 2 values show that SAFE is probabilistically superior to Baseline with respect to minimising the number of deadline misses.
Regarding the execution times of SAFE and Baseline, SAFE took more time than Baseline for all the subjects as shown in Table 3. Estimating safe WCET ranges for ADCS requires the largest execution time (on average, 25.14h) compared to the other subjects. We note that such execution time is acceptable as SAFE can be executed offline in practice. Figure 5 shows probability distributions obtained from 50 runs of SAFE and simulations for ADCS, ICS, and UAV. As described in Section 4.2, SAFE partitions the given WCET ranges into safe and unsafe sub-ranges using a safe WCET border that is defined by an inferred logistic regression model q q q q q q q q q q q q q q q q q q q q q q q q q q q q q and a selected probability of deadline misses. In EXP1, SAFE selects a deadline miss probability that maximises the safe area under the safe border while ensuring that all the WCET points, i.e., sets of WCETs, classified as safe using the safe border are actually observed to be safe in the input dataset of logistic regression. The estimated WCET ranges, i.e., 50 best-size WCET points obtained by SAFE, are then evaluated through 40000 simulation runs for each WCET point by varying task arrivals and their execution time within their estimated WCET ranges. The empirical probability, i.e., relative frequency, of deadline misses is computed by the ratio of the number of simulation runs containing any deadline misses to the total number of simulation runs, i.e., 40000. The probability comparison depicted in Figure 5 shows that the selected probability of deadline misses by SAFE is larger than the empirical probability computed by simulation-based evaluations. SAFE infers a logistic regression model, providing a probabilistic interpretation, based on a labelled dataset evaluated by worst-case task arrivals. The inferred logistic regression model, therefore, likely fits the system executions when task arrivals are worst with respect to maximising the magnitude of deadline misses. However, the empirical probability estimated through simulations is based on system executions when tasks randomly arrive within their inter-arrival time ranges. Hence, a logistic regression model obtained by SAFE enables more conservative probabilistic interpretations of the estimated WCET ranges than simulation-based evaluations for the WCET ranges. This implies that actual deadline miss probabilities tend to be lower than SAFE probability estimates, which is in practice a desirable property.
The answer to RQ1 is that SAFE significantly outperforms the baseline method with respect to minimising the number of deadline misses when using the estimated WCET ranges. Across our experiments, SAFE takes at most 27.1h (while the baseline takes 26.4h) to estimate the best-size WCET ranges. The execution time is acceptable as SAFE can be executed offline in practice.
As shown in Figures 6a, 6c, and 6e, across over 100 model refinements, SAFE with D achieves higher precision values than those obtained by R for the three subjects. Also, Figures 6a, 6c, and 6e show that the variance of precision with D tends to be smaller than that of R. On average, D's precision converges toward 1 with model refinements; however, precision with R shows a markedly different trend that is not converging to 1, an important property in our context. Based on statistical comparisons, the difference in precision values between D and R becomes statistically significant after only 10, 4, and 11 model refinements, respectively, for ADCS, ICS, and UAV.
Regarding recall comparisons between D and R for ADCS, as shown in Figure 6b, D produces higher recall values over 100 model refinements than those of R. The difference in recall values between D and R becomes statistically significant after 36 model refinements. Regarding ICS ( Figure 6d) and UAV (Figure 6f), their differences in recall values for D and R are not statistically significant even after 100 model refinements. This may be explained by their much smaller number of tasks compared with ADCS. Across our experiments, for 100 model refinements, SAFE took, at most, 10.86h and 10.54h with D and R, respectively.
The answer to RQ2 is that SAFE with distance-based sampling significantly outperforms SAFE with random sampling in achieving higher precision. Only distance-based sampling can achieve a precision close to 1 within practical time, an important requirement in our context. Figure 7 shows precision values obtained from 10-fold cross-validation at each model refinement for the ADCS subject. Recall from Section 4.2 that SAFE stops model refinements once a precision value reaches a desired value. As shown in Figure 7, precision values tend to increase with additional WCET samples. Hence, practitioners are able to stop the model refinement procedure once precision reaches an acceptable level, e.g., >0.999. At 100 model refinement, SAFE reaches, on average, a precision of 0.99986. For EXP3, SAFE took, at most, 16.36h for phase 1 and 10.74h for phase 2.

RQ3.
As described in Section 4.2, SAFE reduces the dimensionality of the WCET space through a feature reduction technique based on random forest. The computed importance scores of each task's WCET in our dataset are as follows: 0.773 for T30, 0.093 for T33, 0.016 for T23, and ≤0.005 for the remaining 31 tasks. Based on a standard feature selection guideline [37], only the WCET values of two tasks, i.e., T30 and T33, are deemed to be important enough to retain as their score is higher than the average importance, i.e., 0.0385. Hence, SAFE computes safe WCET ranges of these two tasks in the next steps described in Algorithm 1. Figure 8 shows the inferred safe border which identifies safe WCET ranges within which all 34 tasks are schedulable with an estimated deadline miss probability of 1.97%. Given the safe border, we found a best-size point which restricts the WCET ranges of T30 and T33 as follows: T30 [0.1ms,

458
.0ms] and T33 [0.1ms, 2138.1ms]. We note that the initial estimated WCET ranges of the two tasks are as follows: T30 [0.1ms, 900.0ms] and T33 [0.1ms, 20000.0ms]. SAFE therefore resulted in safe WCET ranges representing a significant decrease of 49.11% and 89.31% of initial maximum WCET estimates, respectively. This information is therefore highly important and can be used to guide design and development.
The answer to RQ3 is that SAFE helps compute safe WCET ranges that have a much lower maximum than practitioners' initial WCET estimates. Our case study showed that SAFE determined safe maximum WCET values that were only 51% or less the original estimate. Further, these safe WCET ranges have a deadline miss probability of 1.97% based on the inferred logistic regression model. More restricted ranges can be selected to reduce this probability. SAFE took, on average, 25.14h to compute such safe WCET regions, which is acceptable for offline analysis in practice. Figure 9 shows the distributions (25%-50%-75%) of execution times obtained from 10 × 10 runs of SAFE, i.e., 10 runs for each of the 10 synthetic systems with the same experimental setting (see Section 5.4). The red solid lines in Figure 9 represent the mean value changes of the execution times of SAFE over varying values of the following control parameters: (a) number of tasks , (b) ratio of aperiodic tasks , (c) range factor for inter-arrival times , (d) number of WCET ranges , (e) range factor for WCET ranges , (f) maximum offset value , (g) number of processing cores , and (h) simulation time t. Note that all the experiments in EXP4 took at most 16.7h, which is acceptable as SAFE is an offline analysis technique.

RQ4.
As shown in Figures 9a, 9g, and 9h, the execution time of SAFE is linear in the number of tasks , the number of processing cores , and the simulation time t. However, regarding the ratio of aperiodic tasks (Figure 9b), the range factor for inter-arrival times (Figure 9c), the range factor for WCET ranges (Figure 9e), the maximum offset value (Figure 9f), the results indicate that they have no correlations with the execution time of SAFE. Therefore, we expect SAFE to scale well as the number of tasks, the number of processing cores, and the simulation time increase. Figure 9d shows that the execution time of SAFE is quadratically correlated with the number of WCET ranges in a system. Recall from Section 4 that the number of tasks characterising their WCETs as ranges (instead of point values) determine the size of a labelled dataset. Specifically, SAFE uses a second-order polynomial response surface model (RSM) to build a logistic regression model. RSM contains linear terms, quadratic terms, and 2-way interactions between linear terms (see Section 4.2). Hence, the number of coefficients in an RSM to be inferred by logistic regression is the sum of the numbers of constants, linear terms, quadratic terms, and 2-way interactions in an RSM, i.e., 1 + + + 2 (see the RSM equation presented in Section 4.2), which impacts   [40]. Hence, the execution time of logistic regression in SAFE is quadratically correlated with the number of WCET ranges as explained above. In addition, the results presented in Figure 9d show that the magnitude of the execution time variation increases when the number of WCET ranges in a system increases. As written in Section 4.2, SAFE employs a feature reduction technique and a stepwise regression technique to efficiently infer logistic regression models (see lines 1-4 of Algorithm 1). The outputs of the two techniques depend on the number of tasks whose WCETs significantly impact deadline misses. Such outputs then impact the execution times of the following steps in Algorithm 1: imbalance handling, sampling, and regression. Note that the synthetic systems generated by setting = 10 are more diverse in terms of WCET ranges when compared to the systems created by setting = 1.
In EXP4, SAFE analyses all tasks in a system. However, recall from Section 4 that SAFE provides the capability of selecting target tasks as engineers often need to focus on the most critical ones. In such cases, the execution time of SAFE can significantly decrease.
The answer to RQ4 is that the execution time of SAFE is linear in the number of tasks, the number of processing cores, and the simulation time. However, the execution time of SAFE is quadratically correlated with the number of tasks whose WCETs are given as ranges. Across our experiments, SAFE took at most 17.1h, which is acceptable for offline analysis in practice.
Benefits from a practitioner's perspective. Investigating practitioners' perceptions of the benefits of SAFE is necessary to adopt SAFE in practice. To do so, we draw on the qualitative reflections of three software engineers at LuxSpace, with whom we have been collaborating on this research. The reflections are based on the observations that the engineers made throughout their interactions with the researchers.
SAFE produces a set of worst-case sequences of task arrivals (see Section 4.1). Engineers deemed them to be useful for further examinations by experts. The current practice is to use an analytical schedulability test [44] which proves whether or not a set of tasks are schedulable. Such an analytical technique typically does not provide additional information regarding possible deadline misses. In contrast, worst-case task arrivals and safe WCET ranges produced by SAFE offer insights to engineers regarding deadline miss scenarios and the conditions under which they happen.
Engineers noted that some tasks' WCET are inherently uncertain and that such uncertainty is hard to estimate based on expertise. Hence, their initial WCET estimates were very rough and conservative. Further, estimating what WCET sub-ranges are safe is even more difficult. Since SAFE estimates safe WCET ranges systematically with a probabilistic guarantee, the engineers deem SAFE to improve over existing practice. Also, SAFE allows engineers to choose system-specific safe WCET ranges from the (infinite) WCET ranges modeled by the safe border, rather than simply selecting the best-size WCET range automatically suggested by SAFE (Figure 8). This flexibility allows engineers to perform domain specific trade-off analysis among possible WCET ranges and is useful in practice to support decision making with respect to their task design.
Given the fact that we have not yet undertaken rigorous user studies, the benefits highlighted above are only suggestive but not conclusive. We believe the positive feedback obtained from LuxSpace and our industrial case study shows that SAFE is promising and worthy of further empirical research with human subjects.

Threats to Validity
Internal validity. To ensure that our promising results cannot be attributed to the problem merely being simple, we compared SAFE with an alternative baseline using random search under identical parameter settings (see the RQ1 results in Section 5.7). Phase 1 of SAFE can indeed be replaced with a random search, as we did, or even an exhaustive technique if the targeted system is small. However, there are no alternatives for Phase 2 -our main contribution -which infers safe WCET ranges and enables trade-off analysis. We present all the underlying parameters and provide our full evaluation package [43] to facilitate reproducibility. We mitigate potential biases and errors in our experiments by drawing on an industrial case study in collaboration with engineers at LuxSpace.
Recall from Section 5.7 that we compared probability values of deadline misses computed by SAFE and simulations. The results show that SAFE enables engineers to have more conservative probabilistic interpretations of the estimated WCET ranges than simulation-based evaluations for the WCET ranges. However, depending on the system's characteristics, e.g., hard real-time systems, engineers may need absolute guarantees for the WCET estimates. In such cases, once engineers find safe WCET ranges using SAFE, they can, in theory, obtain an absolute guarantee using exhaustive verification techniques, e.g., UPPAAL [53], on whether tasks always meet their deadlines for the given WCET ranges or not. Note that we performed an experiment using UPPAAL as it has often been used in the literature [54,74,76]. We applied UPPAAL to verify whether ADCS tasks are schedulable for the given WCET values. However, our experiment results showed that UPPAAL was not able to complete the analysis task, even after five days of execution. Hence, engineers should therefore consider such scalability issues when applying exhaustive analysis techniques to complement SAFE. Since this UPPAAL evaluation is not the main focus of this article, we point the reader to the UPPAAL specification of ADCS available online [43].
External validity. The main threat to external validity is that our results may not generalize to other contexts. We evaluated SAFE using early-stage WCET ranges estimated by practitioners at LuxSpace. However, SAFE can be applied at later development stages as well (1) to test the schedulability of the underlying set of tasks of a system and (2) to develop tasks under more precise constraints regarding safe WCETs. Future case studies covering the entire development process remain necessary for a more conclusive evaluation of SAFE. In addition, while motivated by ADCS (see Section 5.2) in the satellite domain, SAFE is designed to be generally applicable to other contexts. To evaluate the usefulness of SAFE in other contexts, in addition to our motivating case study system, we applied SAFE to two industrial systems from different domains, having very different system characteristics such as resource dependencies and multiple processing cores. As we described in Section 5.2, however, none of the public study subjects provide initial estimates of their task WCETs as ranges, which are required by SAFE as input. Hence, we had to modify the study subjects to include WCET ranges in their task descriptions but attempted to minimise any potential biases and errors in the experiments by converting a point WCET value to a WCET range in a systematic and straightforward way. In Section 5.2, we described the converting method in detail. Also, we made the modified task descriptions available online [43]. However, the general usefulness of SAFE needs to be further assessed in other contexts and domains.
Bini et al. [14] propose a theoretical sensitivity analysis method for real-time systems accounting for a set of periodic tasks and their uncertain execution times. Brüggen et al. [69] present an analytical method to analyse a deadline miss probability of real-time tasks using probability density functions of approximated task execution times. In contrast to SAFE, most of these analytical approaches do not directly account for aperiodic tasks having variable arrival intervals; instead, they treat aperiodic tasks as periodic tasks using their minimum inter-arrival times as periods [24]. However, SAFE takes various task parameters, including irregular arrival times, into account without any unwarranted assumption. Also, our simulation-based approach enables engineers to explore different scheduling policies provided by real RTOS; however, these analytical methods are typically only valid for a specific conceptual scheduling policy model.
Bernat et al. [12] introduce the concept of weakly hard real-time systems that can tolerate occasional deadline misses. They precisely define weakly hard deadline constraints, specifying a maximum number of deadlines that can be missed during a time window, and provide the theoretical analysis of the properties and relationships between tasks with the temporal constraints. Xu et al. [73] develop an algorithm that accounts for sporadic task overload when analysing the number of deadlines a task can miss in a given sequence of consecutive task arrivals. Pazzaglia et al. [60] present an extended weakly hard analysis method by accounting for additional uncertainties such as task offsets and release jitters. SAFE complements the above research strands on weakly hard real-time systems since, instead of analysing the number of deadlines a task can afford to miss over a time window, SAFE provides probabilistic guarantees for deadline misses based on logistic regression models inferred from search and simulation outputs.
Cimatti et al. [20] develop an approach that computes the regions of the task parameter values guaranteeing tasks are schedulable, i.e., schedulability regions. Their approach uses parametric timed automata and an SMT solver, i.e., NuSMT [19], and is applied to a system that contains two periodic tasks. Sun et al. [64] propose a method, named IMITATOR, that aims at computing schedulability regions. IMITATOR is based on model checking of parametric timed automata with stopwatches. They evaluate the method by applying it to two test-case systems that contain at most two free parameters, i.e., task execution times, that are defined as variables. Note that the other parameters, e.g., task periods and deadlines, are defined as fixed values. The results show that IMITATOR covers the entire parameter space but does not scale well with the size of the problem. André et al. [4] developed a tool that translates a graphical specification of a real-time system to the input of IMITATOR such that it allows computation of some schedulability regions using IMITATOR. However, these methods that exhaustively search the problem space are often not amenable to analyse industrial systems in a scalable manner as they typically contain many tasks, different task types, complex task relationships, and multiple processing cores.
Hansen et al. [34] present a measurement-based approach to estimate WCET and a probability of estimation failure. The measurement-based WCET estimation technique collects actual execution time samples and estimates WCETs using linear regression and a proposed analytical model. To our knowledge, most of the research strands regarding WCET estimation are developed for later development stages at which task implementations are available. Note that relatively few prior works aim at estimating WCET at an early design stage; however, these work strands still require access to source code, hardware, compilers, and program behaviour specifications [3,15,33]. In contrast, SAFE uses as input estimated WCET ranges and then precisely restricts the WCET ranges within which tasks are schedulable with a selected deadline miss probability, by relying on a tailored genetic algorithm, simulation, feature reduction, a dedicated sampling strategy, and logistic regression.
Testing and verification are important to successfully develop safety-critical real-time systems [1,17,27,42,53,77]. Some prior studies employ model-based testing to generate and execute tests for real-time systems [27,53,77]. SAFE complements these prior studies by providing safe WCETs as objectives to engineers implementing and testing real-time tasks. Constraint programming and model checking have been applied to ensure that a system satisfies its time constraints [1,42]. These techniques may be useful to conclusively verify whether or not a WCET value is safe. However, such exhaustive techniques are not amenable to address the analysis problem addressed in this article, which requires the inference of safe WCET ranges. To our knowledge, SAFE is the first attempt to accurately estimate safe WCET ranges to prevent deadline misses with a given level of confidence and offer ways to achieve different trade-offs among tasks' WCET values.

CONCLUSION
We developed SAFE, a two-phase approach applicable in early design stages, to precisely estimate safe WCET ranges within which real-time tasks are likely meet their deadlines with a high-level of confidence. SAFE uses a meta-heuristic search algorithm to generate worst-case sequences of task arrivals that maximise the magnitude of deadline misses, when they are possible. Based on the search results, SAFE uses a logistic regression model to infer safe WCET ranges within which tasks are highly likely to meet their deadlines, given a selected probability. SAFE is developed to be scalable by using a combination of techniques such as a genetic algorithm and simulation for the SAFE search (phase 1) and feature reduction, an effective sampling strategy, and polynomial logistic regression for the SAFE model refinement (phase 2). We evaluated SAFE on a mission-critical, real-time satellite system in collaboration with a satellite company as well as two industrial systems from different domains whose description was retrieved from the literature. The results indicate that SAFE is able to precisely compute safe WCET ranges for which deadline misses are highly unlikely, these ranges being much smaller than the WCET ranges initially estimated by engineers. Further, we evaluated the scalability of SAFE using a number of synthetic systems. The results indicate that SAFE scales to complex systems. Across the experiments on industrial and synthetic systems, SAFE took at most 27h, which is acceptable in practice as an offline analysis method.
For future work, we plan to extend SAFE in the following directions: (1) developing a real-time task modelling language to describe dependencies, constraints, behaviours of real-time tasks and to facilitate schedulability analysis and (2) building a decision support system to recommend a schedulable solution if a set of tasks are not schedulable, e.g., priority re-assignments. In the long term, we would like to more conclusively validate the usefulness of SAFE by applying it to other case studies in different domains.