Hypertesting of Programs: Theoretical Foundation and Automated Test Generation

Hyperproperties are used to define correctness requirements that involve relations between multiple program executions. This allows, for instance, to model security and concurrency requirements, which cannot be expressed by means of trace properties. In this paper, we propose a novel systematic approach for automated testing of hyperproperties. Our contribution is both foundational and practical. On the foundational side, we define a hyper-testing framework, which includes a novel hypercoverage adequacy criterion designed to guide the synthesis of test cases for hyperproperties. On the practical side, we instantiate such framework by implementing HyperFuzz and HyperEvo, two test generators targeting the Non-Interference security requirement, that rely respectively on fuzzing and search algorithms. Experimental results show that the proposed hypercoverage adequacy criterion correlates with the capability of a hypertest to expose hyperproperty violations and that both HyperFuzz and HyperEvo achieve high hypercoverage and high vulnerability exposure with no false alarms (by construction). While they both outperform the state-of-the-art dynamic taint analysis tool Phosphor, HyperEvo is more effective than HyperFuzz on some benchmark programs.


INTRODUCTION
In addition to functional correctness, software systems are required to respect critical non-functional properties, such as safety (e.g., a program should not threaten human life) and security (e.g., confidential information should not be leaked publicly).Most existing software Verification & Validation (V&V) approaches focus on functional correctness properties that belong to the family of so called trace properties.These correspond to program requirements that can be checked by observing single program executions.Consider, as an example, a smartphone app and suppose to check whether the app crashes or not.If the app can indeed crash, then by observing its execution when inputs are smartly selected or generated, we can eventually notice the crash.There are, however, important program requirements that go beyond the family of trace properties.For instance, security requirements may need to compare more than one execution at a time, in order to spot a defect.Consider a smartphone app dealing with users' confidential information that must not be leaked to public channels (e.g., unsecured connections).In order to precisely spot an information leakage, observing individual executions in isolation is not sufficient: we need to compare pairs of executions.Indeed, checking single executions may lead to false alarms, even in the case of dynamic techniques that are usually supposed to be precise.For instance, dynamic taint analysis may yield false alarms when running a single execution and propagating taint tags that reach, but do not have any effect on, public outputs.To precisely check information leaks, we have to execute the app at least twice, with different confidential information values, to assess whether these executions may generate public outputs that also differ, hence leaking information (i.e., in a secure program, when only confidential information changes, public information must remain the same).
Other requirements that cannot be expressed as trace properties include functional and safety properties of concurrent systems, where multiple parallel executions must satisfy atomicity of some operations or freedom from deadlock.Other examples comprise: security of cryptographic protocols [35]; planning in multi-agents systems [5]; robustness of robotic control mechanisms [37]; code obfuscation [25] and certified compilation techniques [34].
Collectively, those complex requirements that involve more than one execution trace have been named hyperproperties [11].Formally, while trace properties are defined in terms of sets of executions that satisfy a given correctness requirement, hyperproperties are defined in terms of sets of sets of program executions, with a specific hyperproperty being the set of all sets of executions (i.e., program semantics) that satisfy the associated requirement.In other words, a hyperproperty stipulates a property over the program semantics and not a property over individual program executions (as trace properties).This added level of complexity allows to specify relations between multiple, different executions of a program, that are not expressible with simple trace properties.Not surprisingly, the gained expressiveness requires more complex V&V techniques.
Current state-of-the-art V&V approaches mainly deal with trace properties, thus a lot of crucial program correctness requirements are not properly considered.Some progress in such direction has been achieved by using static analysis techniques (e.g., based on abstract interpretation [2,23]), but these solutions do not scale to large and complex software systems.Moreover, such analyses return a sound, conservative superset of the possible violations, which often includes a large proportion of false positives.Hence, the alternative of exposing hyperproperty violations by means of dynamic analysis techniques (e.g., automated test case generation) is extremely appealing [21].
In this paper, we aim at answering the following question: "How can we define a systematic framework for the dynamic verification of hyperproperties and for the automated generation of execution traces that violate them?" We first tackle the problem from a theoretical point of view, by developing a foundational theory for the systematic definition of testing strategies suitable for hyperproperties verification: the hypertesting framework, which includes a novel hypercoverage adequacy criterion as cornerstone.Second, we propose two input generation approaches specifically targeting hyperproperties: one based on fuzzing, dubbed Hyper-Fuzzing, and another based on evolutionary search algorithms, dubbed Evolutionary Hypertesting.These two approaches are respectively implemented in two tools, HyperFuzz and HyperEvo, which instantiate the proposed hypertesting framework to test a specific security hyperproperty, Non-Interference (i.e., absence of leakage of confidential information).We validated the two tools on a state-of-the-art benchmark for security containing vulnerable and non-vulnerable Java programs.Results show that our approach is very accurate in detecting hyperproperty violations, outperforming the state-of-the-art taint analysis tool Phosphor [4] (that can be used to approximate Non-Interference).
Summary of Contributions.The paper contributes with the following theoretical and practical results: • a theoretical framework for the systematic testing of hyperproperties, comprising novel coverage criterion and structural search metaheuristics specific for hyperproperties; • two test input generation approaches for hyperproperties, one based on fuzzing and another based on search algorithms, together with novel crossover and mutation operators specific for hyperproprties; and • two automated tools, HyperFuzz and HyperEvo, customized to test a given security hyperproperty (Non-Interference).

Synopsis.
In Section 2 we present the proposed hypertesting framework.Then, in Section 3, we describe the hypertesting procedure, declined in two variants: fuzzing-based and search-based.The empirical evaluation of the approach is reported in Section 4, together with the collected results.In Section 5 we compare our approach with the related work.Finally, in Section 6 we draw conclusions and discuss future research directions.

HYPERTESTING FRAMEWORK
Given a set of values V and a set of variables X, a variables assignment (often called memory, store or state) is a function m ∈ M ≜ X − → V ∪ {⊥} mapping variables to values (here ⊥ denotes an undefined value).In the following, we model program executions e ∈ E as finite sequences of memories, namely E ≜ n ∈N M n .Given a program P, we denote with inputVars(P) ⊆ X (resp.outputVars(P) ⊆ X) the set of its input (resp.output) variables.In this setting, an input (resp.output) for P is a variables assignment m that is defined for all input (resp.output) variables of P, namely such that dom(m) = inputVars(P) (resp.dom(m) = outputVars(P)), where dom(m) is the set of all variables x for which m(x) ⊥ (the domain of m).Given an execution e ∈ E of P, we denote with e i ∈ M (resp.e o ∈ M) the input (resp.output) of e.This means that program P produces the program execution e when running starting from the (input) memory e i , yielding the (output) memory e o .
The Control Flow Graph (CFG) of a program is an abstract representation of the program semantics (i.e., of all program executions), embedding control and data flow information of program variables.The nodes of such graph (usually called basic blocks) represent the statements of the program, while arcs represent an execution flow between statements.For instance, in Figure 1 (on the right) we have the CFG of a simple program code snippet (on the left).As we can see from the figure, each statement of the program is labeled with a unique program point (the underlined reddish number on the left of the statement) and sequential statements (i.e., sequences of assignments that are not interleaved by conditionals) are grouped in multi-statement blocks.The program state at each program point is the one computed after the execution of the command pointed by such program point.In a CFG special blocks are added: an entry point (entry), indicating the beginning of the program (with typical label 0 ); and an exit point (exit), indicating the end of the program (with typical label e ).We assume, w.l.o.g., that each program has unique entry and exit points.
Hyperproperties are in general complex, usually modeled by using very expressive logical systems [10].In this paper we focus on a particular subset of hyperproperties, the relational k-bounded [23], that are sufficient to express lots of security and concurrency requirements.Relational here means that the hyperproperty stipulates a relation between the input and the output of a program, while k-bounded means that the executions needed to refute the hyperproperty can be limited to a fixed finite number k.A k-bounded hyperproperty, k-hyperproperty for short, is hence of the form: such that P i ⊆ M k is a k-ary predicate on inputs and P o ⊆ M k is a k-ary predicate on outputs 1 .Note that, being e 1 , . . ., e k ∈ E, we have that, for instance, e 1 is a program execution, modeled as a sequence of program memories, where e i 1 is the input (memory) while e o 1 is the output (memory).Given a program P, the input predicate P i and the output predicate P o of Equation ( 1) are defined for P as predicates on P's variables at the entry and the exit point of P, respectively. 1The notation P(m 1 , . . ., m k ), with P ⊆ M k , is a shorthand for (m 1 , . . ., m k ) ∈ P.

Snippet of code:
1 key == 02 log = key + 5; 3 log == 0 For instance, the classic notion of Non-Interference [12], that requires the lack of (semantic) dependency on confidential information by public/non-confidential resources, is defined as: where = L says that two memories agree on the values of L (i.e., public) variables 2 .The idea is that there exists a (possibly harmful) information flow from a (possibly confidential) variable x to a (possibly public) variable y in a program P whenever a change in x is conveyed to y by the execution of P. Equation ( 2) models the absence of (potentially harmful) information flows from confidential to public variables.
In Figure 1 (on the left) we have a simple snippet of Java-like code that does not satisfy Equation (2), when key is considered confidential and log is considered public.Indeed, the predicate = L checks the value of public, i.e., L, variables (log in this case) at the beginning ot the program (before the first line) and at the end of the program (after last line).It is easy to find two executions that violate the hyperproperty.For instance, take two executions that have the following inputs: one having key equal to 0 and another having key equal to 1, with both executions having log equal to 6.If we execute the program on those inputs, we obtain the value 5 for log in one case and the value 1 in the other.Thus, Non-Interference is violated, since L-equivalent input variables are mapped to non L-equivalent output variables.
Syntactic dependencies, exploited by state-of-the-art approaches to track information flows (e.g., taint analysis), provide an approximation of Non-Interference, that models semantic dependencies.This is very often (implicitly) done by existing V&V approaches: the verification of a hyperproperty is approximated by considering a (simpler) trace properties [23].Being an approximation, it may of course lead to false positives (even in the case of dynamic approaches, that are usually supposed to be precise).Imagine to remove the program points 4 , 5 , 6 , 7 and 8 from the CFG of Figure 1, and to replace the conditional at 3 with the assignment log == 5;.The resulting program is secure since, independently from the initial value of key, the final value of log is always 5. Nevertheless, we have a syntactic dependency between key and log, potentially inducing a taint analysis to fire a (false) alarm.
To be refuted, a k-hyperproperty requires k properly chosen program executions, meaning that a testcase for a k-hyperproperty has necessarily to model k program executions.Hence, we model a test input for a k-hyperproperty as a tuple of (input) memories.Definition 2.1 (Hypertest Input).Given a k-hyperproperty hp, with input predicate P i , and a program P, a hypertest input for hp and P is a tuple (m 1 , . . ., m k ) such that: • dom(m i ) = inputVars(P), for all i ∈ [1..k]; and To assess whether a testcase is successful or not, we need a notion of oracle.In the case of k-hyperproperties, the oracle has to check the satisfaction of the output predicate.that is, the set of all counterexample executions (i.e., hyperproperty violations) for hp.
We can define a coverage criterion for a given k-hyperproperty as follows.A variable definition is the program point of an assignment where the variable is on the left-hand side.We denote with defs(G, x) the set of all definitions of the variable x in the CFG G.For instance, let G ni be the CFG of the program in Figure 1 Given a predicate P, we denote with vars(P) the scope of P, that is, the set of variables on which P acts on.Given a variable x and a program point ℓ in the CFG G, we denote with vals(x, ℓ , G) the set of all possible values x may take at ℓ in G. Again referring to the program in Figure 1, we have that vals(log, 4 Given a set of program points { ℓ 1 , . . ., ℓn } of a CFG G, we denote with dataSlice(G, { ℓ 1 , . . ., ℓn }, d) its d-bounded data-slice [15] in G, that is the set of all program points in G having a (transitive) data dependency of depth at most d on statements at program points ℓ 1 , . . ., ℓn .A program point ℓ ′ is data dependent on program point ℓ when the instruction at ℓ ′ uses a variable x defined at ℓ and a path exists in the CFG between ℓ and ℓ ′ containing no definition of x.A bounded data-slice transitively computes data dependencies in the backward direction, starting from the target program points, up-to a given threshold d.
Definition 2.4 (Value-Insensitive Hypercoverage Goal).Given a k-hyperproperty hp, with output predicate P o , and a program P, a value-insensitive hypercoverage goal for hp and P is a tuple (x, ℓ 1 , . . ., ℓ k ), where ℓ 1 , . . ., ℓ k are not necessarily different, such that: • ℓ 1 , . . ., ℓ k are program points in the CFG G P of P; and Consider again the program in Figure 1.Variable log is in the scope of the Non-Interference output predicate = L , and the program points 6 and 7 are definitions of log in G ni .Hence, (log, 6 , 7 ) is a value-insensitive hypercoverage goal.
Given a k-hyperproperty hp, with output predicate P o , and a program P, we define the value-insensitive hypercoverage criterion for hp and P, written VIHCC(hp, P), as the set of all value-insensitive hypercoverage goals for hp and P. The idea behind hypercoverage is that coverage of all k-tuples of output variable definitions observed in k different executions ensures that no definition potentially leading to a violation of the output predicate is left untried.Nevertheless, covering all k-tuples of output variable definitions with all possible values is unaffordable from a computational point of view, as it represents a form of exhaustive testing (hence theoretically intractable as computing all possible variable values at a program point is undecidable).For these reason, we rely on a value-insensitive hypercoverage criterion, providing an over-approximation of all possible violations (but effectively computable).Indeed, the value-insensitive hypercoverage criterion provides a necessary, but not sufficient, condition to expose a k-hyperproperty violation3 .Proposition 2.5.If P violates the k-hyperproperty hp, the k executions (e 1 , . . ., e k ) ∈ Core hp that witness the violation cover one hypercoverage goal in VIHCC(hp, P).
Among the value-insensitive hypercoverage goals, there are redundant elements, namely goals that may be ignored without affecting the possibility of counterexample generation.Dropping such elements will produce a narrower search space, hence improving the testing performance.In other words, redundant goals represent execution tuples that do not belong to Core hp , hence they do not concur to the falsification of the hyperproperty.We can identify two sources of redundancy: infeasible goals and irrelevant goals.
Infeasible goals.They represent program point tuples that cannot be simultaneously covered by execution tuples in Core hp .Starting from executions e 1 , . . ., e k such that P i (e i 1 , . . ., e i k ), the program points in infeasible goals are not exercised in any of the possible k executions satisfying P i .In Figure 1, infeasible program point pairs are linked by red dotted lines (e.g., the goal (log, 4 , 7 ) is infeasible).Definition 2.6 (Infeasible Goal).Given a k-hyperproperty hp, with input predicate P i , and a program P, the goal (x, ℓ 1 , . . ., ℓ k ) ∈ VIHCC(hp, P) is said infeasible when for all executions e 1 , . . ., e k of P such that P i (e i 1 , . . ., e i k ) holds, we have that ℓ 1 , . . ., ℓ k are not all reachable in e 1 , . . ., e k .
Irrelevant goals.They represent program point tuples having definitions that do not refute the hyperproperty.This can be seen as an output selection: executions e 1 , . . ., e k such that P o (e o 1 , . . ., e o k ) can be ignored, since they do not provide a counterexample for the hyperproperty.In Figure 1, irrelevant program point pairs are linked by green dashed lines (e.g., the goal (log, 2 , 4 ) is irrelevant).Definition 2.7 (Irrelevant Goal).Given a k-hyperproperty hp, with output predicate P o , and a program P with CFG G P , the goal Efficient Hypercoverage Goals Computation.Infeasible and irrelevant goals still involve semantic aspects of a program, hence, we cannot precisely compute such elements.However, we can settle for approximations of such sets that are efficiently computable.
In particular, we can approximate infeasible goals by computing a forward slice [6] of the program, using ⟨ 0 , vars(P i )⟩, where 0 is the entry point of the program, as slicing criterion.This means that we cover only definitions affected by input variables or by decisions that depend on input variables.
Similarly, we can approximate irrelevant goals by performing a constant propagation [38]  values for the output variables.This is indeed a very coarse approximation, since we need such variables to be constant only when k paths satisfying the considered hypercoverage goal are traversed, not in the execution of k arbitrary paths.We plan to find a more clever strategy to prune irrelevant goals as a future work.
Algorithm 1 computes the value-insensitive hypercoverage criterion (lines 3 -9).Then, irrelevant and infeasible goals, if any, are removed.For the latter purpose, we compute a forward slice (line 10) and a constant propagation (line 11) of the program, in order to detect redundant goals (lines 12 -15).This is done by either checking that the program points of a goal are not in the slice, or the variable of a goal is constant (line 14).Redundant goals are removed from the value-insensitive hypercoverage criterion (line 16).
Distance Metrics for Hyperproperties.To cover as much as possible value-insensitive hypercoverage goals, we define a distance metric for k-hyperproperties.The latter measures the distance between the execution traces of a candidate k-tuple of hypertests and a target, yet uncovered, hypercoverage goal.This distance is used to guide test generation, as explained in the next section.
To define such distance metric for k-hyperproperties, we use standard structural search metaheuristics, such as approach level and branch distance [26], adapted to our setting.In particular, given a memory m and a program point ℓ , belonging to an implicitly referenced program P, we denote with: AL( ℓ , m), the minimum number of control nodes between a statement executed by P on m and the statement at ℓ (i.e., approach level); and BD( ℓ , m), the distance, computed according to any branch computation scheme [26], between the variable values involved in a condition whose truth value makes ℓ unreachable in the given execution and those achieving the opposite truth value, which would make ℓ reachable after execution of the given condition.Such branch distance is 0 if the statement at ℓ has been reached by executing P on m.Here, the considered condition is the boolean expression corresponding to the closest (w.r.t. the statement at ℓ ) control node in a path not leading to the statement at ℓ .The single-run distance of the statement at  Approach level and branch distance are the basic components of the multi-run distance needed for k-hyperproperties.Indeed, to satisfy a hypercoverage goal we have to cover k program points in the k executions associated with a hypertest, which consists of k inputs.Hence, we need a distance metric considering k execution paths, not just one.In Figure 2, we consider the case of k = 2.In the picture, we aim at covering two program points, ℓ 1 and ℓ 2 , by performing two program executions, one yielding from m 1 and another yielding from m 2 .As we can see, the goal ( ℓ 1 , ℓ 2 ) is not covered, since we do not encounter ℓ 1 and ℓ 2 along the paths yielding from m 1 and m 2 .To measure how much we are far from the goal, we compute the single-run distances of each execution from each program point and we take the minimum of the their sum.Of course, one execution can cover only one program point, hence the only interesting combinations are those with different control points and different input memories, namely SRD( ℓ 1 , m 1 ) paired with SRD( ℓ 2 , m 2 ) and SRD( ℓ 1 , m 2 ) paired with SRD( ℓ 2 , m 1 ).
In the example, we considered the case of k = 2 for the sake of simplicity, but the definition can be given for an arbitrary k > 1.
For instance, when k = 2, hence when д = (x, In Definition 2.8 we insert a penalty component P x (v 1 , . . ., v k ) to give more importance to execution pairs that yield values for x falsifying the output predicate P o .The penalty is defined as , where ϵ is an arbitrary small constant, while δ x ∈ V k − → {0, 1} is a function defined as: for some m 1 , . . ., m k such that P o (m 1 , . . ., m k ).
For instance, when k = 2 and P o is the equality relation for the variable x, the penalty becomes

HYPERTEST INPUT GENERATION
To effectively test a k-hyperproperty, by instantiating the framework proposed in the previous section, we have to: (i) generate hypertest inputs that satisfy the hyperproperty input predicate; (ii) run the k executions for each hypertest input; and (iii) check the satisfaction of the hyperproperty output predicate.If check (iii) fails, we have a hyperproperty violation, and the corresponding hypertest input provides the counterexample.
To assess how much we have tested a program, we can exploit the value-insensitive hypercoverage criterion.Indeed, if we cover a large portion of hypercoverage goals but no violation is found, we may say with confidence that the program is likely to satisfy the hyperproperty.Such hypertesting procedure is summarized in Algorithm 2, where HyperTester is a suitable strategy to craft hypertest inputs.We provide two of such strategies in the following.

Fuzzing-based Hypertesting
A simple way to generate hypertest inputs is based on fuzzing, where values for the input k-tuples of memories are randomly generated.We expect this strategy to be sufficient to test simple programs, but it may exhibit low detection performance when program complexity increases and, hence, hypercoverage goals are harder to cover.We call such approach Hyper-Fuzzing, described in Algorithm 3. The procedure InputsWithConstraints generates a random initial set of inputs, consisting of k memories that satisfy the input predicate P i .Given one (e.g., randomly generated) input memory m 1 , the remaining k − 1 ones needed to create a hypertest input can be obtained by running an SMT solver that solves P i (m 1 , . . ., m k ), with m 2 , . . ., m k free variables.
The procedure RunInputs executes the generated hypertest inputs.It updates the covered hypercoverage goals and it adds the corresponding covering hypertest inputs to the current archive of successful inputs (that will be part of the final hypertest suite).

Hypertesting as an Optimization Problem
As there is no analytical solution to the problem of finding the hypertest inputs that satisfy all hypercoverage goals, we may resort to meta-heuristic search-based algorithm [26], by restating hypertest input generation as an optimization problem.We will first consider its single-objective and then its multi-objective formulation.
The fitness function in Definition 3.1 considers all goals at the same time, aggregating all corresponding distance metrics, yielding a single-objective minimization task.Following [32], we rewrite Definition 3.1 as a many-objective optimization problem.Definition 3.2 (Many-Objective Optimization).Given a set G of hypercoverage goals, find a set of non-dominated hypertest inputs t that minimize the fitness vector ì In many-objective optimization, candidate solutions are evaluated in terms of Pareto dominance [13], that we can restate in the context of hypertesting as follows.

Definition 3.3 (Dominance).
A hypertest input t dominates another hypertest input t, w.r.t. the fitness vector ⟨f д ⟩ д ∈G , if and only if both the following hold: • f д (t) ≤ f д ( t), for all д ∈ G; and We write t ≺ G t to indicate that t dominates t, when the set G of hypercoverage goals is considered.Among all possible hypertest inputs, the (Pareto) optimal ones are those non-dominated by any other possible hypertest input.Definition 3.4 (Preference).Given a hypercoverage goal д ∈ G, with д = (x, ℓ 1 , . . ., ℓ k ), a hypertest input t = (m 1 , . . ., m k ) is preferred over another hypertest input t = ( m1 , . . ., mk ), w.r.t. the fitness vector ì f G , if and only if one of the following holds: We write t ⋖ д t to indicate that t is preferred over t, when the hypercoverage goal д is considered.
The preference criterion states that one hypertest input is preferred for a goal if it has lower multi-run distance than the other.When two hypertest inputs have the same multi-run distance, we prefer those having the minimum single-run distance.The rationale is that such hypertest input is closer to partially cover a hypercoverage goal (i.e., to cover one of the program point in the goal).Among all inputs, the best hypertest input for a hypercoverage goal is the one preferred over all others for such target.The preference criterion is used for selecting the best non-dominated testcases.Note that, we do not select the best hypertest input covering a goal as the one with least complexity.The structural complexity of hypertest inputs is always the same, as we do not generate, for instance, sequences of method invocations, just k-tuples of input memories.

Evolutionary-based Hypertesting
Solving the optimization problem presented in Definition 3.2 results in a clever inspection of the program to check, in order to find potential hyperproperty violations.The more goals we cover and the more counterexamples for the hyperproperty (if any) we expect to find.It has been empirically showed [32] that multi-objective optimization outperforms the single-objective one for test generation.Hence, to solve the problem in Definition 3.2 we adopt the state-of-the-art many-objective search-based algorithm MOSA [32], with some modifications introduced to adapt it to the hyperproperty setting.These modifications yield what we call Evolutionary Hypertesting, described in Algorithm 4.
The algorithm follows the general pattern introduced by MOSA, and the components at lines 6 (preference sorting), 10 (crowding distance assignment) and 13 (final sorting) can be easily derived from the dominance (Definition 3.3) and preference (Definition 3.4) relations introduced in Subsection 3.2.The most important modifications w.r.t.standard MOSA are the red-highlighted procedures of Algorithm 4. InputsWithConstraints and RunInputs are the same procedure described in Subsection 3.1, while GenerateOffspring is in charge of generating new individuals (i.e., hypertest inputs), hopefully better ones, to be added to the current population.
In GenerateOffspring, we first select two groups of optimal hypertest inputs from the population (selection phase), by using the dominance (Definition 3.3) and preference (Definition 3.4) relations.Then, we apply crossover and mutation operators specifically designed for hyperproperties.
In the crossover phase, pairs of individuals, taken from the two selected groups, are swapped by the Pair-wise Memory Crossover, that exchanges the values of a randomly chosen variable between all memories in the two hypertest inputs.Definition 3.5 (Pair-wise Memory Crossover).The Pair-wise Memory Crossover operator C for the hypertest input pair (t, t), with t = (m 1 , . . ., m k ) and t = ( m1 , . . ., mk ), is: for a variable x ∈ vars(P) randomly selected.
Here, the swap of x between the hypertest inputs t = (m 1 , . . ., m k ) and t = ( m1 , . . ., mk ) is defined as swap(t, x, t) ≜ (t ′ , t ′ ), where: Once the selected individuals have been scrambled, we perform the mutation phase, by applying a random value mutation, with probability α.This perturbation implements the Single Memory Mutation, which randomly selects a variable and assigns it with a new value, randomly chosen from the variable type.
Finally, since hypertest inputs must satisfy the input predicate P i , the resulting individuals violating P i are discarded.An alternative to discarding the individuals that violate P i (not yet implemented in our tool) could be to repair them, e.g., by applying an SMT solver to P i after replacing some concrete values with free variables.

Implementation
We have developed two tools that implement the proposed hypertesting procedure (Algorithm 2) for Java, HyperFuzz and HyperEvo, considering one specific hyperproperty, Non-Interference [12].In particular, HyperFuzz adopts the Hyper-Fuzzing approach of Subsection 3.1, while HyperEvo adopts the Evolutionary Hypertesting approach of Subsection 3.3.Both tools can execute a Java program under two execution scenarios, whose inputs satisfy the input predicate for Non-Interference, i.e., in the two executions all public input variables have the same values, while confidential input variables differ on at least one value.Both can also check the output predicate for Non-Interference, i.e., whether any public output variable has a different value in the two executions, which indicates some information leakage from confidential to public variables.The difference between HyperFuzz and HyperEvo is that the latter uses the hypercoverage-based fitness function described in Section 3.2 as guidance, while the former has no guidance (i.e., it generates random hypertest inputs that satisfy the input predicate).

Discussion
A potential weak point of our approach consists in the fact that the proposed coverage criterion is a necessary but not sufficient condition to reveal hyperproperty violations, hence our tool may yield false negatives.This is somewhat expected, being our approach dynamic in nature.Nevertheless, the empirical evaluation we conducted in Section 4 indicates that the approach is indeed effective in spotting hyperproperty violations (at least for Non-Interference).
In addition, our framework targets k-hyperproperties and it may not generalize to other hyperproperties.We started with such subset of hyperproperties since Non-Interference, that is the prominent hyperproperty example, belongs to it (and other important requirements, such as data races, are k-hyperproperties).We plan to extend our framework to other kinds of hyperproperties as a future work.

EMPIRICAL EVALUATION
To empirically validate the value-insensitive hypercoverage criterion and the proposed hypertesting approach, we considered a specific hyperproperty, Non-Interference [12] (in short, there should be no information flow from confidential to public variables), and we instantiated our framework to test it.We considered Non-Interference among other possibilities because of its relevance and importance for software privacy and security.In our empirical study, we address the following three research questions.With these research questions we gradually validate the hypotheses behind the proposed hypertesting approach: first we check if hypercoverage correlates with the exposure of hyperproperty violations (RQ 1 ), by adopting a standard correlation metric (Pointbiserial); then, if the proposed hypertest input generators can achieve high coverage (RQ 2 ), by measuring the amount of hypercoverage goals covered; and, finally, if the hypertest inputs generated under the guidance of hypercoverage can effectively expose hyperproperty violations (RQ 3 ), by computing standard information retrieval metrics (recall and accuracy).Since there are no dynamic verification approaches specifically targeting Non-Interference providing a tool (see Section 5 for a qualitative comparison with the related work), we compare our approach with a dynamic taint analysis, that is the most similar dynamic technique allowing to track information flows (and, hence, serve as baseline for our tools).For such comparison, we have chosen the state-of-the-art tool Phosphor [4].Technically, the latter is not purely dynamic, since it leverages static program analysis (e.g., to track control-flow relationships, as described by Hough and Bell [19]).

Experiment Setting
Program Datasets.In our empirical evaluation, we used the Java classes provided by IFSpec [17], a collection of Java applications that are by design vulnerable or non-vulnerable to Non-Interference.In IFSpec, variables are already tagged with security levels, either public or confidential, by using RIFL [3] specifications.IFSpec is intended to be a benchmark to stress the capabilities of static analyzers that target Non-Interference vulnerabilities.For this reason, the programs in IFSpec make use of a large portion of the syntactic structures provided by Java.Since our implementation does not support yet some of them (e.g., exceptions), we selected the samples in IFSpec that can be managed by our tool instrumentation, resulting in 34 vulnerable and non-vulnerable Java classes (FullDataset).To answer the first research question only the vulnerable programs are needed, which amounts to 14 samples (UnsecureOnlyDataset).
Experimental Procedure.To answer the previously mentioned research questions, we adopted the following methodology.
(RQ 1 ).We specifically developed a tool that randomly generates a pool of POOL_SIZE = 1000 hypertest inputs for each program of Un-secureOnlyDataset.Then, the tool randomly assigns the elements of such pool to groups of size SAMPLING = 100.For each group, the tool considers its element in random order, measuring the incremental hypercoverage reached and associating to each level whether a Non-Interference violation was exposed or not.Such hypercoverage level and violation flag pairs have been finally used to compute the Point-biserial correlation.
(RQ 2 ).Both HyperFuzz and HyperEvo take as input the source code of the Java class under test and a configuration file containing the security tags for class and method variables (that we manually retrieved from the RIFL specification present in IFSpec).Then, the tools instrument and compile on-the-fly the input class and perform the hypertesting session.For each program of FullDataset, we run HyperFuzz and HyperEvo with the same testing budget of MAX_CALLS = 2000 invocations of the method under test.After completion, we collected the reached level of hypercoverage for both tools.Due to non-deterministic components present in the hypertest input generation, each tool (for each program) has been run 5 times, measuring then the average hypercoverage.
(RQ 3 ).Phosphor requires a manual modification of the program source code, in order to insert the information needed to perform instrumentation.In particular, sources (confidential variables in our case) and sinks (public variables in our case) must be wrapped into specific calls to Phosphor's APIs (again, variables security levels have been manually retrieved from the RIFL specifications present in IFSpec).For each program of FullDataset, we run HyperFuzz, HyperEvo and Phosphor (the first two with a testing budget of MAX_CALLS = 2000 invocation of the method under test, while the third does not require any analysis budget to be set).After completion, we retrieved the testing/analysis results for all tools.HyperFuzz and HyperEvo output directly whether Non-Interference violations have been found or not, while Phosphor outputs the possible taint tag of each sink.We considered a Non-Interference violation for Phosphor as the fact that a public variable is tainted by a label corresponding to a confidential variable (indicating a dependence between the latter and the former).Due to non-deterministic components present in all approaches, each tool (for each program) has been run 5 times.
Collected Metrics.To answer RQ 1 we compute the correlation between the number of value-insensitive hypercoverage goals covered and the detection of a Non-Interference vulnerability.Since in our dataset each sample contains only one vulnerability, the outcome of the testing is binary (violation found or not).Hence, we apply the Point-biserial correlation, a standard correlation coefficient (denoted as R) to be used when one variable is dichotomous.
To answer RQ 2 we compute the coverage level reached by our hypertesting approach, for both HyperFuzz and HyperEvo versions.
To answer RQ 3 we compare the Non-Interference violations (unsafe programs) found during the hypertesting sessions with the ground truth provided by IFSpec.In particular, we adopt the following standard information retrieval metrics.
True Positives (TP) That is, the number of unsafe programs that are correctly detected as unsafe.False Positives (FP) That is, the number of programs reported as unsafe that correspond to safe ones (i.e., false alarms).False Negatives (FN) That is, the number of unreported unsafe programs (i.e., missed violations).True Negatives (TN) That is, the number of safe programs that are correctly detected as safe.While, by construction, HyperFuzz and HyperEvo, report only hypertests that provably produce a Non-Interference violation, Phosphor might instead report a public variable as incorrectly tainted.This may happen for two reasons: because Phosphor approximates a hyperproperty (Non-Interference) with a trace property (taint propagation); or, because of dynamic overtainting, i.e., because it conservatively propagates the taint tag when the information flow is unknown (e.g., when information flows into native code or into black-box library components that cannot be instrumented).Hence, only Phosphor could potentially report false positives.
Since our technique has FP = 0 by construction, to evaluate its accuracy the most interesting metrics are TP and FN, which can be combined into the True Positives Rate (TPR) (aka recall).We also compute a single accuracy metric Accuracy (ACC), that aggregates all correct predictions against all performed predictions.TPR ≜ TP / FN+TP ACC ≜ TP+TN / FN+FP+TP+TN False positives FP are measured for Phosphor only, as they are by construction zero for HyperFuzz and HyperEvo.In our empirical evaluation, we consider a false (resp.true) negative for HyperFuzz and HyperEvo when they output LIKELY_SAFE or GIVE_UP for an unsecure (resp.secure) program.

Experimental Results
Table 1 shows the Point-biserial correlation R between the level of hypercoverage achieved by randomly generated hypertest inputs and the corresponding boolean variable stating whether a Non-Interference vulnerability was exposed or not by the hypertest.The p-values indicate that all correlations are significantly different from 0. Actually, most of them indicate strong correlation, with a value greater than 0.7 (green-highlighted in Table 1).Based on these results, we can formulate the following answer to RQ 1 .
RQ 1 (Correlation) The value-insensitive hypercoverage criterion helps in discovering Non-Interference violations, since there is an overall positive and significant correlation between the increasing number of value-insensitive hypercoverage goals covered and the likelihood of detecting a Non-Interference violation.HyperFuzz and HyperEvo, which indicates that on this benchmark both proposed generation strategies (fuzzing and search-based) are generally effective.However, there is also evidence that the searchbased strategy can be more effective than fuzzing: on Aliasing-ControlFlow-u and Arrays-ImplicitLeak-u HyperEvo achieves 100% coverage, while HyperFuzz achieves respectively 67% and 62% coverage.We manually investigated the reasons for such a difference and found that the hypercoverage goals missed by fuzzing require a smart selection of the confidential inputs values, because a specific path, yielding when traversing the code guarded by a non-trivial conditional, has to be taken to reach them.There are also two instances in which full coverage is not reached, neither by HyperEvo nor by HyperFuzz, that are ScenarioPassword-s and ScenarioPassword-u.We manually investigated the hypercoverage goals missed by both implementations of our approach and found that they are both infeasible goals (Definition 2.6), hence associated with paths that cannot be covered by any pair of inputs satisfying the input predicate.Based on these results, we can formulate the following answer to RQ 2 .
RQ 2 (Coverage) The proposed hypertesting approach is very effective in covering value-insensitive hypercoverage goals, since it obtains full-coverage in the majority of the considered case studies and, overall, the coverage reached is never less than 43%.
Table 2 (Violations columns) shows the ground truth classification of each Java program, which can be secure (✓) or unsecure (✗).The outcome of each tool being compared is reported in the following columns.In particular, column Phosphor reports secure (✓) (resp.unsecure (✗)) when the taint tag associated with confidential variables propagates to a public variable in a program execution, while columns HyperFuzz and HyperEvo report secure (✓) (resp.unsecure (✗)) when an automatically generated hypertest input provides a counterexample violating Non-Interference, i.e., the tool generated a pair of executions differing only on the value of some confidential input variables, eventually affecting the value of some public output variables, which differ between the two executions.
We can notice that in many cases there is agreement with the ground truth across the three tools: they all find the vulnerability, if present, or report no alarm if the code is secure.Disagreements with ground truth are indicated with a red background.There are two instances in which both Phosphor and HyperFuzz miss the vulnerability, while HyperEvo can detect it: Aliasing-ControlFlowu and Arrays-ImplicitLeak-u.Not surprisingly, these are the same two cases where HyperEvo achieved higher hypercoverage than HyperFuzz, which shows the usefulness of hypercoverage as adequacy criterion for hyperproperty testing: by achieving 100% hypercoverage, HyperEvo can also expose the vulnerabilities in these two Java programs, while HyperFuzz misses some hypercoverage goals and correspondingly misses also the vulnerability present in these two programs.It is interesting that Phosphor misses these two vulnerabilities as well.In fact, Phosphor does not rely on hypercoverage.Actually, the taint propagation path that would lead Phosphor to expose the vulnerability was manually found to be also involved in the hypercoverage goals missed by HyperFuzz, thus confirming that a search-based strategy might be needed to look for inputs that exercise specific, vulnerable paths.In three more cases Phosphor missed the vulnerability, while HyperFuzz and HyperEvo are able to detected it: BooleanOperations-u, HighCond.Incr.Leak-u and ScenarioPassword-u.By manually investigating these cases, we found that taint tags should have been propagated along a path that involves an implicit information flow (e.g., inside a loop).Implicit flows are hard to catch by using (not-hyper) dynamic techniques, that use syntactic dependencies to approximate Non-Interference.Overall, HyperEvo achieves 100% TPR and ACC, HyperFuzz 86% TPR (94% ACC) and Phosphor 64% TPR (82% ACC).By construction, all vulnerabilities reported by HyperFuzz and HyperEvo cannot be false alarms, as the executions exposing the vulnerability are explicitly run and checked to be true positives during the test generation process.On the contrary, dynamic taint analysis is potentially subject to false alarms, in case of over-tainting.Indeed, Phosphor erroneously detects the program LostInCast-s as vulnerable.By manually investigating this case, we found that the information flow from a confidential variable to a public one is nullified, at some point during program execution, by a cast operation (that drops the four most significant bytes of the confidential variable).Such kind of alarms are quite hard to rule out without comparing two executions of the program and, indeed, Phosphor conservatively marks such syntactic dependency as a (potential) violation.Based on the results, we can provide the following answer to RQ 3 .
RQ 3 (Effectiveness) The proposed hypertesting approach is very accurate in detecting Non-Interference vulnerabilities, since it outperforms state-of-the-art dynamic taint analysis.In particular, HyperFuzz and HyperEvo reach an accuracy of 94% and 100%, respectively, while Phosphor reaches an accuracy of 82% only.

Threats to Validity
Internal Validity.Internal validity threats are due to the metrics chosen to answer the research questions.We adopted standard metrics from statistics (correlation), structural testing (coverage) and information retrieval (true positive rate, accuracy) that are directly related to the respective research questions.However, different metrics might provide different insights and view points.
External Validity.External validity threats are associated with the generalizability of our findings beyond the considered benchmark.We do not claim any form of general validity of our results beyond the benchmark and we believe that future replications and extensions of the empirical study are needed to corroborate our findings.We chose a subset of the standard benchmark IFSpec, such that our tools could be applied to it, resulting in 34 Java programs.

RELATED WORK
Among all works in V&V, only some static approaches systematically deal with hyperproperties, in particular using abstract interpretation [2,23] or model-checking [16,20].The latter define a hyperlogic, i.e., a temporal logic quantifying over sets of executions.Unfortunately, only a small fragment of this logic is decidable, hence statically verifiable.The drawback of static approaches to hyperproperty verification is their imprecision: hyperproperties are often quite complex, hence resulting in a very coarse analysis.Indeed, a dynamic approach would be potentially more effective but, to the best of our knowledge, there are only few dynamic methods designed to verify hyperproperties [18,[27][28][29][30][31].
Muduli et al. [29] use fuzzing to generate test cases for generic hyperproperties in the context of Systems-on-Chip (SoC).The approach randomly generates pairs of inputs and checks pairs of executions, but test generation is unguided (there is no target hypercoverage adequacy criteria, as in our approach) and the proposed technique is designed for a very narrow application context (SoC), which makes it difficult to compare with our approach.
Fuzzing-based approaches, such as DifFuzz [30], ct-fuzz [18] and QFuzz [31], test programs against side-channel leaks, that are hyperproperties (e.g., timing guarantee).Nevertheless, Non-Interference violations and side-channel leaks are not in general comparable.In such fuzzers, test generation, which exploits either multi-executions [7,30,31] or self-composition [18], is essentially random, and not guided by the hyperproperty to test as in our approach.Since test generation in such works is random, hence similar to that implemented in HyperFuzz, we believe that our empirical results support already an indirect comparison with these approaches.
Concerning information-flows, the closest work is HyperGI [28], a technique that uses multiple program executions to measure information leaks and to repair them by using genetic improvement.By resorting to entropy-based measures, HyperGI checks Quantitative Non-Interference [9,36].Nevertheless, their focus is on program repair rather than test input generation (our paper's focus).Indeed, in HyperGI input generation is based on a binary search, that iteratively halves the input space and selects some public inputs from each half.Then, confidential inputs are altered, in order to spot changes in the output.Since they hold several similarities, we could have compared HyperGI with HyperFuzz and HyperEvo, but, unfortunately, the tool is not available.HyperGI's follow-up work [27] proposes LeakReducer, that improves the repair phase of HyperGI by adopting a multi-objective approach, keeping unchanged the test input generation phase.Again, we could have compared LeakReducer with HyperFuzz and HyperEvo, but the tool is not available.Finally, some other V&V works verify Non-Interference by using abstract interpretation [24] or hybrid monitors [22].
Metamorphic Testing.As in our approach, Metamorphic Testing (MT) [8] also exploits multiple program executions.In MT some necessary properties of the program are identified, taking the form of metamorphic relations (MR) among multiple inputs and their expected outputs.Such relations are used to transform existing (source) test cases into new (follow-up) ones, which by construction satisfy the input part of the MR.A bug is found when source and follow-up test cases satisfy the input but not the output part of a MR.Indeed, MT was proposed as a method to alleviate the oracle problem when testing programs whose expected behaviour is difficult or impossible to anticipate (e.g., machine learning techniques).Even when a thorough oracle cannot be defined, if the actual outputs of source and follow-up tests violate a certain MR, we can say that the program under test is faulty w.r.t. the program property associated with that relation.
In this respect, a metamorphic relation can be seen as a particular k-hyperproperty.However, MT does not provide any guidance on how to verify/refute such relation, in contrast to our framework that derives a hypercoverage adequacy criterion from the hyperproperty, in order to craft specific inputs that may refute it.We believe that our framework may help in improving MT approaches, by providing them with a guiding adequacy criterion for MR violation.Our framework represents a step toward the systematic testing of khyperproperties, which include metamorphic relations.
Mutation Testing.Mutation testing [1] can be formulated as a hyperproperty problem, where multi-executions are given by the mutated and the non-mutated versions of the program and the predicate to check is equality.Indeed, Fellner et al. [14] exploit such correspondence to reuse hyperproperty formal verification machinery (i.e., model-checking) to perform mutation testing.So, differently from us, their goal is not to check a hyperproperty but, rather, to improve mutation testing.As mutation testing can be encoded into a hyperproperty, we may use our approach to craft inputs suitable for improving mutation testing as well.

CONCLUSION
We have proposed a novel testing framework for hyperproperty testing, consisting of an adequacy criterion, a structural search metaheuristic and a test generation approach.The adequacy criterion, called hypercoverage, was designed to force the exploration of the different variable value assignments, possibly involved in a hyperproperty violation.The test generation approach has two instances, HyperFuzz and HyperEvo, respectively based on fuzzing and search algorithms.The latter takes advantage of the proposed distance metaheuristic to lead test generation to the satisfaction of a hypercoverage goal, which allowed us to formulate hyperproperty testing as an optimization problem in the hypertest input space, solved by the HyperEvo multi-objective search algorithm.
Experimental results confirmed the validity of our framework, at least for the hyperproperty considered in the evaluation (i.e., Non-Interference), by showing that inputs achieving high hypercoverage have a higher chance of exposing hyperproperty violations, and that both tools HyperFuzz and HyperEvo achieve high hypercoverage and correspondingly detect a high number of vulnerabilities in the considered benchmark.They both outperformed the state-of-theart dynamic taint analysis tool Phosphor.Between them, HyperEvo showed marginal advantages on Java programs that require specific input combinations both to reach the target hypercoverage goals and to expose the vulnerabilities contained in these programs.
Even if the empirical evaluation has been conducted on a specific hyperproperty (i.e., Non-Interference), we believe that the proposed framework is applicable to any k-bounded hyperproperty [23].In future work, we want to extend the applicability of HyperFuzz and HyperEvo to other hyperproperties, beyond Non-Interference, and we want to test them on additional, more complex, programs.

Figure 1 :
Figure 1: A snippet of code and the corresponding CFG.

Figure 2 :
Figure 2: Graphical explanation of the Multi-Run Distance.

Algorithm 3 :
Fuzzing-based hyperproperty testing.Definition 3.1 (Single-Objective Optimization).Given a set G of hypercoverage goals, find a set T of hypertest inputs that minimizes the fitness function f G :

RQ 1 (
Correlation): Is there a relation between high value-insensitive hypercoverage and the detection of Non-Interference violations?RQ 2 (Coverage): Is the proposed approach for hypertest input generation able to achieve high hypercoverage?RQ 3 (Effectiveness): Is the proposed hypertesting technique effective at exposing Non-Interference violations?How does it compare to state-of-the-art dynamic taint analysis?

Table 2 (
Coverage columns) shows the number of hypercoverage goals identified in each Java program (column Goals) followed by the proportion of such goals covered by HyperFuzz and HyperEvo, respectively.In general, the achieved level of coverage is high for both

Table 1 :
Correlation results

Table 2 :
Hypercoverage and Accuracy results