CIT4DNN: Generating Diverse and Rare Inputs for Neural Networks Using Latent Space Combinatorial Testing

Deep neural networks (DNN) are being used in a wide range of applications including safety-critical systems. Several DNN test generation approaches have been proposed to generate fault-revealing test inputs. However, the existing test generation approaches do not systematically cover the input data distribution to test DNNs with diverse inputs, and none of the approaches investigate the relationship between rare inputs and faults. We propose CIT4DNN, an automated black-box approach to generate DNN test sets that are feature-diverse and that comprise rare inputs. CIT4DNN constructs diverse test sets by applying combinatorial interaction testing to the latent space of generative models and formulates constraints over the geometry of the latent space to generate rare and fault-revealing test inputs. Evaluation on a range of datasets and models shows that CIT4DNN generated tests are more feature diverse than the state-of-the-art, and can target rare fault-revealing testing inputs more effectively than existing methods.


INTRODUCTION
Deep Neural Networks (DNN) are being developed for use in mission and safety critical systems, e.g., [6,36,61].Similar to traditional programmed software components, these learned components require significant testing to ensure that they are fit for deployment.
DNNs learn from observed data that comprises multiple factors of variation -referred to as features, combinations of which give rise to the diversity in the data [7,13,17,31,38].For example, consider an input domain including handwritten zeros, for which stroke thickness and slant are two features.A combination of these two features results in diverse samples with varying stroke thickness and slant as shown in Figure 1.A study conducted on the real faults of deep learning systems identified that DNNs make incorrect predictions when they process inputs that are underrepresented in the training data [32,70,71].Finding these faults requires methods that adequately test DNNs with diverse inputs that are representative of the feature combinations of the observed data distribution.
Feature combinations occur with varying probabilities in the observed data, and the inputs with a low probability of occurrence are referred to as rare inputs.Failing to test the system behavior for rare inputs can make deep learning systems unsafe for real-world deployment.For example, in a fatal Tesla crash, the Autopilot was not able to detect the white side of a tractor-trailer against a brightly lit sky and a GM autonomous vehicle crashed into a bus in this rare circumstance [2,5].There is a need for methods that adequately test DNNs with rare inputs to ensure the safety and trustworthiness of deep learning systems.
Much of the prior work in neural network test input generation has focused on fault detection and has targeted neither diverse nor rare inputs.DNN test generation methods that apply pixel-level manipulations on seed images to generate test inputs, e.g., [29,49,57,60,67], do not yield feature-diverse inputs [26,70].More recent work has targeted the generation of diverse test inputs using two approaches.First, tests can be generated using a manually constructed model of features and their interactions, but this does not scale to complex datasets [16,27,70].Second, tests can be generated using feature space models that are trained from observed data [14,15,35,68], but these provide no information about the degree to which test inputs cover that feature space.While, in principle, some of these techniques, e.g., [14,35], could generate rare inputs through rejection sampling approaches, i.e., by generating test inputs first and then rejecting the ones that are not rare, this would be cost-prohibitive for complex datasets.Moreover, such approaches cannot systematically explore a subspace of the observed data distribution that comprises the target density of rare inputs.
In this research, we present a black-box test generation algorithm, cit4dnn, that addresses the limitations of the existing methods discussed above.The research objective of cit4dnn is to automatically generate diverse and rare inputs that systematically cover a target density of the feature space of the observed data distribution.
cit4dnn meets the research objectives by adapting ideas from recently proposed work on input distribution coverage (IDC) [26].IDC uses a generative model to automatically learn a low-dimensional representation of the features of the training data, called the latent space.IDC partitions the latent space and applies combinatorial interaction testing (CIT) [20] on the partitioned latent space to measure test coverage proportional to the -way feature combinations present in a test set.While IDC measures test coverage, cit4dnn applies CIT on the partitioned latent space to generate covering arrays containing combinations of partitions, called test descriptions from which test inputs can be generated.The test descriptions cover all the -way feature combinations resulting in test adequacy with respect to the diversity of the observed data distribution.
Rare inputs are spread out on the low-probability regions of the latent space.To support testing with rare inputs, cit4dnn formulates constraints on the latent space and presents a constrained CIT algorithm to generate test descriptions that belong to the required target density.
The test descriptions generated by cit4dnn are dependent only on the dimensionality of the latent space.As a result, the test descriptions can be reused for DNNs trained on different datasets as long as the dimensionality of the latent space of the generative models of the datasets are the same.This results in test description generation time being amortized across testing many different DNNs under test.In §3 we describe how permuting test descriptions can further increase test diversity with low overhead.
An evaluation of cit4dnn, in §4, shows that cit4dnn yields higher feature diversity than the state-of-the-art DNN testing approach DeepHyperion-CS [70].In comparison to the state-of-the-art generative model-based testing approach SINVAD [35], cit4dnn yields 9 times more coverage and detects 6.5 times more faults while running 90 times faster.For rare test input generation, cit4dnn improves on SINVAD 107-fold for coverage and 111-fold for faultdetection, and in comparison to random sampling, cit4dnn improves on coverage while generating 6 orders of magnitude fewer test inputs and running 56 times faster.
The contributions of this work include: (1) cit4dnn-a black-box DNN test generator capable of automatically producing diverse and rare test inputs while achieving 100% IDC test adequacy; (2) a constraint-based method to generate rare inputs without costly rejection sampling; (3) a method for applying CIT to the latent space of a generative model that enables reusing test descriptions to generate inputs with increasing diversity; and (4) the results of an evaluation across a range of datasets, DNNs, and instantiations of cit4dnn that demonstrates its beneficial characteristics.

BACKGROUND 2.1 Deep Generative Models
The latent space is a low-dimensional embedding that represents the factors of variation comprising the observed data [7,40].Machine Learning research has shown that deep generative models are effective in learning the latent space of real-world datasets [13,17,28,38,39].We use a deep generative model called variational autoenoder in this work [39].
A variational autoencoder (VAE) is comprised of a pair of networks -an encoder,  E , and a decoder,  D -that is trained to accurately reconstruct inputs from the data distribution, X, with the training objective, min  ∈ X ∥ −  D ( E ())∥ [39]. E converts samples from the data into vectors in the latent space whereas  D is a generator that converts samples from the latent space into inputs in the data space.The latent space of a VAE defines the parameters of a multivariate distribution of size equal to the dimension of the latent space.A loss term in training biases that distribution to match an assumed prior, commonly a standard Normal distribution.While early VAEs were known to suffer in the quality of their reconstructions, modern VAEs exceed the performance of many other generative models [44].

Combinatorial Interaction Testing
Thorough black-box testing of software with a high-dimensional input space is challenging.Applying methods, such as categorypartitioning [47], to construct a finite partition for each input helps to a degree, but the combinatorics generally preclude complete coverage of input-partition combinations.Combinatorial interaction testing (CIT) is a method to generate test suites that systematically cover a partitioned input space up to a user-specified arity, , which is referred to as combinatorial strength [20].CIT methods have been used to both generate test suites that achieve a desired strength [20] and to measure the strength of a given test suite to define coverage criteria [41].
Central to CIT is the notion of a covering array (CA), a matrix with a column for each input where the cells hold the appropriate partition values for the input.A row constitutes a test description defining a possible set of values, defined by the partitions, for each input.A -way CA includes every possible combination of inputpartition pairs in some row and thereby assures that any interaction among inputs up to  has a chance to be exposed by tests generated from the CA.For example, Figure 1  In many systems, inputs are not completely independent, requiring that CIT methods take into account constraints over inputpartition combinations [62].Cohen et al. [21] defined a general framework that permits propositional constraints over such combinations to be incorporated into greedy CIT test input generation methods [20].The resulting constrained CIT (CCIT) approach efficiently generates a constrained covering array (CCA), which is a CA whose rows are consistent with the constraints.

Input Distribution Coverage
A CIT coverage metric, called input distribution coverage (IDC), has been defined for DNN testing [26].IDC's coverage domain is the latent space of a trained VAE with a standard Normal prior, N (0, 1).IDC defines an equal density partition, P, of N (0, 1) as a set of intervals that each contain the same probability density.Choosing the size of the partition, |P |, allows the set of intervals to be computed using the quantile function for N (0, 1).
IDC allows coverage to be computed over the portion of the latent space defined by a user-defined target density, .This density defines a shell with inner,   , and outer,   , radii.IDC maps a test input to the latent space and then converts it to a hyper-rectangle defined as the product of the elements of P corresponding to the latent coordinates.The total -way coverage metric measures how well a test set exercises combinations of partitions by accumulating the combinations present in the hyper-rectangles derived from the test inputs.Test suites with higher -way coverage were shown to be more feature diverse, as judged against human-defined ground truth, and improve fault-detection effectiveness [26].
A key challenge in IDC is to compute the ratio of the count of -way combinations in a test set to the feasible -way feature combinations, which allows coverage to be reported as a percentage.To calculate this quantity, it is necessary to account for the fact that a hyper-rectangle defined by the product of partitions may not overlap with the target density shell, in which case it is infeasible.A step in this calculation involves checking a quadratic distance constraint: ∃ ∈ ⟨ 1 , . . .,   ⟩ : || || ∈ [  ,   ], where  is a latent space coordinate that is consistent with the   ∈ P intervals.To side-step the nonlinearity of this constraint, IDC uses an efficient SMT-encoding that expresses the constraints over the squared latent coordinates which yields a linear constraint that is efficient to solve.
Whereas IDC only measures coverage, cit4dnn is the first DNN test generation approach to guarantee -way coverage of a generated test suite.We describe how cit4dnn builds on the concepts introduced in IDC in the next section.

APPROACH
The goal of cit4dnn is to systematically generate diverse and rare inputs from a target density of the observed data distribution -Figure 2 sketches the components of cit4dnn.cit4dnn uses a generative model with a standard Normal prior, N (0, 1), to learn a representation of the features of the observed data in the latent space.cit4dnn leverages the properties of N (0, 1) to partition the latent space into equal-density partitions and parameterizes the geometry of the latent space to represent a target density using radial constraints.The Constrained Combinatorial Interaction Testing (CCIT) module shown in Figure 2 uses the partitioned latent space and radial constraints that express the target density to generate a Radial Constrained Covering Array (RCCA) comprising the test descriptions representing partition combinations on the required target density.The Sample Partition module converts the RCCA into a set of latent samples, such that each sample belongs to the corresponding test description of the RCCA.The Generator module converts latent samples into test inputs, and, by construction, the generated test inputs achieve 100% -way combinatorial coverage of the target density in the latent space.

Radial Constrained Covering Arrays
The latent space with a multivariate standard Normal prior has its probability concentrated in an annulus [9], and the probability density along the radial dimension of the annulus is described by a Chi distribution.Hence a target density can be converted to a pair of radii using the interval function of the Chi distribution [26], where the interval is the smallest one containing the target density.These radial bounds are used to specify the constraints for the CCIT and the resulting covering array is referred to as a Radial Constrained Covering Array (RCCA).
Definition 1 (Radial Constrained Covering Array).A radial constrained covering array, (, , P, [  ,   ]), is an  ×  array where: 1) Each column 1 ≤  ≤  contains an element of P; 2) the rows of each  ×  subarray cover all -way feature combinations that are feasible for the given radii, [  ,   ], at least once; and 3) all rows are feasible combinations for the radii.
The properties of RCCA ensure that cit4dnn systematically covers a target density of the feature space to test DNNs with diverse and rare inputs; cit4dnn generates rare inputs when target density represents a low-probability region of the latent space.
RCCA differ from existing CCA in two ways.First, each of the columns of the array can take on the same set of values, P. Second, they employ the quadratic distance constraints from IDC to model feasibility relative to a target density -specifically, we use the checkConstraint function from Alg. 1 [26].We adapted the greedy AETG-SAT covering array generation algorithm from [21] with this modification to generate RCCA.
Example 3.1.Figure 3 depicts a 2-dimensional latent space -the surface comprised of light-gray grid lines with densities shown as intensities of orange and marginal distributions for each dimension shown separately,  (  ).The gray rectangular regions on the plane show the 4-way partitioned latent dimensions within an outer target density of 0.99 (solid circle).The four black circles, e.g.,  1 , depict samples from a 1-way RCCA that covers each dimension partition.
Restricting the target density to the range [0.97, 0.99] -the shell between the dashed and solid circles -requires a larger RCCA.
The black square coordinates depict samples, e.g.,  2 , from a 6row RCCA.Some of these samples, e.g.,  3 ,  5 , fall outside of the shell and as a result, they are infeasible with respect to the target density.These samples are projected (arrows) to the blue triangle coordinates,  4 ,  6 , using the sample-partition method discussed in Alg. 1 thereby producing latent samples on the target density.
3.1.1Reusing RCCA.cit4dnn is designed to be applicable to any input domain for which a high-quality generator with a standard Normal prior can be trained, but it is largely independent of the learned mapping between the domain and the generator's latent space, i.e., CCIT and Sample Partition modules in Figure 2 are independent of the Generator module.Instead, it relies on the dimension of the latent space and the fact that the latent distribution is a close match to an isotropic standard normal prior.
Whereas prior CIT work focuses mostly on 2-way combinatorial strength [58], for DNN test generation, 3-way coverage yields more diverse test suites [26].This presents a challenge since the cost of generating CAs grows combinatorially with .Here we can leverage the fact that a -way RCCA for a -dimensional latent space partitioned  ways in a target density range [  ,   ] is independent of the VAE's encoder and decoder.Thus, an RCCA for an input domain modeled by a -dimensional latent space can be pre-computed and reused, saving time.
Moreover, the isotropic latent space means that the column-wise permutation of an RCCA yields an RCCA that, with high probability, has distinct rows.A tuple in  () is of the form (  1 , . . .,   , . . .,    −1 ) and is mapped by  to (  (  1 ) , . . .,   ( ) , . . .,   (   −1 ) ).Every tuple in  () is mapped in this way, and since  is surjective, it generates a complete set of tuples for column  ().Applying this mapping for all columns, means that permutation of  yields a covering array.
Radial constraints are formulated in terms of the sum of the squared intervals associated with the partitions in a row.Since addition commutes, the constraints hold on any permutation of a row.□ In general, there are ! possible permutations, but repeated partition values in rows reduce the number of distinct permutations.If there are  repeated values, the number of distinct permutations is reduced by ! − 1, but the probability of  repeated values is 1   −1 which indicates that the probability of identical permutations is very low.Thus, reusing and permuting RCCA columns effectively guarantees distinct test descriptions in time that is linear in  -the time to permute column indices.
To illustrate the diversity of column permutation, we randomly selected a row of a 3-way covering array for MNIST with  = 9,  = 20,   = 0.9999, and   = 0.999999.Figure 4 shows the images generated from 8 random permutations of this row and shows the diversity possible from column permutation.Only two columns share a common partition value meaning that there are !−(!−1) = 362879 distinct column permutations.

cit4dnn Algorithm
Alg. 1 defines cit4dnn as a pair of functions: cit4dnn and samplepartition.The entry point is the function cit4dnn which first extracts the latent dimension and computes their partition -lines 2-3.Lines 4-11 employ  to compute the covering array handling two cases.The simpler case is where   is zero and a single covering array is constructed -line 10 -based on the radii calculated for  line 4. The more complex case -lines 6-8 -deals with the case where an inner target density is defined.Here a pair of covering arrays are constructed -lines 7-8 -where one uses the inner radii for   and   and the other uses the outer radii to define the shells used to compute .We require that   <   which means that their corresponding shells -whose radii are computed on lines 4 and 6are concentric, requiring that the roles of the radii be transposed  ← () 3: P ← partition-eq-density(,   ) ,   ← ℎ.(  , )   (, ) ←   29: return  38: end function when computing the inner covering array -line 7. Lines 12-18 generate  samples from each row of .To yield more diverse test samples, we implement proposition 3.2 and generate  column-wise permutations of the covering array -line 14 -and draw a single sample from each row of each permutation -line 16.Since rows of the covering array(s) may be associated with different radii, we record radii along with each row to generate latent samples -line 15.The set of generated latent samples, , is decoded to generate a set of test inputs -line 19.
The sample-partition function samples coordinates, , using the density associated with the partition interval, [  ,   ], for each dimension,  -line 22.When inner and outer densities target a rare portion of the input distribution a sample within the partition interval may not lie between the target radii -line 23.This can be observed in Figure 3 where a small arc of the [0.97, 0.99] shell intersects with any of the rectangular partition combinations.Lines 24-35 compute a sample that satisfies the partition and radial constraints using SMT.We build on the squared-distance constraint formulation used in the checkConstraint function from Alg. 1 [26], which works because distance constraints are insensitive to the orthant 1 within which a partition lies.We record a sample's orthant, , as a vector that holds the polarity of each sample coordinateline 24.Lines 26-33 compute squared-distance constraints and then solve them for a model in line 34.The use of radial constraints in  generation guarantees these constraints are solvable.The model generated holds the squared coordinates and lies in the positive orthant.Line 35 recovers the coordinates and maps them to the recorded orthant and updates the sample .
Example 3.5.Sampling from row ( 2 ,  4 ) results in coordinate  3 in Figure 3.The radius of this coordinate falls short of 2.898, the dotted circle.The orthant is recorded,  = ⟨−1, 1⟩, and the following squared radial constraint is formulated: where  [1] = ( 1 ) 2 and  [2] = ( 2 ) 2 .Solving the constraint yields a satisfying model where  [1] = 0.04 and  [2] = 9.Taking the square root and recovering the orthant yields the blue coordinate  4 = (−0.2,3).A similar calculation projects  5 to the blue triangle  6 in the target shell.Permuting the columns of the covering array to draw a second sample for  4 results in the row ( 4 ,  2 ) which is sampled and solved to yield the green diamond  7 .

EVALUATION
We designed a set of experiments to explore the effectiveness of cit4dnn in generating realistic, feature-diverse, and rare inputs for testing DNNs by exploring a series of research questions: RQ1: How realistic are tests generated by cit4dnn?RQ2: How effective is cit4dnn in generating feature-diverse tests?RQ3: How does fault density vary with the latent distribution?RQ4: How cost-effective is cit4dnn in targeting normal and rare inputs?
1 An orthant is the high-dimensional analog of a quadrant.

Experimental Setup
Three Classification datasets, MNIST [42], FashionMNIST [65], SVHN [46], and two regression datasets, TaxiNet [34] and Udacity [33], are used in the studies.MNIST and SVHN are selected as they are used in the experimental studies of the baselines of our work, DeepHyperion-CS [71] and SINVAD [35].FashionMNIST, TaxiNet, and Udacity are considered as they represent domains different from that of MNIST and SVHN.For each of the datasets, we train VAEs for instantiating cit4dnn as shown in Table 2 and two DNN models as shown in Table 1 to study fault-revealing test inputs in RQ3 and RQ4.
FashionMNIST contains 28x28 greyscale images of fashion products belonging to 10 categories.We use the FashionMNIST networks used in the IDC work [26].SVHN contains 32x32 color images of digits in natural scenes; it has 73257 training inputs and 26032 test inputs.We train All-CNN-A and All-CNN-B networks for this dataset [55].
TaxiNet contains aircraft runway images with 16x32 resolution with cross-track position and heading angle for each.The TaxiNet dataset has 80k training and 20k test inputs.We use the network from an open-sourced artifact [63] as one of the models for this dataset.Since a second model for TaxiNet is not available, we developed a custom model by adding two extra fully connected layers to the first model and using ELU activation in one of the layers.We used the same training hyperparameters for training both models.The Udacity dataset is a self-driving car dataset generated in a simulation environment as open-sourced by DeepCrime [33].This dataset has 9800 training inputs and 2451 test inputs where each input is a 160x320 color image.We train two models, NVIDIA's Dave-2 [10,33] and Epoch [1] to output steering angles for the inputs.

Test
Oracles.cit4dnn uses a differential test oracle similar to DeepXplore [49] for identifying fault-revealing test inputs.For classification datasets, the test oracle fails when the two DNN models trained on the same dataset predict different classes.For the regression datasets, we use the steering angle outputs of DNNs for Udacity and the heading angle outputs of the DNNs for TaxiNet for formulating the test oracle.The test oracle fails when the outputs of the two models have different signs, and the difference in their predictions is greater than 5% of the output range of the test dataset.There are other test oracles proposed in the literature which we plan to study in our future work [22,56,59,66].
4.1.3Quantitative Model Metrics.We use quantitative metrics to measure both how realistic a given test is and how diverse a set of tests is.For a test to be completely realistic, a consumer must be unable to discern whether the test came from a generative technique or from the original data.For a test set to be perfectly diverse, all possible underlying feature combinations must be expressed in the set.Unfortunately, evaluation of the fidelity of generative models is a difficult problem with significant active research [12].
The three metrics we use to assess our models are FID [30] (which is used in prior works [15,68]), Coverage, and Density [45].While prior works have also used Inception Score [15], we contend that this is not an effective metric for this use case -Inception Score does not handle mode collapse well and is improved on by FID [12].
FID, the Fréchet-Inception Distance, is the 2-Wasserstein distance of two distributions taken from a lower-dimensional "embedding" space; the embedding used is Inception v3 [30].FID is highly variable across implementations -we use the torchmetrics FID implementation on Inception v3's 2048 layer.A critical issue with FID and related single-valued metrics is their inability to differentiate a lack of diversity from other failure-modes.A recent line of work introduces a 2-dimensional metric: Density and Coverage [45].Similar to FID, Density and Coverage use an embedding space, but instead of measuring the distribution distance, they measure how often generated points (in the embedding space) occur near real points, where "near" means within the manifold created by taking the k-nearest-neighbor ball of each point in the original dataset.Intuitively, Coverage is the percentage of real points that have at least one generated point near them.Density follows as the rate at which generated points are near real points.We include these as measures of cit4dnn's capability to generate realistic inputs in §4.We use the reference implementation [3] for these with K=5 and torchvision's pretrained vgg16 Imagenet model as the embedding.
For datasets smaller than 32x32, we upscale by repeating the undersized dimension until the image is >32x32; for single channel datasets we repeat the greyscale channel 3 times.

cit4dnn Instantiation
cit4dnn has five configuration parameters which yield a large experimental space.To control costs, we consider a range of parameter combinations selected to explore the RQs and leave a fuller consideration of the parameter space to future work.Generator cit4dnn uses a pre-trained Generator, and we use the decoder networks of variational autoencoders (VAE) as the Generator networks in the experiments.Since using higher-quality VAEs is beneficial, we explore the combination of two recent innovations in VAE architecture and training: a two-stage VAE [23], which ensures a better match to the prior, and a -VAE [54], which optimizes the balance between loss terms that govern reconstruction accuracy and matching the prior.
We use three VAE architectures in the study: a basic VAE [39], denoted , a basic VAE trained with optimal variance estimate [54], denoted , and a Two-Stage VAE [24] trained with optimal variance estimate, denoted 2.VAEs for the MNIST dataset are trained using the network configuration used in Burgess et al. [13] and all other VAEs are trained using the InfoGAN [18] network architecture.The network input layers are modified to fit the input sizes of the respective datasets.Each VAE configuration is trained for five datasets (MNIST, FashionMNIST, SVHN, TaxiNet, and Udacity), resulting in the 15 VAE configurations shown in Table 2.We report the non-noise latent dimension, , for 2 since using non-noise latent dimensions is recommended by IDC for formulating the test coverage domain [26].
Target Density Across the experiments, we consider a range of target densities that include both high and low probability regions of the latent distribution to study both normal and rare input test generation.More specifically, we use the following overlapping higher-probability regions: 1 = [0, 0.99], 2 = [0.49,0.99], 3 = [0.94,0.99], and a disjoint set of low-probability regions: 4 = [0.99,0.9999], 5 = [0.9999,0.999999].We distributed these target densities across the research questions to control the experimental cost.RQ1 uses only 1, which includes 99% of the distribution since the aim of the study is to explore normal inputs.In RQ2, we study the diversity of normal inputs using 1 and also the improvement in diversity obtained by adding rare inputs from the tail of the distribution, 5 with a density of .99e-4.RQ3 uses all of the regions as it studies the prevalence of faults across the regions with varying densities.RQ4 has two sub-studies that use both normal and rare inputs, we use D1,D5, and D1,D4 to include both low-probability regions in the experiments respectively.
CIT Parameters and  We vary  as well as  in RQ2 as that question studies aspects of the diversity of test sets.We fix  = 3 for RQ3/RQ4 and  = 20 for RQ1/RQ3/RQ4 since these were shown to be good choices for assessing test suite coverage [26].We use =1 in all of the research questions except for RQ4, where we also explore the fault-revealing capability of the test sets generated for increasing .The inputs generated by the testing techniques should be representative of the input data distribution for testing to be effective [8,25].We conduct a study to investigate whether the inputs generated by cit4dnn are realistic.We show that our selected models generate viable outputs with random sampling; we then show that sampling with cit4dnn preserves the fidelity of the underlying 2 VAE.

Results and Research Questions
We compute FID and Coverage scores relative to the full test set for each dataset to select the best VAE, and these values are presented in Table 2.While FID is sensitive to test set size, this is not an issue since we are only comparing within our trained model architectures.
We find that VAEs with optimal variance perform consistently better than those without on both metrics.Additionally, for each metric, 2 performs better for 4/5 datasets.To reduce the cost of subsequent experiments, we use the 2 VAE to instantiate cit4dnn.It is a common practice in the generative model literature to perform a visual study to verify the quality of the generated inputs [11,54].Using the same approach, all the co-authors manually checked the random samples generated by the VAEs for visual similarity with their training inputs and any visual anomalies.Figure 5 shows random samples for visual inspection.We also performed a qualitative comparison of randomly generated outputs from each VAE to random test samples.This analysis confirmed the quality of the 2; we show random generated samples from this VAE in the r rows of Figure 5 and show random test samples in the t rows.
We then use Density, Coverage, and FID to assess the impact of selecting tests using cit4dnn as compared to randomly sampling the VAE.We are limited by the small size of cit4dnn's t=2 test sets (100% 2-way coverage is achieved with 528-832 tests depending on dataset), so we test with random samples of size 500 from each set and repeat the trails 100 times each.The results are shown in Figure 6.
We find that sampling with cit4dnn, for either value of , preserves the realism of the 2 VAE as measured by Density, Coverage, and FID.We confirmed this with a qualitative analysis.We provide example cit4dnn tests with  = 3 for each dataset in the rows labeled 3 in Figure 5 and compare them with randomly generated tests in the rows labeled r.RQ1 Finding: Based on both FID and qualitative analysis used by prior work and the state-of-the-art Coverage and Density metrics, we find that cit4dnn generates tests that are as realistic as any that can be generated by the VAE used to instantiate it.

RQ2: How effective is cit4dnn in generating feature-diverse tests?
To provide insight into the diversity of the tests generated by cit4dnn with respect to human interpretable features, we need ground truth features for the datasets used in our studies (which are unknown).DeepHyperion-CS [70,71] is a recently published approach that uses human interpretable features for generating diverse tests.Their evaluation shows that DeepHyperion-CS is superior to DeepHyperion [70], DeepJanus [52] and DLFuzz [29] with respect to feature diversity.For these reasons, we use DeepHyperion-CS as a baseline in this study.
DeepHyperion-CS uses human assessors to identify the features of the input data; the experimental studies include MNIST [42] and BeamNG [71] datasets.Of these datasets, DeepHyperion-CS transforms the inputs directly only for MNIST, so we compare cit4dnn on that dataset.DeepHyperion-CS generates feature maps of the generated test sets and uses the metrics, Filled Cells (FC), and Coverage Sparseness (CS), computed over the feature maps to measure the feature diversity of the test sets.We use these two metrics for comparing the feature diversity of the tests generated by cit4dnn and DeepHyperion-CS to answer RQ2.
DeepHyperion-CS uses Luminosity (Lum), Moves (Mov), and Orientation (Or) as the features of the MNIST digits.We ran cit4dnn and DeepHyperion-CS for an hour each and generated feature maps for test inputs generated by the two approaches for pairs of feature combinations, (, ), (, ), (, ).cit4dnn is run with =20, =3, =1, and [  ,   ] ∈ {1, 5, 1 + 5} to include both normal and rare inputs in the study.We ran DeepHyperion-CS for all 10 classes of MNIST and limited the overall runtime of the tool to one hour.cit4dnn runs for less than 2 minutes across 1 and 5; DeepHyperion-CS runs for one hour.We note that cit4dnn could have been run for a full hour by using  = 25, but we only    used  = 1 for these studies; therefore, the results presented underestimate the feature diversity that could have been achieved by cit4dnn given an hour.This experiment is repeated 10 times with the  and  metrics being measured for each of the feature maps.Box plots of the results are shown in Figure 7.We performed a Mann-Whitney U Test and the results show that rare inputs (5) have better  and  values when compared to normal inputs (1).Additionally, a test set with both rare and normal inputs (1 + 5) outperforms both DeepHyperion-CS and test sets generated for normal or rare inputs alone.These all hold with significance; p-values for all less than 0.05.
Using FC and CS metrics, we conduct two studies to demonstrate how the partition granularity, , and the strength of CIT, , of cit4dnn impact diversity.For the partition granularity study, test sets are generated for  ∈ {4, 8, 16, 20} while keeping [  ,   ],  and  fixed to (0, .99), 3 and 1 respectively.Figure 8  RQ2 Finding: cit4dnn generates diverse test inputs when compared to DeepHyperion-CS.For the configurations of  and  studied in the research question, diversity increases with both  and , with increasing  leading to better diversity than increasing .

RQ3: How does fault density vary with the latent distribution?
For this study, we used different target density regions, 1 − 5, to generate normal and rare inputs.These density ranges include a broad range of cumulative densities in the latent space.We measure the number of fault-revealing test inputs in each density partition to study the distribution of faults in the latent space.MNIST, Fashion, SVHN, TaxiNet, and Udacity are used in the study.The study varies [  ,   ] while ,  and  are set to 20, 3, and 1 respectively.Table 3 shows the average number of faults identified by the differential test oracles for each of the density partitions across 10 repetitions of the experiment.Due to the low standard deviation of the number of faults across all the configurations, which is at most 45, we show only average values in the table.
Since 3 ⊆ 2 ⊆ 1, we note that faults detected in 1 might include faults present in 3.Nevertheless, we see that across the datasets there is an increase in the number of faults detected moving from 1 to 3, except for Udacity.Generally, the test set size increases across this range, except for Udacity which could explain the lack of increased fault detection.Recall that varying the target density can change the RCCA size and the extent of that change will vary with , which varies across the VAEs used for these datasets (Table 2).Comparing rare input to normal density region (5/1) reveals that the increase in test size alone does not explain the increase in the number of faults detected.This is especially true for datasets like MNIST and SVHN.
RQ3 Finding: The research results indicate an increase in faults detected that is out of proportion to the increase in test set size with decreasing input probability, which suggests that fault density increases as input density decreases.

RQ4:
How cost-effective is cit4dnn in targeting normal and rare inputs?We demonstrate the effectiveness of cit4dnn in generating normal and rare inputs by comparing against random sampling in the latent space and SINVAD [35].We consider SINVAD as a baseline since it is a recently published generative-model based approach, and its implementation is available.While manifold-based test generation [15] is also relevant and open-sourced, it only generates normal inputs, and it would be unfair to use it as a baseline for rare input testing.Our primary cost metric is test generation time, but we also report the number of tests generated by each technique since this can influence clients of the generated tests, e.g., when running a test set multiple times.Our primary effectiveness metric is total 3-way coverage, which, as we have shown in RQ1 and RQ2, is capable of generating realistic and feature-diverse tests, but in the second part of this RQ we also report fault detection as an effectiveness metric.
Random Baseline The random baseline generates tests by drawing  samples from the Gaussian prior of a VAE,  , as Byun et al. [15] proposed, and is equivalent to cit4dnn( , 0, 1, 1, 1, ).For each approach, the experiment terminates either when it achieves 100% total 3-way coverage or the runtime exceeds one hour.We used a server with an AMD EPYC 7742, 2.25GHz CPU with 128GB of memory for running the experiment.We formulated the random baseline test generation using a strategy to ensure that the coverage measurement overhead does not dominate the study results.The strategy first incrementally adds 1000 test inputs to the test set until the algorithm meets the termination criteria.When the  random sampling fails to terminate with 100% total 3-way coverage, a strategy that adds one million tests on each increment is used.cit4dnn is configured for two target densities, 1 and 5, to study normal and rare inputs respectively with =20, =3, and =1.Table 4 reports the metrics for cit4dnn and the random baseline for five datasets.The number of tests is divided into the number sampled and the number that is feasible relative to the target density range.For the high-density region, 1, most random samples are feasible, but this is not the case for the low-density region, 5.The hit rate of random sampling allows it to achieve 100% coverage, reported as a 1 in the table, within the timeout for 1, but requires between 3.3 and 7.2 times the number of tests generated by cit4dnn.Moreover, cit4dnn generates those tests up to 7 times faster.In the low-density space, the story is very different.Random sampling cannot hit the target density range frequently causing it to timeout and fail to achieve 100% coverage.In contrast, while cit4dnn incurs increased cost in 5, those costs are comparable to the time for random on 1 for 4 of the datasets.cit4dnn achieves 100% coverage by construction and the projection technique never fails to map a sampled input to the target density range.
SINVAD Since SINVAD only supports classification datasets, only MNIST, Fashion, and SVHN are used in the experiment.cit4dnn is configured for two target densities, 1 and 4, to study normal and rare inputs respectively with =20, =3, and =1.We use the differential testing algorithm and the DNN networks provided in the SINVAD artifact for test generation and the test oracles in the study [4].We ran each technique until either the generated tests achieve 100% total 3-way coverage or the total runtime exceeds five hours.Since SINVAD has high runtime costs, we ran it on a server with an 11GB Nvidia GTX1080Ti GPU, an Intel Xeon E5-2620 2.10GHz CPU and 128GB RAM for the experiment.SINVAD timed out for all runs, whereas cit4dnn for =1 ran in less than 15 minutes just on a CPU across each dataset.
Figure 9 reports the detected faults and coverage ratios of SIN-VAD relative to cit4dnn metrics across 1 and 4 for 10 repetitions of the experiment.The 1 results show that cit4dnn yields a 5-fold increase in total 3-way coverage when compared to SINVAD and that cit4dnn is better than SINVAD with respect to the number of faults for 2 out of 3 cases in the D1 study.For MNIST, SINVAD detected 842 faults and cit4dnn only 671.We then ran cit4dnn with  = 2 for these datasets and found that it detected 1339 faults, an improvement of 1.6 times relative to SINVAD.Running with  = 2 involves permuting the columns of the covering array, allowing tests to be generated for 1 in 50 seconds -360 times faster than SINVAD which ran for 5 hours.More broadly, we find that the cost of cit4dnn grows linearly with  as does its fault detection effectiveness.The 4 results in Figure 9 indicate that cit4dnn yields more than a 50-fold improvement in faults detected and total 3-way coverage across all datasets.
RQ4 Finding: cit4dnn is cost-effective in generating normal and rare inputs and achieving 100% total 3-way coverage when compared to random sampling and SINVAD.While cit4dnn is not a fault-directed technique it outperforms SINVAD in fault-detection while running more than 140 times faster.

Threats to Validity
To promote replicability, we release our implementation as opensource 2 .
To mitigate the threats to internal validity, we reused an existing CIT algorithm as a starting point as well as existing DNN model architectures and baselines.We used the default hyperparameters when training the models and for running the baselines.For the components we developed, we manually cross-checked the results for any anomalies in the data.
To mitigate the threats to external validity, we designed the experimental studies to explore a range of configuration parameters of cit4dnn and recently published baselines.The study presented in §4 guides users in selecting VAEs for generalizing cit4dnn to other datasets.We used five datasets representing different data domains.However, all the datasets use image inputs; generative models are available for other domains [50,69], and we plan to extend cit4dnn to speech and text datasets in the future.

RELATED WORK
DNN test generation approaches generate test inputs by exploring either the input space or the feature space [53].Techniques such as 2 https://github.com/less-lab-uva/CIT4DNNDeepXplore [49], DeepTest [57], DLFuzz [29], and BET [60] work on the input space and generate test inputs by applying pixel-level transformations on seed inputs.The diversity of the generated tests is limited by the diversity of the seed inputs used by these methods and they can generate out-of-distribution inputs [8,25].
Test generation approaches that work on the feature space produce in-distribution tests.However, these approaches require a model representing the features of the input data distribution.With such a model technique, DeepHyperion [70,71], DeepJanus [52], and methods that apply traditional CIT to ML [19,27,48] have been shown to be effective in covering the input feature space and revealing faults.When such a model is available these approaches can be effective, but they are costly for domain-experts to construct and challenging to produce for high-dimensional input spaces like those found in image DNNs.
In contrast, generative model-based approaches such as Manifold-Based Test Generation [15], SINVAD [35], and DeepTraversal [68] use the latent space of the generative models as a feature domain for test generation, sidestepping the need for a human-defined model.cit4dnn falls into the generative model-based test category but differs from existing approaches since it (a) guarantees systematic latent space coverage which yields a form of systematic feature diversity coverage and (b) has the ability to efficiently target low-probability input regions which may harbor faults.
Khadka et al. developed a method that applies CIT on the partitioned latent space of a VAE to generate synthetic datasets to train Machine learning models [37].However, their goal is training data generation whereas cit4dnn focuses on DNN testing.While their work uses CIT, cit4dnn uses constrained CIT on the geometry of the latent space which means that, unlike cit4dnn, their work cannot generate rare inputs on a target density of the latent space.

CONCLUSIONS AND FUTURE WORK
cit4dnn applies constrained combinatorial interaction testing [21] to the latent space of a generative model to produce diverse test inputs.Our experimental studies show that cit4dnn is effective in generating feature-diverse test sets when compared to the state-ofthe-art approaches, is cost-effective for generating rare inputs, and is effective in revealing faults.
cit4dnn is the first cost-effective approach for validating model behavior on rare inputs.We plan to further study the relationship between input and fault density by investigating a broader set of models, test oracles, and target density shells.We expect that moving too far out on the tails of the distribution will yield inputs that do not resemble training inputs and we plan to investigate methods to estimate the range of target density shells for which cit4dnn can produce valuable tests.

Figure 1 :
Figure 1: Diversity of digit 0 with variation in the slant and stroke thickness.

Figure 9 :
Figure 9: Number of fault-revealing inputs (Faults) and total 3-way coverage (Coverage) of the tests generated by SINVAD normalized with respect to the values measured for cit4dnn across datasets for 1 (left) and 4 (right).

Table 1 :
Models used in our studies with number of parameters, test accuracy or MSE (Mean Squared Error); "M" denotes millions of parameters.

Table 2 :
Coverage and FID scores for VAEs of different types for each dataset evaluated relative to test set of given size.
Best metric values are in bold.Latent dimension, , shown for 2 VAE.

Table 4 :
Comparison of cit4dnn (2,   ,   , 3, 20, 1) to Random baselines in terms of number of generated tests, number of feasible tests, 3-way coverage, and test generation time for 1 and 5.