DeepSample: DNN sampling-based testing for operational accuracy assessment

Deep Neural Networks (DNN) are core components for classification and regression tasks of many software systems. Companies incur in high costs for testing DNN with datasets representative of the inputs expected in operation, as these need to be manually labelled. The challenge is to select a representative set of test inputs as small as possible to reduce the labelling cost, while sufficing to yield unbiased high-confidence estimates of the expected DNN accuracy. At the same time, testers are interested in exposing as many DNN mispredictions as possible to improve the DNN, ending up in the need for techniques pursuing a threefold aim: small dataset size, trustworthy estimates, mispredictions exposure. This study presents DeepSample, a family of DNN testing techniques for cost-effective accuracy assessment based on probabilistic sampling. We investigate whether, to what extent, and under which conditions probabilistic sampling can help to tackle the outlined challenge. We implement five new sampling-based testing techniques, and perform a comprehensive comparison of such techniques and of three further state-of-the-art techniques for both DNN classification and regression tasks. Results serve as guidance for best use of sampling-based testing for faithful and high-confidence estimates of DNN accuracy in operation at low cost.

the DNN to estimate their accuracy (i.e., probability of not having mispredictions).This allows to establish a release criterion and to correct or tune the DNN until the criterion is met.
The reference scenario is the following: a DNN model meant to operate in a target context is trained with a training dataset.The goal of the tester is to select a small yet representative subset of (unlabelled) inputs from an operational dataset, to use as test cases to estimate the DNN accuracy [1].Their manual labelling has a high cost.The challenge is to build a small test set able to provide an unbiased, high-confidence estimate of the DNN accuracy.At the same time, testers are interested in exposing DNN mispredictions, since they are input to DNN debugging and re-training [2].The goal thus becomes threefold: build a small dataset, able to faithfully estimate DNN accuracy, and with a good ability to expose mispredictions.
Inspired by operational testing, a known practice in software reliability engineering [3][4][5][6][7], researchers proposed probabilistic sampling to test DNN.The basic scheme is simple random sampling (SRS).Li et al. proposed a sampling scheme aimed at minimizing cross-entropy between the selected tests and the operational dataset [1].Guerriero et al. [2] leveraged adaptive sampling [8] to propose DeepEST, whose objective is to expose many DNN mispredictions while yielding good accuracy estimates.These techniques borrow basic concepts from sampling theory to derive algorithms working well for specific goals or contexts -for instance, CES and DeepEST outperform each other in their respective objectives (lower-variance estimate the former, better failure exposure the latter).However, better trade-offs can be achieved by exploiting advanced strategies from statistical sampling, e.g., by properly using the information available to drive the sampling process.
This work aims to give a high level view of sampling-based DNN testing to highlight what are the main knobs to tailor a technique according to the needs and improve performance, exploiting advanced sampling theory concepts besides the basic ones (e.g., auxiliary variables, unequal sampling, without-replacement schemes, stratification).To this aim: • We propose DeepSample, a family of sampling-based DNN testing techniques differing from each other in the sampling strategy, in the auxiliary information used for sampling and for partitioning, and in the estimation process.The framework includes five new testing techniques, each implemented in three variants depending on the auxiliary information used to drive sampling.• We present a comprehensive comparison of the new techniques and of three existing ones, SRS, CES, DeepEST, to evaluate their ability to assess DNN accuracy and select failing examples.The evaluation is conducted on classification and regression tasks, under 5 testing budgets, 3 datasets, with 3 models per dataset for classification, and 1 dataset and 2 models for regression. 1he new algorithms turn out to outperform the existing ones in almost all the contexts.Overall, the results allow to draw guidelines for practitioners and researchers -on relevant factors like if and which auxiliary information to use and how to use it -for samplingbased DNN testing for high accuracy, high confidence estimates at low cost and with good mispredictions exposure ability.

RELATED WORK
Probabilistic sampling is used in operational testing (OT) to estimate the expected reliability of a software system after release.In OT, test suites are built by selecting or generating tests according to the expected operational profile, a probabilistic characterization of the expected usage.OT was central in Cleanroom software engineering [3][4][5][6] and in the Software Reliability Engineering Test process [7].
Over the years, researchers proposed better sampling strategies to improve estimates or lower their cost.Cai et al. [9][10][11] developed Adaptive Testing, still based on the operational profile, but with an adaptive selection of test cases from partitions.Adaptive Testing with Gradient Descent2 [12] is one of the techniques considered in this study.Stratified sampling too has been used for reliability assessment [13,14].Later, Pietrantuono et al. [15,16] stressed the use of unequal probability sampling to improve efficiency, formalizing several sampling schemes to this aim [17].
Li et al. [1] first proposed sampling for DNN operational accuracy assessment in the CES (Cross-Entropy Sampling) technique.Like OT, CES aims to select a small yet representative sample, by minimizing the cross-entropy between the selected and the operational dataset.A sample is expected to contain the same proportion of failing examples as in the operational dataset.Guerriero et al. [2] observed that the mere imitation of operational inputs may be inefficient, especially for accurate DNN, as much effort is wasted to label correctly classified inputs.
They propose DeepEST, exploiting an adaptive sampling algorithm for rare populations [8] to spot the more failing examples, hence spending effort to label examples useful for improvement besides assessment.The disproportional selection is balanced by an estimator that preserves unbiasedness.
A further technique is PACE (Practical accuracy estimation) [18], a heuristic method that uses clustering to partition tests into groups, and then uses adaptive random selection of test inputs representative of the clusters.Zhou et al. proposed DeepReduce [19], a two-stage heuristic method exploiting neuron coverage to select a subset of inputs, then using the Kullback-Leibler Divergence to drive the second-stage selection.These techniques are however not based on probabilistic sampling like those compared in this work, and they do not guarantee unbiasedness and convergence.
•  = { 1 , . . .,   } is the operational dataset, an arbitrarily large set of examples with unknown labels, which are possibly given as input to the model  in the operational phase.Its size is  = | |; •  ∈  = { 1 , . . .,   } is the subset of examples to select from  and to be labelled.This set is used for estimating DNN accuracy, and can also be used to enlarge the training set and improve the DNN performance in new releases.Its size is  = | | ≪  .When an example   is submitted to the DNN, a human oracle assigns the expected output to   , and then compares it with the actual output.
In classification tasks, this gives a binary outcome   (whether actual and expected labels match or not).In regression tasks, the comparison gives an offset   , which is the absolute difference between the true (  ) and predicted (r  ) values -considering this a failure or not depends on the tolerable threshold.For our purposes, it suffices to focus on the value of   .•  =  (  = 1), with  = 1, . . .| |, is, in classification tasks, the true failure probability on a randomly selected example from the entire operational dataset, and corresponds to the true (unknown) proportion  = 1   =1   .Accuracy is defined as:  = 1− .In the case of regression, we look at the mean squared error between the true (  ) and predicted (r  ) value over the entire operational dataset: , and  = 1 − Δ.Its estimate is ξ.Given a sample size budget , the goal of DeepSample is to select a subset  able of giving an unbiased (i.e., such that E[ ξ] = ) estimate of  while maximizing the efficiency of the estimator (i.e., minimizing the variance of the estimate). 3In addition, the set  is wanted to expose as many failing examples as possible.

Overview of DeepSample
DeepSample is a family of techniques leveraging prior knowledge available about the operational dataset, supposed to be correlated to the variable to estimate (namely, accuracy).Prior information is encoded in what are called auxiliary variables [20], here denoted as ; for instance, the confidence value provided by classifiers when predicting a label can be assumed to be (negatively) correlated with the failure probability  .Clearly, accuracy and efficiency of estimates depend on the extent to which assumptions hold.
The DeepSample techniques are characterized by two dimensions: i) the sampling algorithm, and ii) the auxiliary variable.
The former specifies a sampling scheme, namely the sequence of steps required to select the tests   .The latter specifies what is the auxiliary variable , if used by the sampling scheme (not all auxiliary variables can be used in all the schemes).
There are two ways of exploiting the auxiliary variables.The first is to partition the dataset into classes that are homogeneous with respect to the auxiliary variable (e.g., similar confidence), similarly to stratification in sampling theory [20].If the variable is well correlated with the failure probability  (or Δ for regression), partitions too should be homogeneous with respect to  (or Δ).This allows to wisely allocate the number of examples to draw from each partition with the aim to reduce the variance of the estimation.The second way is to let the sampling scheme select the examples proportionally to the auxiliary variable's value, so as to get the ones with higher expected failure probability.A proper estimator is then needed to correct bias due to this unequal selection probability.
Techniques can be with or without replacement.The former ones (allowing an example to be selected more times) are associated with simpler estimators -a common choice in literature [1] [9] [10] [11] [12]; the latter ones are expected to give higher efficiency, though the gain in large populations can be marginal (with-replacement schemes will unlikely select twice the same example).
The estimator takes the result of submitting the selected sample  to the DNN  (denoted as   and   for classification and regression, respectively) and yields an unbiased estimate of  by counterbalancing the disproportional selection (cf.with Sec.3.4).

Auxiliary variables
We consider three auxiliary variables for classification problems, and for regression as well.For classification, they are: Confidence, Distance-based Surprise Adequacy (DSA), and Likelihood-based Surprise Adequacy (LSA).We opted for these variables based on the literature [1,2].For regression, they are LSA, and two variables based on the reconstruction error of a simple autoencoder (SAE) and of a variational autoencoder (VAE), which have been demonstrated to be effective in detecting inputs likely to cause failure [21].
Confidence    of an input   is the maximum value in the probability vector obtained from the last layer's output of the DNN 4 ; it is for classification problems only.DSA and LSA, defined by Kim et al. [22], exploit Activation Traces (AT), which are vectors of activation values of neurons belonging to a certain layer.DSA is defined as: , where   is the Euclidean distance between the ATs of the input   (whose predicted class is A) and its nearest neighbour belonging to the same class ,   is the distance between the ATs of   and its nearest neighbour belonging to a different class . 5 It makes sense for classification models only.LSA uses Kernel Density Estimation (KDE) [23] to estimate the probability density of each activation value, obtaining the surprise of a new input with respect to the estimated density.LSA is a measure of rareness computed as: , where f (  ) is the KDE applied to the new input   .LSA is for both classification and regression.
For SAE/VAE-based variables, we leverage the reconstruction error .We used the two best-performing autoencoders implemented by Stocco et al. [21], SAE (Simple Autoencoder) with a single hidden layer, and VAE (Variational Autoencoder).We consider autoencoders as single-image reconstructors, computing their outputs for all the operational examples, and then calculating the reconstruction error as: , where   is the original image,  ′  is the reconstructed image,  ,  , and  are width, height, and channels respectively.The corresponding auxiliary variables are synthetically called SAE and VAE, meaning the    value obtained by SAE and VAE.
All variables are assumed to be correlated to accuracy: lower confidence, higher surprise (DSA, LSA), and higher reconstruction error (SAE, VAE) are expected to be related to higher failure probability.To have all positive variables (from which selection probabilities need to be derived), DSA and LSA for classification are min-max normalized.For regression, as the min-max normalization affects the distribution of test data, we just shift the values:    =    + ⌈ (   ) ⌉ (the same for ).All the above variables are denoted as   in the following, when there is no need to distinguish them.In the case of confidence,   = 1 −    since we assume that confidence is negatively correlated to the accuracy.

Testing techniques
The characteristics of the eight compared testing techniques are summarized in Table 1; their description follows.

Without-partitioning techniques.
Simple Random Sampling (SRS).SRS with replacement, where all examples have the same probability to be selected, is the simplest and baseline technique [17] [1].For SRS; unbiased estimators of  (for classification) and Δ (regression) are, respectively, the observed proportion and mean squared error over the subset of selected tests: Simple Unequal Probability Sampling (SUPS).This scheme leverages auxiliary variables  for selecting the examples.The selection probability   for the -th example   is obtained by normalizing the auxiliary variable   =   /  =1   ; this is known as probabilityproportional-to-size (PPS) sampling [20].The selection is with replacement.An unbiased estimator is the sample mean of the observed values re-scaled by the inverse of their selection probability   and by  , known as Hansen-Hurwitz estimator [24]: Note that this is a generalization of SRS, wherein the selection probability is   = 1/ for all the examples.

RHC-Sampling (RHC-S)
. This is another unequal probability selection scheme, but without replacement, and uses the Rao, Hartley, and Cochran (RHC) estimator [25].The scheme is as follows: (1) Given the budget of  = | | test cases, divide randomly the  = | | units of the operational dataset into  groups, by selecting  1 inputs with SRS without replacement for the first group, then  2 inputs out of the remaining ( −  1 ) for the second, and so on.This will lead to  groups of size  1 , . . .,   with   =1   =  .The group size is arbitrary, but we select as this minimizes the variance.
(2) One test case is then drawn by taking an input   in each of these  groups independently and with a PPS sampling according to the above-defined  variable.(3) Denote with  , the probability associated with the   -th unit in the  -th group, and with   =  ∈   , the sum in the  -th group.The unbiased estimators are: Cross-entropy Sampling (CES).Cross Entropy-based Sampling (CES) was proposed by Li et al. [1].The CES algorithm builds the sample first selecting randomly an initial set of examples, and then selecting the remaining examples trying to minimize the average cross-entropy between the probability distribution of the -dimensional representation of neurons output computed on the operational dataset and the selected images.The objective is to sample a set of examples as much as possible representative of the operational dataset, namely if it contains the same proportion of mispredictions as the operational dataset.For CES, the authors demonstrate that the estimator is the same as SRS (Eq. 1 and Eq. 2).
Deep neural networks Enhanced Sampler for operational Testing (DeepEST).Guerriero et al. presented DeepEST [2], a technique for DNN operational testing with the twofold objective of accuracy estimation and accuracy improvement.DeepEST exploits adaptive sampling [8] to select a sample providing a close and efficient estimate and, at the same time, including a high number of failing examples.The original version of DeepEST works only for classification tasks.We hereafter extend it for regression too, defining the corresponding estimator.The auxiliary variable, , is used by DeepEST to define a weight  , between any pair of examples   and   of the operational dataset, used to explore the example space adaptively.The weight  , is the value of    if    exceeds a threshold (i.e., it means that   is in an interesting cluster to explore), 0 otherwise.The thresholds are those of the original paper.The strategy acts as follows: the first input is selected via SRS, then a weight-based sampling (WBS) is used with probability r to sample the next example (or SRS with probability 1-r).The example   is selected at step  with probability  ,  : where: •  : probability of using WBS; •   : current sample (all examples selected up to step ); •  , : weight relating example   in   to example   ; •    : the size of the current sample   ; •  : the size of the operational dataset.
WBS selects an example   proportionally to the sum of weights  , of already selected examples toward   .We compute the following step-by-step estimators to balance for the adaptive sampling: where  1 and  2 1 are the estimates obtained at step  = 1 (hence when  = 1), θ and Δ are the Hansen-Hurwitz estimates at step  > 1 for the total failures and for the mean-squared error: The final estimators (Eq.8, 9) are the sample mean of the stepby-step estimators.For regression, the -th MSE estimate is Δ .

Partition-based techniques. Partition-based techniques split
the operational dataset into classes to improve sampling.In sampling theory, stratification splits the population to have a small expected intra-stratum variance of the variable to estimate  and a large inter-strata variance, so as to sample more from partitions with higher variance.Since the true variance of  is unknown, stratification can be done on an estimate of such variance (e.g., computed from a preliminary sample) [17].However, this would require labeling a subset only just for the purpose of estimating the variance and then applying stratification.Another common solution, that we adopt, is to stratify based on auxiliary variables.Although risky (performance depends on the extent to which they are correlated to ), this requires no prior knowledge about .We used −means clustering [26] on , with  set to 10 after a preliminary tuning on 30 random samples from MNIST, with  = 6, 8, 10, 12.
Stratified Simple Random Sampling (SSRS).In this scheme, the number of examples to draw from each partition  is computed by the Neyman allocation [20] applied to , namely proportionally to the standard deviation of the (normalized)  values for that partition, and to the size of the partition,   .Selection within the partition is without-replacement.The estimators are the weighted sum of the SRS estimates for partitions: where θ and Δ are the within-partition SRS estimators (Section 3.4.1), =  = 10 is the number of partitions.
Gradient-Based Sampling (GBS).Unlike SSRS, this technique does not initially allocate a sample size for each stratum, but it decides step by step which partition the next example will be drawn from.Inspired by adaptive testing with gradient descent [12], at each step the partition is chosen so as to maximize the reduction of the variance   ( ξ) of the  estimator, by taking the partition with the largest negative gradient: −  ( ξ)/  (ties broken randomly),   being the number of examples selected from partition  up to the current step.The selection within the partition is then with replacement.The estimators are the same as SSRS (Eq.12, 13).Note that the with-replacement SRS, used in GBS, and withoutreplacement SRS, used in SSRS, have the same mean estimatorsthey differ for the variance of these estimators.

Two-stage Unequal Probability Sampling (2-UPS).
This technique implements a two-stage sampling scheme, where unequal probability sampling is adopted to select the partition (first stage), and SRS without replacement is adopted to select the example from the chosen partition (second stage).The selection probability for partition  is proportional to the sum of (normalized)  values (denoted as   as in SUPS and RHC-S) within that partition: Clearly, selection of partitions is with replacement;   is the number of times partition  is selected.The estimator for this technique is the average over  estimates: Inner terms (    )/(  ) and ( 2    )/(  ) are Hansen-Hurwitz estimates for the total number of failures and squared errors in partition , respectively.These estimates are summed up over all partitions and divided by the sample size  to get an average total estimate.The division by  gives θ and Δ.Over  = 30 repetitions, we measure the root mean squared error (RMSE) between the accuracy estimates ξ and the true accuracy  computed on the operational datasets by labeling all the images: where ξ for classification and regression is computed using θ and Δ, respectively.Lower RMSE means higher confidence in the estimate.To answer RQ1 and RQ2, we consider a budget size of 200, as in [1,2]; the total runs are 6,600 [11 models × 30 repetitions × (6 techniques × 3 auxiliary variables + the 2 techniques CES and SRS not using auxiliary variables)].For RQ3, with 5 sample size values (50, 100, 200, 400, 800), there are additional 6,600 × 4 = 26,400 runs, for a total of 33,000 runs.

Subjects
The evaluation is on 11 DNN models on popular datasets (Table 2).For classification we consider 3 models for each of the following Recht et al. [30] showed that if the accuracy is computed on previously unseen data, it is actually smaller than the claimed one by a value ranging from 3% to 15% on CIFAR10 and from 11% to 14% on ImageNet.Therefore, for a more realistic accuracy, each DNN is trained "from scratch" by separating training, verification, and operational sets, as in [31].The verification set is the set used to evaluate the DNN.The operational set contains unlabelled images.
The three datasets are split as follows.For MNIST, 7,000 images are for training and 2,500 for verification; the remaining 60,500 entries are the operational dataset (big size).All models trained with this configuration achieve an accuracy greater than 90%.For CI-FAR10, we use 24,000 images for training and 2,500 for verification; the remaining 33,500 entries are the operational dataset (medium size).For CIFAR100, 40,000 entries are for training and 5, 000 for verification; thus, the operational dataset has 15,000 images (small size).The operational datasets are chosen to have MNIST (big) almost double than CIFAR10 (medium) and four times CIFAR100 (small).The greater training set sizes for CIFAR10 and CIFAR100 are due to the higher complexity of the images, to pursue an acceptable accuracy.For regression models, we use as operational dataset the entire test dataset, as all its examples are unseen during training.

RQ1: operational accuracy assessment
5.1.1RQ1.1: Classification.To check if techniques have pairwise a statistically significant difference, we run the Friedman test [32] on all subjects/auxiliary variable pairs.The -value is lower than  = 0.05 in all cases, hence the null hypothesis of no difference among techniques is always rejected.For pairwise comparison, we run the non-parametric post hoc Dunn test [33] with the Holm adjustment.The results are in Figure 1, where gray squares mean no significant difference for the pair, white (black) squares mean the technique on the row is statistically better (worse) than the one on the column.All exact -values are in the replication package. 1  On MNIST, DeepEST and 2-UPS significantly differ from the other techniques (which perform similarly).We show three examples in Figures 2a-2c.The first is on Model A (top-left box in Fig. 1)  with confidence as auxiliary variable.Here, the RMSE of 2-UPS is by far the worst; however, it is affected by few outliers due to the inability of the estimator to balance, within the given budget, the examples whose auxiliary information is incoherent with the result (e.g., failures with high confidence).If we take the root square of the median of squared errors, called RMedSE, we see 2-UPS is in line with the others.This causes 2-UPS to go unreported as significantly different by the Dunn test (non-parametric, hence robust to outliers).DeepEST, instead, shows to be significantly worse.
The second example (Fig. 2b) is on Model B with LSA (top-middle box in Fig. 1).2-UPS performs worse than the others, the second being DeepEST although the difference is not detected by the Dunn test.The third example (Fig. 2c) is on Model C with DSA (top-right box in Fig. 1).In this case, both DeepEST and 2-UPS perform worse.In the second and third examples, the values of the RMSE and RMedSE for 2-UPS are close (no outliers); this is attributable to the higher representativeness of LSA and DSA, which were more robust than confidence to misclassification on inputs closer to training set.
On CIFAR10 with confidence, the outliers in 2-UPS are even more pronounced (Fig. 2d).DeepEST and 2-UPS again give the worst estimates.The other algorithms are similar (Fig. 2e).
On CIFAR100 with confidence, GBS, SSRS and SRS differ significantly from the other techniques.Consider Fig. 2f (Model I).GBS, SSRS and SRS exhibit the best values.Outliers in 2-UPS are confirmed; they are more frequent, especially on low-accurate models (the behaviour is more evident with CIFAR10 and CIFAR100, less accurate for MNIST).On the other hand, it is worth to stress that not all the algorithms relying on the auxiliary variable suffer from unstable results; RHC-S and SUPS are more stable.With LSA (Fig. 2g) and DSA (Fig. 2h), the previous results are confirmed; after DeepEST and 2-UPS, CES turned out to be the third worst one.
5.1.2RQ1.2: Regression.The Friedman test gives a -value lower than  = 0.05 in all the cases, except for DO with the SAE auxiliary variable.Figure 3  With the VAE auxiliary variable, GBS differs from SRS, but it is almost equivalent to the other algorithms in the DO model.Figure 4 confirms that GBS has higher RMSE than the others.For the DD model, 2-UPS differs from SUPS and RHC-S.2-UPS has the highest RMSE values, while RHC-S has the lowest ones (Figure 5).
With SAE, the Friedman test did not detect any difference for DO, while, for DD, 2-UPS is still the worst technique, although it is closer to GBS and SSRS than the previous cases.As for RMedSE%, the worst values are always with the LSA variable: for DO, 2-UPS and DeepEST show the worst values (4.2% and 3.2%, respectively); SSRS has the best value (0.4%); for DD, DeepEST and 2-UPS show 3.9% and 2.1%, respectively, against SUPS with 0.4%.The worst case with autoencoders is for DD with the SAE variable, with 2-UPS (2.0%), while the best one is SRS (0.4%).These results are in line with the classification ones.
Overall, the techniques are all equivalently effective in assessing the operational accuracy of DNN models for classification and regression tasks, except for DeepEST and 2-UPS.While DeepEST was expected to show worse results (its primary objective is on failure exposure), 2-UPS shows many outliers since it is strongly affected by auxiliary variable representativeness.

RQ2: failing examples detection
For RQ2, we treat classification and regression differently.For the former, we count the number of misclassifications.For the latter, we count the number of examples whose offset  (predicted vs actual value) is greater than a threshold , with  ∈ [0 • , 2.5 • , 5 • , . . ., 25 • ] -the higher the difference, the more severe the misprediction.CIFAR10 with confidence, and SUPS for CIFAR100 with confidence.2-UPS, SUPS, and RHC-S almost equivalently follow DeepEST.These results counterbalance the DeepEST and 2-UPS results on the estimates, which were worse than the others (RQ1.1).DeepEST assumes that failures belong to a rare population, and is conceived to spot them.The greater ability to find misclassifications causes a greater variability of the estimates, and more budget is needed to converge.A similar problem is observed for 2-UPS.We hypothesize that partitioning combined with unequal sampling (both based on the auxiliary variable ) can push toward failing examples, but the estimator needs more time to converge.Unlike DeepEST, 2-UPS showed many spikes in the accuracy estimation; the estimator generates spikes every time a failure is detected with "misleading" values of , namely misclassified examples with values of  that would indicate a correct classification.For instance, failures with high confidence, or with low LSA/DSA.SUPS and RHC-S seem very good compromises between the two -more details in the final discussion.GBS, CES, and SRS detect fewer failures.For SRS and CES this is likely because the former does not use any auxiliary variable, the latter uses cross-entropy, not supposed to be related to failures.GBS and SSRS both use  only for partitioning; but GBS detects fewer failures likely because the algorithm is thought to minimize the variance of the estimate.4 and 5 report the histograms of the offset, starting from 12.5 • to 25 • .Compared to the classification case, the differences here are less pronounced.Looking at the sum of the bins, we notice that CES and SRS select less examples with higher offset with respect to the others under the LSA case, while all 10.2 5.2 1.9 0.9 0.5 0.6

RQ2.2: Regression. Tables
19.1 CES 8.5 5.1 2.9 0.7 0.5 1.2 18.9 8.5 5.1 2.9 0.7 0.5 1.2 18.9 8.5 5.1 2.9 0.7 0.5 1.2   5.0 3.5 2.9 1.5 0.9 1.6 the techniques are roughly equivalent with VAE and SAE 8 .GBS has similar poor performance, but it performs much better when used with VAE (consistently with the more unstable RMSE (Fig. 4).SUPS is the best one with LSA.The good performance of partitioningbased techniques (which achieve or even outperform DeepEST) is attributable to a better effect of partitioning when applied to regression compared to classification (since the auxiliary variable, used for partitioning, and the offset are more correlated).

RQ3: efficiency analysis
5.3.1 RQ3.1.Accuracy assessment.We synthesize in Tables 6 and 7 the results for classification and regression.Besides the RMSE value at each point, 9 we are interested in figuring out if the techniques smoothly converge as the sample size increase.First, we report for each dataset, technique, auxiliary variable, and model, how many times the minimum RMSE is reached under the given sample size.
For instance, 3/3/3 of GBS for sample size 800 in MNIST, means that the minimum RMSE was reached for all the 3 models used with MNIST, using respectively confidence/LSA/DSA as auxiliary variable.This is marked as green, and is the expected behaviour.When this is not true for at least one case, we mark it as red, and correspondingly mark as yellow those cells in the same row (with sample size smaller than 800) where the minimum was reached.
There are many cases where the minimum is not achieved with the largest sample size (red cells).For instance, the instability of 2-UPS makes it even reach the best values with a sample size 50 (MNIST and CIFAR100) and sample size 100 (MNIST and CIFAR10).CES with CIFAR100 has the same convergence problem, while it is stable in MNIST and CIFAR10.In remaining red cases, the minimum is at 400.SRS is the most stable technique, for independence from auxiliary variables.GBS and DeepEST are stable for 2 of 3 datasets; in the bad case, they converge at size 400.For regression, performance is better; GBS is more unstable, while the others converge at 800 with few exceptions at 400 and one (CES) at 200.Tables 6 and 7 report also in how many cases the RMSE with budget 50 is smaller than that with budget 800 (red cells).We call this inversions, denoting convergence problems.There are 5 such cases: 3 for 2-UPS (2 with MNIST and 1 with CIFAR100), 1 for SUPS and 1

DISCUSSION
We analyze the results with respect to the main impacting factors, to provide guidance to both practitioners (to select the technique best fitting the needs) and researchers (to design new techniques).The performance of a sampling technique depends on the tester's objective and on the application context.
As for the objective (set in the problem formulation, Sec.3.1), while a tester is always interested in an unbiased assessment of the DNN accuracy, s/he can specifically focus on: 1 ○ High confidence (i.e., low variance), e.g., as criterion to release a DNN, or to choose which DNN to deploy among various alternatives -a high-confidence estimate is usually required in critical domains.This can be achieved by reducing the RMSE or RMedSE: in the former case, one looks for highconfidence estimate even in presence of outliers; in the latter case, one neglects the negative effect of outliers.2 ○ High failure exposure ability, e.g., when the tester needs to assess and improve the DNN accuracy efficiently, and the high-confidence requirements can be relaxed (e.g., in non-critical domains).The simultaneous assessment and improvement can help during subsequent re-training/finetuning iterations to efficiently track progress in the achieved accuracy.

3
○ A trade-off between confidence in the accuracy estimate and number of exposed failures, e.g., when a good confidence estimate is used to monitor the accuracy of a DNN and engineers want to use the exposed failing examples in the re-training actions (these may be triggered only when the accuracy drops under a certain threshold) [34].As for the context, following our experimental design, the factors that we identified as potentially impacting are: the task (classification or regression), the sample size (hence the budget available), the dataset 10 , and the auxiliary variable, if available, for sampling.
Table 10 reports a two-way analysis of the ranking performance of the techniques.On the row, we list the objective.On the column, 10 Datasets and models are considered together; the average accuracy of the models on the datasets capture three distinct cases of low, medium and high accuracy (Tab.2) we break down the results by the impacting factor.For each combination (e.g, RMSE with Classification, Table 10a), we count the number of times a technique was among the top-3 ones, and report the best 3 techniques according to this count.A practitioner should consider the combination reflecting more his/her needs and context.For instance, one might want a highconfidence robust-to-outlier assessment (row 1), with a medium (200) labelling effort (Table 10b); or (s)he might not want to use LSA or DSA, which are more expensive to compute, preferring the use of confidence (Table 10d). 11Since exploring any n-way combinations could be of interest too (e.g., small RMSE and small sample size and high-accuracy dataset), we release a notebook in our replication package 1 to specify the factors of interest and query the results.
It is worth to note that the new algorithms proposed (GBS, 2-UPS, RHC-S, SSRS, SUPS) appear among the best three in the vast majority of cases.The following specific considerations can be drawn.
SSRS is particularly good for high-confidence estimates; SUPS (and to a lesser extent RHC-S) outperforms the others for high failure exposure, where it even defeats DeepEST that is specifically conceived for that task via adaptive sampling.
SUPS and RHC-S give the best trade-offs.This indicates that they perform generally well for all the objectives.
The distinguishing feature of the new techniques is that they exploit the auxiliary variable for just partitioning (SSRS, GBS) and/or for inputs selection (RHC-S, SUPS, 2-UPS).This in essence allows to direct the sampling toward higher-variance areas of the population, reducing the estimator variance and exposing more failures.
In the perspective of a researcher devising a new technique, attention has to be paid to these aspects: auxiliary variable (if and which one to use), partitioning, and replacement scheme (Tab.1).
Auxiliary variable.The performance of auxiliary variables is useful not only for selecting a technique, but also to design new ones.The results in Table 10.d highlight that the only techniques not using auxiliary variables (SRS and CES) are rarely among the top-3 ones, especially for the failure exposure ability (  ).
Table 11 reports how many times each auxiliary variable yields the best RMSE and RMedSE, and the number of failures.For classification, DSA and confidence are the best variables for RMSE/RMedSE 1 ○ and number of failures 2 ○, respectively.It is important to highlight that confidence is cheaper to collect, as it comes with the output of the classification.For regression, LSA shows the best results 1 ○ 2 ○ 3 ○.The variables derived by SAE/VAE perform poorly.
Partitioning.Partitioning based on auxiliary variables is particuarly beneficial for good accuracy estimates 1 ○; SSRS and GBS are the best ones for this aim.The benefit of partitioning is lower when the aim is to expose failures 2 ○ 3 ○; performance is better when partitioning with LSA, especially for regression, as it is better correlated to (in)accuracy.
Replacement.We found no remarkable advantage of withoutreplacement sampling; for instance, SUPS (with replacement) works well in all scenarios.This is likely due to the negligible sample size compared to the operational dataset, hence sampling with replacement is unlikely to pick the same example twice.

THREATS TO VALIDITY
As for the selection of the experimental subjects, we have considered publicly available DNNs [31]; we have however re-trained them from scratch to have realistic accuracy and avoid the mentioned inflated accuracy issue described in [30].
The choice of the sample size affects the results.We ran a sensitivity analysis with five (from 50 to 800) values of the sample size.Different values could yield different results.
The evaluation does not include an extensive analysis of partitioning.We ran -means, with  = 10 partitions, after a preliminary tuning on 30 random samples from MNIST and  = 6, 8, 10, 12. Extending the tuning of  to all cases would improve performance.
Despite extensive code inspection, the presence of defects in the algorithms cannot be excluded.
External validity is undermined by the number of models and datasets; we considered state-of-the-art DNNs and widely-used datasets.The replicability of the experiments mitigates this threat.

CONCLUSIONS
We presented DeepSample, a framework encompassing a set of sampling-based techniques for DNN operational accuracy assessment.We implemented techniques with and without partitioning, with and without replacement, with and without auxiliary variables to drive the selection, and we empirically evaluated them in terms of accuracy estimation and number of failures, on both classification and regression problems.
The findings pertaining to the individual techniques, as well as to the key factors impacting the sampling algorithms, serve: i) as guidance for testers to select the technique depending on the needs and on the auxiliary information available to expedite sampling, and ii) for researchers to devise new techniques.
We conclude that the tester's objective and the application context are crucial in selecting a sampling technique.Techniques yielding high-confidence estimates (such as SSRS) are well suited to check the DNN against a release criterion, or for choosing among different DNNs.Techniques with high failure exposure ability (such as SUPS and DeepEST) are well suited for the simultaneous DNN accuracy assessment and improvement in iterative life cycle models.Techniques exhibiting a good trade-off between high-confidence estimates and high failure exposure (such as SUPS and RHC-S) are appropriate for cost-effective assessment and retraining.
In devising new techniques, the use of auxiliary variables and partitioning is strongly encouraged, as they have been shown to be beneficial for both accuracy estimation and failure exposure -LSA was the best choice for regression, while confidence (for failures exposure) and DSA (for accuracy estimation) were the best ones for classification.

DATA AVAILABILITY
All results and the artefacts for replication are available at: https://github.com/dessertlab/DeepSample.git.

4 EVALUATION 4 . 1
Research questions and metrics RQ1: How do the sampling techniques perform in assessing the operational accuracy of DNN models?• RQ1.1:How do the techniques perform for classification?• RQ1.2:How do the techniques perform for regression?

Table 1 :
Compared testing techniques The sample size is directly related to the cost of labelling, as it determines the number of examples to be manually labelled.• RQ3.1:How does the size affect the accuracy estimate?• RQ3.2:How does the size affect the failing examples detection?
[2] do the sampling techniques perform in detecting failing examples?An issue of some techniques like CES is that they, with reason, try to have in the sample the same proportion of failures as in the operational dataset, to faithfully estimate accuracy (what is called the imitation bias[2]); but in highly-accurate DNNs, this entails very few failures exposed, which requires engineers to run further tests to expose failures -an issue addressed by DeepEST[2].Thus a desirable property is to expose a high number of failures, besides the ability to provide unbiased high-confidence estimates.•RQ2.1: Classification task.How many failures (namely, misclassifications) are exposed by the techniques?•RQ2.2: Regression task.How many examples with an inaccurate prediction are selected by the techniques?Since in regression we have continuous outputs, we measure the number of examples having a difference between true and predicted output (i.e., the offset:   = | − r |) greater than or equal to a given value :   ≥ with  ranging from 0 • to 25 • , with a step of 2.5 • .6RQ3: How does the budgeted sample size affect performance?
Table 3 reports the number of misclassifications broken down by dataset and auxiliary variable -the best mean values are in bold.DeepEST exposes more failures than the others in 7 out of 9 times.2-UPS has the highest value only for

Table 11 :
Number of best-performing occurrences out of 270 (classification) and 60 (regression) configurations