UnWise: High T-Wise Coverage from Uniform Sampling

Configuration spaces of industrial product lines are typically too large to be tested exhaustively. Therefore, testing in practice is often carried out on samples, sets of configurations which satisfy the requirements of the testing scenario. For t-wise sampling, the objective is to cover all t-wise interactions between configurable options with as few configurations as possible. However, a trade-off needs to be made between t, sampling time, sample size, and achieved coverage. In addition, it is infeasible for larger systems to even compute the set of all 2-wise interactions in practicable time. In this work, we reevaluate the performance of uniform samplers in terms of 2-wise coverage and come to a more positive result than previous research. We also present completion and reduction algorithms that greatly improve said performance. As a baseline for comparison, we additionally evaluate the two state-of-the-art dedicated t-wise samplers Baital and YASA. In doing so, we are the first to evaluate and compare these samplers on a large set of industrial feature models.


INTRODUCTION
Contemporary industrial product lines commonly have thousands of configurable options (i.e., features) which give way to enormous configuration spaces (in recent work we encountered models with up to 10 1500 valid configurations [34]).As this makes exhaustive testing infeasible [13], sampling techniques are used to create sets of representative configurations of manageable size (i.e., samples) [32].For purposes such as combinatorial interaction testing [7,24,25], one is interested in covering as many interactions as possible with the configurations in the sample.A -wise coverage of 100 % means that every interaction between  features is present in at least one configuration in the sample (i.e., covered).
A number of specialized samplers have been developed with the goal of achieving high to full -wise coverage with as few configurations as possible [2,4,17,22,29].However, their time requirements [2,21,22,29], yielded sample sizes [4,29], or achieved coverages [4,29] commonly make them impractical when applied to real-world feature models [31,32].For example, the state-of-theart sampler YASA is capable of sampling hard models such as Linux 2.6.28 [34,36] within 30 minutes [22], achieves 100 % 2-wise coverage, but requires more than 500 configurations to reach this coverage.For large but underconstrained models, such as Automotive02, YASA fails to scale, as it, in essence, needs to iterate over all valid 2-wise interactions and therefore takes days [22].
For scenarios such as continuous integration, where time is at the essence and resources are limited, both a sampling time of 30 minutes and sample size of 500 configurations are impractical.In the case of JHipster, an open-source developer framework for web applications and microservices, there are only enough resources available to test 12 configurations per commit [13].In 2016, more than 36,000 kernels were build per day for testing the Linux kernel [31], resulting in 25 configurations being tested per commit on average. 1 Therefore, we propose to trade-off coverage for increased sampling speed and reduced sample size, which in turn lowers the time and resource requirements for testing.
In this work, we revisit work by Oh et al. and evaluate the coverages achieved by uniform sampling [29].However, in contrast to their work, our approach achieves high (98 % on median) 2-wise coverages on all models to which uniform sampling, in this case the sampler Spur [1], scales.We achieve this by applying two postprocessing steps to the uniformly generated sample.First, we utilize a SAT solver to generate configurations that cover uncovered literals and thereby complete the sample until we reach 100 % 1-wise coverage.Second, we apply a coverage-maintaining reduction to the sample, reducing its size, while maintaining 100 % 1-wise coverage.Both post-processing steps are typically facilitated in under one second.Alternatively, we also developed a more time-intensive reduction that maintains the 2-wise coverage, which we use to select the best  configurations, in terms of coverage, from a sample.[29].However, in contrast to their work, our approach achieves high (98 % on median) 2-wise coverages on all models to which uniform sampling, in this case the sampler Spur [1], scales.We achieve this by applying two post-processing steps to the uniformly generated sample.First, we utilize a SAT solver to generate configurations that cover uncovered literals and thereby complete the sample until we reach 100 % 1-wise coverage.Second, we apply a coverage-maintaining reduction to the sample, reducing its size, while maintaining 100 % 1-wise coverage.Both post-processing steps are typically facilitated in under one second.Alternatively, we also developed a more time-intensive reduction that maintains the 2-wise coverage, which we use to select the best  configurations, in terms of coverage, from a sample.
In our evaluation, we evaluate our approach on 49 realworld feature models (Oh et al. evaluated only FinancialServices [12,29]) and compare it to the state-of-the-art t-wise samplers Baital [3,4] and YASA [22].In doing so, we are the first to evaluate the performance of Baital and YASA on a large set of industrial feature models.To investigate the influence of the sample, we compare the uniform sampler SPUR [1] against the non-uniform [14] sampler Quicksampler [10].In particular, we investigate the following research questions: RQ1 How well do state-of-the-art uniform and t-wise samplers scale to real-world feature models?RQ2 What -wise coverages are achieved by sampling uniformly?RQ3 What -wise coverages are achieved by samples of practical size?
The remainder of this work is structured as follows.We give background information in Section 2 and outline our sampling method in Section 3. We present the findings of our evaluation in Section 4 and compare it with previous works in Section 5.

Background
In this section, we give a brief introduction to feature models, their analysis as well as uniform and t-wise sampling.

Feature Models
Feature models have emerged as the default for modeling the variability in configurable systems [5,6].Configurable options are modeled as features, which are hierarchically organized in a feature diagram.Additionally, cross-tree constraints can impose additional constraints on the set of valid configurations.
Consider Figure 1, which contains our running example, a product-line for tasty toasties.Abstract features are used to organize features but do not impact derived products [37].Selecting a child features mandates selecting the parent feature as well.Likewise, mandatory features must be selected when the parent is selected.In or groups, at least one feature must be selected when the parent feature is selected and precisely one feature in alternative groups.Finally, the root feature must always be selected.
Let C() denote the set of all valid configurations of a feature model .Every configuration in C() needs to satisfy the constraints imposed by the feature diagram and the crosstree constraints.For example,  1 = {White, Ham, Cheese} is a valid configuration for our running example, while  2 = {White, Pineapple} is invalid.In particular,  2 does neither contain Ham nor Cheese, violating the or group of Toppings, even though Toppings is selected (as it is a mandatory feature of the root feature).Furthermore, Pineapple must not be selected, as its selection violates the cross-tree constraint.
Features like Pineapple that are not part of any valid configuration are called dead features.Likewise, features that appear in every valid configuration are called core features [6].While models like our running example can be analyzed by hand, this is infeasible for larger models [6].Therefore, feature models are translated into Boolean formulas [28] which are then analyzed using, for instance, SAT solvers [6,28].

t-Wise Sampling
The goal of t-wise sampling is to cover all t-wise interactions between features.For  = 2 the theoretically possible interactions between features  and  are (, ), (¬, ), (, ¬), and (¬, ¬).However, not all of this interactions must be valid, due to constraints in the feature model.
Let  be a feature model with a set of non-core, and nondead features F * (), inducing a set of literals Every literal  ∈ L thus represents the selection or deseletion of a feature.In addition, let  =  () be a Boolean formula encoding  [6].With this, we may define the set of all possible ordered t-wise interactions as and the subset of all valid interactions as In our evaluation, we evaluate our approach on 49 real-world feature models (Oh et al. evaluated only FinancialServices [12,29]) and compare it to the state-of-the-art t-wise samplers Baital [3,4] and YASA [22].In doing so, we are the first to evaluate the performance of Baital and YASA on a large set of industrial feature models.To investigate the influence of the sample, we compare the uniform sampler SPUR [1] against the non-uniform [14] sampler Quicksampler [10].In particular, we investigate the following research questions: RQ1 How well do state-of-the-art uniform and t-wise samplers scale to real-world feature models?RQ2 What -wise coverages are achieved by sampling uniformly?RQ3 What -wise coverages are achieved by samples of practical size?
The remainder of this work is structured as follows.We give background information in Section 2 and outline our sampling method in Section 3. We present the findings of our evaluation in Section 4 and compare it with previous works in Section 5.

BACKGROUND
In this section, we give a brief introduction to feature models, their analysis as well as uniform and t-wise sampling.

Feature Models
Feature models have emerged as the default for modeling the variability in configurable systems [5,6].Configurable options are modeled as features, which are hierarchically organized in a feature diagram.Additionally, cross-tree constraints can impose additional constraints on the set of valid configurations.
Consider Figure 1, which contains our running example, a product-line for tasty toasties.Abstract features are used to organize features but do not impact derived products [37].Selecting a child features mandates selecting the parent feature as well.Likewise, mandatory features must be selected when the parent is selected.In or groups, at least one feature must be selected when the parent feature is selected and precisely one feature in alternative groups.Finally, the root feature must always be selected.
Let C() denote the set of all valid configurations of a feature model .Every configuration in C() needs to satisfy the constraints imposed by the feature diagram and the cross-tree constraints.For example,  1 = {White, Ham, Cheese} is a valid configuration for our running example, while  2 = {White, Pineapple} is invalid.In particular,  2 does neither contain Ham nor Cheese, violating the or group of Toppings, even though Toppings is selected (as it is a mandatory feature of the root feature).Furthermore, Pineapple must not be selected, as its selection violates the crosstree constraint.
Features like Pineapple that are not part of any valid configuration are called dead features.Likewise, features that appear in every valid configuration are called core features [6].While models like our running example can be analyzed by hand, this is infeasible for larger models [6].Therefore, feature models are translated into Boolean formulas [28] which are then analyzed using, for instance, SAT solvers [6,28].

t-Wise Sampling
The goal of t-wise sampling is to cover all t-wise interactions between features.For  = 2 the theoretically possible interactions between features  and  are (, ), (¬, ), (, ¬), and (¬, ¬).However, not all of this interactions must be valid, due to constraints in the feature model.
Let  be a feature model with a set of non-core, and non-dead features F * (), inducing a set of literals Every literal  ∈ L thus represents the selection or deseletion of a feature.In addition, let  =  () be a Boolean formula encoding  [6].With this, we may define the set of all possible ordered t-wise interactions as

Uniform Sampling
In uniform sampling, configurations are repeatedly chosen uniformly at random (i.e, drawing with replacement without order) until a specified sample size  is reached [14].If we assume that there are  valid configurations in total, then the probability for a configuration  to be included in a sample  ⊂  of size  is Given a reasonable sample size , uniform sampling maintains certain characteristics of the feature model, such as the commonality (i.e., the ratio of configurations in the sample containing a feature converges to the commonality of the feature for large enough sample sizes).Consequently, if a feature does not appear in a uniform sample of non-trivial size, it will also be rarely selected in practice, as it only appears in few configurations.In short, configurations in uniform samples are statistically representative of the configuration space.
Based on Equation 1, one can predict the t-wise coverage achieved by a uniform sample [29].Let  ∈ I *  ( ) be a t-wise interaction and let ( ) be the number of valid configurations including the interaction  (i.e., ( ) = |{ ∈ C |  ∈ }|).Then Equation 2denotes the probability that a uniform sample of size  covers the interaction I.

OUR APPROACH
In this section, we present our approach for achieving high -wise coverages by post-processing uniform samples.First, we calculate a sample using a uniform sampler.Second, we measure the 1-wise coverage of the sample, using a SAT solver (in our case MiniSAT [11], the PySAT [16] default).Third, we complete the sample by computing additional configurations with the SAT solver until all literals are covered (i.e., 100 % 1-wise coverage is achieved).Fourth and last, we reduce the sample while maintaining the 100 % 1-wise coverage, as described in Section 3.2.Alternatively, we can reduce the sample while maintaining the achieved 2-wise coverage of the sample.
The post-processing is, however, not limited to uniform samples and can also be applied to the samples generated by the -wise samplers.In particular, the 2-wise reduction algorithm can be used to select a sub sample of fixed size from a sample.

Completion
As the commonality of a feature directly translates to the appearance of its literals in the uniform sample (cf.Section 2.3) it is likely that only one of the feature's literals appears in configurations.As an uncovered literal directly causes all -wise interactions including this literal to be uncovered, even a 1-wise coverage of 99 % has the potential to snowball to abysmal -wise coverages for  ≥ 2.
Therefore, our approach harnesses the high performance of SAT solvers for feature models [23,28] to generate additional configurations, until the sample achieves 100 % 1-wise coverage.Algorithm 1 depicts the completion procedure.For all literals that are not covered by the current sample, we call the SAT solver.As we already know that a configuration exists for an uncovered literal  (as else it would not be in the set of valid literals I * 1 ), we are not interested in the solver's decision, but rather the configuration it produces as proof of its decision.As other uncovered literals may appear in this configuration, we not only remove  but all newly covered literals from the set of uncovered literals.
Algorithm 1: 1-wise Completion After completion, our sample does achieve 100 % 1-wise coverage but likely contains redundant configurations that cover literals which are also covered by other configurations.While these configurations might cover additional -wise interactions (cf.Section 4), we prioritize a smaller sample size over coverage for now.

Reduction
Our procedure to reduce the sample size is straightforward and aims at being fast and "good enough" over producing samples of minimal size.As uniform sampling already produces non-minimal samples in terms of coverage, even a sample of minimal size is likely to be larger than samples produced by dedicated t-wise samplers that aim for 100 % 1-wise coverage.
The procedure itself, which is depicted in Algorithm 2 incrementally selects configurations from the sample that cover the most literals or interactions currently not covered by configurations in the reduced sample.While this could be augmented to account for the rarity of literals or interactions appearing in configurations, we found in preliminary experiments that the improvement was marginal at best.

Algorithm 2: 𝑡-wise Reduction
Input: sample  Output: Naturally, while the implementation of 1-wise reduction can follow Algorithm 2, already Line 1 is often infeasible for larger models and  ≥ 2. Therefore, we exploit the following observation in our implementation for  = 2. Let  ′ = {} and we want to compute the number of newly covered 2-wise interactions by adding a valid configuration  to  ′ .Then it holds that Clearly, interactions build from literals in  ∩  are already covered by  and interactions build from  \  are not.Adding  to  ′ would cover all interactions in ( ∩ ) × ( \ ), hence the first part of Equation 3 and literals in  \  also form interactions with each other, hence the second part (division by two to ignore permutations).All these interactions are trivially valid, as  and  were valid configurations.Now let  ′ = {, },  =  ∪ and we want to compute Δ(,  ), i.e., the number of additionally covered interactions by adding a configuration  to  ′ .Now, Equation 3 does not necessarily hold anymore, as for instance both literals of a variable may have appeared in  or  and therefore, new interactions are also possible between literals in the set  ∩  .However, any interaction (, ) Our 2-wise reduction algorithm exploits these observations to replace Lines 1 and 5 in Algorithm 2.

EVALUATION
In this section, we present the findings of our evaluation and answer our research questions.

Preliminaries
We start by introducing the samplers we used in our evaluation, followed by the subject systems, the execution environment and our methodology.
4.1.1Samplers.We use the two state-of-the-art -wise samplers Baital and YASA, which to the best of our knowledge never have been jointly evaluated, to establish baseline coverages.In addition, we evaluate the three uniform samplers Smarch, Spur, and Quicksampler.Smarch.Smarch [30] builds a uniform sample by recursively invoking the #SAT solver sharpSAT [38] based on partitioning the configuration space with the cube-and-conquer method by Heule et al. [15].While Smarch is known to scale poorly to most industrial feature models [14], we include it for comparison to the work by Oh et al. [29].
Quicksampler.Quicksampler [10] computes samples probabilistically based on atomic mutations.However, both the sample's uniformity and the validity of the configurations are not guaranteed but only statistically probable [10,14].
4.1.2Subject Systems.For our evaluation, we use 49 real-world feature models from a variety of origins and domains.36 models are provided by Oh et al. [30], containing models from the embedded and software systems domain.Furthermore, we include three representative models (minimum, median, and maximum number of features) from the large number of very similar CDL models [35], seven KConfig models, two automotive models [18] and one model from the financial domain [12].23 of the 48 models have less than 1,000 features, 21 models have between 1,000 and 10,000 features, and the remaining four models (automotive02v4 (18,616 features), embtoolkit-smarch (23,516 features), freetz (31,012 features), uclinux-config (11,254 features) have up to 31,012 features.Uniform Sampling.For every subject systems, we attempted to compute a uniform sample of size 1024 = 2 10 with each of the three uniform samplers Smarch, Spur, and Quicksampler with a timeout of five minutes.We chose this sample size based on previous work, and the sample sizes yielded by Baital and YASA in preliminary experiments.On success we removed duplicate configurations from the respective samples, which sporadically appear for small models.For Quicksampler, which does not guarantee the validity of the configurations in the sample, we additionally removed invalid configurations.sample size (272.5 4 vs 483.8), and number of models for which 100 % 2-wise coverage was achieved (43 vs 16).

Achieved
Coverages.We were unable to count the number of valid 2-wise interactions for automotive02v4 and freetz within 24 hours and can therefore make no statement on the achieved coverages on these models.Figure 3 depicts the 1-and 2-wise coverages achieved by samples by the respective samplers without any post-processing.
The best results are achieved by 2-wise sampling with YASA, which achieved 100 % 2-wise coverage for all but three models, namely linux-2.6.33.3 (99.40 %), automotive01 (99.99 %), and buildroot (99.56 %).For Baital, the results are curious, as both 1-wise and 2-wise sampling appears to perform virtually equal with regards to 2-wise coverage.This is most likely due to its prioritization of sampling time over sample size and coverage.
For the two other samplers, Quicksampler and Spur, the difference between sampling non-uniformly and sampling uniformly are very apparent in the achieved coverages, even without taking post-processing into account.With a target sample size of 1024, Quicksampler achieves a 2-wise coverage of 52.3 % on median and 51.6±16.3% on average and does not on the other hand, achieves a median of 91.8 % and a mean of 88.9 ± 11.7 %, and nine times with 100 % 2-wise coverage 4.2.3Post-Processing.So far, we looked at the time and achieved coverages, without consideration of the sample sizes.Figure 4 depicts the sample sizes without post-processin after 1-wise completion, and after 1-wise completion and 1-wise or 2-wise reduction, respectively.Starting again with the -wise samplers, we see that without post-processing, YASA (t = 2) on median and average outperforms Baital significantly in terms of sample size.However, for four models, YASA requires more than 500 configurations to reach its target coverage of 100 %.The difference in sample size becomes even more apparent when comparing the sample size of YASA (t = 1) to Baital (t = 1), where Baital requires our post-processing to even come close to the sample sizes generated by YASA for full 1-wise coverage.The sample sizes of both are only marginally affected by the completion, due to their already high to full 1-wise coverages.
For the uniform samplers, Quicksampler produces more than the targeted 1,024 configurations for some models and also commonly loses 20 % or more due to duplicate or in- sample size (272.5 4 vs 483.8), and number of models for which 100 % 2-wise coverage was achieved (43 vs 16).

Achieved
Coverages.We were unable to count the number of valid 2-wise interactions for automotive02v4 and freetz within 24 hours and can therefore make no statement on the achieved coverages on these models.Figure 3 depicts the 1-and 2-wise coverages achieved by samples by the respective samplers without any post-processing.
The best results are achieved by 2-wise sampling with YASA, which achieved 100 % 2-wise coverage for all but three models, namely linux-2.6.33.3 (99.40 %), automotive01 (99.99 %), and buildroot (99.56 %).For Baital, the results are curious, as both 1-wise and 2-wise sampling appears to perform virtually equal with regards to 2-wise coverage.This is most likely due to its prioritization of sampling time over sample size and coverage.
For the two other samplers, Quicksampler and Spur, the difference between sampling non-uniformly and sampling uniformly are very apparent in the achieved coverages, even without taking post-processing into account.With a target sample size of 1024, Quicksampler achieves a 2-wise coverage of 52.3 % on median and 51.6±16.3% on average and does not achieve 100 % 2-wise coverage for any of the models.Spur on the other hand, achieves a median of 91.8 % and a mean of 88.9 ± 11.7 %, and nine times with 100 % 2-wise coverage.

4.2.3
Post-Processing.So far, we looked at the time and achieved coverages, without consideration of the sample sizes.Figure 4 depicts the sample sizes without post-processing after 1-wise completion, and after 1-wise completion and 1-wise or 2-wise reduction, respectively.Starting again with the -wise samplers, we see that without post-processing, YASA (t = 2) on median and average outperforms Baital significantly in terms of sample size.However, for four models, YASA requires more than 500 configurations to reach its target coverage of 100 %.The difference in sample size becomes even more apparent when comparing the sample size of YASA (t = 1) to Baital (t = 1), where Baital requires our post-processing to even come close to the sample sizes generated by YASA for full 1-wise coverage.The sample sizes of both are only marginally affected by the completion, due to their already high to full 1-wise coverages.
For the uniform samplers, Quicksampler produces more than the targeted 1,024 configurations for some models and also commonly loses 20 % or more due to duplicate or invalid configurations.Spur only loses configurations due to t-Wise Sampling.As a baseline, we used Baital and YASA to compute 1-and 2-wise samples for all of the subject systems within time limits of five minutes and a time limit of one hour.
Coverage Calculation.Based on the samples yielded from both uniform and t-wise sampling, we measured the achieved 1-and 2wise coverages.To investigate the impact of our post-processing, we additionally measured the achieved coverages and sample sizes after 1-wise completition together with no reduction, 1-wise reduction, and 2-wise reduction.

Results
In the following, we present the results of our experiments, grouped by the overall sampler performance in terms of number of successes, sampling time, and sample size, as well as the achieved coverages with and without post-processing.

Sampling Performance.
Out of the three uniform samplers in our evaluation, only Quicksampler (QS) was capable to compute samples for all of the 49 models within the time limit of 5 minutes.Spur performed second best but was unsuccessful for the models buildroot (QS: 78.7 s), embtoolkit-smarch (QS: 162.4 s), freetz (QS: 212.0 s), and linux-2.6.33.3 (QS: 13.0 s).Spur was additionally able to compute a sample for embtoolkit-smarch (612.7 s), within a timeout of 1 hour.
Smarch, which was used as the uniform sampler by Oh et al. [29] in their measurement of t-wise coverage from uniform sampling, was only able to sample four trivial models.For example, Smarch required 230.6 s to compute 1024 sample configurations for JHipster [13], which was sampled by each of the other uniform samplers in well beyond 1 s, respectively.Therefore, we exclude Smarch from further consideration.
Both t-wise samplers, Baital and YASA are capable of producing intermediate results.Therefore, we count the availability of any non-empty sample at timeout as success.With the configuration from above, Baital computed 1-and 2-wise samples for all of the 49 models within 30 s, respectively.Within the time limit of 5 minutes, YASA was able to sample all models but embtoolkit-smarch (both 1-and 2-wise), but was successful within 1 hour.

Related Work
In this section, we summarize works on t-wise and uniform sampling and their relation to ours.Closest to our work is the work of Oh et al. that measured the 1-and 2-wise coverages achieved by samples generated with uniform sampling [29].They used their uniform sampler Smarch [30] to compute samples of different sizes up to 14 shows that this is an outlier limited to FinancialServices which can be overcome by our post-processing (cf.Section 4.3) and that uniform sampling achieves a 2-wise coverage of 88.9 % without and 96.2 % with post-processing for all other models in our evaluation.
Like their and our work, both YASA [21,22] and Baital [3,4] are answers to the sampling scalability challenge of Pett et al. [32].The challenge asks for submissions evaluating the -wise sampling of three models (FinancialServices, linux-2.6.33.3, and automotive02v4) with regards to time and memory requirements.We contribute to this cause by   contain a significant outlier in FinancialServices, which we therefore discussed separately in Section 4.3.

Related Work
In this section, we summarize works on t-wise and uniform sampling and their relation to ours.Closest to our work is the work of Oh et al. that measured the 1-and 2-wise coverages achieved by samples generated with uniform sampling [29].They used their uniform sampler Smarch [30] to compute samples of different sizes up to 14 shows that this is an outlier limited to FinancialServices which can be overcome by our post-processing (cf.Section 4.3) and that uniform sampling achieves a 2-wise coverage of 88.9 % without and 96.2 % with post-processing for all other models in our evaluation.
Like their and our work, both YASA [21,22] and Baital [3,4] are answers to the sampling scalability challenge of Pett et al. [32].The challenge asks for submissions evaluating the -wise sampling of three models (FinancialServices, linux-2.6.33.3, and automotive02v4) with regards to time and memory requirements.We contribute to this cause by

Achieved
Coverages.We were unable to count the number of valid 2-wise interactions for automotive02v4 and freetz within 24 hours and can therefore make no statement on the achieved coverages on these models.Figure 3 depicts the 1-and 2-wise coverages achieved by samples by the respective samplers without any post-processing.
The best results are achieved by 2-wise sampling with YASA, which achieved 100 % 2-wise coverage for all but three models, namely linux-2.6.33.3 (99.40 %), automotive01 (99.99 %), and buildroot (99.56 %).For Baital, the results are curious, as both 1-wise and 2-wise sampling appears to perform virtually equal with regards to 2-wise coverage.This is most likely due to its prioritization of sampling time over sample size and coverage.
For the two other samplers, Quicksampler and Spur, the difference between sampling non-uniformly and sampling uniformly are very apparent in the achieved coverages, even without taking post-processing into account.With a target sample size of 1024, Quicksampler achieves a 2-wise coverage of 52.3 % on median and 4 67.5 if one ignores the 5 models with sample sizes above 500, as YASA prioritizes 100 % coverage over sample size 51.6±16.3% on average and does not achieve 100 % 2-wise coverage for any of the models.Spur on the other hand, achieves a median of 91.8 % and a mean of 88.9 ± 11.7 %, and nine times with 100 % 2-wise coverage.

4.2.3
Post-Processing.So far, we looked at the time and achieved coverages, without consideration of the sample sizes.Figure 4 depicts the sample sizes without post-processing, after 1-wise completion, and after 1-wise completion and 1-wise or 2-wise reduction, respectively.Starting again with the -wise samplers, we see that without post-processing, YASA (t = 2) on median and average outperforms Baital significantly in terms of sample size.However, for four models, YASA requires more than 500 configurations to reach its target coverage of 100 %.The difference in sample size becomes even more apparent when comparing the sample size of YASA (t = 1) to Baital (t = 1), where Baital requires our post-processing to even come close to the sample sizes generated by YASA for full 1-wise coverage.The sample sizes of both are only marginally affected by the completion, due to their already high to full 1-wise coverages.
For the uniform samplers, Quicksampler produces more than the targeted 1,024 configurations for some models and also commonly loses 20 % or more due to duplicate or invalid configurations.Spur only loses configurations due to duplicates and always has a sample size of or below 1,024.While Figure 3 already suggested a negative impact from the missing uniformity of its sample, this is illustrated again by the fact that the increase in sample size due to 1-wise completion is more sizable for Quicksampler than for Spur. Figure 5 depicts the 2-wise coverages achieved by the samplers after the various post-processing scenarios have applied.The leftmost column per sampler configuration (i.e., no post-processing) is identical to the right column in Figure 3.One can see, that all samplers except YASA benefit from 1-wise completion.Quicksampler benefits the most (med: 52.4 % → 85.9 %, avg: 51.6±16.3% → 86.5±6.0 %) and Spur second most (med: 91.8 % → 98.1 %, avg: 88.9±11.7 % → 96.3±3.9 %).
Furthermore, while we see that 1-wise reduction is too aggressive and reduces the 2-wise coverage significantly, the resulting coverages after 1-wise reduction (cf., 3rd column in Figure 5) are with 6 % for all samplers, except Quicksampler.On the other hand, all samplers but YASA, benefit from 2-wise reduction, but are still outperformed by YASA, which has the smallest sample sizes, with the best 2-wise coverage, for all models to which it scales.2 contains the sizes and achieved 2-wise coverages achieved by no post-processing, 1-wise completion and 2-wise reduction, and 1-wise completion and 2wise reduction to size 16.While or choice of target size 16 is mostly arbitrary, it aligns well with sample sizes currently used in practice [13].We see that the 2-wise reduction significantly reduces the sample sizes of Baital and the uniform samplers, while leaving the sample size of YASA pretty much unaffected.Finally, by limiting the sample size to 16, we see that the dedicated 2-wise samplers still achieve coverages above 93 %.

Discussion
Sampler Scalability.With regard to RQ1 "How well do state-of-theart uniform and t-wise samplers scale to real-world feature models?",our evaluation shows that all evaluated samplers except Smarch are capable of scaling to the vast majority of models.Quicksampler achieves its perfect scalability by sacrificing both the validity and uniformity [8,10,14,33] of its samples.Curiously, while both Smarch and Spur depend on sharpSAT [38], Spur scales to all 45 models for which sharpSAT scales [34], while Smarch only scales to four models.Sample Quality.With regard to RQ2 "What -wise coverages are achieved by sampling uniformly?",we find that uniform sampling with state-of-the-art uniform sampler like Spur is capable of achieving high 2-wise coverages for all models but the non-software model FinancialServices.With our post-processing, the sample size can be reduced up to a factor of 4 and 2-wise coverages of 96.3 % on average are achieved.However, the large starting sample size of uniform sampling results in a reduction effort that typically exceeds the sampling time (we encountered 2-wise reduction times of up to 30 minutes for larger models).In short, it is always better to just use YASA with our timeout augmentation.
Practical Sampling.Our experiment with a fixed sample size of 16 (cf.,Table 2), shows that high 2-wise coverages of over 93.5 % on average can still be achieved.Therefore we conclude as answer for RQ3 "What -wise coverages are achieved by samples of practical size?" that limiting the sample size to meet resource limitations is possible without loosing too much coverage.
FinancialServices.As mentioned before, the non-software model FinancialServices constitutes a hard benchmark for t-wise sampling [4,21,22,32], due to its excessive use of alternative groups and cross-tree constraints [12].Using Equation 2, Oh et al. estimated that a uniform sample size of 10 12 is required to achieve a 2-wise coverage of above 90% and a sample size of 10 14 for 99.99 % [29].As FinancialServices only contains about 9.7 • 10 13 configurations in total, this would de facto be enumeration.However, we found in previous research that 100 % 2-wise coverage of FinancialServices can be achieved with around 4,400 configurations [21].Our evaluation confirms this (cf.Figure 4).In addition, we found that 2-wise coverage of 97.0 % can be achieved with 330 configurations by combining SPUR with our inexpensive 1-wise completion and 1-wise reduction.

Threats to Validity
Internal Validity.Due to probabilistic and therefore nondeterministic nature of the samplers, distinct runs may compute samples of different size, with different coverage, with different runtimes.However, in preliminary experiments on a subset of models, we found that the differences in outcome are very marginal and do not change the overall outcome.In their evaluation of Baital, Baranov et al. came to a similar observation after repeating their experiments [4].Therefore, we decided against multiple repetitions of our experiment, due to large investment of time and resources.Nonetheless, our artifact5 supports multiple repetitions.
In addition, our implementation of the wrapper for the samplers, the post-processing, or the coverage computation may be flawed.However, we verified for some smaller models that coverage achieved by uniform sampling with SPUR corresponds to the theoretical expected value(cf.Equation 2 [29]).Furthermore, we successfully verified the calculated coverages for some systems with the values computed internally by Baital and YASA.External Validity.It is possible that our approach yields different results for models not included in our evaluation.However, we chose a large number of models with different sizes and complexities from a variety of domains, including non-software systems.All of these models have been used in previous works [4,14,22,34].Our evaluation itself does contain a significant outlier in Finan-cialServices, which we therefore discussed separately in Section 4.3.

RELATED WORK
In this section, we summarize works on t-wise and uniform sampling and their relation to ours.
Closest to our work is the work of Oh et al. that measured the 1-and 2-wise coverages achieved by samples generated with uniform sampling [29].They used their uniform sampler Smarch [30] to compute samples of different sizes up to 10 14 configurations for the FinancialServices [12] model and concluded that uniform sampling alone is not enough to achieve high to full 2-wise coverage.Our evaluation however shows that this is an outlier limited to FinancialServices which can be overcome by our postprocessing (cf.Section 4.3) and that uniform sampling achieves a 2-wise coverage of 88.9 % without and 96.2 % with post-processing for all other models in our evaluation.
Like their and our work, both YASA [21,22] and Baital [3,4] are answers to the sampling scalability challenge of Pett et al. [32].The challenge asks for submissions evaluating the -wise sampling of three models (FinancialServices, linux-2.6.33.3, and automotive02v4) with regards to time and memory requirements.We contribute to this cause by being (to the best of our knowledge) the first to evaluate different strategies for -wise sampling on a large number of industrial feature models.For instance, YASA and Baital have never been evaluated against each other, previously.
Heradio et al. evaluate a number of uniform samplers (including Quicksampler, Smarch, and Spur) in terms of sampling time and uniformity on nine industrial feature models.Our results regarding sampling time and sampling success are in line with theirs.While we are not primarily interested in the sample's uniformity, we attribute the coverage differences between Quicksampler and Spur to the lack of uniformity in Quicksampler's samples, as diagnosed by them [14].
Varshosaz et al. give an overview and classification of the literature regarding product sampling in the context of software product lines [39].However all samplers of relevance to our work, like ICPL [20], IncLing [2], or based on Chvatal [9], are outperformed by an order of magnitude or more by YASA [22].
Finally, Medeiros et al. compared ten sampling approaches in terms of fault detection, recommending that -wise sampling with high values of  is used for rigorous testing [26].Similar observations were made by Halin et al. in their case study of JHipster [13].While generating 3-wise samples with high to full coverage will be feasible for a large number of models in our evaluation, measuring their coverage presents a challenge as of now, due to the required effort of counting the number valid 3-wise interactions.

CONCLUSION
In this work, we compared the 1-and 2-wise coverages achieved by dedicated t-wise and uniform samplers on a large set of industrial feature models.We find that it is always beneficial to use our timeout-augmented version of YASA, as it produces the smallest samples with the highest 2-wise coverages.Nevertheless, we found that uniform sampling is capable of achieving much higher coverages than previously reported [29], even without our postprocessing.With the post-processing presented in this work, both sample sizes and 2-wise coverages of practical value can be achieved by means of uniform sampling.For the future, we plan to explore  > 2, more sophisticated means of sample size reduction, and the behavior of samples on model evolution.

Figure 1 :
Figure 1: Running Example of a Feature Model
VaMoS 2024, February 7-9, 2024, Bern, Switzerland Heß, Schmidt, Ostheimer, Krieter, and Thüm Running Example of a Feature Model In this work, we revisit work by Oh et al. and evaluate the coverages achieved by uniform sampling

Table 2 .
Sample Size and Average 2-Wise Coverage ± Standard Deviation for Different Post-processing Scenarios

Table 2 .
Sample Size and Average 2-Wise Coverage ± Standard Deviation for Different Post-processing Scenarios

Table 2 :
Sample Size and Average 2-Wise Coverage ± Standard Deviation for Different Post-processing Scenarios