Incremental Identification of T-Wise Feature Interactions

Developers of configurable software use the concept of selecting and deselecting features to create different variants of a software product. In this context, one of the most challenging aspects is to identify unwanted interactions between those features. Due to the combinatorial explosion of the number of potentially interacting features, it is currently an open question how to systematically identify a particular feature interaction that causes a specific fault in a set of software products. In this paper, we propose an incremental approach to identify such t-wise feature interactions based on testing additional configurations in a black-box setting. We present the algorithm Inciident, which generates and selects new configurations based on a divide-and-conquer strategy to efficiently identify the feature interaction with a preferably minimal number of configurations. We evaluate our approach by considering simulated and real interactions of different sizes for 48 real-world feature models. Our results show that on average, Inciident requires 80 % less configurations to identify an interaction than using randomly selected configurations.


INTRODUCTION
The goal of software product lines is to manage configurable software, where the variability is determined by configurations (i.e., feature selections) [5,18].Features can be combined in a configuration under the adherence to constraints, which can be represented, for instance, in a feature model [7,8].A feature model specifies all possible combinations of features that lead to valid configurations.
One of the biggest challenges associated with highly configurable software systems is to deal with the ever-increasing number of possible feature combinations [1,5,19,30].A particular combination of features may interact, causing undesired behavior, which was not intended or anticipated for the overall system.Conversely, some features are designed to work together and should therefore only occur in combination [38].Unforeseen feature interactions can cause failures and security vulnerabilities [1,19,30,38,39,41], emphasizing the importance of identifying these interactions [38].
While it is infeasible to generate and test all possible configurations of a product line, there are several techniques, such as combinatorial interaction testing, to identify failing configurations [2,29,34,36,50]. -wise sampling algorithms aim to generate a small but representative subset of configurations that covers all possible combinations of  features [6, 22, 25-29, 36, 40, 43].Although the majority of existing -wise sampling algorithms generate a suitable sample, which contains failing configurations, they do not further investigate found failures.Thus, even if a failing configuration can be found, finding out the responsible feature interaction requires additional manual effort.Furthermore, as the number of potential interactions between  features grows polynomially with the total number of features and exponentially with the value of , it is difficult to pinpoint a failure to a specific feature interaction [5].
For example, in the Linux kernel, which is regarded as one of the most complex configurable systems, developers are not able to test every possible configuration [20].A Linux developer reported at the FOSD Meeting that problematic configurations (typically random configurations) can be found automatically during continuous integration, but locating the origin of the problem (i.e., the feature interaction) is a labor-intensive manual process.To the best of our knowledge, there exists no good strategy to help developers in finding the involved features.This paper wants to tackle the problem of identifying -wise feature interactions causing failing configurations.
We propose the algorithm Inciident (Incremental interaction identification) to identify faulty -wise feature interactions by generating and testing further configurations.For this, we consider a black-box setting, which allows the algorithm to work independently of particular implementation artifacts and the concrete implementation technique (e.g., preprocessors or plug-ins).Given at least one failing configuration containing a specific fault, Inciident computes all potential feature-interaction candidates up to Cornflakes ⇒ Milk Figure 1: A feature model describing a breakfast scenario an assumed interaction size .Inciident then employs a divide-andconquer strategy, which generates and tests new configurations, to narrow down the set of potential interactions until it is able to identify a single interaction causing the fault.Ideally, using this strategy, the number of newly generated and tested configurations is minimal and grows only logarithmically in the number of features.
We consider Inciident to be incremental in two dimensions, (1) it incrementally generates new configurations and (2) incrementally increases the assumed interaction size  up to a given maximum.In summary, we contribute the following: • We present the concept of incremental identification of interactions and the algorithm Inciident to determine feature interactions causing failing configurations (see Section 3).• We evaluate our concept in terms of efficiency and effectiveness by trying to identify simulated and real -wise feature interactions in 48 real-world feature models (see Section 4).

•
We provide an open-source implementation and a replication package of our experiments [10] (see Section 4.1).

BACKGROUND
In the following, we describe fundamentals of software product lines, focusing on the concepts of feature models (Section 2.1) and configurations (Section 2.2).Further, we explain the notion of feature interactions in a nutshell (Section 2.3).

Feature Models
A feature model describes the problem space of an entire software product line, which represents the requirements and views of the corresponding stakeholders [5,18].A feature model defines the possible combinations of features, from which valid configurations can be inferred.In comparison, the implementation artifacts of a product line describe the solution space, which represents the actual products that can be derived from valid configurations [5,18].
In Figure 1, we depict an example feature model as a feature diagram [7,8].Due to brevity, in the following, we use abbreviations for the feature names represented by the first letter of their name.

Configurations
We create a configuration by selecting features for a desired software product.Our formal notation of a configuration is based on the in-and exclusion of features from a feature model.

Definition 2.2 (Configuration)
. Let M = (, ) be a feature model, a configuration  = { 1 , . . .,   } ∈ L(M) with {, ¬ } ⊈  is a set of literals containing at most one literal per feature (i.e., either positive, negative, or none).A feature  ∈  is included in , iff  contains its positive literal and excludes  , iff  contains its negative literal.
• A configuration  is called partial, iff the configuration does not contain a literal for every feature (i.e., | | < | |).Otherwise, it is called complete.• A configuration  is called valid, iff there exists a complete configuration  ′ ⊇  that satisfies all dependencies in .
Otherwise, it is called invalid.• The configuration space C(M) is the set of all valid configurations of a feature model M. In the remainder of the paper, we use the term configuration to refer to a complete valid configuration, if not stated otherwise.
From the configuration space C(M), we can infer certain feature properties [7].We call a feature core if there exists no valid configuration for the feature model, in which the feature is deselected (e.g., ).Analogously, we call a feature dead if there exists no valid configuration, in which the feature is selected.A feature that is neither core nor dead we call a variant feature (e.g., ).
Each configuration describes a subset of the entire configuration space C(M, ), consisting of all valid configurations that are a superset of the given configuration (i.e., C(M, ) = { ′ ∈ C(M) |  ′ ⊇ }).Note that, for a partial configuration  there may exist another configuration  ′ ⊃  that describes the same subset of the configuration space (e.g.,  = {¬,  } and  ′ = {, , ¬, , ¬}).This happens when some features become conditionally core or dead [8] under a partial configuration.By adding all literals of conditionally core and dead features to a partial configuration , we get the largest (partial) configuration ĉ that describes a particular configuration space (i.e., ∀ ∈ C(M) :

Feature Interactions
A feature interaction of one or more features can be seen as a particular behavior that cannot be deduced from the individual behaviors associated to the features involved [5].In this work, we deliberately abstract from concrete implementation artifacts in the solution space and instead investigate feature interactions from a problem space point of view.

Definition 2.3 (Feature Interaction).
A -wise feature interaction  is a valid (partial) configuration of size  (i.e., | | = ) that contains only literals of variant features.Given a feature model M, the set I  (M) contains all possible -wise feature interactions of M.
We call an interaction with interaction size  = 1 a one-wise interaction (e.g.,  = { }), for  = 2 a pair-wise interaction (e.g.,  ′ = { , ¬ }), and for all  > 2 a higher-order interaction.It is crucial to not omit deselected features, as absence of features can also be responsible for an unexpected software behavior [1,19].We limit our considered feature set for interactions to variant features due to the non-existing variability induced by core and dead features.
Approaches for -wise interaction testing generate and test a small but representative subset of all configurations (i.e., a sample ) [22, 25-29, 40, 43].These approaches aim to achieve full -wise coverage of the tested system, which means that every interaction  that contains  features must be present (i.e., covered) in at least one configuration (i.e., ∀ ∈ I  (M) ∃ ∈  :  ⊆ ).This is a wellknown problem in the field of sampling, where these numerous combination possibilities of features challenge the creation of a representative sample for the system [26,36].
As an example, the feature model in Figure 1 has 39 feature combinations (e.g., W ∧¬ M) that must be covered for full pair-wise coverage.Let us assume that the configuration  = {¬ , , , , ¬} fails during testing.From this information alone, we cannot deduce that the failure is caused by the interaction  = {, ¬}.At this point, we would need to manually investigate the origin of the fault.Thus, we aim to improve this process by providing guidance to developers through proposing a set of potentially involved features to identify the feature interaction.

IDENTIFICATION OF INTERACTIONS
We propose the concept of incremental identification of interactions, for finding the smallest -wise feature interactions responsible for a given fault.In Algorithm 1, we propose our corresponding base algorithm.Given at least one failing configuration, the algorithm generates and tests additional configurations to incrementally decrease the set of -wise feature interactions potentially responsible for the failure.As input, we require a feature model and a value for the presumed interaction size .We start with a global configuration pool CP, which contains an initial set of tested configurations.First, we partition the set of configurations from CP into the sets  × , containing all failing, and  ✓ , containing all non-failing configurations.Second, we compute the set of all potential interactions   for the input interaction size  based on the known configurations.To this end, we compute the literals common to all failing configurations and then generate all possible -tuples over this set, excluding every tuple that is contained in any configuration from  ✓ .Third, we generate and test a new configuration  and, based on the test result, we reduce the set of potential interactions   .Each new configuration is either added to the set  × or  ✓ .The algorithm repeats the third step until either it cannot generate any new configuration  or the potential interactions   contain less than two interactions.Finally, the algorithm either returns the set   or no result.
To be able to apply our algorithm, we make the following important assumptions.The fault we are looking for (1) is produced by exactly one feature interaction and (2) can be detected in a reliable and reproducible manner.In this paper, we abstract the testing process by considering a black-box oracle with fault information to be able to identify the interaction for any given configuration (e.g., via unit testing).We deliberately agree to these assumptions, as we currently aim to demonstrate the feasibility of the overall approach of improving the manual process for identifying feature interactions.

Compute Potential 𝑇 -Wise Interactions
It is crucial for our algorithm to compute all -wise interactions that could be candidates for causing failing configurations.We define the set of all potentially faulty -wise feature interactions, also called potential interactions, as follows.
Definition 3.1 (Potential Feature Interactions).Let  be a failing configuration and  a desired interaction size, the function   () = { ∈ I  (M) |  ⊆ } maps  to the set   of all potential -wise feature interactions causing the configuration  to fail.
In Table 1, we provide an overview of the further reduction procedure of the potential interactions   after generating and testing configurations.We assume to have some procedure that generates an arbitrary configuration on demand.To further reduce the set  2 0 , we proceed by generating a new configuration, for instance  1 = {¬ , ¬, , , ¬}, and test if it fails.Assume that  1 passes the test and, therefore, we exclude all interactions that are contained in  1 .Note that  1 is now stored in the set of non-failing configurations  ✓ .Next, we reduce the set of all potential interactions  2 0 (i.e., | 2 0 | = 10) to  2 1 (i.e., | 2 1 | = 4), as shown in Table 1.We generate the next configuration  2 = { , , ¬, , ¬} and test if it fails.Suppose  2 fails and, therefore, we exclude all interactions that are not contained in  2 .The considered configuration reduces  2 1 to  2 2 (i.e., | 2 2 | = 2).Then, we save  2 in the set of failing configurations  × .We continue to generate a next configuration  3 = {¬ , , ¬, ¬ , ¬} and assume that its test fails.We update Table 1: Example of incremental identification of interactions for the failing configuration  0 = {¬ , , , , ¬} (i.e., {, ¬}), the algorithm returns this interaction as cause of the failing configurations.

Generate New Configurations
The generation of new configurations can be done in several ways, for instance by generating random configurations.Considering the configurations in the example, we have deliberately chosen helpful configurations, as each new configuration has excluded further interactions and has narrowed down the set of potential interactions   .However, by choosing a random generation approach, we may not be able to exclude interactions from   in every iteration (e.g.,  ′ 3 = { , ¬, ¬, ¬ , ¬} instead of  3 ).The choice of the next configuration is a crucial factor within our approach.Using a random approach can lead to testing many configurations that do not provide any new knowledge about the remaining potential interactions.Ideally, we would generate configurations that contain half of the remaining potential interactions and do not contain the other half.Thus, independent of the test result for each individual configuration, we would be able to exclude half of the potential interactions with each iteration.This approach would lead to a testing effort that increases only logarithmically in the number of interactions.Taking a closer look at our example in Table 1, we have configurations that fulfill these criteria and we see that our reduction of   is approximately logarithmically.
Generating such configurations is not trivial in practice, which is why we use a greedy strategy by generating multiple random configurations and test the one closest to covering about half of the remaining interactions.Note that, aside from the configuration we select for testing, we only generate configurations in the problem space without deriving any products in the solution space.Nevertheless, we limit the number of configurations that we randomly generate (i.e., in each iteration we generate ⌈2 • log 2 |  0 |⌉ configurations).This limit is based on the initial set of potential interactions, such that it generates more configurations for larger feature models, but is logarithmically bounded to maintain a feasible run time.
In case we cannot find any random configuration that includes and excludes at least one potential interaction from    , we attempt to generate such a configuration using the satisfying assignment returned by a SAT solver.In particular, we create a propositional formula based on the feature model with the additional constraint that at least one interaction from    must be selected and at least one must be deselected.To achieve this, we substitute each interaction from    with a fresh variable.Given a feature model M = (, ) with  = { 1 , . . .,   } and    = { 1 , . . .,   }, the resulting formula, which we use as input for the SAT solver, is: exclude at least one If the SAT solver is not able to find a configuration for this formula, the current set of potential interactions cannot be reduced any further, because the remaining candidates specify the same subset of the configuration space.In this case, we terminate the process and return the set    .

Inciident
In practice, we usually do not know the interaction size  of the feature interaction causing failing configurations.To tackle this issue, Inciident adapts the input interaction size dynamically and runs multiple iterations of the base algorithm, starting with an input interaction size of  = 1 until it reaches a given limit   .
In Figure 2, we present the main steps of Inciident.In Step 1, we take the input data consisting of the feature model, a failing configuration, and the input interaction size   .In Step 2, we set the input interaction size  (starting with  = 1).Afterward, we proceed with Step 3, the computation of all potential -wise interactions.Next, we perform the configuration loop, which consists of Step 3 to Step 5.After generating (Step 4) and testing (Step 5) a configuration, we proceed with Step 3 and refine the set of potential interactions   (cf.Section 3.1).We finish the current interactionsize loop if   is empty, contains only one interaction, or if it is not possible to generate a configuration that includes and excludes at least one interaction from   .After computing a result for the current input interaction size , we proceed with Step 2 and perform the interaction-size loop by increasing the interaction size .When we reach   , we proceed with Step 6, where we merge the results from each iteration into a single feature interaction (i.e., a partial configuration) and validate the result (cf.Section 3.4).In Step 7, our algorithm terminates by returning the computed feature interaction or the empty set if the result could not be validated.
Inciident makes use of intermediate results from each interactionsize loop.In each loop, it re-computes the set of potential interactions, reusing all configurations from the set CP.This potentially reduces run time and memory usage, as more potential interactions can be excluded before generating and testing new configurations.

Validate Resulting T-Wise Interactions
In Step 3, Inciident produces a set of potential interactions   for each value of .As a final step, we have to check whether the resulting output is correct.An output may be incorrect, if the bound value   is too small or any of our assumptions is violated.
Given a set   , we determine a partial configuration    by computing the union set over all interactions in the set   (i.e.,    =   ).This is possible, because all interactions in   are subsets of the initial failing configuration  0 and, thus, the resulting partial configuration    is also a subset of  0 .Then, we determine ĉ  by adding all literals that    implies but not already contains (cf.Section 2.2).We argue that this is a useful step to avoid returning a misleading result.Within our black-box approach (i.e., considering only the problem space), we cannot distinguish the observable behavior of ĉ  and    , as every configuration that contains    also contains ĉ  .Hence, we return ĉ  , which contains all potentially relevant literals.
Second, we search the smallest value  ≤   for which ĉ  still contains the feature interaction causing the fault.Given a value for , we know that ĉ  does not contain the complete interaction if there exists a non-failing configuration   ✓ that includes all interactions from   and excludes at least one interaction from   +1 .Consequently, we look for the largest value  ≤   for which there exists a   −1 ✓ but no   ✓ .Starting with  =   , we try to compute   −1 ✓ , subsequently decreasing  by one until we are successful, in which case we select ĉ  as the correct partial configuration.Finally, we validate the selected configuration ĉ  using a further test to cross-check if it is a plausible output.We generate two further configurations, one containing and one not containing ĉ  .

If the configuration containing ĉ𝑡
fails, we continue by testing the configuration without ĉ  .If it passes, we terminate the algorithm and output ĉ  .If one of the considered configurations does not correspond to the desired result, we terminate our algorithm with no result as an indicator that the chosen value   is too small.

Discussion and Limitations
We start our algorithm when receiving a particular test result for at least one failing configuration.For most real-world faults, we have more knowledge, such as particular error messages, which sometimes can help to find the origin of failing configurations.To apply our concept, we assume to have a black-box scenario with configuration tests that lead to reproducible results for each failing configuration.This may limit the effectiveness of our approach because we cannot rely on having perfect fault information for real-world systems.
Considering our assumptions, we currently assume to have only one feature interaction in our system that causes failing configurations, which is not always given in practice.We accept this limitation to show the feasibility of our divide-and-conquer approach as presented.Therefore, when it comes, for instance, to error masking, our current approach does not guarantee to identify the interaction causing failing configurations, but it is conceivable to apply our concept by looking at each fault independently.Furthermore, we assume the occurring fault can be detected in a reproducible manner to be able to get the same result for each configuration.We would argue that the limitations we impose constitute a good trade-off to show the feasibility of our approach to automate the manual process of identifying the origin of the fault.
When computing the partial configuration for a set of potential interactions, we may have have more literals included in the configuration than involved in the interaction actually causing failing configurations.More precisely, due to the post-processing in Step 6 of Inciident, we always get the largest possible configuration that comprises the same configuration space as the partial configuration containing only the actual feature interaction.

EXPERIMENTAL EVALUATION
In this section, we present the results of evaluating our concept of incremental identification of -wise interactions.We evaluated 48 real-world feature models by simulating one-, pair-, and three-wise feature interactions.We compare Inciident to a random approach (Random) to guide the selection of the next configurations to be tested.We base our evaluation on criteria commonly used in the context of sampling algorithms to determine their testing effectiveness, testing efficiency, and sampling efficiency [39,49,50].Our focus is to investigate the effectiveness of Inciident to identify the interactions, the number of tested configurations, and the time required to perform the algorithm.In our evaluation, we address the following research questions.
1 How effectively can we identify the feature interaction that leads to failing configurations? 2 What computational effort is required for identifying the feature interaction? 2.1 How many configurations have to be tested to identify the feature interaction? 2.2 How much computation time is required to identify the feature interaction?
In  1 , we investigate how precisely we can identify a particular feature interaction based on failing configurations.We distinguish for the identified interaction, whether it is exact, a superset, subset, or different, and how often we get no result.
The research question concerning computational effort is divided in two parts.In  2.1 , we consider the number of configurations that have to be tested during our algorithm.This is crucial because we only achieve applicability of our approach if the number of tested configurations is realistic for real-world systems.Related to  2.1 ,  2.2 considers the computational time to perform our algorithm.Note that we exclude the time for deriving and testing actual products from generated configurations, as we abstract from the solution space, considering a black-box setting with perfect fault information.

Algorithms
As there exists no state-of-the-art algorithm that realizes interaction identification the way we do, we cannot compare ourselves to other algorithms.We expect that the strategy for generating further configurations, the input interaction size   and the actual interaction size have an impact on the number of configurations tested (cf.Section 3.2), which we investigate in our evaluation.Therefore, we evaluate our concept by varying the most crucial step within our algorithm, the strategy for configuration generation (i.e., Step 4 from Figure 2) and selection of the input interaction size.
Inciident performs the identification of interactions until the input interaction size   is reached.We choose the configuration to be tested next deliberately as described in Section 3.2.The algorithm Random generates configurations randomly using the SAT solver Sat4J [31] (i.e., non-uniform random sampling).
To avoid testing every possible configuration, we set a maximum number of configurations to be tested (i.e., ⌈10•log 2 (  0 )⌉).Similar to the generation limit of Inciident, this limit allows to generate more configurations for feature models with more potential interactions.For almost all experiments, this limit lies above the number of configurations required by Inciident.Thus, Random also terminates when the maximum number of configurations is reached.
We provide an open-source implementation of Inciident1 written in Java and based on the FeatJAR library2 .In addition, we provide a replication package with the algorithms used in our evaluation and the data generated by our experiments [10].

Experiment Setup
To evaluate our concept, we perform several experiments to find multiple -wise feature interactions for different systems.

Feature
Interactions.We consider 48 feature models from different sources and from different domains, namely finance, systems software, e-commerce, gaming, and communication [24,37,[44][45][46].The chosen models cover a wide range of number of features 296) and number of constraints (13)(14)(15)692).For a wide range of feature-model sizes, we selected small-and medium-sized feature models from examples provided by the tool FeatureIDE [37].We also used feature models from real-world Kconfig systems, provided by Pett et al. [44], for which we chose the earliest and latest versions for each system.In addition, we used more complex, real-world feature models [24,45,46].Moreover, we use six models with different sizes of the eCos system from Knüppel et al. [24].For all feature models containing less than 800 features (35), we simulate 100 one-wise, 100 pair-wise, and 100 three-wise interactions per feature model, per algorithm, and per input interaction size .For feature models containing between 800 and 3,000 features (13), we simulate 10 one-, pair-, and three-wise interactions.For the largest feature model, which contains 3,296 features, we simulate only 3 one-, pair-, and three-wise interactions due to the high computation time of Random for larger models.
To simulate an interaction, we generate a random configuration and based on that we extract a partial configuration by randomly choosing a subset with  features that are neither core nor dead.Additionally, we use the random configuration as failing input configuration  0 for our algorithm.Note that, as we choose the partial configuration for the simulated fault randomly, it may imply additional literals (i.e., conditionally core and dead features).In this case, we add these literals to the configuration to ease the comparison to the output of the algorithms in our evaluation (cf.Section 2.2 and Section 3.4).To ensure good comparability within our results, we use the same set of random configurations per model for all one-, pair-, and three-wise feature interactions, making sure that each one-wise interaction is a subset of a pair-wise interaction, which in turn is a subset of a three-wise interaction.
In addition, we use 10 real-world faults caused by feature interactions in the systems BusyBox (v1.23.1) and Linux (v3.18.5).Nine of the faults are caused by a single one-, two-, three-, or fourwise interactions.One fault is caused by two alternative one-wise interactions (i.e.,  1 ∨  2 ), violating our first assumption.4.2.2Measurements.Our independent variables are the underlying feature model, chosen algorithm, actual interaction size   , and input interaction size   .The seed for the pseudo-random number generator is a control variable to enable replication.The dependent variables are the number of tested configurations and the run time that is required for each execution of one experiment.As a limit for the input feature interaction size, we consider   ∈ {1, 2, 3}, because most of the known interactions do not contain more features [1,20,30].Our test oracle checks whether a given configuration contains the simulated interaction.
Regarding  1 , we analyze the effectiveness of the identification of the simulated interactions in our approach.To this end, we differentiate between five result types.First, the found interaction is equal to the simulated interaction.Second, the found interaction is a subset of the simulated interaction (e.g., we simulated the interaction {¬ , ,  } and our result is {¬ ,  }).Third, the found interaction is a superset of the simulated interaction (e.g., we simulated the interaction {¬ ,  } and our result is {¬ , ,  }).Fourth, the found interaction is not empty and neither a sub-nor a superset of the simulated interaction (e.g., we simulated the interaction {¬ ,  } and our result is {¬ , }).Fifth, the algorithm cannot compute a result.
For  2.1 , we consider the number of configurations we have to test in order to correctly identify an interaction.Regarding  2.2 , we inspect the time needed to perform the incremental identification of interactions for all experiments where we identified the simulated interaction.4.2.3System Specifications.We ran our experiments on a server with the following specifications: CPU : 2x Intel Xeon E5-2630 v3 @ 2.4 GH; RAM: 256 GB DDR3; OS: Ubuntu 22.04.2LTS; Java: openjdk 17.0.6.; JVM memory: 64 GB

Effectiveness.
In Figure 3, we show how effectively each algorithm can identify an interaction for a given actual interaction size   and input interaction size   .We show the percentage distribution of the five possible result types per algorithm for all experiments.For this, Figure 3 is split into three parts, (left)   is greater than   , (middle)   is equal to   , and (right)   is less than   .We see that if   is smaller than   , none of the algorithms produces reliable results (< 18% of correctly identified interactions).If   equals   , Inciident correctly identifies 100% of all simulated interactions for an input interaction size   , while Random correctly identifies ≈ 86%.If   is greater than   ,   Inciident still identifies 100% of all simulated interactions, while Random only identifies ≈ 45% correctly.If Random returned a subset or super set of the actual interaction, on average a subset is 85% and a super set 210% as big as a simulated interaction.In comparison, for Inciident, subsets are 85% and super set 152% as large as a simulated interaction on average.
Our experiments with real-world faults confirm these findings.If   ≥   , Random identifies 6 and Inciident 8 out of 10 interactions correctly.Both algorithms cannot find the fault in the Linux system caused by a four-wise interaction within a reasonable time (i.e., 1 hour) and cannot find the fault consisting of two alternative feature interactions, as this violates our first assumption.

Number of Configurations.
In Figure 4, we show the median number of tested configurations (y-axis) of Inciident and Random, using different values for   , for the number of features (x-axis) in each feature model over all experiments, where   is greater than or equal to   .In this plot, we can see two trends.First, the number of configurations increases with a larger value of   .Second, the number of configurations increases with the number of features of the underlying feature model.For most feature models, even for models with up to 2,000 features, the number of configurations stays below 100.The median number of configurations for   = 3 is at most 317.In comparison, Random tests on average more configurations than Inciident (≈ 387% for   = 3).experiments, where   is greater than or equal to   .For   = 3, the median time of Inciident is less than 11s for any feature model, with a maximum time of 364s for the second largest model.On average, Random requires less time than Inciident for   = 1 (≈ 60%) and   = 2 (≈ 71%) and more time for   = 3 (≈ 121%).

Discussion
Regarding  1 , we see that the effectiveness of Inciident is dependent on the chosen value for the input interaction size   .A value lower than actual interaction size   yields unreliable results.However, if   ≥   and our assumptions are satisfied Inciident always identifies the correct interaction, even if   >   , which is especially useful in practice, where we often do not know the interaction size beforehand.In contrast, testing arbitrary configurations and using a fixed value for  (i.e., Random) yields substantially worse results for   =   and especially for   >   .Thus, when   is high enough, Inciident is strictly more effective than Random.
Regarding  2.1 , for most feature models, we see a logarithmic correlation between the number of features and median number of tested configurations.In addition, for most cases, the number of tested configurations required by Inciident is substantially lower than with Random.Still, for larger systems, the number of additional test configurations might be to high for practical applications.This may be mitigated by already available samples for the corresponding system, which can be used as input of Inciident.Regarding  2.2 , Inciident requires less than a minute of computing time even for the largest feature model in our experiments, which seems to be a feasible amount of time for most real-world applications.

Threats to Validity
Internal Validity.In our work, we use pseudo-random numbers to generate configurations and simulate interactions, which means that we may get good or bad results by coincidence.Due to the use of random numbers, measured values and their derived values such as the average value can be inaccurate.We address this problem by performing multiple iterations of experiments for each model, algorithm, input interaction size, and actual interaction size in order to make a more general prediction of the results of our experiments.Furthermore, we generate the interactions for lower interaction sizes by constructing strict subsets of higher interactions.This way, we ensure independence per experiment and reduce the selection bias regarding features contained in simulated interactions.
External Validity.In Section 3, we state our assumptions needed for our concept to work properly.However, in real-world scenarios, these assumptions may not always hold.We cannot guarantee that only one feature interaction causes the fault or that one interaction causes only one fault.In our experiments, we used simulated faults.Nevertheless, as we aim to demonstrate the feasibility of the overall approach, we argue that the limitations implied by the assumption constitute a good trade-off.Our experiments have been performed on 48 real-world feature models, which does not automatically generalize our findings to all feature models.However, when selecting feature models for our experiments, we tried to select a large set of models, which differ in their number of features, constraints, and domains.Many of the selected models are realistic models from industrial applications, which shows that our concept is probably transferable to more industrial models.

RELATED WORK
Many sampling algorithms are concerned with generating a representative sample of configurations [12-14, 21-23, 33, 36, 42, 43].There are some literature surveys [39,49,50] considering the concept of combinatorial interaction testing, which is a promising approach to compute a small sample, in which each combination of  features appears in at least one configuration [14,21,22,28,29,43].As in our approach, combinatorial interaction testing focuses on covering certain -wise feature interactions.In contrast, we need no specific coverage criteria or fully -wise covered sample when generating configurations.Most sampling techniques only consider the problem space to generate suitable configurations.Including further information from solution space, such as test artifacts or code coverage, has rarely been used and seems to be understudied [50].Our approach considers only the problem space as well.
None of these sampling algorithms deal with the actual identification process of the fault but stop when detecting failing configurations.Our concept of incremental identification of interactions can be combined with any sampling algorithm.Moreover, we need fault information and configuration testing on demand.There are several algorithms [3,22,28] that perform an incremental approach by increasing the coverage with each further configuration incrementally.In comparison, our goal is to incrementally exclude potential interactions to be a candidate for causing failing configurations.
Detecting and identifying feature interactions is challenging [5].Many variability bugs involve multiple features and are hence feature-interaction bugs [1].Calder and Miller [11] detect feature interactions by pair-wise analysis.Furthermore, Kuhn et al. [30] state that higher-order interactions are less likely to occur than pair-wise interactions.Currently, we do not know how common interaction faults are in practice or whether current interaction testing techniques are effective at finding the faults [19].Abal et al. [1] provide a variability bug database with real-world interaction bugs, where 41 faults are caused by single features and 57 faults by feature interactions with an interaction size  ≥ 2. In our evaluation, we focus on feature interactions with an interaction size from  = 1 till  = 3, as interactions of more than three features are seldom [1,20,30].However, our concept and tool is applicable to higher-order interactions.Besides, variability bugs may also involve non-locally defined features (i.e., features defined in another subsystem) [1].Of course, we can never know what interaction size is required to detect all faults in a system [4,30].This problem also is inherent to our concept of incremental identification of interactions.
Colbourn and McClary [17] introduce the concept of locating arrays and detecting arrays, which are special types of covering arrays [15,16] that allow the localization of -way interactions in a white-box scenario.Compared to that, we consider a black-box scenario independent of the underlying implementation technique.Locating-array algorithms need an entire sample as input, whereas we do not.Martínez et al. [35] present an adaptive approach to locate faults through locating arrays by considering the choice of each new test based on the outcome of all previous tests similar to our approach.When using locating arrays, Aldaco et al. [4] suppose that a maximum number  of existing faulty interactions is given in advance, which seems challenging.In comparison, all of the algorithms considering locating arrays do not consider constraints between features, whereas we are able to consider constraints.It could be possible to involve locating arrays as input for our approach.
There are several machine-learning approaches regarding configuration classification and generation.Temple et al. [48] generate configurations and test them through an oracle to resolve failing configurations by inferring constraints.Siegmund et al. [47] present a tool to analyze several configuration results to get as output which features interact with respect to non-functional properties.Those approaches focus on probabilistic fault prediction for configurations whereas we aim at exact results.In contrast to the probabilistic fault prediction of these machine learning approaches, which cannot guarantee exact results, we employ an analytic approach.Furthermore, when we already have many existing configurations, machine learning can perform a configuration classification easily, but when it comes to generating new configurations machine-learning approaches are far from trivial.
There exist also static-analysis and model-checking approaches addressing the detection of interaction faults [32,38,49].In contrast to our approach, these are white-box approaches.For instance, Meinicke et al. [38] test different configurations, to find type errors, which then provide feature assignments as output like in our approach.These approaches are expensive in terms of time and restrictive, whereas our approach is always applicable for a given test suite.

CONCLUSION AND FUTURE WORK
We propose the concept of incremental identification of interactions to tackle the problem of identifying -wise feature interactions that cause failing configurations.We use a greedy divide-and-conquer strategy that starts with a set of potential feature interactions and incrementally reduces this set by generating and testing additional configurations until it identifies the feature interaction responsible for a given fault.With Inciident, we provide an open-source implementation for our concept.Our evaluation shows that we can reliably identify any interaction that causes a particular fault, given that we choose an input value for the interaction size  that is at least as high as the actual interaction size.Further, we demonstrate that on average Inciident requires 80% less configurations to be tested than Random, while taking less than a minute of computing time even for the largest feature model in our experiments.
In future work, we plan to improve our concept by addressing its current limitations, which is handling faults caused by multiple feature interactions or faults that cannot be detected reliably for every configuration.In addition, we plan to optimize the strategy for configuration generation by including domain knowledge or information from the solution space.This way, we might create more selective configurations and, thereby, reduce testing effort.

ACKNOWLEDGMENTS
This paper is based on the master's thesis by Sabrina Böhm [9].We like to thank members of the Institute of Software Engineering and Programming Languages from the University of Ulm for feedback on presentations about this topic.Moreover, we thank the community of the FOSD Meeting 2023 3 for the enriching discussions and feedback.The source code was developed with help by Jens Meinicke, to which we would like to express our thanks, too.This paper has been partly supported by DFG (German Research Foundation) in project LO 2198/4-1.

Figure 2 :
Figure 2: Steps of the algorithm Inciident

Figure 3 :
Figure 3: Effectiveness of identifying the interaction

Figure 4 :
Figure 4: Median number of tested configurations by Inciident and Random per   and feature model

Figure 5 :
Figure 5: Median run time of Inciident and Random per   and feature model Figure 5  depicts the median run time (y-axis) of Inciident and Random, using different values for   , for the number of features (x-axis) in each feature model over all VaMoS 2024, February 07-09, 2024, Bern, Switzerland Böhm et al.