Preprocessing is What You Need: Understanding and Predicting the Complexity of SAT-based Uniform Random Sampling

Despite its NP-completeness, the Boolean satisfiability problem gave birth to highly efficient tools that are able to find solutions to a Boolean formula and compute their number. Boolean formulae compactly encode huge, constrained search spaces for variability-intensive systems, e.g., the possible configurations of the Linux kernel. These search spaces are generally too big to explore exhaustively, leading most testing approaches to sample a few solutions before analysing them. A desirable property of such samples is uniformity: each solution should get the same selection probability. This property motivated the design of uniform random samplers, relying on SAT solvers and counters and achieving different tradeoffs between uniformity and scalability. Though we can observe their performance in practice, understanding the complexity these tools face and accurately predicting it is an under-explored problem. Indeed, structural metrics such as the number of variables and clauses involved in a formula poorly predict the sampling complexity. More elaborated ones, such as minimal independent support (MIS), are intractable to compute on large formulae. We provide an efficient parallel algorithm to compute a related metric, the number of equivalence classes, and demonstrate that this metric is highly correlated to time and memory usage of uniform random sampling and model counting tools. We explore the role of formula preprocessing on various metrics and show its positive influence on correlations. Relying on these correlations, we train an efficient classifier (F1-score 0.97) to predict whether uniformly sampling a given formula will exceed a specified budget. Our results allow us to characterise the similarities and differences between (uniform) sampling, solving and counting.


INTRODUCTION
Uniform Random Sampling (URS) is a family of techniques to sample from the set of solutions of a logical formula (often, a Boolean formula), such that each solution gets the same probability of being selected.URS is a problem of both theoretical and practical interest.In particular, when testing configurable systems with hundreds of options -inducing search spaces one cannot exhaustively explore -uniform sampling is interesting as we may not know where bugs are [30,31,37].Other applications include deep learning verification, where inputs are drawn from an unknown distribution [5] or evolutionary algorithms where De Perthuis de Laillevault et al. theoretically demonstrated the relevance of repeated uniform random sampling when initializing populations [12].Improving URS thus benefits multiple research fields: SAT solving, software testing, machine learning, etc.
When evaluating URS techniques (or samplers), two quality criteria matter: uniformity and scalability.Uniformity evaluates how close the distribution of the sampled solutions is to the uniform distribution.Scalability refers to the efficiency of the sampler to produce samples within a specified amount of time, even for large formulae.Previous studies [37] demonstrated the difficulty for existing samplers to satisfy both quality criteria.Despite recent improvements [41], state-of-the-art samplers still fail to scale on complex real-world formulae (representing, e.g., the Linux kernel configurations) without sacrificing uniformity.
Why some formulae are harder to sample uniformly is a poorly understood problem.A simple but wrong approach to determining sampling complexity is to count the number of variables and clauses of formulae.As an example, UniGen3 [41] requires 8 seconds to produce 10000 samples from the formula blasted_case64 -96 variables and 299 clauses -and 13.5 seconds for the same number of samples from the JHipster feature model -44 variables and 104 clauses.This indicates that these simple metrics do not adequately characterize the complexity of sampling.While there exist formula metrics that correlate with the complexity of SAT solving -although with varying successes [3] -the characteristics that make a formula easier or harder to sample from remain unknown.
In this paper, we assess and define meaningful metrics for understanding and predicting URS difficulty (time and memory consumption).In addition to simple metrics trivially computed from the formula structure, we consider other studied metrics in the context of SAT solving (such as the minimal independent support size and the treewidth).We also provide an efficient algorithm to compute equivalence classes [9].To evaluate the relevance of these metrics to assess sampling and solving complexity, we consider two uniform random samplers, SPUR [2] and UniGen3 [41], as well as SAT solvers [11,14] and a model counter [25].Motivated by previous studies showing that the formulae encoding the variability spaces of configurable systems tend to be harder to uniformly sample than others [16,37], we built a diversified dataset of 488 SAT formulae, 128 of which are encoding configurable systems.
Equipped with a set of metrics measured on various formulae, we measure correlations between these metrics and the performance of uniform samplers.We demonstrate the existence of strong correlations (Kendall coefficients > 60) between some (combinations of) metrics and sampling complexity (time and memory consumption).We also demonstrate the positive role of formula preprocessing, i.e., computing independent support for the formulae and applying the metrics on them, for complexity prediction.Next, we lean on these results and develop a classification model to predict whether a given sampling problem (i.e. using a given uniform sampler to sample from a given formula) is affordable for a given time and memory budget.We evaluate our model on all our 488 subject formulae and show that it can achieve at best a classification F1-score of 0.97 and an AUC-ROC of 0.98.
To summarize, this paper makes the following contributions: (1) Correlation study.We study the correlation between the complexity metrics and the computational cost of sampling (time and memory).We demonstrate a strong correlation between the number of equivalence classes and sampling cost.
(2) Prediction.Based on these correlations, we build classification models (random forests) that leverage the metrics to classify formulae according to sampling cost, with F1-score up to 0.97 and AUC-ROC up to 0.98.We further analyze the feature importance of these models to increase our trust in the correlation study.Open science policy.All our experimentation infrastructure is available at the following website: https://anonymous.4open.science/r/eqv_pred-9E64.The repository contains some artifacts we used to compute the treewidth or the deficiency metrics.The repository also includes the program to compute the equivalence classes and some example scripts to generate the files in the data folder.The data folder contains the resulting CSV files of our experiments.The Python scripts were used to compute the correlations and the prediction models.The repository also contains an archive with all the formulae in the DIMACS format.

BACKGROUND 2.1 Boolean formulae
A Boolean formula  is defined over a set of Boolean variables Var( ) and takes a Boolean value that can be true or false.A literal is either a variable  ∈ Var( ) or its negation ¬, such that if variable  is set to true then the literal  evaluates to true and the literal ¬ evaluates to false.We use the notations Var() and Var(¬) to refer to the variable corresponding to the literal  and ¬, respectively, viz.variable .
A model  of  ( |=  ) is a set of literals such that ∀ ∈   ( )  ∈  ⊕ ¬ ∈  and  evaluates to true (with ⊕, the binary exclusive or operator).We say that a literal  evaluates to true in a model  if and only if  ∈ .Otherwise we say that the literal  evaluates to false in .We define   as the set of models  of  such that  |=  if and only if  ∈   .We define |  | as the size of the set   .
A formula  is in negational normal form (NNF) if the negation only appears directly in front of variables.Furthermore, it is in conjunctive normal form (CNF) if written as a conjunction of disjunction of literals ( =    ∈  ).A deterministic decomposable NNF (d-DNNF) is an NNF where every conjunction is decomposable and every disjunction is deterministic.A conjunction   is decomposable if for every pair ( , ) we have Var(  )∩Var(  ) = ∅.A disjunction   is deterministic if for every pair ( , ),  (  ∧  ) = ∅.
⊆ Var( ) of a formula  is an independent support if every model of  can be uniquely distinguished by using the variables in  only [8,19].An independent support is minimal (MIS) if removing any variable from it does not yield an independent support.
Based on the above, we define the concepts of backbone and equivalence class: Definition 1 (Backbone).The backbone   of a formula  is defined as the set of literals that appear in each model of the formula: The backbone contains the literals that evaluate to true in every model of the formula.If we generalize the idea of equivalence between literals and the constant true to the idea of equivalence between literals we find a notion of equivalence class: Definition 2 (Eqivalence class).An equivalence class  is a set of literals that evaluate to the same value in every model of  .

∀𝑙, 𝑙
By this definition we find that if {, } is an equivalence class then {¬, ¬} is also an equivalence class.These two equivalence classes are redundant as they represent the same result.We define two equivalence classes  and  as redundant if and only if  =  or  = {¬ | ∈ } or  ⊆  or  ⊆ .If we have  ⊆  we keep  and discard .For the rest of the paper, without loss of generality, we only consider non-redundant equivalence classes.We define the set   as the set of all non-redundant equivalence classes of a formula  and |  | as the number of equivalence classes of  .Note that we necessarily have |  | ≤ |  ( )| because we only consider non-redundant equivalence classes.
We next define three common problems for Boolean formulae, i.e., SAT solving, model counting, and URS.Definition 3 (SAT solving).SAT solving is the problem of finding a model  for a given formula  .Despite their intrinsic links, these three problems are very different and require dedicated solutions to be addressed.

URS, Configurable Systems and Feature Models
URS is highly relevant for quality assurance activities for configurable systems, e.g., during testing [17,37], verification [10,32] and performance analysis [21].This support consists of computing a representative sample of variants to infer analysis results for other variants (based on their common features with the sample).Because these variants are too numerous to be all considered for analysis, sampling offers an adequate compromise between completeness and efficiency.
Various artifacts can drive the sampling of system variants, such as feature models [22], source code, test suites, behavioral models, etc. Feature models, however, remain the most commonly used input for sampling techniques designed for configurable systems.The main reason is that the semantics of feature models can be expressed in first-order logic [6,39], whose set of solutions corresponds to the set of valid SPL variants.This makes feature models inherently amenable to URS.

OBJECTIVES AND METHODS
Our objective is to understand and predict the capability (or lack thereof) of state-of-the-art samplers to sample solutions from a Boolean formula uniformly.

Research Questions
Our first research question investigates the role of metrics in the complexity of uniform random sampling: RQ1: Which metrics of Boolean formulae correlate with URS time and memory consumption?
In addition to simple characteristics like the number of variables and clauses, we consider concepts that are intensively used in the problems of SAT solving, model counting, and URS, e.g., the size of the minimal independent support and the number of equivalence classes.
We aim to exploit our analysis results to develop an approach that, based on the correlated characteristics, can predict whether a formula would be too costly to uniformly sample from (i.e., would exceed a predefined time and memory budget).This would enable engineers to estimate whether it is feasible to sample solutions with uniform samplers without wasting computation resources on intractable problems.

RQ2: Can the correlated characteristics be used to predict the affordability of URS in terms of time and memory consumption?
To answer this question, we train random forest models to classify Boolean formulae into "affordable" or "not affordable", based on different combinations of the characteristics we study.
Lastly, we study whether the intrinsic links between SAT solving, model counting and sampling translate into the same influence of formula characteristics on these three problems.
RQ3: Are the characteristics of Boolean formulae correlated to the complexity of URS as they are to SAT solving and model counting?
A positive answer to this question would pave the way to improve the efficiency of URS by working on the same formula transformations that reduce the difficulty of SAT solving and model counting.A negative answer would invalidate this path and call for specific solutions to reduce the complexity characteristics that impact sampling.

Complexity metrics
We consider simple metrics that are trivially computed from the structure of a Boolean formula: • the number of variables # • the number of clauses # • the number of literals # We, furthermore, consider underlying concepts that SAT solvers, counters, and samplers have used to improve the performance of their algorithm.One such metric is #, the the size of the Minimal Independent Support (MIS).MIS is typically computed to improve the performance of model counters (like D4 [25] and sharpSAT [43]) that some URS tools invoke during sampling.
Unfortunately computing the MIS itself may be unaffordable for complex formulae.To this end, we propose the number of equivalence classes (#).The advantage over MIS is that the computation of equivalence classes only requires a simple SAT solver.We further increase the efficiency of this computation through a parallel algorithm that we develop hereafter.Using this algorithm, we compute # for the Linux 2013 model (50000 variables) [36] in less than 1.5 wall-clock hours while computing # times out at 24 hours.Another way of approximating the MIS is to use Arjun [42], which is significantly faster than the computation of equivalence classes.Unfortunately, using Arjun to compute an independent support gave us lower correlations so we decided to use MIS [19] and #.In addition, we consider other metrics that have been studied in the context of SAT solving, viz.treewidth [33] and deficiency [34].Treewidth (tw) is used to bound the worst-case size of the decision DNNF (D-DNNF) during solving [33].Deficiency () was proven to have intrinsic links with the worst-case time complexity of SAT solving [34].Though computing deficiency is an NP-hard problem, it can often be approximated as the number of clauses minus the number of variables.

EQV: A parallel algorithm to compute the number of equivalence classes
In [9], the authors generalize the notion of backbone to equivalence classes and propose an algorithm to compute the equivalence classes.However, their algorithm requires to add variables to a formula with  variables -in our dataset,  can be as high as 486193 variables.Assuming every variable requires 4 bytes of RAM to be stored, the algorithm would necessitate around 472 GB of RAM to store the additional variables.This is unaffordable and prevents us from computing # on most of the formulae we use in our experiments.
We therefore propose an adapted algorithm that requires less storage memory and can improve efficiency via parallelization.Our algorithm can divide the computation of [9] to reduce the number of added variables and enable spreading over multiple cores.It introduces an overhead, though, as it may increase the number of intermediate solver calls.As a result, our approach would run slower on a single-core computer than [9], but brings benefits on multi-core infrastructures.

Algorithm 1 EQV(𝜙)
Require:  a satisfiable Boolean formula end if 30: end for 31: return v Our method is depicted by Algorithm 1 with ⊕ being the logical exclusive or operator.The algorithm uses a  procedure which takes as input a Boolean formula and either returns    if the formula is not satisfiable or returns the set of literals that represents the solution found by the SAT solver.The algorithm works as follows.We start by making a first call to  .Here, we suppose that the formula is satisfiable.We then suppose that the formula has only a single solution and thus consider all literals to be one equivalence class, i.e., the set .The set  represents the set of possible but unverified equivalence classes.We pick a possible pair out of all the equivalence classes (lines 6 to 9) that is a possible candidate for an equivalence and which has not yet been proven to be correct.We call the  solver and ask for a solution in which  and  are different to disprove their equivalence.If the result is    , we have a proof that  and  are equivalent in all the models and we modify  accordingly.The set  thus represents the set of the verified equivalence classes.On the other hand, if the  solver returns a solution  we know that there exists a model of our formula in which  and  are not equivalent and thus  and  cannot be in the same equivalence class.We can also learn from the solution  by looking at its difference with our first model .If two literals  and  are supposedly equal in every model then if  is present in both  and , then so should  be.In other words, every change in  from  to  should also happen in .Using this information, we update  between lines 18 and 24.We observe that  contains the verified separation of equivalence classes and  contains the verified unions of equivalence classes.The set  thus allows us to avoid making unnecessary  calls if we have already found two models that disprove the equivalence of two literals.
The loop on line 6 is the for loop that may be parallelized.The critical sections of Algorithm 1 may seem very large, but the data structures  and  can be updated efficiently (especially considering that  may be implemented using the UnionFind data structure).Moreover, the  calls are done outside of a critical section and thus in parallel which should grant us a significant speedup.

EXPERIMENTAL SETUP
We detail below the general experimental protocol that applies to all research questions.The specific settings of each research question are detailed in Section 5.

Samplers
SPUR [2]: SPUR is built on top of sharpSAT [43], a #SAT solver.Since sharpSAT essentially walks through all the solutions of a formula to count them one might think of using that to sample from a formula which is exactly what SPUR does.SPUR being tightly integrated into sharpSAT, it can exploit the way sharpSAT walks through the solutions and can thus produce uniform samples.SPUR is also one of the few samplers that comes with theoretical guarantees regarding uniformity.
UniGen3 [41]: a hashing-based algorithm.To improve UniGen2, the authors investigated the bottlenecks of UniGen2 and made key improvements to their algorithm and to the way CryptoMiniSat handles XOR formulae, leading to better performance.
We use both SPUR and UniGen3 as these are the state of the art samplers with theoretical guarantees of uniformity.
In our study, we also would like to explore the relationship between URS and SAT solving and the relationship between URS and SAT counting.To compare URS with SAT solving, we explored the two solvers MiniSAT [14] and Z3 [11].To compare with SAT counting, we used the two state-of-the-art model counters D4 [25] and sharpSAT [43].Since another sampler called KUS [40] is based on D4, this should also give us insights into the complexity of KUS.We do not evaluate KUS as most of the complexity related to the sampling process is absorbed by the call to D4 as demonstrated in [40].
We added an implementation of bounded SAT solving (BSAT) using Z3.BSAT is a function BSAT(, ) defined as follows: the function recursively calls Z3 on  and removes the returned model from the formula until either the formula becomes unsatisfiable or the number of iterations is greater than .BSAT is thus a form of SAT sampler which is almost guaranteed to be very far from uniform.

#SAT preprocessing
We would like to study the influence of formula preprocessing on the complexity of URS and on the correlations with our metrics.To this end we use a preprocessor called Arjun [42].Arjun computes an independent support  of the input formula  and removes the variables that are not in the independent support  if the projection can be done in reasonable time and space.We thus obtain a new formula  ′ which is the projection of  on the set of variables  .Arjun ensures that   ′ is the projection of   on the independent support  and that |  ′ | = |  |.Thus using Arjun as a preprocessor to URS does not influence the uniformity of a sampler if the sampler is guaranteed to be uniform.

Dataset
We use well-known and publicly available models in our study, which are of various complexity and are either feature models or general Boolean formulae.

4.
3.1 Feature model benchmark.Overall, we use the feature models of 128 real-world configurable systems (Linux, eCos, toybox, JHipster, etc.) with varying sizes and complexity.We first rely on 117 feature models used in [23,24].The majority of feature models contain between 1,221 and 1,266 features.Of these 117 models, 107 comprise between 2,968 and 4,138 cross-tree constraints, while one has 14,295 and the other nine have between 49,770 and 50,606 cross-tree constraints [23,24].Second, we include ten additional feature models used in [26] and not in [23,24]; they also contain a large number of features (e.g., more than 6,000).Third, we add the JHipster feature model [17,38] to the study, a realistic but relatively small feature model (45 variables, 26,000+ configurations).We later refer to these benchmarks as the feature model benchmarks.Once put in conjunctive normal form, these instances typically contain between 1 and 15 thousand variables and up to 340 thousand clauses.The hardest of them, modeling the Linux kernel configuration, has more than 6,000 variables, and 340,000 clauses.It is generally seen as a milestone in configurable system analysis.

General Boolean formulae.
In addition to these feature models, we have replicated the initial experiments on industrial SAT formulae as conducted in [13].We use these results to ensure that we are using the tools with the same configurations that were previously compared.Moreover, since these original formulae are much smaller than the feature models we use (typically a few thousand clauses), they will provide a basis of results for statistical analysis, in case a solver cannot produce enough samples on the harder formulae.

Infrastructure
The experiments regarding the computation of the equivalence classes, the MIS computation as well as the time and memory usage of the samplers were computed on an HPC containing 318 nodes each of which has 256 GB of RAM and 2 AMD Epyc ROME 7H12 CPUs running at 2.6 GHz.
To measure the memory usage of the samplers we developed a wrapper program which reads the appropriate file in the /proc folder which contains information about the virtual memory usage of the program.We asked the samplers to compute 1000 samples while using less than 64 GB of RAM and in under 5 hours.
The treewidth was computed with the tool described in [18].The correlations were computed using the SciPy Python library.To train the predictors we used Python and the scikit-learn library [35].We used standard parameters for random forests, viz.we set the number of trees to 100, used Gini impurity for splitting, and set the number of features to consider at each split to the square root of the total number of features.

RQ1: complexity factors
Table 1 shows the Kendall rank correlation coefficients for the SPUR and UniGen3 samplers.The coefficients have been computed on the instances on which we successfully managed to compute 1000 samples in less than 5 hours and using less than 64GB of virtual memory.This means that the table was computed on 416 formulae for SPUR and 241 formulae for UniGen3.The columns #v, #c, and #l represent respectively, the number of variables, number of clauses, and the number of literals respectively, with the number of literals being the sum of the lengths of all clauses.The time and mem columns indicate the computation time and the amount of virtual memory used by a single call to Z3 respectively.We have 2 groups in our table, the regular group where we compute the correlations over our formulae and the (+Arjun) group where we first preprocess the formula with Arjun [42] and then call SPUR or UniGen3 on the output of Arjun.Some solvers take advantage of a possible MIS declaration inside of the DIMACS files.Unfortunately, not all of the solvers take advantage of the MIS declaration.We thus removed the MIS declarations from the DIMACS files.The results with the MIS declaration are nonetheless available on our companion GitHub [4].There are no correlations between the (+Arjun) groups and the equivalence classes because Arjun automatically removes redundant variables.The time and memory usage of Arjun is ignored (the median runtime was 0.15 seconds with the longest runtime being 17 minutes).All the p-values are lower than 10 −3 .We computed the MIS by using the tool in [19] on both the initial formulae and the preprocessed formulae.Although Arjun [42] returns an independent support, we find that the correlations are worse.We thus decided to compute the MIS with [19].
For both SPUR and UniGen3 we observe that the most correlated metrics with the computation time or the virtual memory usage is either the size of the MIS or the number of equivalence classes.However, if we add Arjun as a preprocessing step, we observe that the correlations change between SPUR and UniGen3.SPUR (+Arjun) is highly correlated with the number of clauses and with  while UniGen3 (+Arjun) is highly correlated with the number of variables and the size of the MIS.This difference can be explained through their respective algorithms.UniGen3 adds clauses to the formula, and the size and number of added clauses depends on the number of variables (or on the MIS if the MIS is declared in the DIMACS file).SPUR on the other hand is based on an exhaustive DPLL algorithm, which means that SPUR spends a lot of time doing boolean constraint propagation which is sensitive to the number of clauses.1: Kendall rank correlation coefficients of the used metrics with SPUR (416 data points), SPUR (+Arjun) (441 data points), UniGen3 (241 data points) and UniGen3 (+Arjun) (309 data points).All of the p-values are lower than 0.001.
Answer to RQ1: The number of equivalence classes and the number of variables in the MIS strongly correlate (> 62 for all formulae) with computation time and memory usage of both UniGen3 and SPUR.If the formulae are preprocessed with Arjun, then we find that the highest correlations are with the number of variables, the number of clauses,  and the size of the MIS.

RQ2: complexity prediction
We cover here the results regarding formula classification using our trained random forests.We consider binary classification here.We selected the formula processed within the following affordability limits: 30 minutes of computation time and less than 4GB of virtual memory.This selection allowed balanced training data.
Table 2 shows the different Gini importances (i.e.feature importances) of our different metrics in a random forest that contains 1000 instances.The lines where the SAT sampler is suffixed with "(+Arjun)" are the lines where the formulae were first preprocessed with Arjun [42].The time and memory used for a single Z3 call play a negligible role.The two main features are the number of equivalence classes and the size of the MIS.If however, we use Arjun as preprocessor we observe that the number of variables, the number of clauses, the number of literals and  seem to be interesting choices as well further confirming our initial correlations.The treewidth has high importance for SPUR (+Arjun) but is expensive to compute, diminishing its value for large formulae.
In Table 3 we explore the F1-scores of a random forest containing 100 instances that were trained on different metrics.The "all" line indicates the predictor trained on all of the metrics.We also use  ′ instead of  in some of the experiments. ′ is defined as  ′ = # − #.While this is only an estimation of , our experiments show that it is usually a very good estimation and it is a lot faster to compute as well.As previously, we report sampler results with and without the Arjun preprocessing step.#eqv is always ignored when Arjun is used as a preprocessor.Arjun automatically simplifies the equivalence classes in the formula, thus we find # = # for the preprocessed formulae eliminating the need to computing #.The table entries that involve both Arjun and # are simply computed by ignoring #.The predictions were done using a leave-one-out strategy and the F1-scores evaluated on the predictions.This means that for every data-point , we trained a model on the complete dataset excluding  and performed a prediction for .The predictions are collected in a table and the scores are computed on the prediction table.Table 4 shows the ROC AUCs just like Table 3 shows the F1-scores.
We observe that while the model trained on all features seems to perform best, the model trained on only a fraction of the features perform almost identically.The tables also show that if we were to take only one metric, then the number of equivalence classes is the best unless Arjun is used as a preprocessor in which case  ′ and the number of clauses seem to be very good candidates.If we focus on easily computable metrics, then the models that seem most promising are the ones trained on the number of variables,  ′ and on the number of equivalence classes.If we preprocess the formulae with Arjun then the number of variables and  ′ seem sufficient.Furthermore, we find that using Arjun increases both F1 scores and ROC AUCs.
In Table 5, we reported the F1-scores of decision trees (DT) and random forests (RF) using a different number of instances.The models were trained using #v,  ′ and #eqv (if Arjun is not used) and were evaluated using a leave-one-out strategy.We observe that a random forest containing 100 instances performs slightly better than the other models.
Answer to RQ2: We find that the number of equivalence classes alone forms an excellent predictor to classify sampling difficulty according to an affordability budget.Similarly, we observe that if Arjun is used to preprocess a formula, then prediction becomes easier.

RQ3: URS
Table 6 shows the Kendall rank correlation coefficients for the MiniSAT and Z3 SAT solvers as well as our implementation of BSAT using Z3 and the state-of-the-art model counters D4 and sharpSAT.
All the 488 models have been used for the lines involving Z3, MiniSAT and BSAT as all managed to be sampled in less than 5 hours and with less than 64GB of virtual memory.The BSAT algorithm seems strongly correlated to the size of the MIS as well as the number of equivalence classes but it is even more correlated to the number of variables, clauses, and literals.BSAT 4: ROC AUCs with different features of a random forest containing 100 instances estimated using LOO coefficients with the size of the MIS and the number of equivalence classes.We do find that D4 and sharpSAT have very similar correlation coefficients with both SPUR and UniGen3.This would indicate that in practice, the complexity of model counters and uniform random samplers are very close.
For both MiniSAT and Z3 we observe a strong correlation with the number of variables, the number of clauses, and the number of literals.We do not however observe a high correlation with the MIS or the number of equivalence classes.The correlation coefficients seem to be very different between SAT solving and URS.On the other hand, BSAT seems to be a combination of SAT solving and URS in terms of correlations.6: Kendall rank correlation coefficients of the used metrics with Z3 and MiniSAT (488 data points), as well as BSAT using Z3 (488 data points), D4 (437 data points) and sharpSAT (416 data points).
Answer to RQ3: SAT solving and URS correlate with different metrics and are thus different tasks.SAT model counting seems to be very close to URS.BSAT seems to be a combination of both SAT solving and URS in terms of correlation.

Perspectives
Our results demonstrate that the number of equivalence classes in a formula has strong correlations with the computational complexity of sampling.This opens the perspective of increasing sampling efficiency by transforming the input formulae into an equivalent formula with fewer equivalence classes similarly to Arjun.
We also revealed that, though less than the equivalence classes, the MIS also shows strong correlations with sampling complexity.Therefore, efficient ways to compute the MIS and project a formula onto its MIS would also increase URS efficiency.This is demonstrated through the usage of Arjun which further confirms our results.Moreover, Arjun allowed the samplers to solve more instances and increased the performance of our prediction models.

THREATS TO VALIDITY
As for any empirical study, there is a number of threats to consider.
Construct Validity.To assess the validity of our findings, we used the Kendall rank correlation coefficient on the existing and our new #EQV metric.The Kendall rank correlation coefficient is nonparametric (and therefore agnostic to the data distribution) and was used in the past to establish a relationship between structural metrics and runtime measures [3].Regarding the evaluation of our random forest predictors, we used both F1-score and Receiver Operating Characteristics (ROC) in order to cope with different classification thresholds.The main reason for this is that we have highly imbalanced data and both metrics react differently to imbalance.
External Validity.We cannot guarantee that our findings generalize to any formula and all tools in each category (sampling, solving, counting).The reason behind this is the lack of general understanding of the complexity of SAT-based tasks [15], which we aim to address with new metrics.To mitigate this threat, we selected a range of SAT formulae from two different sources.They come from SAT Benchmarks used for the evaluation of uniform samplers [7,8,13] and feature models representing configurable systems of various types and sizes [1,37].In both FM and non-FM categories, formulae encode different types of models: Electonic circuits, algorithmic problems, etc. for the former, and Linux kernels, Unix command line tools or configuration tools [17] for the latter.

RELATED WORK
Complexity of SAT problems.As noted by Alyahya et al. [3] and Vardi et al. [15], studying the complexity of SAT-based tasks is not new.One of the first approaches was to characterise phase transitions linked to abrupt changes in solving complexity.Monasson et al. offered a structural metric, namely the ratio of clauses to variables [29].They were able to demonstrate that when this ratio increases, finding solutions for a given randomly generated formula is progressively harder up to a critical value of this ratio past which the formula becomes easy to solve again (often by proving it UNSAT).Alyahya's survey further covers metrics we also used in this study, such as treewidth correlated with solving time [27].These metrics were not so far assessed for URS techniques.Yet, MIS is expected to play a role in the scalability of sampling [42].We found that MIS is indeed correlated with solving time and memory consumption but the difficulty of computing MIS is an issue.This motivated us to offer a more scalable metric.
Regarding FM-based formulae specifically, the body of knowledge is more limited.Mendonca et al. did not observe such phase transitions for FM formulae: solving was easy throughout the ratio values [28].Liang et al. [26] further confirmed these results on larger industrial FMs.Johansen discusses the implications of these findings for combinatorial interaction testing of software product lines [20].This study is the first to evaluate and offer metrics for uniform sampling of FM and non-FM formulae, in which we show that direct comparison with solving does not hold.

CONCLUSION
To understand the complexity of SAT-based uniform sampling, solving and counting, we have proposed an efficient algorithm to compute the equivalence classes (EQV) of a Boolean formula  .This metric possesses two desirable properties other structural metrics fail to have both: i) a strong correlation to the computation time and memory consumption and ii) its computation scales even on complex formulae, thanks to its ability to exploit parallel computing infrastructures.We showed that EQV can accurately (ROC AUC scores > 87% ) predict if a formula  is going to be easy or difficult to sample uniformly.
Furthermore, we showed that preprocessing techniques like Arjun can not only improve the scalability of samplers but also make the performance predictions of said samplers easier and more accurate further motivating the development of efficient preprocessing techniques for URS and model counting.
We also highlighted that EQV helped understand where URS complexity stands compared to two other SAT-based tasks: solving and model counting.We found that, at least in practice, URS is closer to model counting than to SAT solving.On the one hand, this prevents the naive use of standard solvers as uniform samplers.On the other hand, it further motivates research at the intersection of model counting with uniform sampling [42].We expect our metric as well as Arjun to play a role in this bidirectional relationship, e.g., supporting the development of new knowledge compilation techniques.

Definition 4 (
Model counting (# SAT)).Model counting is the problem of computing the size of   .Definition 5 (Uniform Random Sampling).URS is the problem of sampling a model from   such that every  ∈   has probability 1 |  | of being sampled.

Table 2 :
Feature importances in a random forest containing 1000 instances

Table 3 :
F1-scores with different features of a random forest containing 100 instances estimated using LOO

Table 5 :
F1-scores with different models trained on #v,  ′ and #eqv estimated using LOO