Fire: An Optimization Approach for Fast Interpretable Rule Extraction

We present FIRE, Fast Interpretable Rule Extraction, an optimization-based framework to extract a small but useful collection of decision rules from tree ensembles. FIRE selects sparse representative subsets of rules from tree ensembles, that are easy for a practitioner to examine. To further enhance the interpretability of the extracted model, FIRE encourages fusing rules during selection, so that many of the selected decision rules share common antecedents. The optimization framework utilizes a fusion regularization penalty to accomplish this, along with a non-convex sparsity-inducing penalty to aggressively select rules. Optimization problems in FIRE pose a challenge to off-the-shelf solvers due to problem scale and the non-convexity of the penalties. To address this, making use of problem-structure, we develop a specialized solver based on block coordinate descent principles; our solver performs up to 40x faster than existing solvers. We show in our experiments that FIRE outperforms state-of-the-art rule ensemble algorithms at building sparse rule sets, and can deliver more interpretable models compared to existing methods.


INTRODUCTION
Tree ensembles are popular for their versatility and excellent offthe-shelf performance.While powerful, these models can grow to massive sizes and become difficult to interpret.To improve model parsimony and interpretability, decision rules can be extracted from trained tree ensembles.Each leaf node in a decision tree represents a decision rule; the path of internal nodes from root to leaf in the tree forms a conjunction of if-then antecedents that assigns a prediction to a partition of the dataset.Extracting a sparse (or parsimonious) subset of decision rules (leaf nodes) from a tree ensemble can produce a compact and transparent model that performs well in terms of prediction accuracy [11].
In this paper, we present the Fast Interpretable Rule Extraction (Fire) framework, an optimization-based framework to extract an interpretable collection of rules from tree ensembles.The goal of Fire is to select a small subset of decision rules that is representative of the larger collection of rules found in a tree ensemble.In addition to sparsity, Fire allows for the flexibility to encourage fusion in the extracted rules.In other words, the framework can encourage the selection of multiple rules that are close together from within the same decision tree, so that the selected rules (leaf nodes) share common antecedents (internal nodes).As we discuss later, encouraging fusion appears to improve the parsimony and interpretability of the extracted rule ensemble.To better convey our intuition, Figure 1 presents an illustration.From the original tree ensemble (panel A), we extract 16 decision rules by encouraging only sparsity (panel B), and by encouraging fusion with sparsity (panel C) 1 .The 16 decision rules selected in the sparsity-only panel each come from a different decision tree while the decision rules selected in the fusion with sparsity panel come from only 6 trees.As a result, the rule set extracted by encouraging both fusion and sparsity contains substantially fewer internal nodes, since leaf nodes from the same decision tree share internal nodes.This translates to fewer if-then antecedents for a practitioner to examine in the rule ensemble, suggesting improved interpretability.
Fire is based on an optimization formulation that assigns a weight to each decision rule in a tree ensemble and extracts a sparse subset of rules by minimizing regularized loss function.This allows a practitioner to evaluate the trade-off between model compactness and performance by varying the regularization penalty, and to select an appropriately-sized model.Fire uses a non-convex sparsity-inducing penalty popularly used in highdimensional linear models to aggressively select rules and fused LASSO penalty [26] to encourage rule fusion.The fused LASSO is a classical tool used in the context of approximating a signal via a piecewise constant approximation using an ℓ 1 -based penalty-we present a novel exploration of this tool in the context of rule ensemble extraction.
Optimization problems in Fire pose a challenge to existing solvers due to problem size and the non-convexity of the penalties.On that account, we develop a novel optimization algorithm to efficiently obtain high-quality solutions to these optimization problems.Our algorithms leverage problem structure and block Extracted Rules: Sparsity w/ Fusion C Figure 1: Fusion improves parsimony by reducing internal nodes.In both panels, 16 decision rules are selected but the sparsity with fusion panel contains 44% fewer internal nodes.

Sample of Original
coordinate descent combined with greedy selection heuristics to improve computational efficiency.By exploiting the blocking structure of problems in Fire, our specialized solver scales and allows for computation that appear to be well beyond the capabilities of off-the-shelf solvers.In addition, our algorithms support warm start continuation across tuning parameters, which allows a practitioner to use Fire to rapidly extract rule sets of varying sizes.With our specialized solver, Fire is computationally fast and easy to tune.We show in our experiments that Fire extracts sparse rule sets that outperform state-of-the-art competing algorithms, by up to a 24% decrease in test error.We also demonstrate through a real-world example that Fire extracts decision rules that are easier to interpret compared to the rules selected by existing methods.
Our paper is organized as follows.We first overview rule extraction from tree ensembles.We then introduce our model framework and discuss the effects of our new penalties.We next present our specialized optimization algorithm along with timing experiments against off-the-shelf solvers.Finally, we present our experimental results and our interpretability case study.An open-source implementation of Fire along with a supplement containing derivations and experimental details can be found in this project repository2 .

Main Contributions
• We introduce the Fire framework for rule extraction.Fire selects sparse representative subsets of decision rules from tree ensembles and can encourage fusion so that the selected rules share common antecedents.• Fire is based on a regularized loss minimization framework.Our regularizer comprises of a non-convex sparsity-inducing penalty to aggressively select rules, and a fused LASSO penalty to encourage fusion.Our work is the first to explore this family of penalty functions originating in high-dimensional statistics in the context of rule extraction.• We show how encouraging fusion (in addition to vanilla sparsity) when extracting rules improves the interpretability and compression of the selected model.• Optimization problems in Fire are challenging due to problem scale and the non-convex penalties, so we develop a specialized solver for our framework.Our algorithm computes solutions up to 40× faster than off-the-shelf solvers on medium-sized problems (10000s of data points and decision variables) and can scale to larger problems.• We show in our experiments that Fire extracts sparse rule sets that outperform rule ensembles built by state-of-the-art algorithms, with up to a 24% decrease in test error.In addition, Fire performs significantly better than RuleFit [11], a classical optimization-based rule extraction algorithm, with up to a 46% decrease in test error when extracting sparse models.

PRELIMINARIES & RELATED WORK
In this section provide a cursory overview of decision trees, rules, and tree ensembles, and survey existing work on rule extraction.

Decision Trees and Decision Rules
Given feature matrix  ∈ R  × and target  ∈ R  , decision tree Γ( ) maps R  × → R  .A decision tree of maximum depth  partitions the training data into at most 2  non-overlapping partitions.Each partition, or leaf node, is defined by a sequence of at most  splits and data points in a partition are assigned the mean (regression) or majority class (classification) for predictions.Each split is an if-then rule that thresholds a single feature; splits partition the data based on whether the feature value of a data point falls above or below that threshold.Decision rules are conjunctions of if-then antecedents that partition a dataset and assign a prediction to each partition [11].Decision trees can be viewed as a collection of decision rules, where each leaf node in the tree is a rule.The decision path to each leaf node is a conjunction of if-then antecedents and data points partitioned by these antecedents are assigned a prediction equal to the mean or majority class of the node.For example, consider the decision tree shown in , where 1() is the indicator function and  3 is the prediction value of leaf node  3 .More generally, the rule obtained from a leaf node whose decision path traverses  splits can be expressed by: where   is the set of data points partitioned along the decision path by split  and  is the value of the node.

Tree Ensembles
Tree ensembles consist of a collection of  decision trees, {Γ  ( ) : ∈ [ ]}.This collection of trees can be obtained via bagging [3], where trees are trained in parallel on bootstrapped samples of the data, or through boosting [10], where dampened trees are added sequentially and trained on the residuals of the prior ensemble.Rule ensembles (rule sets) are collections of decision rules with weights assigned to each rule.The prediction of a rule ensemble is obtained by taking the weighted linear combination of the rules.For example, we can obtain a rule ensemble from the decision tree in figure 2 by assigning each rule   () weight   .The prediction of the rule ensemble can be expressed as: 4  =1     ().Tree ensembles result in large collections of decision rules which can be extracted into sparse rule ensembles.We use the following notation to discuss extracting rule ensembles from trees.Consider a decision tree Γ  , fit on data matrix  ∈ R  × , with   leaf nodes.Each leaf node has prediction value   for  ∈ [  ].Recall that if data point   reaches leaf node   , the data point is assigned prediction value   .We define a mapping matrix   ∈ R  ×  whose (, )-th entry is given by: Mapping matrix   maps data points to predictions.The matrix is sparse, with density 1   , since each data point is routed to a single leaf in the decision tree.Let weight vector   ∈ R   represent the weights assigned to each leaf node;   and   define the rule ensemble obtained from Γ  .The prediction of this rule ensemble is given by     .
For an ensemble of  trees, define   for each tree  ∈ [ ].Let  =   =1   denote the total number of rules (nodes) in the ensemble and denote the mapping matrix  ∈ R  × as  = [ 1 ,  2 , . . .  ].Matrix  is also sparse with density   .Given weight vector,  ∈ R  , the prediction of this rule ensemble is .
To extract rules, we fit weight vector ; setting an entry of  to zero prunes the corresponding rule from the ensemble.

Related Work
Extracting decision rules from tree ensembles was first introduced in 2005 by RuleFit [11].Following the notation introduced above, RuleFit selects a subset of rules by minimizing the ℓ 1 -regularized optimization problem (aka LASSO): which penalizes the ℓ 1 norm of the weights .RuleFit uses LASSO solvers to compute a solution  to Problem (2).Subsequently, various algorithms to post-process or generate rule ensembles have been proposed.Node Harvest [19] uses the non-negative garrote and quadratic programming to select rule ensembles from tree ensembles.More recently, SIRUS [2] uses stabilized random forests to build and aggregate rule sets, and GLRM [29] uses column generation to create rules from scratch.To the best of our knowledge, Fire is the first framework that incorporates improved sparse selection and rule groupings within a holistic optimization framework.We show in our experiments that Fire outperforms SIRUS and GLRM at selecting sparse human-readable rule sets.Existing optimization-based rule extraction algorithms, such as RuleFit, face several challenges due to the structure of tree ensembles.The number of variables in the optimization problem (i.e. the number of leaves in the tree ensemble) increases exponentially with tree depth, as shown in the left plot in figure 3. RuleFit uses cyclic coordinate descent to solve the LASSO, which becomes expensive when the number of coordinates is large.As a result, RuleFit is restricted for use on shallow tree ensembles.We show in §4 that our specialized optimization algorithm for Fire is robust to the depth of the ensemble and scales substantially better than the LASSO solvers used by RuleFit.
An important difference between Fire and RuleFit stems from a simple yet critical observation: shallow tree ensembles contain many correlated decision rules.The right plot in figure 3 shows the pairwise correlations between the columns of M on a depth 2 ensemble of 500 trees; many pairs of columns have correlation scores close to 1.This further complicates rule extraction, since LASSO performs poorly at sparse selection on highly correlated features [12,13,25].Earlier work in high-dimensional statistics proposes the use of non-convex penalties, which performs better at sparse selection in the presence of correlation [18,30].

PROPOSED MODELING FRAMEWORK
In this section, we present our model framework and discuss in detail the effects of our non-convex sparsity-inducing penalty and fusion penalty on decision rule extraction.
Consider a tree ensemble with  decision trees {Γ  ( ) :  ∈ [ ]}.Each decision tree has   leaf nodes and mapping matrix   ∈ R  ×  .The ensemble has  =   =1   leaf nodes and mapping matrix  ∈ R  × .Fire selects a sparse subset of rules by learning weights  by solving: where  () = (1/2) ∥ −  ∥ 2 2 is quadratic loss that measures data-fidelity, ℎ is the sparsity penalty with regularization parameter   , and  is the fusion penalty with parameter   .

Sparsity-Inducing Penalty
We discuss possible choices for sparsity-inducing penalty ℎ.One baseline choice is the LASSO from RuleFit where, ℎ(,   ) =    =1 |  |.However, this ℓ 1 -penalty encourages heavy shrinkage and bias in , which makes the LASSO a poor choice for sparse variable selection in the presence of correlated variables [12,13].Since we intend to perform sparse selection from a collection of correlated rules, we use a non-convex penalty which incurs less bias than LASSO.
Many unbiased and nearly unbiased penalty functions exist, such as the ℓ 0 -penalty [12], the smoothly clipped absolute deviation (SCAD) penalty [6], and the minimax concave plus (MCP) penalty [30].Fire uses the MCP penalty since it is continuous and subdifferentiable-properties that will come in handy when we develop our optimization solver.We set ℎ as: where   (  ,   ) is the MCP penalty function defined by: and  > 1 is a hyperparameter that (loosely speaking) controls the concavity of the penalty function.As  ∼ ∞, the penalty behaves like the LASSO, and when  ∼ 1 + it operates like the ℓ 0 -penalty.

Fusion Penalty
In addition to sparse selection, we also present a framework to encourage rule fusion.To this end, we use a fused LASSO penalty [26] to encourage contiguity in the leaf nodes (rules) selected or pruned from within each tree.The fused LASSO penalizes the sum of absolute differences between the coefficients and is commonly used for piecewise constant signal approximation [14].Here, we explore how this classical penalty function can be used to improve rule extraction.
Let   ∈ R   represent the sub-vector of weights in  that correspond to tree Γ  and let (  )  denote the -th entry of   .We set  as: where   ∈ {-1, 0, 1} (  −1)×  is the tree fusion matrix with (  )   = −1 for all  =  and (  )   = 1 for all  =  − 1.We have that: This penalizes the absolute value of the differences of the fitted weights in each tree.As a result, in problem (3), for larger values of   , the nodes assigned zero weights and pruned (and the nodes assigned non-zero weights and selected) are grouped together in each tree.In what follows we provide some intuition into why this can be appealing to a practitioner.

Compressing Trees.
By grouping pruned leaf nodes together, we increase the number of internal nodes removed from a tree, since an internal node whose children are pruned is removed as well.Consider this example in figure 4.

Sparsity Only Sparsity w/ Fusion
Figure 4: Grouping pruned leaves increases the number of internal nodes removed from a decision tree.
In both plots, we prune 16 out of the 32 leaf nodes from a depth 5 decision tree fit on the California Housing Prices dataset [28].In the left plot the pruned leaves are noncontiguous so 0 internal nodes are removed; the pruned tree contains 47 total (leaf + internal) nodes.In contrast, the pruned leaves are grouped in the right plot.Consequently, 13 additional internal nodes are removed and the pruned tree contains 34 total nodes.Both trees incur the same rule sparsity penalty of 16 leaves, but the right tree contains 28% fewer total nodes.The fusion penalty  encourages grouping in the leaf nodes pruned from each tree which further compresses tree ensembles compared to only regularizing for sparsity in the leaves.

Grouping
Rules.Grouping the rules selected from each tree improves model interpretability since grouped rules share antecedents.In figure 5, we select 4 rules from the decision tree shown in figure 4. The left plot selects rules that are spread out and the right plot selects rules that are grouped.
Consider the task of interpreting all interaction effects in the rule ensemble up to depth 3. The 4 rules in the right ensemble share the antecedents: 1(MedInc >  1 ), 1(AveRooms >  2 ), and 1(Longitude >  3 ).A user would need to analyze 3 antecedents to interpret the interactions.For the left ensemble, a user would need to analyze 7 antecedents to interpret all depth 3 interactions.
Fusion regularizer  (7) introduces a more natural way of penalizing the complexity of the selected rules.Consider selecting two leaf nodes; the first leaf node shares a parent node with a leaf that has already been selected and the second leaf node is on a branch where no leaves have been selected.Selecting the first leaf adds no internal nodes (antecedents) to the rule ensemble while selecting the second leaf can add up to  new antecedents.Sparsity regularizer ℎ penalizes both choices equally but the fusion regularizer incurs an additional penalty for the second choice.

Hyperparameters
We discuss how to select good values for hyperparameters in Fire.We denote   as the sparsity hyperparameter,  as the concavitiy hyperparameter, and   as the fusion hyperparameter.
3.3.1 Sparsity.Sparsity hyperparameter   generally controls the number of rules extracted from the tree ensemble, i.e., the number of nonzero entries in .We can use warm start continuation to efficiently compute the entire regularization path of  across   with the other hyperparameters held fixed.We start with a value of   sufficiently large such that  * = 0 and decrement   , using the previous solution as a warm start to Problem 3, until the full model is reached [8].This procedure allows a practitioner to quickly evaluate rule ensembles of different sizes.Given any fixed configuration of  and   , it is easy to select   ; we compute the regularization path for the hyperparameter and find the value of   that minimizes validation loss.The left plot in figure 6 demonstrates this effect on extracting rules from a tree ensemble fit on the US Communities and Crime dataset [28].We conduct a sensitivity analysis on the MCP penalty by varying  ∈ {1.1, 3, 10} and computing the regularization path of   for each value of .The horizontal axis shows the number of rules extracted and the vertical axis shows the test performance of the selected model.The right plot in figure 6 shows the corresponding shape of the MCP penalty function.
When  − → 1 + , the MCP penalty performs better at selecting very sparse subsets of rules, due to the reduced bias of the more concave penalty [30].However, this aggressive selection can possibly result in overfitting in low-signal regimes.Increasing  increases the shrinkage imparted by the MCP penalty, which regularizes the model and reduces overfitting.As  increases, the model is less likely to overfit in the low regularization regime (RHS of figure 6) but performs worse at sparse selection (LHS of figure 6).
Selecting the best value for  depends on the use case for Fire.When using Fire to select very sparse rule ensembles, we recommend setting  to be close to 1 + .Otherwise, we consider a small number of possible  values and for each value, we compute the regularization path for   .We use validation-tuning to select an appropriate (  , ) pair.Our framework's ability to use warm-start continuation to compute regularization paths for   makes this 2-dimensional tuning computationally efficient.

Fusion.
Fusion hyperparameter   influences the interpretability of the extracted rule ensemble.Increasing   encourages more fused rules which are easier to interpret.The best value of   is use-case specific, and we observe empirically that values of   ∈ [0.5  , 2  ] work well.We show in our case study in §6 the effect of   on the interpretability of the selected ensemble.Increasing   also adds additional regularization, which may be useful for preventing overfitting on noisy datasets.

OPTIMIZATION ALGORITHM
We present our specialized optimization algorithm to efficiently obtain high-quality solutions to problem (3).Note that the smooth loss function and non-smooth regularizers in problem (3) are separable across blocks   's, where each block represents a tree in the ensemble.Motivated by the success of block coordinate descent (BCD) algorithms for large-scale sparse regression [9,12], we apply the method to problem (3).As we discuss below, our proposed algorithm has notable differences: we make use of a block structure to perform updates-taking advantage of a structure that naturally arises from the tree ensemble.Also, a direct application of cyclic coordinate descent approaches can be quite expensive, so we use a greedy selection rule motivated by the success of greedy coordinate descent for LASSO problems [7,17].This results in important computational savings, as our experiments in §4.4 show.

Block Proximal Updates
We make use of a natural blocking structure that arises in our tree ensemble.Specifically, each block  corresponds to a tree in the ensemble with mapping matrix   and associated weights   ∈ R   .For a fixed block , let  denote the other blocks.The goal of a block update is to update weights   while holding everything else constant.The optimization criterion for each block update is: where  (  ) = 1 2 ∥ ŷ −     ∥ 2 2 and ŷ =  −     .This composite criterion has smooth loss function  and non-smooth regularizers ℎ and , so we apply (block) proximal gradient updates [1,21].
The function   ↦ →  (  ) has Lipschitz continuous gradient and satisfies ∥∇ () − ∇ ()∥ ≤   ∥ −  ∥, for all  and , where   is the largest eigenvalue of  ⊺    .At point    each proximal update minimizes the quadratic approximation of objective (8) and can be expressed as: where θ =    − 1   ∇ (   ).Our choice of step size 1   ensures that the proximal updates lead to a sequence of decreasing objective values [1].We show that univariate problem (9) has closed-form minimizers  * for all choices of sparsity penalty ℎ when  = 0, and can be rapidly solved using dynamic programming when fusion penalty  is introduced.

Sparsity with
Fusion.Now consider the case where the fusion penalty  is nonzero.We start with sparsity penalty set to zero:   = 0 and ℎ(,   ) = 0. Problem ( 9) can be re-expressed as: which is equivalent to the 1-dimensional fused lasso signal approximation problem (FSLA), with fusion penalty parameter     .This FSLA problem can be solved efficiently using the dynamic programming algorithm proposed by [16], in linear worst-case time complexity with respect to the number of training observations.
Given the solution to problem (11),  * (0,   ), we can find the solution to problem (9),  * (  ,   ), for any   > 0 by applying the soft-thresholding operator to  * (0,   ) if ℎ is the ℓ 1 -penalty, or by applying the MCP thresholding operator if ℎ is the MCP penalty.We derive these procedures in the supplement (suppl.C.3 & C.4).
For improved computational performance, we conduct 5-10 proximal gradient iterations for each block update by solving (9).This problem either has a closed-form minimizer or can be solved in O( ) time complexity, so blocks can be efficiently updated in constant or linear time.In the following section, we present a method to prioritize the selection of blocks.

Block Selection
We first discuss unguided block selection methods.Cyclic block selection cycles through blocks {1 . . . } and updates them one at a time until convergence, while randomized block selection updates a random block at each iteration.BCD algorithms that use unguided block selection are typically slow; guided greedy block selection can greatly reduce computation time [22].We present a novel greedy block selection heuristic for problem (3).
Greedy selection uses heuristics to find the best block or coordinate to update at each iteration.For example, the Gauss Southwell steepest direction (GS-s) rule picks the steepest coordinate as the best coordinate to update.For smooth functions, this corresponds to the coordinate with the largest gradient magnitude.For composite functions, the steepest direction is computed with respect to the subgradients of the regularizers [17].For our composite objective in problem (3) define the direction vector  ∈ R  elementwise by: The GS-s rule selects the entry of  with the largest magnitude as the best coordinate to update.To find the best block to update, we modify the GS-s rule following Nutini et al. ( 2017) [23] and select the block whose direction vector has the largest magnitude.Let [ ] represent the set of all blocks and   ∈ R   represent the elements of  associated with block .Select the best block to update  * via: Problems ( 12) and ( 13) form our greedy block selection heuristic.Our heuristic is only useful if problem ( 12) can be efficiently solved.Karimireddy et al. ( 2019) [17] derives a closed-form minimizer for this problem when ℎ is the ℓ 1 -penalty and  = 0. Our algorithm is novel in that we derive closed-form minimizers to find  when fusion penalty  is introduced, and when ℎ is the MCP penalty.This requires computing the subgradients of a modified fused LASSO problem [27] and the MCP penalty function; the derivation is lengthy and is presented for all penalties in the supplement (suppl.D).

Discussion
. Our greedy block selection heuristic drastically reduces the number of BCD iterations as we demonstrate below.Fit a tree ensemble of 250 trees (blocks) with 7500 leaves (rules) on the Stock Price Prediction dataset [28], which contains 1000 rows.Select a sparse rule ensemble from (3) for various choices of ℎ and , with   = 1,   = 0.5, and  = 1.1.Compare the progress of our algorithm using cyclic block selection versus greedy block selection.
From figure 7 we observe that greedy BCD requires 2 orders of magnitude fewer iterations compared to cyclic BCD.Greedy BCD iterations are costlier than cyclic BCD iterations since finding the steepest direction vector at each iteration requires computing the full gradient.However, greedy BCD drastically reduces computation time.Here, greedy BCD takes 8.5 seconds for (3) with the MCP and fusion penalty, while cyclic BCD takes 702 seconds.The timing results for the other configurations are shown in figure 7. have closed-form minimizers and the block update problems either have closed-form minimizers or can be solved in linear 0( ) worstcase time complexities.We include step 9 as an optional step where we conduct cyclic block coordinate descent sweeps to ensure that our algorithm converges.In practice, we observe that usually only a single pass over the blocks is needed to verify convergence.

Computation Time Experiment
Here we compare the computation time of GBCD against existing off-the-shelf optimization solvers.Since the MCP and fusion penalties are novel to our framework, we use GBCD to solve problem (3) with ℎ set as the ℓ 1 -penalty and  = 0.This optimization problem is the same as the one in RuleFit, so we can directly compare our GBCD algorithm against existing LASSO solvers.
4.4.1 Medium-Sized Problems.We build a random forest of 250 trees grown to depths 3, 5, and 7 and use GBCD to sparsify the ensemble with   = 0.1.Under this configuration, GBCD typically selects 20% of the rules.This represents a realistic use case for RFOP; users compute the regularization path up to some sparsity level and select the best model.With   = 0.1, we show the computation time required for a single solve near the middle of the path.We compare GBCD against the Python implementation of Rule-Fit [20], which uses the LASSO coordinate descent solver in Scikitlearn [24].Also, we compute the full regularization path for   ∈ [0.01, 1000] using our GBCD algorithm with warm start continuation and the LASSO coordinate path function in Scikit-learn and compare the timing results.Finally, we compare GBCD for a single solve against Node Harvest implemented using CVXPY [4].Node Harvest solves a different optimization problem than GBCD but we include this algorithm to compare GBCD against other optimizationbased rule extraction methods.We conduct this timing experiment on a personal laptop with a 2.80 GHz Intel processor.
Table 1 shows the results of our experiment across various problem sizes.We see that GBCD is much faster than Scikit-learn RuleFit (SKLRF) and Node Harvest, up to 40× faster on high dimensional problems.In addition, GBCD with warm start continuation computes the entire regularization path around 10× faster than SKLRF.We think that a main reason behind GBCD outperforming the SKLRF LASSO solver is that we exploit the block-structure of the problem.The leaf nodes (coordinates) in a tree ensemble are naturally grouped into trees (blocks).As tree depth increases, the number of coordinates explodes exponentially, but the number of blocks remains the same.GBCD updates blocks instead of coordinates and leverages greedy block selection heuristics, while SKLRF relies on cyclic coordinate descent.As a result, GBCD computes solutions much faster than SKLRF.Computation times of GBCD on problem (3) with the MCP and fusion penalties are shown in the supplement (suppl.E).

Large Problems.
As an aside, we also compare the computation time of GBCD against SKLRF for much larger problems.We modify the experimental setup in the section above to extract rules from depth 20 random forests.The corresponding optimization problems contain hundreds of thousands to millions of decision variables.Table 2 shows the results of this timing experiment; the computation time of GBCD is still much faster than the computation time of SKLRF.For the largest problem instance (10955 row dataset, >1 million decision rules), SKLRF fails to reach a solution after a day of computation.GBCD on the other hand reaches a good solution in hours.Our specialized GBCD algorithm allows Fire to extract decision rules from problem instances beyond the capabilities of existing off-the-shelf solvers.

PERFORMANCE EXPERIMENTS
In this section, we compare the performance of Fire against competing state-of-the-art algorithms for building rule ensembles.We evaluate Fire against RuleFit in greater detail to better understand the effects of the MCP and fused LASSO penalties on rule extraction.

Fire v. Competing Methods
To evaluate the performance of Fire, we design an experiment to recreate how rule ensembles are used in practice.Rule ensembles are typically used in situations where model interpretability and transparency are important.In these situations, for a rule ensemble to be useful, the set of extracted rules must be human readable.As such, we restrict our extracted rule set to contain less than 15 rules, with a maximum interaction depth of 3. We use Fire and our competing methods to extract rule ensembles under these parameters and compare the test performances of the selected models.
We repeat this procedure on 25 datasets from the OpenML repository [28] using 5-fold cross validations; the full list of datasets with metadata can be found in the supplement.First, fit a random forest of 500 depth 3 trees.Initialize Fire with the MCP penalty and fusion penalty.Since we are interested in selecting very sparse subsets of decision rules, we set concavity parameter  = 1.1 close to 1 + as discussed in §3.3.We are only interested in the performance of the selected sparse ensemble for this experiment, so we set fusion parameter   = 0.1 to a low constant value.We use GBCD with warm start continuation to compute the entire regularization path for   under these Fire configurations.Select the value of   that produces the best model, evaluated on a validation set, of less than 15 decision rules.Record the test performance of the selected model.
We compare the performance of the model above against the following competing algorithms: RuleFit, GLRM with and without debiasing, and SIRUS.For RuleFit, we extract decision rules from the same tree ensemble as Fire.We tune the LASSO parameter for this algorithm on the validation set to select the best rule ensemble with less than 15 rules.SIRUS builds stabilized tree ensembles and GLRM builds decision rules using column generation.For these state-of-the-art competing algorithms, we again tune their sparsity hyperparameters on a validation set to find the best performing-rule ensemble with less than 15 rules.We record the test performances of the competing methods and compare them against Fire. Figure 8 presents the results of our experiment.The vertical axes show the percent decrease in test error between Fire and our competing methods, large positive values indicate that Fire performs better than our competing algorithms.The distributions of each boxplot in the figure are obtained across all datasets and folds in our experiment.We observe that the IQRs of all of the boxplots are positive.This indicates that Fire consistently performs better than our competing algorithms with median percent decreases in test error of 42% compared to GLRM without debiasing, 24% compared to GLRM with debiasing, 18% compared to SIRUS, and 46% compared to RuleFit.These results strongly suggest that Fire is a competitive algorithm for extracting sparse decision rule sets compared to state-of-the-art methods.

GLRM GLRM debias SIRUS RuleFit
One interesting thing to note is that the optional debiasing step in GLRM, where the rules are re-weighted after generation, greatly improves the performance of the algorithm.The improvement of GLRM over RuleFit found in Wei et al. (2019) [29] may be partially due to this step since the LASSO selection in RuleFit introduces bias.We are encouraged to observe that Fire can outperform GLRM even with debiasing and SIRUS, two recently developed state-of-the-art algorithms for building rule ensembles.

Further Analysis of Fire v. RuleFit
Our goal here is to understand how the MCP and fusion penalties in Fire affect extracting decision rules from tree ensembles.To highlight the effect of our new penalties, we design this experiment to compare Fire with MCP and fusion against RuleFit, which extracts rules using only the LASSO penalty, across various problem sizes.
On the same datasets and folds mentioned in the section above, we fit random forests of 500 trees of depths 3,5, and 7. We initialize two versions of Fire.For the first version (MCP only), we set  = 1.1 and   = 0 and use GBCD with warm start continuation to compute the entire regularization path for   .This version of Fire only uses the MCP penalty, and since  is close to 1 + the penalty performs aggressive selection.For the second version (MCP w/ fusion), we set  = 1.1 and set the fusion hyperparameter   = 0.5  .Again we use warm start continuation to compute the entire regularization path for   .This version of Fire applies a small fusion penalty which works in conjunction with the aggressive selection encouraged by the  = 1.1 MCP penalty.For both versions of Fire, we record the test performance of the extracted ensemble across various sparsity levels.We compare the test performances of Fire against RuleFit, computed along the regularization path for the sparsity parameter.
Figure 9 shows the results of this experiment.The plots compare the best model selected by Fire against the best model selected by RuleFit given a budget or rules, shown on the horizontal axis.The vertical axes show the percent decrease in test error between Fire and RuleFit, values above 0% indicate that Fire performs better than RuleFit.The distributions for each boxplot are again obtained across all folds and datasets in the experiment.We observe that Fire with the MCP penalties perform substantially better than RuleFit at selecting sparse models (LHS of figure 9).This is expected due to the behavior of the MCP penalty compared to the LASSO penalty when  − → 1 + .We also note that Fire with the MCP penalty only performs slightly better than Fire with both the MCP and fusion penalty in this regime.This is likely due to the fact that the additional regularization imparted by the fusion penalty causes the model to  underfit when performing very sparse selection.Consequently, we suggest keeping   small when extracting sparse rule sets.When the sparsity regularization penalty is reduced (i.e., the rule budget is increased) we observe that Fire with the MCP penalty only begins to overfit.This effect is especially pronounced when extracting decision rules from the depth 7 tree ensemble (bottom panel of figure 9), due to the inherent complexity of the deeper rules.Fire with both the MCP and fusion penalty avoids this issue since the fusion penalty adds additional regularization.We see in figure 9 that Fire with both the MCP and fusion penalty outperform RuleFit across all rule budgets; all of the green boxplots in the figure lie above 0%.By combining the aggressive selection of the MCP penalty with the regularization added by fusion, Fire outperforms RuleFit at extracting rule ensembles across all model sparsities.

INTERPRETABILITY CASE STUDY
We conclude with a case study to showcase the improved interpretability of Fire in a real-world example.We follow the work by Ibrahim et al. (2021) [15] and use the Census Planning Database to predict census tract response rates.The US Census Bureau wants to understand what features influence response rate so that lowresponse tracts can be targeted; previous efforts have found that tree ensembles perform well but are difficult to interpret [5].We use Fire to extract an interpretable set of decision rules.
We first build a random forest of 500 depth 3 trees.The full model achieves a test MSE of 8.64%.We use Fire with the MCP and fusion penalties to extract around 10 decision rules with  = 1.1 and    = 0.5  (low) or   = 2  (high).The 10 rules selected by Fire with low   perform the same as the full ensemble (8.64% MSE) and the 10 rules selected when   is high perform slightly worse (9.44% MSE).In contrast, when we select 11 rules with RuleFit, the model performs much worse (12.42% MSE).
The rule ensembles extracted by Fire are substantially more interpretable than the RuleFit ensembles.The bottom plot in figure 10 shows the 11 rules selected by RuleFit.These rules are selected across different trees and are not grouped in any meaningful manner; it is difficult to interpret this model from the figure alone.In comparison, the middle plot contains the 10 rule ensemble selected by Fire with low   .This rule ensemble contains a partial decision tree which reveals that the feature NH-White-alone is important since 4 rules share the same split on this feature.Increasing   yields the most interpretable rule ensemble, shown in the top plot, which consists of a single decision tree and 2 additional rules.
As an attempt to quantify interpretability, we can count the number of antecedents, colored nodes in figure 10, that a user must analyze to interpret a rule ensemble.The RuleFit ensemble contains 33 antecedents while the Fire high   ensemble contains just 13.

CONCLUSION
Fire is a novel optimization-based framework to extract decision rules from tree ensembles.The framework selects sparse representative subsets of rules from an ensemble and allows for the flexibility to encourage rule fusion during the selection procedure.This improves the interpretability and compression of the extracted model since many of the selected rules share common antecedents.Fire uses a non-convex MCP penalty to aggressively select rules in the presence of correlations and a fused LASSO penalty to encourage rule fusion.To solve the large non-convex optimization problems in Fire, we develop a specialized GBCD solver that computes high-quality solutions efficiently.Our solver exploits the blocking structure of the problem and leverages greedy block selection heuristics to reduce computation time.As a result, our solver scales well and allows for computation beyond the capabilities of off-the-shelf methods.Our experiments show that Fire performs better than state-of-the-art algorithms at building human readable rule sets and that Fire outperforms RuleFit at extracting rule ensembles across all sparsity levels.Altogether, these features and finding make Fire a fast and effective framework for extracting interpretable rule ensembles.

Figure 2 :figure 2 .
Figure 2: This depth 2 decision tree yields 4 decision rules.The antecedents of each rule are obtained by traversing the tree from root to leaf.

Figure 3 :
Figure 3: Bagging ensemble of 500 trees fit on the Elevators dataset [28].The number of decision rules in the ensemble scales exponentially with tree depth.Shallower ensembles contain highly correlated rules.

Figure 5 :
Figure 5: Extracting 4 rules from a decision tree fit on the California Housing Price dataset.In the right plot, the rules are grouped together and are more interpretable.

Figure 6 :
Figure 6: Effect of  on the MCP penalty.Varying  controls the trade-off between shrinkage and selection.

Figure 8 :
Figure 8: Fire outperforms SOTA competing algorithms at selecting sparse human readable rule ensembles.

Figure 9 :
Figure 9: Fire v. RuleFit across various problem sizes.The MCP penalty in Fire performs better at selecting sparse sets of rules and the fusion penalty helps prevent overfitting.

Figure 10 :
Figure 10: US Census case study: Fire extracts more interpretable, better performing ensembles compared to RuleFit.

Table 1 :
Timing results in seconds.The fastest methods are highlighted in green and red cells indicate that the method did not finish within 30 minutes.

Table 2 :
Computation time of GBCD v. SKLRF for extracting rules from depth 20 tree ensembles.