Verifiable Learning for Robust Tree Ensembles

Verifying the robustness of machine learning models against evasion attacks at test time is an important research problem. Unfortunately, prior work established that this problem is NP-hard for decision tree ensembles, hence bound to be intractable for specific inputs. In this paper, we identify a restricted class of decision tree ensembles, called large-spread ensembles, which admit a security verification algorithm running in polynomial time. We then propose a new approach called verifiable learning, which advocates the training of such restricted model classes which are amenable for efficient verification. We show the benefits of this idea by designing a new training algorithm that automatically learns a large-spread decision tree ensemble from labelled data, thus enabling its security verification in polynomial time. Experimental results on public datasets confirm that large-spread ensembles trained using our algorithm can be verified in a matter of seconds, using standard commercial hardware. Moreover, large-spread ensembles are more robust than traditional ensembles against evasion attacks, at the cost of an acceptable loss of accuracy in the non-adversarial setting.


INTRODUCTION
Machine learning (ML) is now phenomenally popular and found an incredible number of applications.The more ML becomes pervasive and applied to critical tasks, however, the more it becomes important to verify whether automatically trained ML models satisfy desirable properties.This is particularly relevant in the security setting, where models trained using traditional learning algorithms proved vulnerable to evasion attacks, i.e., malicious perturbations of inputs designed to force mispredictions at test time [3,15,34].
Unfortunately, verifying the security of ML models against evasion attacks is a computationally hard problem, because verification must account for all the possible malicious perturbations that the attacker may perform.In this work, we are concerned about the security of decision tree ensembles [5], a well-known class of ML models particularly popular for non-perceptual classification tasks, which already received significant attention by the research community.Kantchelian et al. [23] first proved that the problem of verifying security against evasion attacks for decision tree ensembles is NP-complete when malicious perturbations are modeled by an arbitrary   -norm.In more recent work, Wang et al. [41] further investigated the problem and observed that the existing negative result largely generalizes to the apparently simpler case of decision stump ensembles, i.e., ensembles including just trees of depth one.They thus proposed incomplete verification approaches for decision tree and decision stump ensembles, which can formally prove the absence of evasion attacks, but may incorrectly report evasion attacks also for secure inputs.This conservative approach is efficient and provides formal security proofs, however it is approximated and can draw a pessimistic picture of the actual security guarantees provided by the ML model.Complete verification approaches against specific attackers, e.g., modeled in terms of the  ∞ -norm, have also been proposed [12,31].They proved to be reasonably efficient in practice for many cases, however they have to deal with the NP-hardness of security verification, hence they are inherently bound to fail in the general setting, especially when the size of the decision tree ensembles increases.As a matter of fact, prior experimental evaluations show that security verification does not always terminate within reasonable time and memory bounds, leading to approximated estimates of the actual robustness of the decision tree ensemble against evasion attacks.1: Summary of notation.In the definitions of ,  ,  we assume that the decision tree and tree ensemble we are predicating upon are clear from the context.attacks modeled in terms of an arbitrary   -norm, thus moving away from existing NP-hardness results (Section 3).(2) We propose a new training algorithm that automatically learns a large-spread decision tree ensemble amenable for efficient security verification.In short, our algorithm first trains a traditional decision tree ensemble and then prunes it to satisfy the proposed large-spread condition (Section 4).(3) We implement our training algorithm and experimentally verify its effectiveness on four public datasets.Our largespread ensembles are more robust than traditional ensembles against evasion attacks and admit a much more efficient security verification, at the cost of just an acceptable loss of accuracy in the non-adversarial setting (Section 5).

BACKGROUND
In this section we review a few notions required to appreciate the rest of the paper.To improve readability, we summarize the main notation used in this paper in Table 1.

Supervised Learning
Let X ⊆ R  be a -dimensional vector space of real-valued features.
An instance ì  ∈ X is a -dimensional feature vector ⟨ 1 ,  2 , . . .,   ⟩ representing an object in the vector space X.Each instance is assigned a class label  ∈ Y by an unknown target function  : X → Y.As common in the literature, we focus on binary classification, i. the same data distribution.For example, the standard accuracy measure (, D test ) counts the percentage of test instances where the classifier  returns a correct prediction.

Decision Trees and Tree Ensembles
In this paper, we focus on traditional binary decision trees for classification [5].Decision trees can be inductively defined as follows: a decision tree  is either a leaf () for some label  ∈ Y or an internal node  ( , ,   ,   ), where  ∈ {1, . . .,  } identifies a feature,  ∈ R is a threshold for the feature, and   ,   are decision trees (left and right child).We just write  ( , ) to represent an internal node when   ,   are unimportant.Decision trees are learned by initially putting all the training set into the root of the tree and by recursively splitting leaves (initially: the root) by identifying the threshold therein leading to the best split of the training data, e.g., the one with the highest information gain, thus transforming the split leaf into a new internal node.At test time, the instance ì  traverses the tree  until it reaches a leaf (), which returns the prediction , denoted by  ( ì ) = .Specifically, for each traversed tree node  ( , ,   ,   ), ì  falls into the left sub-tree   if   ≤ , and into the right sub-tree   otherwise.Fig. 1 represents an example decision tree of depth 2, which assigns label +1 to the instance ⟨12, 7⟩ and label −1 to the instance ⟨8, 6⟩.
To improve their performance, decision trees are often combined into an ensemble  = { 1 , . . .,   }, which aggregates individual tree predictions, e.g., by performing majority voting.We write  ( ì ) for the prediction of  on ì  and we let  stand for the number of nodes of the ensemble  when such ensemble is clear from the context.For simplicity, we focus on majority voting to aggregate individual tree predictions, assuming that the number of trees  is odd to avoid ties.While ensembles trained using existing frameworks (like sklearn) may use more sophisticated aggregation techniques, our focus on large-spread ensembles trained using a custom algorithm gives us freedom on the choice of the aggregation strategy and majority voting already proves effective in practice.Notable ensemble methods include Random Forest [4] and Gradient Boosting [26].

Robustness
Classifiers deployed in adversarial settings may be susceptible to evasion attacks, i.e., malicious perturbations of test instances crafted to force prediction errors [3,34].To capture this problem, the robustness measure has been introduced [29].Below, we follow the presentation in [31].
An attacker  : X → 2 X is modeled as a function from instances to sets of instances, i.e., ( ì ) represents the set of all the adversarial manipulations of the instance ì , corresponding to the possible evasion attack attempts against ì .The stability property requires that the classifier does not change its original prediction on some input for all its possible adversarial manipulations.Stability is certainly a desirable property for classifiers deployed in adversarial settings; however, a classifier that always predicts the same class for all the instances trivially satisfies stability for all the attackers, but it is useless in practice because it lacks any predictive power.Robustness improves upon stability by requiring the classifier to also perform correct predictions.Based on the definition of robustness, for a given attacker , we can define the robustness measure   (, D test ) by computing the percentage of test instances where the classifier  is robust.
In the following, we focus on attackers represented in terms of an arbitrary   -norm, i.e., the attacker's capabilities are defined by some  ∈ N ∪ {0, ∞} and the maximum perturbation .For fixed  and , we assume the attacker  , ( ì

EFFICIENT ROBUSTNESS VERIFICATION
We first review results regarding the robustness verification problem for single decision trees (Section 3.1).We then generalize the result to  trees by introducing large-spread decision tree ensembles (Section 3.2), which enable robustness verification in  ( + log ) time.This is a major improvement over traditional decision tree ensembles, for which robustness verification is NP-complete [23].

Decision Trees
The robustness verification problem can be solved in  () time for a decision tree with  nodes when the attacker is expressed in terms of an arbitrary   -norm [41].This generalizes a previous result for the  ∞ -norm [12].The key idea of the algorithm is that stability on the instance ì  can be verified by identifying all the leaves that are reachable as the result of an evasion attack attempt ì  ∈  , ( ì ); hence, stability holds iff all such leaves predict the same class.This set of leaves can be computed by means of a simple tree traversal.Correspondingly, assuming that ì  has label , a decision tree  is robust on ì  iff  ( ì ) =  and there does not exist any reachable leaf assigning to ì  a label different from .The algorithm operates in two steps: (1) tree annotation and (2) robustness verification.

3.1.1
Step 1 -Tree Annotation.The first step of the algorithm is a pre-processing operation -performed only once -where each node of the decision tree is annotated with auxiliary information for the second step.The annotations are hyper-rectangles that symbolically represent the set of instances which may traverse the nodes upon prediction.The algorithm first annotates the root with the -dimensional hyper-rectangle (−∞, +∞]  , meaning that every instance will traverse the root.Children are then annotated by means of a recursive tree traversal: concretely, if the father node  ( , ,  1 ,  2 ) is annotated with ( 1 ,  1 ] × . . .× (  ,   ], then the annotations of the roots of  1 and  2 are defined as ( 11 , and: ( The annotation process terminates when all the nodes have been annotated.Note that the complexity of this annotation step is  (), because all  nodes are traversed and annotated with a hyperrectangle of size .

3.1.2
Step 2 -Robustness Verification.Given an annotated decision tree and an instance ì , it is possible to identify the set of leaves which may be reached by ì  upon prediction in presence of adversarial manipulations.
Let  = ( 1 ,  1 ] × . . .× (  ,   ] be the hyper-rectangle annotating a leaf ( ′ ).The minimal perturbation required to push ì Thus, given the instance ì  with label , it is possible to compute the set: In other words, during the visit we find the leaves with a wrong class where ì  might fall as the result of adversarial manipulations by the attacker  , and we compute the norms || ì  ||  of the minimal perturbations ì  to be applied to ì  to push it there.Hence, the tree is robust against the attacker  , iff  = ∅.This computation can be performed in  () time, since we have  () leaves and each vector ì  with its norm can be computed in Θ() time.

Generalization to Tree Ensembles
The robustness verification problem is NP-complete for tree ensembles when the attacker is expressed in terms of an arbitrary   -norm [23].Of course, this negative result predicates over arbitrary tree ensembles, but does not exclude the possibility that restricted classes of ensembles may admit a more efficient robustness verification algorithm.In this section we introduce the class of large-spread tree ensembles, which rule out the key source of complexity from the robustness verification problem and allow robustness verification in  ( +  log ) time.
Figure 2: Example of tree ensemble with three decision trees.

Key Intuitions.
The key idea of the proposed large-spread condition allows one to verify the robustness guarantees of the individual decision trees in the ensembles and compose their results to draw conclusions about the robustness of the whole ensemble.
To understand why composing robustness verification results is unfeasible for arbitrary ensembles, consider the ensemble  in Fig. 2 and an instance ì  with label +1 such that  1 = 11.Consider the attacker  1,2 who can modify feature 1 of at most ±2, then for every adversarial manipulation ì  ∈  1,2 ( ì ) we have  1 ∈ [9,13].We observe that the trees  1 and  2 are not robust on ì , because there exists an adversarial manipulation that forces them to predict the wrong class −1.However, the whole ensemble  is robust on ì , because  ( ì ) = +1 and for every adversarial manipulation ì  ∈  1,2 ( ì ) we have  (ì ) = +1, because either  1 or  2 alone is affected by the attack, hence at least two out of the three trees in the ensemble always perform the correct prediction.The example is deliberately simple to show that attacks against two different trees might be incompatible, i.e., an attack working against one tree does not necessarily work against the other tree and vice-versa.This implies that the combination of multiple non-robust trees can lead to the creation of a robust ensemble.
The key intuition enabling our compositional reasoning is that interactions among different trees are only possible when the thresholds therein are close enough to each other.Indeed, in our example we showed that there exists an instance ì  which can be successfully attacked in both  1 and  2 , yet no attack succeeds against both trees at the same time.The reason why this happens is that the thresholds in the roots of the trees (10 and 12 respectively) are too close to each other when taking into account the possible adversarial manipulations: an adversarial manipulation can corrupt the original feature value 11 to produce an arbitrary value in the interval [9,13], which suffices to enable attacks in both  1 and  2 .However, none of the attacks against  1 works against  2 and vice-versa.Conversely, it is not possible to find any instance ì  which can be attacked in both  2 and  3 , because for every adversarial manipulation ì and the distance between the thresholds in the trees (17−12 = 5 > 4) is large enough to ensure that the problem of incompatible attacks cannot exist, because the feature 1 can be attacked just in one of the two trees.For example, if  1 = 14, then only  2 can be attacked, while if  1 = 16 only  3 can be attacked; if  1 = 15, instead, neither  2 nor  3 can be attacked.

3.2.2
Large-Spread Ensembles.We formalize this intuition by defining the -spread of a tree ensemble  as the minimum distance between the thresholds of the same feature across different trees, according to the   -norm.If   ( ) > 2, where  is the maximum adversarial perturbation, we say that  is large-spread.Definition 3.1 (Large-Spread Ensemble).Given the ensemble  = { 1 , . . .,   }, its -spread   ( ) is: We say that  is large-spread for the attacker  , iff   ( ) > 2.
A large-spread ensemble  allows one to compose attacks working against individual trees to produce an attack against the ensemble as follows.Assuming ì   = ì  + ì   is an attack against a tree   ∈  and ì   = ì  + ì   is an attack against a different tree   ∈  , then the large-spread condition guarantees that ì   and ì   target disjoint sets of features, i.e., they are orthogonal ( ì   • ì   = 0).Indeed, each feature can be corrupted of  at most, however the same feature can be reused in different trees only if the corresponding thresholds are more than 2 away, hence it is impossible for any feature value to traverse more than one threshold as the result of an evasion attack (we formalize and prove this result in Appendix A).The disjointness condition of attacks implies that ì  = ì  + ì   + ì   is an attack working against both   and   (assuming || ì   + ì   ||  ≤ ), because   (ì ) and   (ì ) take the same prediction paths of   (ì   ) and   (ì   ) respectively, which are successful attacks against the two trees.Note that this does not hold for arbitrary tree ensembles, like the one in Fig. 2. Indeed, for that ensemble and an instance ì  such that  1 = 11 the attack against  1 subtracts 2 from the feature 1 and the attack against  2 adds 2 to the feature 1, hence the sum of the two attacks would leave the instance ì  unchanged.

3.2.3
Robustness Verification of Large-Spread Ensembles.This compositionality result is powerful, because it allows the efficient robustness verification of large-spread ensembles.The intuition is that -since the ensemble  is large-spread -the minimal perturbations { ì   }  enabling attacks against the individual trees {  }  can be summed up together to obtain a perturbation ì  enabling an attack against the whole ensemble.More precisely, let  ′ ⊆  be the set of trees in  which may suffer from a successful attack, then: • If | ′ | < −1 2 + 1, then the number of trees performing a wrong prediction under attack is too low to identify a successful attack against the whole ensemble.However, note that the complexity of this algorithm is  ( +  log ) because we annotate each of the  nodes in the ensemble with a hyper-rectangle of size  and we compute the minimum perturbations along with their norms, as explained in Section 3.1.Moreover, to find the perturbations with the smallest norms, we have to sort the pairs ( ì   , || ì   ||  ) in non-decreasing order of   -norm in  ( log ) time.We now show that the large-spread condition enables a more efficient algorithm, running in  ( + log ) time.

Optimization. If the minimal perturbations { ì
}  are pairwise orthogonal vectors, then the following facts hold.
Note that the proof of Fact 1 and Fact 2 is immediate, hence we just prove Fact 3 for the   -norm, with  ∈ N.
Proof.We show the equivalence for  = 2; the case for  > 2 is a simple generalization.By definition of   -norm, || ì for the binomial theorem.Note that the latter sum can be rewritten as The generalization to an arbitrary number of vectors  > 2 involves a multinomial theorem instead of a binomial theorem.□ We introduce the following operator to have a suitable way of referring to the result of the three facts above: Fact 1, 2, and 3 imply that we do not actually need to explicitly compute an adversarial perturbation if we just want its   -norm, which is exactly our case because we just need to check whether such norm does not exceed .Since any adversarial perturbation against a large-spread ensemble results from the sum of pairwise orthogonal vectors, we can use Eq. 5 to compute the norm directly from the norms of the orthogonal vectors, i.e., the verification algorithm can operate on scalars rather than vectors, thus reducing its complexity by a  factor.
In light of these considerations, we now revisit the tree traversal from Section 3.1 to show that we can compute for each leaf of the tree just a scalar Δ = || ì  ||  , where ì  = dist( ì ,  ) and  is the hyper-rectangle which would normally annotate the leaf.Similarly to the linear-time tree visit described in [12] for the  ∞ -norm, the idea is to maintain one global hyper-rectangle during the visit instead of one hyper-rectangle per node.Ultimately, this reduces the time complexity from  () to the optimal  (), since the hyperrectangle is not copied from parent to children.The optimized variant of the algorithm is described in the Reachable procedure of Algorithm 1.This  ()-time algorithm for arbitrary   -norm is, in fact, a combination of the  ()-time algorithm of [12] (which Algorithm 1 Optimized robustness verification algorithm for decision trees.
⊲ Eq. 1 16: ⊲ Eq. 6 18: ←  ∪ Traverse(  , , , ì , , , Δ  ) 19: ⊲ Eq. 6 return False works only for the  ∞ -norm) with the generalization to any  norm of [41] (which however runs in  () time).We implement  as an initially-empty map (e.g., using a hash table):   ∈ R 2 is the entry associated to the -th feature.If the map does not contain an entry for the -th feature, then it is implicitly assumed be the state of the hyper-rectangle when visiting node  =  ( , ,  1 ,  2 ).When moving to a child   of , with  ∈ {1, 2}, note that the distance vector ì  changes only in its  -th component   , since only the  -th component (  ,   ] of the hyper-rectangle  changes.We can therefore update Δ efficiently as follows.Let Δ ′ and  ′ = ( ′ 1 ,  ′ 1 ]×. ..×( ′  ,  ′  ] be the perturbation distance and hyper-rectangle associated to any of 's children.Let  ′  be the quantity defined in Eq. 3. We extend the linear-time algorithm of [12] to an arbitrary   -norm by noting that the following is implied by Facts 1, 2, and 3: By definition, it is clear that Update-Norm is computed in  (1) time.The correctness of the case  = ∞ (as also discussed in [12]) follows from the fact that it must be In conclusion, we spend  (1) time per node and the time complexity of the whole visit is therefore  ().Hence, the set  in Eq. 4 is computed in  () time rather than  () time as we previously described in Section 3.1.This also lowers the time complexity of the robustness verification for decision trees shown in the Robust-Tree procedure of Algorithm 1 to just  () rather than  ().Since robustness verification for large-spread ensembles builds on the verification algorithm of the individual trees therein, this optimization reduces the complexity of our final algorithm.

Final
Algorithm.We conclude this section with Algorithm 2, our robustness verification algorithm for large-spread ensembles, whose correctness is stated in the following theorem and proved in Appendix A. It follows the description in Section 3.2.3,revised to operate with norms (scalars) rather than vectors.Theorem 3.2.Let ì  be an instance with label .A tree ensemble  such that   ( ) > 2 is robust on ì  against the attacker  , iff Robust( , , , ì , ) returns True.
Observe that the complexity of Algorithm 2 is  ( +  log ), where  and  are, respectively, the total number of nodes and trees in the ensemble.Verifying the robustness of the  individual trees in the ensemble and updating vector ì Δ takes  ( ) time thanks to the linear-time Algorithm 1. Afterwards, the algorithm sorts ì Δ in  ( log ) time and computes the minimum norm required to attack at least −1 2 + 1 trees in  () time.

TRAINING LARGE-SPREAD ENSEMBLES
We have described an efficient robustness verification algorithm for large-spread ensembles in Section 3.However, traditional decision tree ensembles trained using, e.g., sklearn, do not necessarily enjoy the large-spread condition.Here we discuss possible ideas for training algorithms designed to enforce the large-spread condition and we present a specific solution from the design space.

Design Space
While reasoning about the design of a training algorithm for largespread ensembles, we considered different approaches falling in three broad classes: (1) Custom ensemble learning algorithms.Develop new learning algorithms in the spirit of Random Forest [4] or Gradient Boosting [26], designed to constrain the ensemble shape so as to satisfy the large-spread condition.For example, one might train each tree while taking into account the thresholds already present in the previously trained trees, to then remove the training data which might lead to learning thresholds which are too close to the existing ones.Indeed, recall Algorithm 2 Robustness verification algorithm for large-spread tree ensembles.return True that thresholds are learned from the training data, hence all the possible thresholds are known a priori.( 2) Training set partitioning.Pre-compute a partition of the training data so that each decision tree in the ensemble is trained over highly separated instances, thus leading to an ensemble of trees satisfying the large-spread condition.The simplest instantiation of this idea would be partitioning the set of features and train different trees over different subsets of features, so that the large-spread condition is trivially satisfied, but more fine-grained strategies based on instance partitioning would also be feasible.(3) Pruning techniques.Train a standard decision tree ensemble, e.g., using the Random Forest algorithm, and prune it so as to keep only trees satisfying the large-spread condition.A variant of this technique might perform different types of mutations of the available trees to improve the effectiveness of pruning.
Although we consider all these routes to be viable and worth investigating, in this work we decide to prioritize the third class of solutions.Compared to the first class, pruning leads to a range of simple and intuitive solutions, which take advantage of stateof-the-art implementations of existing training algorithms, e.g., those available in sklearn.This simplifies the deployment of an efficient and robust implementation.Moreover, pruning does not necessarily require a massive amount of training data and features, as needed for an effective training set partitioning (second class).In the last part of this section, we also discuss how to leverage feature partitioning to improve the effectiveness of our pruningbased learning algorithm in those settings where a high number of features is available (hierarchical training).

Proposed Training Algorithm
Here we present our training algorithm.We motivate its design, describe how it works and discuss a few relevant aspects of the proposed solution.
4.2.1 Preliminaries.Our problem of interest can be formulated as follows: given a decision tree ensemble  and a size 0 <  ≤ | |, determine whether there exists an ensemble  ′ ⊆  such that  ′ is large-spread and | ′ | = .We refer to this problem as the largespread subset problem for decision tree ensembles.Unfortunately, we can prove that this problem is NP-hard.The proof is provided in Appendix B. The theorem implies that it is computationally hard to train largespread ensembles by pruning when the desired number of trees therein is enforced a priori, which is normally the case because the number of trees is a standard hyper-parameter of ensemble methods.One might argue that this negative result is not a showstopper, because training is performed just once and one might devise efficient heuristic approaches to approximate the large-spread subset problem, however preliminary experiments on public datasets suggest that any training approach which is purely based on pruning is likely ineffective in practice.Indeed, we empirically observed on our datasets that traditional random forests trained using sklearn are not directly amenable for pruning, because any two trees in the ensemble already violate the large-spread condition when joined into an ensemble of size two.Our understanding of this phenomenon is that there exist some important features which are pervasively reused across different trees, which often learn the same thresholds, thus making the identification of a large-spread ensemble unfeasible.Our training algorithm thus integrates a greedy heuristic approach to pruning with a mutation operation, which perturbs thresholds so as to actively enforce the large-spread condition even when it would not be possible by pruning alone.

Training Algorithm.
The proposed training algorithm takes as input a training set D train , a number of trees , a norm  and a maximum perturbation .In addition to the classic hyperparameters of tree learning such as tree depth, the algorithm relies on a few specific hyper-parameters: a maximum number of iterations  _  ∈ N, a multiplicative factor   ∈ N and a real-valued interval   ∈ R × R. From a high level point of view, the algorithm operates by training a standard random forest  including   • trees to then select a set of  trees constituting a large-spread ensemble  * .This is done by a combination of pruning and mutation of the trees in  .After picking a random tree of  to begin with, the algorithm iteratively tries to identify the other  − 1 trees by means of a greedy approach.The candidate tree  to be inserted in  * is always the tree in  minimizing the number of feature overlaps with  * , i.e., the number of features violating the large-spread condition in  * ∪ { }.If the number of feature overlaps is greater than zero, the ensemble is fixed to enforce the large-spread condition by iteratively removing the overlaps.In particular, let  ( , ) and  ( ,  ′ ) be two nodes from different trees such that || −  ′ ||  ≤ 2.We sample a perturbation  ∈   , we subtract  from min(,  ′ ) and we sum  to max(,  ′ ) in the attempt to fix the overlap.Since this change might introduce new overlaps, we then iterate through the ensemble until all the overlaps have been fixed (i.e., the ensemble is large-spread) or the maximum number of iterations  _  have been reached.If all the overlaps of  * ∪ { } have been fixed, i.e., the resulting tree-based ensemble is large-spread, then the extended large-spread ensemble becomes the new large-spread ensemble  * , otherwise  * is not extended and the tree  is discarded.Then the algorithm tries to extend  * with another tree in  , unless  * has reached the desired number of trees or all the trees in  have been selected for extending the large-spread ensemble.The pseudocode of the training algorithm is presented in Algorithm 3.

Complexity.
Recall that each tree has at most  nodes and we fix   to be a small constant, e.g.,   ∈ [2,6].TrainLarge-Spread calls  () times GetBestTree and FixForest.The former function GetBestTree iterates at most | | ∈  () times ( ∈  , line 23) the construction of set overlaps.A naive way of building this set is to iterate over all nodes of  (at most  nodes) and compare their thresholds with all the thresholds appearing in the nodes of  * (at most  nodes), leading to time  ( 2 ) to build one instance of overlaps.We observe that it is easy to speed up this step using balanced search trees, but we leave optimizations to further extensions of this work.To sum up, GetBestTree takes  ( 2  2 ) and, hence, the  () calls to GetBestTree cost overall time  ( 3  2 ).Function FixForest iterates  _  times the for loop at line 35.Each iteration of the for loop costs  (1) time and there are at most  2  2 iterations because the loop iterates over all possible combinations of  ( , ) and  ( ′ ,  ′ ) belonging to two distinct trees of  * .Since there are at most  nodes in  * , the number of iterations is at most  2  2 .To this cost, we have to add the  _  evaluations of " * is not large-spread" (line 33); this predicate can be evaluated in  (| * | 2 ) =  ( 2  2 ) time by comparing all pairs of thresholds appearing in  * .We conclude that the running time of the  () iterations of FixForest is in total  ( _  •  3  2 ).This dominates the running time of the  () iterations of GetBest-Tree, so we conclude that  ( _  • 3  2 ) is also the running time of TrainLargeSpread.This cost is paid in addition to the cost of training the standard random forest at line 2.
As noted above, although it is feasible to reduce this complexity using appropriate data structures, we observe that () training is often performed only once, so any optimization just offers limited benefits and is left to future work, and () the number of trees and nodes is often small enough to make a cubic complexity acceptable in practice.As a matter of fact, our experimental evaluation gives evidence about the acceptable empirical efficiency of the proposed training algorithm.

Hierarchical
Training.We observe that our training algorithm can fail, in particular when it is not possible to add one tree to the current large-spread ensemble and reduce to zero the overlaps by our mutation routine, i.e., the number of overlaps resulting from adding a tree to the large-spread ensemble is too high.However, we show in our experimental evaluation (see Section 5) that it is possible to train large-spread ensembles of different dimensions after some parameter tuning.In particular, we propose an intuitive and effective technique to mitigate the risks of failures during training.A key insight is that the larger the ensemble is, the more difficult it becomes to avoid violations of the large-spread  ← GetBestTree( , * , ,  ) _ _ ← +∞

46:
return  * requirement, because ensembles including many trees also have many thresholds, hence overlaps become harder to avoid.We thus propose a hierarchical training approach as follows: (1) We first partition the set of features in  disjoint subsets and we build  different projections of the training set D train , based on such feature sets.(2) We train a large-spread ensemble of size   on each of the  different training sets using Algorithm 3 and we finally merge all the trained ensembles into an ensemble of  trees.Note that the final ensemble is indeed large-spread, because each of the merged ensembles ensures the large-spread condition on the trees therein, and trees from different ensembles cannot violate the large-spread condition because they are built on disjoint sets of features.For example, an ensemble of 100 trees can be trained by building 4 disjoint projections of the training data (based on feature partitioning) and training an ensemble of 25 trees on each of them.We empirically observed that this approach may improve the effectiveness of the training process, by enabling the construction of larger ensembles in practice.We report on experiments confirming this observation in the next section.

EXPERIMENTAL EVALUATION 5.1 Experimental Setup
To show the practical relevance of our theory, we develop two tools on top of it and we prove their effectiveness on public datasets.
5.1.1Tools.Our first tool CARVE2 is a C++ implementation of the proposed robustness verification algorithm for large-spread ensembles (Algorithm 2).It takes as input a random forest classifier  , a norm , a maximum perturbation  and a test set D test to return as output the robustness score   , ( , D test ).CARVE assumes that  is large-spread and implements majority voting as the aggregation scheme of individual tree predictions.Our second tool LSE is a sequential Python implementation of the proposed training algorithm for large-spread ensembles (Algorithm 3).Starting from a training set D train , a number of trees , a norm  and a maximum perturbation , it returns a large-spread ensemble  * of  trees (unless the training algorithm fails by returning ⊥).The random forest trained before pruning is created using sklearn.

Methodology.
Our experimental evaluation is performed on four public datasets: Fashion-MNIST 3 , MNIST 4 , REWEMA 5 and Webspam 6 .Since Fashion-MNIST and MNIST are datasets associated to multiclass classification tasks and we focus on binary classification tasks in this work, we consider two subsets of them.In particular, for Fashion-MNIST we consider the instances with class 0 (T-shirt/top) and 3 (Dress), while for MNIST we keep the instances representing the digits 2 and 6.The key characteristics of the chosen datasets are reported in Table 2.The chosen datasets are representative for different reasons: Fashion-MNIST, MNIST and Webspam have already been considered in the robustness verification literature [1,12,31,41]; moreover, REWEMA and Webspam are associated with a security-relevant classification task (malware and spam detection, respectively) for which the robustness verification of the employed classifier is critically important.In general, we choose datasets with a high number of features, where it may be useful to train large tree ensembles to reach the best performance.Each dataset is partitioned into a training set and a test set, using 70/30 stratified random sampling.
In our experimental evaluation we make use of two training algorithms to learn different types of classifiers: () a majorityvoting classifier based on a traditional random forest (RF) trained using sklearn, and () a majority-voting classifier based on a largespread tree ensemble trained using LSE.Moreover, we consider tree-based classifiers of different sizes: () small ensembles with 25 trees of maximum depth 4; () large ensembles with 101 trees with maximum depth 6.We only consider ensembles with an odd number of trees in order to avoid ties in classification.
Robustness verification is then performed using CARVE and SILVA, a state-of-the-art verifier for traditional decision tree ensembles based on abstract interpretation [31].Note that SILVA can be applied to arbitrary ensembles, while CARVE can only be used on large-spread ensembles.Since SILVA leverages the hyperrectangle abstract domain for verification, which does not introduce any loss of precision for  ∞ -norm attackers but might lead to an over-approximation for generic   -norm attackers, we only focus on  ∞ -attackers in our comparison.For the sake of completeness, in our evaluation of CARVE we also consider robustness against  1 -attackers and  2 -attackers for large-spread ensembles.
Finally, in our evaluation we consider different perturbations  ∈ {0.0050, 0.0100, 0.0150} for the MNIST, Fashion-MNIST and REWEMA datasets, while we assume  ∈ {0.0002, 0.0004, 0.0006} for Webspam.We choose different perturbations for the Webspam dataset to be aligned with previous work and to obtain roughly the same decrease in robustness observed on the other three datasets for the considered tree-based classifiers.Indeed, Chen et.al. [11] showed in their experimental evaluation that the certified minimum adversarial perturbation obtained for the Webspam dataset is one order of magnitude smaller than the one obtained for the MNIST dataset, i.e., models trained over Webspam would be too fragile to be usable when tested against larger perturbations.

LSE Setup.
Our tool LSE requires the user to specify the value of some additional parameters (described in Section 4.2) with respect to the traditional implementation of the training algorithm for random forests by sklearn.The norm  and the perturbation  depend on the assumed attacker's capabilities, so they do not require a particular tuning.Still, other parameters such as the number of partitions  for the hierarchical training and the maximum number of iterations  _  of the FixForest procedure require some tuning.Indeed, although partitioning the features may enable the training of larger ensembles, a too high number of partitions might negatively affect the accuracy of the resulting large-spread ensemble, because each sub-forest has only a partial view on the set of available features and some patterns may not be learned.In the same way, the maximum number of rounds  _  has an impact on the success of the training procedure, since a minimum number of rounds is required to adjust the thresholds of the ensemble, but a too high number of rounds may modify the thresholds too much and downgrade the predictive power of the model.We perform some experiments in order to assess the influence of these parameters on the success of training a large-spread ensemble and on the accuracy of the resulting model to then pick the best-performing models in our experimental evaluation.For space reasons, we discuss details in Appendix C.

Accuracy and Robustness Results
In our first experiment we assess whether large-spread ensembles are effective at classification and we analyze their robustness properties.Indeed, the large-spread condition enforced on the ensemble limits the model shape, thus potentially reducing its predictive power with respect to traditional tree ensembles.Since we are not just concerned about accuracy but we target robustness, we also analyze how large-spread ensembles fare against evasion attacks.Our evaluation consists of two parts.We first compare the accuracy and robustness of the large-spread ensembles against traditional random forests of the same size, considering an  ∞ -attacker.The robustness of the traditional models is computed using SILVA, since CARVE can only be used for verifying large-spread ensembles.We set a timeout per instance of one second, as in [31].Then, we use CARVE to verify the robustness of large-spread ensembles against  1 -attackers and  2 -attackers that are not supported by SILVA.

5.2.1
Comparison for  ∞ -norm Attackers.Table 3 shows the experimental results of our comparison.Note that the value of robustness may be approximated, since SILVA may not be able to verify robustness on some instances within the time limit; for these cases, we provide lower and upper bounds of robustness, using the ± notation.The results highlight that the large-spread ensembles are reasonably accurate and often more robust than the random forests of the same size.In particular, the accuracy of the large-spread ensembles is at most 0.03 lower than the accuracy of the corresponding traditional model in the majority of the cases, while the improvement in robustness is at least 0.04 in around half of the cases.This is reassuring, because accuracy was at stake, since the large-spread condition restricts the shape of the ensemble and might be associated to a reduction of predictive power.The increase of robustness is an interesting byproduct of the large-spread condition: since thresholds in different trees are far way, evasion attacks are empirically harder to craft.Observe that the accuracy and robustness values of the large-spread ensembles on the MNIST and Fashion-MNIST test sets show that large-spread models present better performance overall than the traditional ensembles.The accuracy of the largespread ensembles on these two test sets is usually equal to the one of the traditional ensembles, while the robustness value improves of at least 0.06 in half of the cases, in particular when the largest considered perturbation  is used as the attacker's capability.For example, the robustness of the large-spread ensemble with 101 trees of maximum depth 6 and perturbation 0.0150 is at least 0.22 higher than the robustness of the corresponding random forest, while the accuracy decreases only by 0.04 at most.When the value of the perturbation  is the lowest considered, the results are still positive, since the large-spread ensembles present the same accuracy and a higher robustness than the ones of the traditional ensembles.
We see a slightly different trend in the results for the REWEMA and Webspam datasets: the robustness of large-spread ensembles Table 3: Accuracy and robustness measures for traditional and large-spread ensembles.Robustness is computed against  ∞, .We highlight in bold the cases in which the gap between the accuracy and the robustness of the traditional tree-based ensemble and large-spread ensemble is at least of 0.05. is always equal to or greater than the robustness of the traditional ensembles, but the gap in accuracy with respect to the traditional ensembles may increase, in particular when considering large adversarial perturbations, which make it harder to enforce the largespread condition.For example, the large spread ensembles of 101 trees with maximum depth 6 trained on the two datasets present 0.88 and 0.82 robustness with perturbation 0.015 and 0.0006 (respectively, +0.10 and +0.01 than the robustness of the corresponding traditional tree ensembles), but their accuracy is 0.88 and 0.85 (respectively, −0.10 and −0.09 than the accuracy of the traditional tree ensembles).This confirms that an improvement in robustness often occurs at the price of a decrease in accuracy, because of the classic trade-off between accuracy and robustness [30,37].Even in these cases though, adopting large-spread ensembles continues to be useful: the accuracy is always way above the majority class distribution, so the model is usable in the non-adversarial setting, while being normally more robust than the traditional counterpart and amenable for efficient security verification.To explain the observed drop in accuracy for large-spread models, we compare the permutation feature importance [4] for traditional ensembles and large-spread ensembles to assess which features have more predictive power according to the different models.The analysis is quite interesting.For REWEMA, it shows that traditional models give significant importance to a few numerical features which are less important for large-spread models; large-spread models, in turn, privilege some categorical / ordinal features which are less important for traditional models.Instead, for Webspam, it shows that traditional and large-spread models privilege numerical features with many distinct values.However, the traditional models give also importance to some features with a very skewed empirical distribution towards the value 0, while the large-spread ensembles give more importance to features with scattered values.This motivates why large-spread models sacrifice some predictive power, but show better robustness in general: categorical / ordinal features and, in general, features with more scattered values are harder to target for   -norm attackers, because their sparse nature makes them more robust to adversarial perturbations, i.e., larger perturbations are required to actually traverse thresholds and thus affect predictions.

Additional
Attackers.Table 4 shows the robustness of the trained large-spread ensembles against different   -attackers for  ∈ {1, 2, ∞}.As expected, the large-spread ensembles trained on MNIST and Fashion-MNIST are generally more robust against the weakest  1 -attacker and less robust against the strongest  ∞attacker.Instead, we observe that the large-spread ensembles trained the robustness values of the large-spread ensemble models are almost the same for every attacker considered.This is explained by the fact that large-spread models trained over such datasets make a more significant use of categorical / ordinal features and features with more scattered values, as discussed in the previous section.The attacker thus cannot perturb the test instances to cross thresholds of important features for prediction, independently of the chosen   -norm.We remark here that the effectiveness of CARVE does not depend upon : robustness verification is always exact and the complexity of the analysis is independent from .This motivates why the rest of our evaluation only considers the case  = ∞.

Efficiency of Robustness Verification
We now compare the SILVA and CARVE robustness verification tools along two different dimensions: verification time and memory consumption.For simplicity, we only focus on the verification of large ensembles with 101 trees and maximum depth 6 on the MNIST dataset with  = 0.0150.As emerged from the results in Section 5.2, this is a setting where a state-of-the-art approach like SILVA clearly shows its limits: indeed, SILVA could not provide a precise estimate of the robustness of this model (± 0.05).In order to measure the verification time per instance and setting timeouts in the same way for both the tools, we use the GNU commands time and timeout that measure the elapsed wall clock time.The former command is also used to compute the maximum amount of physical memory allocated to the verifier.When it is required to set a maximum amount of physical memory that the process can use, we use the Linux kernel feature cgroup.All the experiments are performed on a virtual machine with 103 GB of RAM and Ubuntu 20.04.4 LTS, running on a server with an Intel Xeon Gold 6148 2.40GHz.

Time Efficiency.
In our first experiment we compare the robustness verification times for traditional tree ensembles using SILVA and the robustness verification times for large-spread ensembles using CARVE.This way, we compare a state-of-the-approach for adversarial machine learning models (i.e., what we would do today) against our custom algorithm designed to take advantage of the large-spread condition (i.e., what we put forward in this paper).
In the experiments of Section 5.2, we set the maximum verification time per instance of SILVA to one second.However, SILVA may complete the verification also on more difficult instances if more time is granted, e.g., 60 seconds [31].In order to perform a fair comparison, we compare how many instances of the MNIST test set can be verified under growing time limits per instance, i.e., from one second to 10 minutes.This methodology allows us to figure out on how many instances the verification is really difficult.Note that the timeout of 10 minutes per instance is already extremely large, since test sets normally include thousands of instances.
Figure 3a shows the results of our experiment.The plot shows that SILVA is not able to verify the robustness of the traditional tree ensemble on 434 instances in less than one second and on 190 instances in less than one minute, providing just approximate robustness estimates with an uncertainty of 0.10 (±0.05) and 0.05 (±0.025) respectively.On the other hand, our tool CARVE requires less than one second per instance to verify the robustness of the large-spread ensemble on all the instances of the test set, providing an exact estimate of the robustness of the model.As the maximum amount of verification time per instance increases, the number of instances on which SILVA is not able to verify the robustness of the model further decreases, e.g., 168 instances with a timeout of 120 seconds and 166 instances with a timeout of 180 seconds.Even though the robustness estimate of SILVA becomes more precise as the timeout per instance increases, i.e., the uncertainty on robustness decreases to 0.04 (±0.02) with a timeout per instance of 180 seconds, this process eventually hits a wall: the remaining 166 instances cannot be verified even when the timeout increases to 10 minutes per instance.Moreover, the improved precision comes at the cost of an higher total verification time: with a timeout of 120 seconds, SILVA requires in total 22,220 seconds to verify the traditional tree ensemble on the entire MNIST test set, while CARVE requires just 129 seconds in total, i.e., a reduction of two orders of magnitude.As expected, the results show the pitfalls of the complete robustness verification on traditional tree-ensembles and the improvements in the verification time enabled by the large-spread condition.Since the verification problem is NP-complete, there may be instances on which the verification time increases exponentially, while the large-spread condition allows one to train tree ensembles whose robustness can always be verified in polynomial time.

Memory Efficiency.
Our first experiment provides only a partial picture of the efficiency of the robustness verification and the reasons for the potential inefficiency of SILVA.Indeed, memory constraints should also be taken into account during robustness verification, since a high memory consumption may make the verification unfeasible on standard commercial systems.
In our second experiment, we compare the memory efficiency of SILVA and CARVE.In particular, we compare how many instances can be verified given a growing maximum memory consumption limit per instance, setting the maximum amount of verification time per instance to 10 minutes.The results of our experiment are shown in Figure 3b.The results highlight that SILVA may consume a lot of memory in order to provide precise robustness estimates.In the best scenario, with 100 GB of memory available, SILVA is still unable to verify the robustness of the model on 168 instances, providing just an approximate estimate of the robustness of the traditional tree-based ensemble with an uncertainty of 0.04 (±0.02).Even though the interval on which the robustness approximation is not so large in this setting, the plot shows that the number of instances that SILVA can not verify increases as the memory consumption limit decreases, expanding also the uncertainty of SILVA in the robustness estimation.For example, SILVA is not able to verify the robustness of the model on 216 and 342 instances with the memory consumption limit of 32 GB and 4 GB respectively, providing an uncertainty in the robustness estimates of 0.05 (±0.025) and 0.08 (±0.04).Instead, CARVE manages to verify the robustness of the large-spread ensemble on all the MNIST test set using less than 4GB of memory per instance, providing an exact value of robustness.More precisely, the maximum memory consumption by CARVE is less than 1 GB in practice.The results confirm the efficiency in terms of memory consumption of our proposal and the unfeasibility of obtaining an exact value of robustness on traditional tree ensembles using a state-of-the-art verifier like SILVA when memory consumption constraints are imposed.We finally perform a comparison between CARVE and SILVA when enforcing both a maximum verification time limit and a maximum memory consumption limit.In particular, we compare the total verification time, the maximum memory consumption and the number of instances on which the tool is not able to return an answer given a maximum verification time of 60 seconds per instance and a maximum memory consumption limit of 64 GB.Table 5 shows the results of our experiment.The results confirm the observations from the previous sections: CARVE is far more efficient of SILVA in terms of both verification time and memory consumption.In particular, CARVE outperforms SILVA on the total verification time on the MNIST test set, verifying the large-spread ensemble on all the instances in just 129 seconds, thus being 112 times faster than SILVA (that requires 14,448 seconds).Moreover, the memory consumption of CARVE is more than 2,000 times lower than the one of SILVA, using just 0.03 GB of memory capacity, thus CARVE is usable on commodity hardware.Finally, SILVA is not able to provide an answer on 190 instances of the test set, providing an approximated robustness estimate with an uncertainty of 0.05 (±0.025), while CARVE is able to provide the exact robustness value.This provides clear evidence of the challenges of robustness verification for traditional tree ensembles: since robustness verification is NP-hard in general, even a state-of-the-art tool like SILVA is bound to fail on specific inputs.

Efficiency of the Training Algorithm
Finally, we evaluate the time efficiency of the training algorithm for large-spread ensembles (Algorithm 3).Intuitively, the difficulty of enforcing the large-spread conditions depends on two factors: the model size and the adversarial perturbation .Indeed, the larger is , the higher becomes the distance to be enforced across thresholds in different trees.We then perform two experiments, each for different values of : in the first, we fix the maximum tree depth at six and we vary the number of trees in {25, 51, 75, 101}; in the second, we fix the number of trees at 101 and we vary the maximum depth of the trees in {3, 4, 5, 6}.The presented times are measured for a specific hyper-parameter choice enabling successful training in all settings ( _  = 500,   = 6,   = [, 1.5],  = 6).

Number of Trees.
Figure 4 shows the results of our first experiment.We observe that the time required for training a large-spread ensemble depends on the dataset, most likely because enforcing the large-spread condition might be easier or harder for different training data.When considering a number of trees less than or equal to 75, the time required for training a large-spread ensemble is less than 150 seconds for all the considered datasets and adversarial perturbations.For example, the time required for training a large-spread ensemble of 75 trees is 28 seconds on MNIST and 145 seconds on Webspam when considering the largest adversarial perturbation.Similarly, training a large-spread ensemble is efficient when considering smaller adversarial perturbations: for the smallest perturbations, training time ranges from one second on the REWEMA dataset to 16 seconds on the Webspam dataset.This result is encouraging, because the trained models already obtain a reasonable accuracy on the test set and the range of adversarial perturbations might be small in practical cases.
On the downside, when considering larger models with 101 trees, the role of the adversarial perturbations on the training time becomes more significant.For example, training a large-spread ensemble with 101 trees under the largest adversarial perturbations required 137 seconds on MNIST and 1,835 seconds on Webspam.The motivation is that the cost of adding a tree to the ensemble increases as the size of the ensemble increases, because all the thresholds of the current ensemble must be adjusted with respect to the new tree.Fixing such violations to the large-spread condition is difficult for larger adversarial perturbations, because thresholds must be pushed farther away.This fact particularly affects the time required for training large-spread ensembles on the Webspam dataset: since some important features for the ensemble have a very skewed empirical distribution, the thresholds learned by the traditional tree-based ensembles for these features are close, thus separating them in an effective way is difficult and may require the training algorithm to perform many iterations.

Maximum Tree Depth.
Figure 5 shows the results of our second experiment.We observe that training a large-spread ensemble of depth at most five requires at most 122 seconds for all the considered datasets and adversarial perturbations.For example, training a large-spread ensemble of 101 trees with maximum depth five takes 55 seconds on the Fashion-MNIST dataset and 122 seconds on the Webspam dataset.Moreover, the results confirm that training a large-spread ensemble considering small adversarial perturbations is efficient, e.g., the maximum time required for training a largespread ensemble of 101 trees with maximum depth six, considering the smallest adversarial perturbation for each dataset, is 35 seconds.
However, we observe that, when considering large-spread ensembles with deeper trees, choosing a higher adversarial perturbation may determine a considerable increase in the time required for the training.The worst case is observed on the Webspam dataset, where the time required for training a large-spread ensemble of 101 trees with maximum depth six and  = 0.0006 is 1,835 seconds.Indeed, increasing the value of the depth of the trees in the ensemble causes an exponential growth in the number of nodes of the ensemble and enforcing the large-spread condition for higher perturbations is more difficult, thus more violations of the large-spread condition need to be fixed to add a single tree to the ensemble.5.4.3Discussion.Our experimental evaluation shows that the training algorithm for large-spread ensembles is efficient when the model size is relatively limited (≤ 75 trees) or the adversarial perturbation is small.Concretely, the most challenging model including 75 trees could be trained in 145 seconds, while the most challenging model for the smallest adversarial perturbation could be trained in 35 seconds.When combining large model size with large adversarial perturbations, however, the training time can become higher.The worst case was observed on the Webspam dataset, where a model with 101 trees required 1,835 seconds to be trained under the largest adversarial perturbation.Nevertheless, this price is just paid for training: once the model is trained, robustness can be verified in polynomial time for thousands of instances.Also, such extreme cases only occurred on the Webspam dataset: for example, the most challenging models to train on Fashion-MNIST and REWEMA took just 113 seconds and 13 seconds respectively.We find these results appropriate for our first evaluation of large-spread ensembles, in particular because our implementation of LSE is not heavily optimized, and we plan to design more efficient training algorithms for large-spread ensembles as future work.

Take-Away Messages
Our experimental evaluation shows that: • Large-spread ensembles sacrifice some predictive power with respect to traditional tree ensembles, yet their accuracy remains way higher than the majority class of the test set.Even better, in several cases the accuracy of large-spread ensembles is equal to the accuracy of traditional tree ensembles.• Large-spread ensembles are generally more robust than traditional tree ensembles.This empirical observation is a useful byproduct of the large-spread condition, which makes it harder to craft evasion attacks which are effective against multiple trees in the ensemble.• Our verification tool for large-spread ensemble CARVE is much more efficient than SILVA, a state-of-the-art verifier for traditional tree ensembles.Improvements are due to both verification time and memory consumption.Moreover, we showed that SILVA can provide just approximate robustness estimates in some experimental settings, even when provided with extremely high time and memory bounds (10 minutes per instance, 100 GB of RAM).Conversely, CARVE can compute the exact value of robustness using just limited time and memory (1 second per instance, 1 GB of RAM).This shows the effectiveness of the verifiable learning paradigm: models trained with formal verification in mind can be verified in a matter of seconds even on traditional commercial hardware, contrary to traditional machine learning models which cannot be accurately verified even when extremely powerful servers are available.

RELATED WORK
We already mentioned that prior work studied the complexity of the robustness verification problem for decision tree ensembles [23,41].This problem was proved to be NP-complete for arbitrary   -norm attackers, even when restricting the model shape to decision stump ensembles [1,41].To the best of our knowledge, we are the first to identify a specific class of decision tree ensembles enabling robustness verification in polynomial time.Prior work on robustness verification for decision tree ensembles proposed different techniques, such as exploiting equivalence classes extracted from the tree ensemble [36], integer linear programming [23], a reduction to the max clique problem [12], abstract interpretation [7,31] and satisfiability modulo theory (SMT) solving [16,18,33].Though effective in many cases, these techniques still have to deal with the exponential complexity of the robustness verification problem, so they are bound to fail for large ensembles and complex datasets.We experimentally showed that a state-of-the-art verifier like SILVA [31] is much less efficient than our verifiable learning approach, supporting verification in polynomial time, and can only compute approximate robustness estimates in practical cases.Moreover, our LSE training algorithm produces tree ensembles that are in general more robust than the traditional counterparts as a side-effect of imposing that the thresholds of different trees are sufficiently far away.Several papers in the literature discussed new algorithms for training tree ensembles that are robust to evasion attacks [1, 8-11, 13, 20, 23, 32, 38-40], but our work is complementary to them.Indeed, our primary goal is not enforcing robustness, which is a byproduct of our training algorithm, but supporting efficient robustness verification of the trained models.We also acknowledge that our work solely focuses on the classic definition of robustness, known as local robustness in more recent literature discussing global robustness and related properties [6,14,27].This line of research aims to achieve security verification independently of the choice of a specific test set, enhancing the credibility of security proofs.Given that local robustness remains popular and is easier to deal with, we stick to it in this paper and we leave the extension of our framework to global robustness as future work.
It is worth mentioning that a lot of work has been done on the robustness verification of deep neural networks (DNNs).Classic approaches for exact verification often do not scale to large DNNs, as for tree ensembles, and they are typically based on SMT [21,24,25] and integer linear programming [2,17,28,35].To mitigate the scalability problems of robustness verification, different proposals have been done, such as shrinking the original DNN through pruning [19] and finding specific classes of DNNs that empirically enable more efficient robustness verification [22].Xiao et.al. [42] proposed the idea of co-designing model training and verification, i.e., training models that show reasonable accuracy and robustness, while better enabling exact verification.In particular, their work proposes a training algorithm for DNNs that encourages weight sparsity and ReLU stability, two properties that improve the efficiency of verification through SMT solving.There are significant differences between these lines of work and ours.First, prior techniques only provide empirical efficiency guarantees, while our proposal leads to a formal complexity reduction of the robustness verification problem through the design of a polynomial time algorithm.Moreover, our research deals with tree ensembles rather than DNNs.
Finally, we observe that recent work explored the adversarial robustness of model ensembles [43].The main result of this work proved that the combination of "diversified gradient" and "large confidence margin" are sufficient and necessary conditions for certifiably robust ensemble models.While this result cannot be directly applied to non-differentiable models such as decision tree ensembles, the intuition of diversifying models is similarly captured by our large-spread condition.We plan to explore any intriguing connections with this proposal as future work.

CONCLUSION
We introduced the general idea of verifiable learning, i.e., the adoption of training algorithms designed to learn restricted model classes amenable for efficient security verification.We applied this idea to decision tree ensembles, identifying the class of large-spread ensembles.We showed that this class of ensembles admits robustness verification in polynomial time, whereas the problem is NP-hard for general decision tree models.We then proposed a pruning-based training algorithm to learn large-spread ensembles from traditional decision tree ensembles.Our experiments on public datasets show that large-spread ensembles sacrifice a limited amount of the predictive power of traditional tree ensembles, but their robustness is normally higher and much more efficient to verify.This makes large-spread ensembles appealing in the adversarial setting.
As future work, we plan to investigate the use of verifiable learning also for other popular model classes, e.g., neural networks.Moreover, we want to explore different training algorithms for large-spread ensembles and compare their effectiveness against the pruning-based approach proposed in this paper.
A PROOF OF THEOREM 3.2

C PARAMETER TUNING FOR LSE TRAINING
Our training algorithm for large-spread ensembles has four hyperparameters: the maximum number of iterations  _  to fix violations to the large-spread condition, the multiplicative factor   determining the size of the initially trained forest, the interval   of the perturbation applied to fix the forest and the size of the feature partition .Each hyper-parameter can affect the performance of the trained large-spread ensemble, as well as the successful termination of the training algorithm.
As it is customary for tree-based models, we deal with hyperparameter tuning by means of grid search, i.e., we try out all the possible combinations of specific hyper-parameter values to identify the one performing best on a validation set including 20% of the training data, extracted via stratified random sampling.Specifically, we look for the combination of hyper-parameters optimizing the average between accuracy and robustness on the validation set, and we perform a grid search by considering the following possible values for the hyper-parameters:  6 reports for each dataset, model size (trees and depth) and perturbation  the value of the hyper-parameters leading to the best performance on the validation set.By looking at the results, we can catch some insights about the influence of each hyper-parameter on training the large-spread ensembles.
We first examine the values of  _ , the number of iterations needed to train the best-performing large-spread ensembles.We observe that the chosen value of  _  depends on the size of the model: typically, only 100 iterations are needed to train the best-performing large-spread ensembles of 25 trees, while 500 iterations are needed for training the best-performing large-spread ensembles of 101 trees.The intuitive reason is that more iterations are needed for successfully training large-spread ensembles with many trees, since more thresholds need to be adjusted to fulfill the large-spread condition.Large-spread ensembles with fewer trees can be trained even with 100 iterations, instead, and a lower number of iterations is often beneficial there, since less noise needs to be applied to adjust the original thresholds.
As to the size of the feature partition , the results show that  = 1 is the value leading to more than half of the best-performing largespread ensembles.In particular,  = 1 is used for training almost all the best large-spread ensembles on MNIST, Fashion-MNIST and REWEMA.This suggests that avoiding partitioning the features is the best choice for training the best large-spread ensembles on most datasets: an ensemble trained on all the available features may exhibit better accuracy and robustness than the ones of an ensemble built of sub-forests trained on subsets of features, since the subforests have only a partial view of the set of available features and some patterns might not be learned.Nevertheless, partitioning the features can be useful for training the best-performing large-spread ensembles in some cases.For example, when training the bestperforming large-spread ensembles with 101 trees and maximum depth six, the best choice is  = 4 on the MNIST dataset when considering the perturbation  = 0.0150, while  ∈ {5, 6} for all the models trained on the Webspam dataset.The result highlights the Finally, we observe that the values of   and   used for training the best-performing large-spread ensembles are strongly dependent on the specific experiment and do not show particularly insightful patterns.It is common to identify different optimal values of   and   even for the same dataset and model size.Grid search is a standard practice to deal with the unpredictable nature of these hyper-parameters.
(b) Number of verified instances of the test set when varying the maximum memory consumption limit for the verification.

Figure 3 :
Figure 3: Comparison of the time and memory efficiency of SILVA and CARVE on the MNIST dataset (we consider ensembles with 101 trees of maximum depth 6).

Figure 4 :
Figure 4: Efficiency of LSE when varying the number of trees of the large-spread ensemble.

Figure 5 :
Figure 5: Efficiency of LSE when varying the maximum depth of the trees of the large-spread ensemble.

Table 4 :
Robustness measures for large-spread ensembles against different   -attackers.

Table 5 :
Comparison of total verification time and maximum memory consumption of SILVA and CARVE on the MNIST test set.The last column reports the number of instances on which the verifier was not able to provide an answer because it exceeded the time or memory limits.

Table 6 :
Grid search results for large-spread ensembles trained using the LSE tool.For each dataset, model size (number of trees & maximum depth) and perturbation , the table reports the value of the hyper-parameters leading to the highest accuracy on the validation set.The large-spread ensembles are trained with norm  = ∞.