Gradient-less Federated Gradient Boosting Trees with Learnable Learning Rates

The privacy-sensitive nature of decentralized datasets and the robustness of eXtreme Gradient Boosting (XGBoost) on tabular data raise the needs to train XGBoost in the context of federated learning (FL). Existing works on federated XGBoost in the horizontal setting rely on the sharing of gradients, which induce per-node level communication frequency and serious privacy concerns. To alleviate these problems, we develop an innovative framework for horizontal federated XGBoost which does not depend on the sharing of gradients and simultaneously boosts privacy and communication efficiency by making the learning rates of the aggregated tree ensembles learnable. We conduct extensive evaluations on various classification and regression datasets, showing our approach achieves performance comparable to the state-of-the-art method and effectively improves communication efficiency by lowering both communication rounds and communication overhead by factors ranging from 25x to 700x. Project Page: https://flower.ai/blog/2023-04-19-xgboost-with-flower/


INTRODUCTION
Federated Learning (FL) enables the training of a global model using decentralized datasets in a privacy-preserving manner, contrasting with the conventional centralized training paradigm [7,23,34,36,37].Existing FL research [18,21,24,28] and the developed techniques [3,13,32,39] to optimize model convergence and reduce systematic privacy risks and costs mainly focus on neural networks (NN).The efforts of developing FL algorithms to support other machine learning (ML) models, on the other hand, remain under-explored.
EXtreme Gradient Boosting (XGBoost) [5] is a powerful and interpretable gradient-boosted decision tree (GBDT).In most of the cases, XGBoost outperforms deep learning methods for tabular data on medium-sized datasets under 10k training examples [10,29,35].In the context of crosssilo FL, where the clients pool is typically made up from 2 to 100 [13] organizations, there is a growing need to deploy a federated XGBoost system on specific tasks such as survival analysis [1] and financial fraud detection [6,31].
Existing works on federated XGBoost usually follow two settings (Fig. 1).The horizontal setting is defined to be the case when clients' datasets have identical feature spaces but different sample IDs.The central server sends the global model to all clients and then aggregates the updated model parameters after each communication round.As for the vertical setting, first proposed by SecureBoost [6,27], the concepts of passive parties and one active party were introduced, where passive parties and the active party share identical sample space but possess different features.As only the active party owns data labels, it naturally acts as the server.
Although the horizontal setting remains to be more common [12], the training of a horizontal federated XGBoost turns out to be harder, not easier, because finding the optimal split condition of XGBoost trees depends on the order of the data samples [26,30] as we iterate the feature set and partition the data samples into left and right according to the feature constraints.Therefore, as all clients share the same sample IDs in the vertical setting, the passive parties only need to send the order of the samples to the active party.However, since the sample IDs are different across all clients in the horizontal setting, at every splitting point, each client needs to transmit the gradients, hessians, and/or sample splits based on the feature values to the server to find the optimal splitting condition [25].Hence, we identify two key problems imposed by this vanilla approach.
1. Per-node level communication frequency.The server needs to communicate with all clients at every splitting point.We denote the depth of each tree as  and the number of trees in the tree ensemble as .The number of nodes in the tree ensemble can scale up to  × 2  [4], and so is the number of communication rounds.As a trained XGBoost model is common to have a depth of 8 and 500 trees [5], the number of communication rounds can reach ∼100K.Moreover, in a real application of federated XG-Boost, it is possible for the server to conduct more than one round of communication per node [22] and carry out extra cryptographic calculations.Thus, the high communication overhead makes it difficult to deploy horizontal federated XGBoost for practical uses.2. Serious privacy concerns.The sharing of gradients and even confident information was proved to be insecure in the distributed training of ML models [9,40].As the training data can be reconstructed using gradients, such sharing needs to be protected.
Existing research on horizontal federated XGBoost tackles the aforementioned two problems by seeking a trade-off between privacy and communication costs.A few works take stronger defenses against privacy leaks.FedXGB [22] developed a new secure aggregation protocol by applying homomorphic encryption and secret sharing on shared parameters directly.However, this induces high communication and computation overhead at per-node level communication frequency.Some works decrease the resolution of the raw data distribution by generating a surrogate representation using gradient histogram [4,12,25,30].Histogram-based methods accelerate the training process by building quantile sketch approximation, but the communication frequency still correlates to the depth of the trees.Besides, they can still leak privacy because the gradients related to the bins and the thresholds can be inferred [33].Other works obfuscate the raw data distribution with methods including clusteringbased k-anonymity [38] and locality-sensitive hashing [17].Although the required communication overhead is less than encryption-based methods, these approaches have a trade-off between model performance and the number of clients.
In this work, we ask the fundamental question: if it is possible not to rely on the sharing of gradients and hessians to construct a federated XGBoost in the horizontal setting?In this way, we can simultaneously boost privacy and disentangle the per-node level communication frequency.We find it to be possible by formulating an important intuition: as the local datasets of clients can be heterogeneous in the horizontal setting, using a fixed learning rate for each tree may be too weak since each tree can make different amounts of mistakes on unseen data with distribution shifts.To this end, we make the learning rates of the aggregated tree ensembles learnable by training a small one-layer 1D CNN with kernel size and stride equal to the number of trees in each client tree ensemble.We use the prediction outcomes as inputs directly.This novel framework preserves privacy.The clients only need to send the constructed tree ensemble to the server.The sharing of gradients and hessians, which may leak sensitive information, is not required.In addition, the number of communication rounds is independent of any hyperparameter related to the trained XGBoost.In practice, we find 10 communication rounds to be sufficient for the global federated XGBoost model to reach performance comparable to the state-of-the-art method.Moreover, the total communication overhead to train a global federated XGBoost model is independent of the dataset size.Our approach induces total communication overhead lower than previous works in the order of tens to hundreds.
The main contributions of this work are summarized as: • We propose a novel privacy-preserving framework, FedXG-Bllr, a federated XGBoost with learnable learning rates in the horizontal setting which do not rely on the sharing of gradients and hessians.

PRELIMINARIES
EXtreme Gradient Boosting (XGBoost) XGBoost is a gradient boosting tree and is an additive ensemble model.It adopts forward stagewise regression and consistently learns new trees to fit the residuals until a stop condition is met.Given a dataset {  ,   }  =1 where   ∈ R  ,   ∈ R represent the features (with dimension ) and labels of the -th sample, the final prediction is calculated by summing predictions of all  trees with a fixed learning rate : where   (  ) is the prediction made by the -th tree.
The objective of XGBoost is to minimize the sum of the loss of all the data samples.It first calculates the first ordergradient,   , and second-order hessian, ℎ  , of all samples: where ŷ ( −1)

𝑖
is the prediction made by the previous tree and (  , ŷ −1  ) is the loss function.Then the gradient sums of instance set   on each node  can be calculated by: The optimal weight  *  and objective   * are derived from the objective function involving regularization terms: where  is the leaf node number and  and  are the regularization for the leaf weights and leaf number, respectively.From root to leaf nodes, the best split can be found by maximizing  =   *    −   *   , which is: where   and   ,   , and   are the sums of the gradients and hessians of the data samples partitioned into the left and right branch based on the splitting point's feature constraint.

METHOD
In this section, we provide a detailed description of our approach.We first formulate our intuitions in Section 3.1.We then facilitate our intuitions in Section 3.2 and discuss how to learn the learning rates using proposed, interpretable onelayer 1D CNN in Section 3.3.Finally, we develop new framework FedXGBllr to train federated XGBoost in Section 3.4.

Intuitions
A fixed learning rate is too weak Local datasets of clients participating in FL can be heterogeneous (i.e., non-IID).The trained model on the client's local dataset converges to its local optima.When the model is sent to other clients and evaluated on their local datasets, it suffers from degradation in performance because different clients' local optima are divergent.The adverse effects of data heterogeneity in FL over NN-based approaches are widely researched [14,16,18,20].More recent works demonstrate that XGBoost also experiences deterioration in model performance with heterogeneous local datasets [8,12].We argue that the core reason causing performance degradation, when the built XGBoost model is evaluated on other unseen datasets with distribution shifts, is that each tree in the tree ensemble makes different amounts of mistakes.
Consider the example illustrated in Fig. 2. We have an XGBoost model consisting of  trees in total where   denotes the -th tree,  = 1...The XGBoost model is trained on the dataset { *  ,  *  }  =1 for a regression task.We send this XGBoost model to two other clients and evaluate on their respective local datasets,  1 and  2 .
The prediction outcomes of the first three trees in the XGBoost tree ensemble on two data samples { 1   ,  ).Moving towards the global optima As explained previously, data heterogeneity causes the trained XGBoost models on different clients' local datasets to converge to local optima that are far from each other.Consequently, given an unseen data sample, these XGBoost tree ensembles output different prediction results.However, among all XGBoost tree ensembles, some can give more accurate predictions because the unseen data sample may be closer to the underlying distribution of their trained datasets.Thus, applying a weighted sum on the diverse prediction results given by all XGBoost tree ensembles can lead to a more accurate final prediction value, helping us to move towards the global optima.
It is important to point out that the approach of utilizing weighted sum to converge to the global optima is proved to be effective in the previous literature.FedAvg [24] used the weighted sum of the aggregated model parameters according to the number of data samples presented in the clients' local datasets, and many kinds of literature have given theoretical convergence guarantees for the method [14,19].Later FL strategies such as FedProx [18] also adopted the weighted sum of aggregated model parameters.

Tree Ensembles Aggregation
Suppose there are  clients participating in the training of federated XGBoost, and denote them as ( 1 ,  2 , ...,   ).All clients' local datasets have different sample IDs but the same feature dimension .Each client trains a XGBoost tree ensemble consisting of  trees using its local dataset, where     denotes the -th tree constructed by client ,  = 1... and  = 1....To facilitate our intuitions, the final prediction result given an arbitrary data sample with feature dimension  is calculated by the weighted sum of all trees from all  clients as shown in Fig. 3.Each vertical tree chain is the tree ensemble built by one client, where    is the learning rate assigned to     and   is the weight applied to the prediction result calculated by client   's tree ensemble.We refer to this system as the aggregated tree ensemble.Both     and   are learnable, which will be revealed in Section 3.3.

Final prediction result
For all clients to calculate the final prediction result, each client needs to receive the aggregated tree ensemble with the help of the server.First, each client ensures that within its tree ensemble, all trees are sorted (i.e., if the tree ensemble is stored in an array, the -th tree is at the -th position).Then, as shown in Fig. 4(a), each client sends their built XGBoost tree ensemble and client ID (  = ) to the server.The server sorts and concatenates all tree ensembles using  s such that the -th tree ensemble is always adjacent to both ( − 1)-th and ( + 1)-th tree ensembles, as illustrated by the input layer of Fig. 4(b).Finally, the server broadcasts the sorted, aggregated tree ensembles to every client.

Learnable Learning Rates by One-layer 1D CNN
We develop a method to learn the learning rate    assigned to each tree     by transforming the aggregated tree ensembles in Fig. 3 to a one-layer 1D CNN as shown in Fig. 4(b).In the first 1D convolution layer, the inputs are the prediction outcome of all trees. is the chosen activation function.

Interpretability
The small-sized model is interpretable.The kernel size and stride of the 1D convolution are equal to the number of trees, , in each client's tree ensemble.Thus, each channel of the 1D convolution is the learnable learning rates (   ) for all     in the tree ensemble of a specific client , and the number of convolution channels can be understood as the number of learning rate strategies that can be applied.The classification head, fully connected (FC) layer, contains the weighting factors (  ) to balance the prediction outcomes of each client's tree ensemble and calculate the final prediction result, which is also updated during training.The incentive for introducing activation  is to avoid overfitting because a portion of the learned strategies will be deactivated.We set  to be the most used activation function, ReLU.

FedXGBllr
We introduce the new framework, FedXGBllr, to train a global federated XGBoost model by learning the learning rate for each tree with FL.The global federated XGBoost model consists of all clients' locally trained XGBoost tree ensembles and the globally trained one-layer 1D CNN.The detailed procedure is shown in Algorithm.1.At round 0 (line 1 to 7), each client first trains its local XGBoost tree ensemble.The server then conducts tree ensemble aggregation and CNN initialization.After receiving the aggregated tree ensemble, all clients calculate the prediction outcomes given the aggregated tree ensemble on their local data samples.The calculated prediction outcomes are inputs of the CNN.It is worth noticing that the clients only build XGBoost models at round 0, and the aggregated tree ensemble is fixed after round 0. For the federated training of the one-layer 1D CNN after round 1 (line 8), the protocol follows the standard FL algorithm, and we use FedAvg [24].
In FedXGBllr, the number of communication rounds is equal to the FL training rounds () because we send the trees (at round 0) and CNN's model parameters (after round 1).

EXPERIMENTS
In this section, we conduct extensive experiments to validate the effectiveness of our approach.We start by describing experiment setup and implementation details in Section 4.1.We then discuss the experimental results with comparisons to the centralized baseline and the state-of-the-art method in Section 4.2.Finally, we provide ablation studies and analysis to justify the interpretability and low communication overhead of our approach in Section 4.3.

Experiment Setup and Implementations
Comparison methods We benchmark our method against one of the state-of-the-art and most influential works on horizontal federated XGBoost, SimFL [17], which adopts locality-sensitive hashing in the context of FL.Opposed to our method, SimFL trains the global XGBoost model by sharing the weighted gradients across rounds.We also use the centralized XGBoost trained on the whole dataset as baseline.
Dataset Following SimFL [17], we evaluate our method on the same six tabular datasets for classification.We also conduct our experiment on four tabular datasets for regression.All datasets can be downloaded from LIBSVM data website 1 .The information of each dataset is summarized in Table .1.For all datasets, the training set to test set ratio is 0.75 : 0.25 with random shuffling.The test set is used as the global test set at the server side.For the training set, we equally divide it according to the number of clients and assign the partitioned datasets to each client as their local dataset.SimFL [17] only conducts experiments using 2 clients.We also provide the results of our method using 5 and 10 clients.

Evaluation metric
We report the performance on classification and regression datasets using Accuracy and Mean Squared Error (MSE) respectively, which is common practice.

Implementation details
We use the Python package xgboost to train the local XGBoost models.Following SimFL [17], the maximum depth of all trees is set to 8.For our implementation, we set the number of trees in each tree ensemble to be 500 divided by the number of clients.The initial learning rate () is the same for all trees and is set to 0.1.Note that  is a fixed hyperparameter of XGBoost (explained in Section 2), and is not the learnable learning rates (   ) that are refined by the one-layer 1D CNN (explained in Section 3.3).For the globally trained one-layer 1D CNN and FL infrastructures, including clients and a server, we implement our method with PyTorch under Flower [2], an end-to-end FL framework.For the CNN, we employ Kaiming initialization [11] and set the number of convolution channels to 64.For each client, we train the CNN using Adam [15] with learning rate () 0.001,  1 momentum 0.5, and  2 momentum 0.999.The local

Experimental Results
Table .2 demonstrates the quantitative results of FedXGBllr.
The number of communication rounds () is set to 10.For all experiments, we take an average of 5 runs.From the results, our approach outperforms or reaches comparable accuracy to SimFL [17] and the centralized baselines on all six classification datasets with 2 clients.For the regression datasets, our approach achieves comparable or slightly higher MSE compared to the centralized baseline.
For both classification and regression datasets, our method performs better on larger datasets.We hypothesize this is due to the generalization capability of CNN scaling up with the volume of data.Additionally, as the number of clients increases from 2 to 5 and 10, the performance slightly decreases.We think it is reasonable because FL is harder with more clients [13].
The results suggest 10 rounds are sufficient for FedXGBllr to build a good global federated XGBoost model.However, it is worth mentioning that the number of rounds needed to reach good performance correlates with the number of local epochs () to train the one-layer 1D CNN on the client side.A higher  may require fewer communication rounds (consider the extreme when  = 1).Our implementation uses  = 100, and leaves the optimal trade-off of  and  for future studies.

Ablation Studies and Analysis
Communication overhead We compare the total communication overhead to build a global federated XGBoost model of our approach to baseline, SimFL [17].For all comparisons, we assume the number of clients  to be 10 and the total number of built XGBoost trees  to be 500 with a depth  of 8 in order to be consistent with SimFL's efficiency experiments.The communications overhead of our approach is independent of the dataset size  , and can be expressed as: where  is the number of FL training rounds,  _ is the size of each tree in bytes, and  _ is the size of the one-layer 1D CNN in bytes.Therefore, 2 ×  _ is the communication overhead during tree ensembles aggregation at round 0, and 2 ×  _ is the communication overhead of the federated training of the CNN from round 1 to .We assume  to be 10 because this number is sufficient for our approach to reach good performance as explained in Section 4.2. _ is 0.03MB (Table .5).In practice,  _ is negligible as the size of 500 trees built by the xgboost package is only 48 bytes.The communication overhead of SimFL [17] in bytes is given by 8 ×ℎ + 8 [ + (2  − 1) ( − 1)], where ℎ is the number of hash functions.Table .3 illustrates the comparison of the total communication overhead.We include the results using the six classification datasets because SimFL [17] provided the exact values on them.We can see the communication overhead of our approach is significantly lower especially as the dataset size scales up.We save the communication cost by at least a factor of 25, and can save up to a factor of 700.
Although we did not compare the exact numbers, our communication overhead is also significantly lower than encryption-based methods such as FedXGB [22], whose training shares encryption keys and communication cost scales up linearly with both input size and the number of clients.

Model interpretability
We want to know if the interpretability of our one-layer 1D CNN couples with the high performance?We change the first 1D convolution layer with kernel size and stride equal to the number of trees in each client tree ensemble with: 1) standard convolution with kernel size 3, stride 1, and 2) FC layer with dimension 256, and remove the flattened layer.The number of communication rounds is set to 10.We fix the number of clients to be 5.The results are shown in Table .4. We also show the number of parameters and total size of each model in Table .5.
From the results, our one-layer 1D CNN reaches the best performance on all datasets although it has the smallest number of parameters and total size.This suggests the effectiveness and interpretability of our CNN model.We find that for all datasets, the performance gap between our interpretable CNN and 2-layer FCNN is much larger than the gap  between our interpretable CNN and CNN with standard kernel size and stride.Also, the gap exaggerates as the dataset size increases.We argue that in addition to our reasoning in Section 3.3, it is because CNN can leverage the temporal information across the tree ensembles built by the clients, and our interpretable CNN has the right amount of temporal resolution (i.e., with kernel size = stride).

CONCLUSION AND FUTURE WORKS
We propose a novel framework, FedXGBllr, for horizontal federated XGBoost which does not rely on the sharing of gradients and hessians.Extensive evaluations prove our approach is robust, interpretable, and communication efficient.Specifically, we reach performance comparable to state-ofthe-art method and reduce communication cost with factors ranging from 25x to 700x.We use FedAvg [24] in this work to learn the learnable learning rates by training a small one-layer 1D CNN.It is important to point out that more advanced FL training algorithms can also be applied and better performance may be achieved; however, we leave it for future studies as it is not the focus of this research.Future works also include extending FedXGBllr to vertical setting.

Figure 3 :
Figure 3: The aggregated tree ensemble.The final prediction given by the weighted sum of all trees.

Figure 4 :
Figure 4: The pipeline.(a) tree ensembles aggregation and (b) one-layer 1D CNN to study the learning rates and output the final prediction result.
Figure 2: An example of the impact of local data heterogeneity on the performance of XGBoost model.performdifferently across data samples.For the first tree  1 , it gives a good initial prediction for  2  (110) but not for  1  (60).The second and third trees  2 and  3 , on the contrary, sufficiently correct the residuals made by the first tree  1 for  1  (30, 5) but not for  2  (-1, 20).In this case, a fixed learning rate (e.g.,  = 0.3) may be too weak because ideally, we want a higher learning rate for  2 ( 1  ) and  3 ( 1  ) but a lower learning rate for  2 ( 2  ) and  3 ( 2 1  } ∈  1 and { 2  ,  2  } ∈  2 are also labeled in Fig. 2. Their ground truths are equal such that  1  =  2  = 100.Since local datasets  1 and  2 belong to two heterogeneous clients, the trees

Table 1 :
Summary of datasets

Table 2 :
Quantitative results of FedXGBllr compared to SimFL and centralized baseline -Accuracy ↑ (for the first six classification datasets), MSE ↓ (for the last four regression datasets).