Local Boosting for Weakly-Supervised Learning

Boosting is a commonly used technique to enhance the performance of a set of base models by combining them into a strong ensemble model. Though widely adopted, boosting is typically used in supervised learning where the data is labeled accurately. However, in weakly supervised learning, where most of the data is labeled through weak and noisy sources, it remains nontrivial to design effective boosting approaches. In this work, we show that the standard implementation of the convex combination of base learners can hardly work due to the presence of noisy labels. Instead, we propose $\textit{LocalBoost}$, a novel framework for weakly-supervised boosting. LocalBoost iteratively boosts the ensemble model from two dimensions, i.e., intra-source and inter-source. The intra-source boosting introduces locality to the base learners and enables each base learner to focus on a particular feature regime by training new base learners on granularity-varying error regions. For the inter-source boosting, we leverage a conditional function to indicate the weak source where the sample is more likely to appear. To account for the weak labels, we further design an estimate-then-modify approach to compute the model weights. Experiments on seven datasets show that our method significantly outperforms vanilla boosting methods and other weakly-supervised methods.


INTRODUCTION
Weakly-supervised learning (WSL) has gained significant attention as a solution to the challenge of label scarcity in machine learning.WSL leverages weak supervision signals, such as labeling functions or other models, to generate a large amount of weakly labeled data, which is easier to obtain than complete annotations.Despite achieving promising results in various tasks including text classification [1], sequence tagging [2], and e-commerce [3], an empirical study [4] reveals that even state-of-the-art WSL methods still underperform fully-supervised methods by significant margins, where the average performance discrepancy is 18.84%, measured by accuracy or F1 score.
On the other hand, boosting algorithm is one of the most commonly used approaches to enhance the performance of machine learning models by combining multiple base models [5][6][7][8][9].For example, AdaBoost [5] dynamically adjusts the importance weight of each training example to learn multiple base models and uses a weighted combination to aggregate these base models' predictions.XGBoost [8] iteratively computes the gradients and hessians defined on a clean training set to fit base learners and combines their predictions via weighted summation.Despite the encouring performance, these boosting algorithms usually assume the availability of a clean labeled dataset.In WSL, however, the imperfect supervision signals interfere with the training data importance reweighting which further prevents us from computing an accurate weight of each base learner.If we naively apply these supervised boosting methods using a weakly labeled dataset, we observe a phenomenon called "weight domination" where the assigned weight of the initial base model is too large and dominates the ensemble model prediction, as shown in Figure 1.
A key challenge of adapting boosting methods to the WSL setting is to accurately compute the importance of each example in the weakly labeled training data for each base learner.Previously, when a clean dataset is provided, the goal of data importance reweighing process is to prioritize instances with large errors for subsequent base learner training.This effectively localizes the base learner to the error region in the label space.However, in WSL, the noisy labels hinder the accurate identification of error instances and thus we need to shift our focus from the label space to the training data space.One potential approach is to partition the weakly-labeled training data into subsets and constructs a mixture of expert models (MoE) [10] where each expert is localized for one training data subset.Along this line, Tsai et al. [11] propose partitioning the unlabeled dataset into latent semantics subsets and using multiple expert models to discriminate instances.However, this approach assumes the input data naturally reside in a homogeneous feature space and requires a hyper-parameter search to appropriately localize the expert models.Additionally, the off-the-shelf clusters do not adapt during the learning process, which conflicts with the philosophy of boosting methods.
We investigate the problem of boosting in the context of weakly supervised learning, where most of the data are labeled by weak sources and only a limited number of data points have accurate labels.To address the difficulties posed by this setting, we introduce LocalBoost, a novel iterative and adaptive framework for WSL boosting.LocalBoost retains the essential concepts of the traditional boosting approach while incorporating adaptations specifically designed for the WSL scenario, described as follows: • Base Learner Locality.Motivated by the challenges posed by the data reweighting approach in AdaBoost for WSL and the limitations of hard clustering in MoE methods, we propose a new approach to base learner localization.In AdaBoost, large-error instances are assigned with larger weights for model training, however, this approach does not account for the fact that error instances exist in multiple feature regimes that are difficult to capture with weak labels.Additionally, the rigid clusters and fixed expert models in MoE cannot adapt in the iterative learning process, hindering the framework's ability to dynamically target weak feature regimes and build upon preceding models.
To address these issues, our proposed framework LocalBoost assigns base learners to adaptively updated local regions in the embedding space, thereby introducing locality to the base learners.• Two-dimension Boosting.Effective aggregation of localized base learners in WSL goes beyond the simple convex combination in supervised settings (as shown in Sec.4.2).To account for potential label noises from weak sources, we aim to learn multiple complementary base learners in LocalBoost.To fulfill this goal, we introduce a weighting function to compute the conditional probability of weak sources that are more likely to annotate a given data instance.We further design a two-dimensional boosting framework in LocalBoost, where inter-source boosting and intra-source boosting are performed alternately.The former improves the base learners within a given weak source, while the latter complements the base learners with additional models trained from other weak sources.• Interactions between Weak and Clean Labels.We incorporate the interactions between weak and clean labels into Local-Boost framework in two steps: (1) We compute a mapping between the small clean dataset and the large weakly labeled dataset to localize base learners in the data embedding space.We first identify the errors made by the current model ensemble, and then sample corresponding clusters from the large weak dataset to form the training set for the next base learner.(2) We propose a novel estimate-then-modify approach for computing base learner weights.Initially, the weights are estimated on the large weakly labeled dataset.Then, we refine these estimates by generating multiple perturbations of the model weights and selecting the one that results in the lowest error rate on the small clean dataset as the modified weights.We evaluate LocalBoost on seven datasets including sentiment analysis, topic classification and relation classification from WRENCH [4], the standard benchmark for weakly supervised learning.The results indicate that LocalBoost achieves superior performance compared with other state-of-the-art methods.Moreover, our analysis further confirms the effectiveness of boosting in two dimensions and incorporating interactions between weak and clean labels.We summarize our key contributions as follows: (1) We present LocalBoost1 , a novel weakly-supervised boosting framework that implements progressive inter-source and intrasource boosting.(2) We incorporate explicit locality into the base learners of the boosting framework, allowing them to specialize in finer-grained data regions and perform well in specific feature regimes.(3) We leverage the interactions between weak and clean labels for effective base learner localization and weight estimation.(4) We conduct extensive experiments on seven benchmark datasets and demonstrate the superiority of LocalBoost over WSL and ensemble baselines.

RELATED WORK
Weakly Supervised Learning.Weakly supervised learning (WSL) focuses on training machine learning models with a variety of weaker, often programmatic supervision sources [4].Specifically, in WSL, users provide weak supervision sources, e.g., heuristics rules [12,13], knowledge bases [14], and pre-trained models [15][16][17], in the form of labeling functions (LFs), which provide labels for some subset of the data to create the training set.The main challenge in WSL is incorrect or conflicting labels produced by labeling functions [18].To address this, recent research has explored two solutions.The first is to create label models, which aggregate the noisy votes of labeling functions using different dependency assumptions.This approach has been used in various studies, including [19][20][21][22][23].The second solution involves designing end models that use noise-robust learning techniques to prevent overfitting to label noise, as seen in studies by [24][25][26].This current work belongs to the first category.
For label models, several recent works attempt to better aggregate the noisy labeling functions via better modeling the distribution between LF and ground truth labels [20,21], Zhang et al. [22] incorporate the instance features with probabilistic models, while [27][28][29] learn the label aggregation using supervision from the target task.Besides, [2,30,31] adapt the WSL to a broader range of applications such as regression and structural prediction, and [32][33][34][35] design active learning approaches to strategically solicit human feedbacks to refine weak supervision models.
So far, only a few works have attempted to integrate the boosting or ensemble methods with WSL.Guan et al. [36] learn an individual weight for each labeler and aggregate the prediction results for prediction.Zhang et al. [17] leverage boosting-style ensemble strategy to identify difficult instances and adaptively solicit new rules from human annotators and Zhang et al. [3] leverage rules from multiple views to further enhance the performance.Very recently, Zhao et al. [37] use mixture-of-experts (MoE) to route noisy labeling functions to different experts for achieving specialized and scalable learning.However, these works directly combine existing ensembling techniques with noisy labels.Instead, we first identify the key drawback of adopting boosting for WSL (Fig. 1) and then design two-dimension boosting to resolve this issue, which provides a more effective and flexible way for learning with multi-source weak labels.
Boosting.The boosting methods are prevailing to improve machine learning models via the combination of base learners.Research in this area begins in the last century [5] and numerous variants have been proposed [7,8,38,39].AdaBoost [5] incorporates base learners to the model ensemble via iterative data reweighting and model weights computation that minimizes the total error on the training set, to improve the model ensemble for binary classification.XGBoost [8] presents an advanced implementation of boosting by using a more regularized model formalization to control over-fitting, and it has the virtue of being accurate and fast.However, these methods are discussed in the context of fully-supervised learning, and their principal derivation or implementation relies on clean data.This results in issues with its adaption to weakly-supervised settings as it is challenging to obtain reliable computation from weakly labeled data.
Recent works explore multi-class boosting [40,41] and the application side of boosting [42][43][44][45].Brukhim et al. [40] study the resources required for boosting and present how the learning cost of boosting depends on the number of classes.Zhang et al. [42] explore boosting in the context of adversarial robustness and propose a robust ensemble approach via margin boosting.For the application with deep neural networks, Taherkhani et al. [46] integrate AdaBoost with a convolutional neural network to deal with the data imbalance.Among these works, the most related one to ours is MultiBoost [41], where the authors study boosting in the presence of multiple source domains.They put forward a Q-ensemble to compute the conditional probability of different domains given the input data.This formulation can be traced back to multi-source learning in fully supervised settings [47][48][49].In this work, we draw inspiration from such formulation and present an adaptation to the WSL setting -we design a weighting function to compute the conditional probability of weak sources.In this way, we modulate the base learners to highly relevant weak sources and present a two-dimension boosting accordingly.

PRELIMINARIES
Let X denote the input space and Y = {1, • • • , } represent the output space, where  is the number of classes.We have a large weakly labeled dataset D  and a small clean dataset D  .The weak labels of D  are generated by a set of weak sources R, and the number of weak sources |R| = .Definition 3.1 (Weak sources).In the context of WSL, the weak sources refer to labeling functions (LFs), which are constructed via keywords or semantic rules.In this work, we will use the term "weak source" and "labeling function" interchangeably.Given an unlabeled data sample   , a weak source  (•) maps it into the label space:  (  ) →  ∈ Y ∪ {0}.Here Y is the original label set for the task and {0} is a special label indicating   is unmatchable by  (•).Given a set of  samples and  labeling functions, we can obtain a LF matching matrix   × , where each entry  , ∈ {0, 1} denotes whether the -th sample is matched by the -th LF.Definition 3.2 (Base learner).Let  = {(x  ,   )}  =1 be  i.i.d samples drawn from IP, and IP  be the empirical distribution of S. We let ℎ : X → Y denote a base learner that predicts the label of x as ℎ(x).For a base learner ℎ and a distribution S, we denote the expected loss of ℎ as L (S, ℎ) = E (,)∼S [ℓ (ℎ(), )], where ℓ is cross entropy loss.
where  : X → R  , ∀ ∈ H  , we can have a convex combination of the base learners via a set of real-valued weights  over H  , and define the weighted ensemble  as   (x) =  ∈ H   ( )   (x), where  ( ) ∈ R is the weight of the base learner  .
Problem Formulation.Given a large weakly labeled dataset D  , a small clean dataset D  , and  weak sources, we aim to iteratively obtain base learners   to boost the performance of the ensemble model

METHODOLOGY 4.1 Learning Procedure Overview
LocalBoost iteratively implements inter-source boosting and intrasource boosting in the WSL setting.For weak source  : 1, • • • ,  and iteration  : 1, • • • , , we have a base learner  , () with a model weight  , .As we will illustrate in Sec.4.2, the straightforward convex combination may not work for the WSL setting.To this end, we introduce a conditional probability function  ( |) to account for the fact that the weak labels are generated from different weak sources.For a given sample x, the base learners from the weak

Algorithm 1 Pseudo-code of LocalBoost Framework
Require: Large weakly-labeled dataset D  , Small clean dataset source where x is more likely to be labeled are allocated higher weights in the voted combination.
Given a large weakly labeled dataset D  and a small clean dataset D  , the overall learning procedure runs as follows: In iteration (, ), the preceding ensemble model   −1, (x) or  , −1 (x) accumulated errors on D  , say the large-error instances are and we fit  , (x) on S , .Next, we estimate the model weights  , on D  and modify it on D  .After  iterations, we obtain the final ensemble: The key components presented in the above model ensemble, including the model weights  , , the conditional function  ( |), and the local base learner  , (), are specially designed for the WSL setting.Different from the formulation of Adaboost [5] and Q-ensemble [41,49] that are solely based on the clean labels, Local-Boost relieves such reliance on fully accurate supervision via the interaction between weak labels and clean labels, and localizes the base learners to adaptively complement the preceding ensemble.The learning algorithm is presented in Algorithm 1.In the initialization stage, we prepare a source-index dataset D  using the LF matching matrix (Sec.4.2) to learn the conditional function, and fit the first base learner on the large weakly labeled dataset D  .Then we iteratively perform the two-dimension boosting.The inner loop over  weak sources corresponds to the inter-source boosting, while the outside loop over  iterations corresponds to the intra-source boosting.In each iteration (, ), we start from the ensemble inference on D  to identify the accumulated large-error instances  1 , • • • ,   .These error instances based on the groundtruth labels can accurately reflect the feature regimes where the current ensemble performs poorly.
Fig. 2 presents an illustrative visualization of the base learner localization.To localize the subsequent base learners, we sample  clusters from the large weakly labeled dataset D  based on the identified large-error instances  1 , • • • ,   .In other words, these instances from D  guide the localization of the base learner via a mapping between the instances on the small clean dataset and the regions on the large weakly labeled dataset.To complement the previous ensemble, we fit the new base learner  , () on the dataset D , consisting of instances from clusters For the model weights computation, we propose an estimatethen-modify paradigm (as shown in Fig. 3) to leverage both the weak labels and clean labels.On the large weak dataset D  , we retain the AdaBoost principles to compute the weighted error and yield an estimate of  , .Considering the labels are obtained from weak sources, the estimated  , on D  can hardly guarantee the boosting progress, so we further modify it on the small clean validation dataset D  .Specifically, we generate a group of the perturbed weight vectors V  = {v , }   and compute the weighted error on D  .Among the perturbations, we select the one that achieves the lowest error as the modified model weights.
In the following sections, we will first introduce the conditional function designed for the weak sources (Section 4.2), then illustrate how we introduce locality to base learners (Section 4.3), and finally discuss the estimate-then-modify paradigm for the computation of the model weights in WSL settings (Section 4.4).

Conditional Function for Weak Sources Localization
In the WSL setting, the weak labels are generated by weak sources such as labeling functions.We first pinpoint that the standard convex combination of the base learners can lead to a poor ensemble without taking the weak sources into account.Being aware of this, we propose to use a conditional function to account for the weak sources.Given an input instance, the conditional function represents the probability of the instance being labeled by each weak source.In this way, the ensemble model is modulated by this conditional function, while being weighted by a series of base learner weights, to deal with the weak sources in the WSL setting.
Proposition 1.There exist weak sources  1 and  2 with corresponding distributions D 1 and D 2 , and base learners  1 and  2 with where  is the model weights in the ensemble, L () quantifies the loss.
The proof is given in Appendix C. The above proposition states that even if the base learner fits well to the weakly labeled data provided by the weak sources, the convex combination of them can still perform poorly.Therefore, we consider the new ensemble form to account for the presence of weak sources.Inspired by [41,[47][48][49][50], we introduce a conditional function to compute the probability of the sample being labeled by each weak source.The major difference is that the mentioned works discuss the conditional probability in the context of domain adaption or multi-source learning and are designed for the fully supervised settings, we instead present a conditional function to account for the weak sources in the WSL settings to modulate the base learners, and naturally introduce the inter-source boosting.Proposition 2. By plugging a conditional function regarding the weak sources into the standard convex combination, there exists an ensemble in the WSL setting such that where D  is a mixture of the weak sources such that Proof.Consider the case of conditional function indicating the matching between the samples and the weak sources: Plug it into the standard convex combination, we get the ensemble as Then the ensemble admits no loss for the case mentioned in Prop.1:

□
In practice, we use an MLP to learn the conditional function on a source-index dataset D  using the instance features.Given  weak sources and   unlabeled data, we can easily construct a matching matrix in the shape of   ×  to represent the matching results, where each entry takes a binary value from {0, 1} to indicate if it gets matched with a specific weak source.For the  weak sources, we have In the inference stage, the advantage of the learned conditional function is more evident compared to the direct LF matching.As a straightforward substitution of the learned conditional function, we can implement an LF matching for each test sample to modulate the base learners to specific weak sources.The major problem with such a hard matching is the potential labeling conflict among multiple LFs, as it is often quite challenging to identify the correct labeling functions based on the voting or aggregation [19,20] approaches with matching matrix only.For the learned conditional function, it can generalize better than the hard matching after training on the source-index dataset.By the output of a probability vector, it assigns different weights to the base learners according to their source relevance and thus enables better modulating.

Base Learner Localization
The core idea of LocalBoost is to allocate the base learners to local regions in the input embedding space.To illustrate this point clearly, we present a 2D visualization in Fig. 2. Such a locality differs from both the focus on large-error data in supervised boosting which does not explicitly account for information from feature space and the task-agnostic pre-clustering in the MoE approach.For traditional boosting approaches such as AdaBoost, the learning framework iteratively computes the weighted error and each base learner is trained on a reweighted training set.The issues are two folds for the WSL setting: 1) First, there are not enough clean data and the noisy labels in the weakly labeled dataset cannot underpin an accurate estimation of the weighted error.2) Second, the base learner is granted access to the entire training set.Although the training set has been reweighted to emphasize the accumulated errors, the imperfect weak labels from multiple feature regimes poses specific challenges for base learner fitting, as the boosted model is still easy to overfit the noise.For the MoE approach, although it constructs clusters in the embedding space and deploys expert models to learn specialized patterns, the clusters are often assigned in a static way with instance features [51].Such a rigid scheme fails to explicitly model the training dynamics of the existing base learners, often resulting in suboptimal performance.
To this end, we harness the small, clean validation set to guide the localization of the base learners to introduce locality for these base learners via a mapping between the error instance and the error regions.We first identify the large-error instances on the small clean dataset D  , then sample clusters on D  based on the identified instances.Denote the model ensemble at iteration (, ) as  , (), and its prediction vector as G( , ()).We maintain an error matrix  , to record the accumulated error, which is initialized as  , ← [0]   at the beginning, where   = |D  |.By the ensemble inference on D  , we update  , as so the entries with accumulated errors get larger in the iterative process.Then we pick the top- error instances by Based on these identified error instances, we back to D  and sample  clusters with a hyper-radius inversely proportional to the accumulated error: where  1 is a parameter,  is the average distance of all data samples in D  : In this way, we form a training set which is the local region for the fitting of the to-be-added base learner  , .We deploy the base learner  , to the local region D , and fit it on D , by optimizing: where ŷ is the weak label for instance   , and ℓ CE is the cross entropy loss.The above process reflects the interactive nature of LocalBoost, i.e., the error instances identified on the clean dataset guide the base learner training on the weakly labeled dataset.If we use the clean dataset alone, the limited number of samples are insufficient to support the model fitting [40].If we use the weakly labeled dataset alone, we cannot distinguish the false positive when updating the error matrix due to the presence of noisy labels.Instead, our approach first accurately identifies the large-error instances on the small clean dataset D  , then sample regions on the large weakly labeled dataset D  based on these instances to gather sufficient supervision for the base learner training.Compared to the data reweighting approach in supervised boosting [5], LocalBoost targets only local regions in each iteration so that the base learners can be trained on more specific feature regimes, which is suitable for WSL setting because it is more difficult to directly learn from the imperfect noisy labels.Besides, LocalBoost explicitly inherits the boosting philosophy compared to the MoE approach-the local regions evolve iteratively and adaptively to reflect the weakness of the preceding ensemble, thereby the new base learner serves as a complement.Next, we generate a group of perturbed weight vectors V , by adding Gaussian and normalizing each of them.Finally, we select the weight vector that achieves the lowest total error on D  .

Estimate-then-Modify Weighting Scheme with Perturbation
For the weight computation, we present an adaptive design for boosting in the WSL setting, namely the estimate-then-modify paradigm.In supervised boosting, it is easy to compute the model weights by the principle of minimizing the total error.This implementation builds upon access to a full clean dataset.In WSL, however, we only have a large weakly labeled dataset, along with a clean validation dataset with only limited examples.Therefore, we propose an estimate-then-modify paradigm for the model weights computation, in which a large number of weak labels are leveraged for the weight estimate, then the limited clean labels are used to rectify the estimated weights.
In particular, we first follow the AdaBoost procedure to estimate the weight  , on the weakly labeled dataset D  , this starts from the data weights initialization on D  : where   = |D  |.Then we calculate the weighted error rate of the current base learner  , by It follows the weight calculation of  , for the base learner  , : This expression is based on the principle that given the to-be-added base learner, the desired weight should minimize the total error defined on the training set D  .Finally, we update the data weights: and normalize it such that    = 1.Till now, we have obtained an estimated weight of the base learner, the ensemble model can be updated by:  , () =  , −1 () +  ,  ( |) , (),  > 1 or  , () =   −1, () +  ,  ( |) , (),  > 1 (18) Note that in the above derivation, we directly use the weak labels in D  for the error rate computation and the total error minimization.However, one key difference for weakly-supervised learning is the absence of a large amount of clean labels.Merely using the noisy weak labels can hardly guarantee the boosting progress due to the fact that calculating error rates involves weak labels, which are less unreliable and can negatively affect the weight estimation.
To address this issue, we further calibrate such estimated weights using the small clean dataset D  with a perturbation-based approach.Specifically, for the iteration (, ), we have the weight We then add Gaussian perturbation to the weight vector by and normalize the sum of weights to 1 for getting a group of perturbed weight vectors V , = {v , }   , where   is the number of perturbations.The V , enables us to validate different combinations of the base learners.We define the clean error on the small clean dataset D  as where   is the clean label on D  , and   = |D  |.We select the weight vector with the lowest validation error [52] v , = arg min as the modified base learner weights.The weights computation for the base learners manifests another aspect of the interaction between weak labels and clean labels.There are two natural alternatives to compute the weights, either using the weak labels or the clean labels alone.We have demonstrated that using only weak labels is suboptimal due to the unreliable computation caused by the noise in the weak labels.On the other hand, if we use clean labels alone, the base learner can easily overfit on the limited number of samples.Another alternative is to integrate both the weak labels and clean labels by fitting the base learner on the weakly labeled dataset, while computing the weights using the clean labels.However, we argue that the error made on the clean dataset could be caused either by the distribution shifts between the weak labels and the clean labels due to the limited and biased labeling functions, or by the base learner overfitting to the label noise.If we decouple the two steps of base learner fitting and weight computation, it equivalently forces the boosting process to the clean dataset only.This ignores the fact that an ensemble well-performed on a large dataset, though weakly labeled, can generalize better than an overfitted one.We empirically validate the above statement in Sec.5.5.

Metrics.
We strictly follow the evaluation protocol in WRENCH benchmark.Specifically, for text classification, we use Accuracy as the metric.For relation classification, we use F1 score as the metric.
5.1.3Baselines.We compare LocalBoost with the following baselines: • Majority voting: It predicts the label of each data point using the most common prediction from LFs. • Weighted majority voting: It extends the majority voting by reweighting the final votes using the label prior.• Dawid-Skene [59]: It estimates the accuracy of each LF by assuming a naive Bayes distribution over the LFs' votes and models the ground truth as the latent variable.• Data Programming [19]: It models the distributions between label and LFs as a factor graph to reflect the dependency between any subset of random variables.It uses Gibbs sampling for maximum likelihood optimization.• MeTaL [20]: It models the distribution via a Markov Network and estimates the parameters via matrix completion.• FlyingSquid [21]: It models the distribution between LF and labels as a binary Ising model and uses a Triplet Method to recover the parameters.• EBCC [60]: It is a method originally proposed for crowdsourcing, which models the relation between workers' annotation and the ground truth label by independent assumption.• FABLE [22]: It incorporates instance features into a statistical label model for better label aggregation.• Denoise [28]: It uses an attention-based mechanism for aggregating over weak labels, and co-trains an additional neural classifier to encode the instance embeddings.• WeaSEL [27]: It is an end-to-end approach for WSL, which maximizes the agreement between the neural label model and end model for better performance.

Implementation Details.
We keep the number of iterations  = 5 for all the experiments, the number of weak sources  varies based on the number of LFs as shown in Table 4. Specifically, we set  = # for IMDb, Yelp, YouTube, and AGNews.For the other datasets, it will bring a redundant ensemble if we implement intersource boosting for all the LFs.Instead, for TREC and SemEval, we group the LFs based on the labels.For CDR, we manually divide the LFs to 6 groups (3 groups for each label).The source-index datasets are constructed accordingly, in which the index of LFs or LF groups are used as labels.To learn the conditional function, we deploy a multi-layer perceptron with 2 hidden layers, the shape of the output layer varies according to the number of weak sources  [61] with 110M parameters as the backbone model and AdamW [62] as the optimizer.More details on hyperparameters are introduced in Appendix B.

Main Results
The results in Table 1 compare the performance of our method LocalBoost with the baselines.On all datasets, LocalBoost consistently outperforms these strong baselines, with performance improvements ranging from 1.08% to 3.48% compared to the strongest baselines.On average across the seven datasets, LocalBoost reaches 91.3% of the performance obtained with fully-supervised learning, significantly reducing the gap between weakly-supervised learning and fully-supervised learning.The performance of our method is significantly better compared to the voting methods.This demonstrates that our conditional function can effectively boost the performance of the base learners by computing the conditional probability of weak sources given an input sample.Unlike the voting methods, which simply average or weight-average the predictions from different sources, the conditional function allows the base learners to emphasize on most reliable labeling functions when combined.
When comparing with the label aggregation methods, the significant improvement demonstrates the advantage of boosting methods over the single end model.Take the strongest baseline (Data Programming) for instance.On the IMDb and Yelp datasets, Local-Boost achieves a performance gain of 4.92% and 8.60%, respectively.Although label aggregation methods can model the distributions between labels and weak sources, the single end model trained on the entire weakly labeled dataset fails to obtain a locality to enhance itself.Instead, we introduce locality to the base learners while retaining the boosting nature, so the ensemble model in LocalBoost can self-strengthen by adding complementary base learners iteratively.

Two-dimension Boosting
Fig. 4 shows the iterative results in the two-dimension boosting process.For inter-source boosting, we plot its performance change at iteration (, ) such that  = 1,  ∈ [].For intra-source boosting, we plot its performance change at iteration (, ) such that  = ,  ∈ [ ].We observe a consistent improvement in performance through intersource boosting, indicating that the base learners can complement each other and build upon the models trained from other weak sources.In the early stages of inter-source boosting, there is a significant improvement in performance, demonstrating that the weak regions of the previous ensemble model are effectively learned by the following base learners.However, the performance gains become relatively modest towards the end, as the ensemble model has already combined sufficient base learners, and the remaining error regions may not be well learned due to the limitations of weak supervision, even with the addition of more base learners.

Study of the Conditional Function
In this set of experiments, we study the effect of the conditional function  ( |) as described in Sec.4.2.We considered two baselines, one where the conditional function is disabled, and another where the conditional function is replaced with labeling function (LF) matching.The first baseline simply ensembles the base learners without distinguishing between intra-source and inter-source boosting.For the second baseline, the direct LF matching is a natural alternative to the learned conditional function.Specifically, to represent the conditional probability of weak sources given a data sample, we replace the  ( |) in the LocalBoost framework with the entries in a normalized LF matching vector.
Table 2 compares the learning framework variants and demonstrates the advantages of LocalBoost.It is consistently better on all seven datasets, with average performance gains of 2.70% and 1.21%.The comparison with the baseline without  ( |) highlights the superiority of introducing a conditional function to account for weak sources.Disabling  ( |) results in a standard convex combination, which is not suitable for the WSL scenario.The baseline using direct LF matching instead of the conditional function shows improved performance over the original baseline, supporting the need to modulate base learners for weak sources.However,

Study of the Estimate-then-Modify Scheme
In this set of experiments, we examine the benefits of our estimatethen-modify scheme for calculating weights in the WSL setting and the importance of the interaction between weak labels and clean labels.We compare with two baselines, both of which are variants of LocalBoost that use different weight calculation methods.The first baseline uses the AdaBoost approach to compute base learner weights on the weakly labeled dataset, while the second integrates both weak and clean labels but only uses weak labels for base learner training and calculates weights solely on the clean dataset.Table 3 shows that LocalBoost outperforms the two variants by significant margins.LocalBoost has a performance improvement of 5.18% compared to the baseline using only weak labels, supporting our hypothesis that weak labels are unreliable for weight computation.Despite the integration of both weak and clean labels in the second baseline, LocalBoost still leads by 2.46%.We believe that the disconnected training of base learners and weight calculation weakens the learning framework, as it cannot provide an appropriate combination of base learners when the weakly labeled data deviates from the clean data.On the other hand, LocalBoost

CONCLUSION
We presented a novel iterative and adaptive learning framework, LocalBoost to boost the ensemble model in the setting of weakly supervised learning.While preserving the key concepts of traditional boosting methods, we introduced locality to base learners and designed an interaction between weak labels and clean labels to adapt to the WSL setting.The adaptations included the use of local base learners, the incorporation of a conditional function to account for weak sources, and the application of the estimate-thenmodify scheme for weight computation.Specifically, we localized the base learners to the iteratively updated error regions in the embedding space, thereby overcoming the issues of weight domination in vanilla boosting for WSL and the challenges of hard clustering in the MoE approach.To handle weak sources such as labeling functions in WSL settings, we designed a conditional function to modulate the base learners towards weak sources with high relevance.This addressed the issue of poor ensemble performance in the standard convex combination approach in WSL settings.Finally, we proposed an estimate-then-modify scheme for weight computation.Our comprehensive empirical study on seven datasets demonstrated the effectiveness and advantages of LocalBoost compared to standard ensemble methods and WSL baselines.

A DATA STATISTICS
In Table 4, we provide detailed information about datasets used in our experiments, encompassing statistics related to the labeling functions for each dataset.

B IMPLEMENTATION DETAILS
The hyperparameters involved in this study includes the  in Eq. ( 9) and  1 in Eq. (10).The value for these two hyperparameters is listed in the Table 5.

B.1 Computational Environment
All of the experiments are conducted on CPU : Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz and GPU : NVIDIA GeForce RTX A5000 GPUs with 24 GB memory using python 3.6 and Pytorch 1.10.

C PROOF
Here, we provide evidence for Proposition 1, which highlights that a good fit with weakly labeled data doesn't guarantee strong performance from their convex combination.
Proof.Consider the weak sources  1 and  2 , a dataset  = { 1 ,  2 }, and the label space  = {+1, −1}.The data samples get matched  1 and  2 , respectively, and the weak labels given by the weak sources are From each weak source, we have base learner

Figure 1 :
Figure 1: Heatmap of base learner weights in model ensembles.Suffix "-FS" indicates fully-supervised settings using clean labels and suffix "-WS" indicates weakly-supervised settings.Each model ensemble consists of 10 base learners, and their weights are shown by color in the heatmap.

Figure 2 :
Figure 2: An illustrative example of base learner localization on a 2D plane.To localize the base learner  , (), we first implement an ensemble inference on the clean dataset D  to identify  large error instances  1 , • • • ,   .Next, we sample  clusters  1 , • • • ,   on the weakly labeled dataset D  .Then the base learner  , () is trained on the local regions consisting of  1 , • • • ,   .Here we emphasize the clean dataset D  is only for validation use.It guides the base learner localization but is not involved in the training.This figure takes  , −1 () as the preceding ensemble, it could also be   −1, () when the loop over  weak sources is completed.

Figure 3 :
Figure 3: The illustration of Estimate-then-Modify scheme for model weight calculation.We first estimate the base learner weight  , on D  and form the new ensemble  , ().Next, we generate a group of perturbed weight vectors V , by adding Gaussian and normalizing each of them.Finally, we select the weight vector that achieves the lowest total error on D  .

Figure 4 :
Figure 4: Iterative performance of the two-dimension boosting.For the intra-source boosting, we show its performance change in the first  = 1 loop.For the inter-source boosting, we show the ensemble at the end of each outer iteration, where  = .

Table 1 :
Main Results.* : Results are copied from the corresponding paper.We use the training set shown in Table 4 as the weakly labeled datasets D  , where the labels are generated by their LFs.We use a subset of the validation set as the small clean dataset D  .For YouTube and SemEval, we set the |D  | as 120 and 200.For AGNews, we set the |D  | = 1000.For the other datasets, we keep |D  | = 500.The clean data are only for the usage of weights computation and error instance identification, not involved in the base learner training.We use BERT-base

Table 2 :
Study of the conditional function

Table 3 :
Study of the estimate-then-modify weighting