To Aggregate or Not? Learning with Separate Noisy Labels

The rawly collected training data often comes with separate noisy labels collected from multiple imperfect annotators (e.g., via crowdsourcing). A typical way of using these separate labels is to first aggregate them into one and apply standard training methods. The literature has also studied extensively on effective aggregation approaches. This paper revisits this choice and aims to provide an answer to the question of whether one should aggregate separate noisy labels into single ones or use them separately as given. We theoretically analyze the performance of both approaches under the empirical risk minimization framework for a number of popular loss functions, including the ones designed specifically for the problem of learning with noisy labels. Our theorems conclude that label separation is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient. Extensive empirical results validate our conclusions.


Introduction
Training high-quality deep neural networks for classification tasks typically requires a large quantity of annotated data.The raw training data often comes with separate noisy labels collected from multiple imperfect annotators.For example, the popular data collection paradigm crowdsourcing [10,16,27] offers the platform to collect such annotations from unverified crowd; medical records are often accompanied with diagnosis from multiple doctors [1,45]; news articles can receive multiple checkings (of the article being fake or not) from different experts [34,37].This leads to the situation considered in this paper: learning with multiple separate noisy labels.
The most popular approach to learning from the multiple separate labels would be aggregating the given labels for each instance [40,58,44,42,30], through an expectation-maximization (EM) inference technique.Each instance will then be provided with one single label, and applied with the standard training procedure.
The primary goal of this paper is to revisit the choice of aggregating separate labels and hope to provide practitioners understandings for the following question: Should the learner aggregate separate noisy labels for one instance into a single label or not?
Our main contributions can be summarized as follows: • We provide theoretical insights on how separation methods and aggregation ones result in different biases (Theorem 3.4, 4.2, 4.6) and variances (Theorem 3.6, 4.3, 4.7) of the output classifier from training.Our analysis considers both the standard loss functions in use, as well as popular robust losses that are designed for the problem of learning with noisy labels.• By comparing the analytical proxy of the worst-case performance bounds, our theoretical results reveal that separating multiple noisy labels is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient.The results are consistent for both basic loss function ℓ and robust designs, including loss correction and peer loss.• We carry out extensive experiments using both synthetic and real-world datasets to validate our theoretical findings.

Related Works
Label separation vs label aggregation Existing works mainly compare the separation with aggregation by empirical results.For example, it has been shown that label separation could be effective in improving model performance and may be potentially more preferable than aggregated labels through majority voting [17].When training with the crossentropy loss, Sheng et.al [46] observes that label separation reduces the bias and roughness, and outperforms majorityvoting aggregated labels.However, it is unclear whether the results hold when robust treatments are employed.Similar problems have also been studied in corrupted label detection with a result leaning towards separation but not proved [64].Another line of approach concentrates on the end-to-end training scheme or ensemble methods which takes all the separate noisy labels as the input during the training process [63,12,43,5,54], and learning from separate noisy labels directly.

Formulation
Consider an M -class classification task and let X ∈ X and Y ∈ Y := {1, 2, ..., M } denote the input examples and their corresponding labels, respectively.We assume that (X, Y ) ∼ D, where D is the joint data distribution.Samples (x, y) are generated according to random variables (X, Y ).In the clean and ideal scenario, the learner has access to N training data points D := {(x n , y n )} n∈ [N ] .Instead of having access to ground truth labels y n s, we only have access to a set of noisy labels {ỹ For ease of presentation, we adopt the decorator • to denote separate labels, and • for aggregated labels specified later.Noisy labels ỹ• n s are generated according to the random variable Y • .We consider the class-dependent label noise transition [24,35] where Y • is generated according to a transition matrix T • with its entries defined as follows: T • k,l := P( Y • = l|Y = k).Most of the existing results on learning with noisy labels have considered the setting where each x n is paired with only one noisy label ỹ• n .In practice, we often operate in a setting where each data point x n is associated with multiple separate labels drawn from the same noisy label generation process [11,25].We consider this setting and assume that for each x n , there are K independent noisy labels ỹ• n,1 , ..., ỹ• n,K obtained from K annotators.We are interested in two popular ways to leverage multiple separate noisy labels: • Keep the separate labels as separate and apply standard learning with noisy labels techniques to each of them.
• Aggregate noisy labels into one label, and then apply standard learning with noisy data techniques.We will look into each of the above two settings separately and then answer the question: "Should the learner aggregate multiple separate noisy labels or not?"

Denote the column vector P Y
as the marginal distribution of Y • .Accordingly, we can define P Y for Y .Clearly, we have the relation: . The noise transition matrix T has the following form when M = 2: For label separation, we define the per-sample loss function as: For simplicity, we shorthand ℓ(f ) for the loss of label separation method when there is no confusion.

Label Aggregation
The other way to leverage multiple separate noisy labels is generating a single label via label aggregation methods using K noisy ones: , where the aggregated noisy labels ỹ• n s are generated according to the random variable Y • .Denote the confusion matrix for this single & aggregated noisy label as T • .Popular aggregation methods include majority vote and EM inference, which are covered by our theoretical insights since our analyses in later sections would be built on the general label aggregation method.For a better understanding, we introduce the majority vote as an example.
Example of Majority Vote Given the majority voted label, we could compute the transition matrix between Y • and the true label Y using the knowledge of T • .The lemma below gives the closed form for T • in terms of T • , when adopting majority vote.
Lemma 2.1.Assume K is odd and recall that in the binary classification task, T • i,j = P( Y • = j|Y = i), the noise transition matrix of the (majority voting) aggregated noisy labels T • p,q becomes: ).Note it still holds that T • p,q + T • p,1−q = 1.For the aggregation method, as illustrated in Figure 1, the x-axis indicates the number of labelers K, and the y-axis denotes the aggregated noise rate given that the overall noise rate is in [0.2, 0.4, 0.6, 0.8].When the number of labelers is large (i.e., K < 10) and the noise rate is small, both majority vote and EM label aggregation methods significantly reduce the noise rate.Although the expectation maximization method consumes much more time when generating the aggregated label, it frequently results in a lower aggregated noise rate than majority vote.
. For separation methods, the noisy training samples are obtained through variables (X, For aggregation methods such as majority vote, we assume the data points and aggregated noisy labels When we mention "noise rate", it usually refers to the average noise: P( Y u = Y ).
ℓ-risk under the distribution Given the loss ℓ, note that ℓ(f ), we define the empirical ℓ-risk for learning with separated/aggregated labels under noisy labels as: unifies the treatment which is either separation • or aggregation •.By increasing the sample size N , we would expect Rℓ, D u (f ) to be close to the following ℓ-risk under the noisy

Bias of a Given Classifier w.r.t. ℓ-Loss
We denote by f * ∈ F the optimal classifier obtained through the clean data distribution (X, Y ) ∼ D within the hypothesis space F .We formally define the bias of a given classifier f as: The bias of classifier f writes as: The Bias term quantifies the prediction bias (excess risk) of a given classifier f on the clean data distribution D w.r.t. the optimal achievable classifier f * , which can be decomposed as [66] . ( Now we bound the distribution shift and the estimation error in the following two lemmas. Lemma 3.3 (Estimation error).Suppose the loss function ℓ(f (x), y) is L-Lipschitz for any feasible y. ∀f ∈ F , with probability at least 1 − δ, the estimation error is upper bounded by where u ∈ {•, •} denotes either separation or aggregation methods, η 2 and η • K ≡ 1 indicate the richness factor, which characterizes the effect of the number of labelers, and R(F ) is the Rademacher complexity of F .
Noting that the number of unique instances x i are the same for both treatments, the duplicated copies of x i are supposed to introduce at least no less effective samples, i.e., the richness factor satisfies that η u K ≥ 1.Thus, we update η • K := max{η • K , 1}, and Figure 2 visualizes the estimated η • K given different number of labelers as well as δ.It is clear that when the number of labelers is larger, or δ is smaller, η Later we shall show how η u K influences the bias and variance of the classifier prediction.
To give a more intuitive comparison of the performance of both mechanisms, we adopt the worst-case bias upper bound Note that α K and η • K are non-decreasing w.r.t. the increase of K, in Section 4.3, we will explore how the LHS of Eqn.(3) is influenced by K: a short answer is that the LHS of Eqn. ( 3) is (generally) monotonically increasing w.r.t.K when K is small, indicating that Eqn. ( 3) is easier to be achieved given fixed δ, N and a smaller K than a larger one.

Variance of a Given Classifier w.r.t. ℓ-Loss
We now move on to explore the variance of a given classifier when learning with ℓ-loss, prior to the discussion, we define the variance of a given classifier as: Definition 3.5 (Classifier Prediction Variance of ℓ-Loss).The variance of a given classifier f when learned with separation (•) or aggregation (•) is defined as: For g(x) = x − x 2 , we derive the closed form of Var and the corresponding upper bound as below.
Theorem 3.6 provides another view to decide on the choices of separation and aggregation methods, i.e., the proxy of classifier prediction variance.To extend the theoretical conclusions w.r.t.ℓ loss to the multi-class setting, we only need to modify the upper bound of the distribution shift in Eqn.(2), as specified in the following corollary.Corollary 3.7 (Multi-Class Extension (ℓ-Loss)).In the M -class classification case, the upper bound of the distribution shift in Eqn.(2) becomes:

Bias and Variance Analyses with Robust Treatments
Intuitively, the learning of noisy labels problem could benefit from more robust loss functions build upon the generic ℓ loss, i.e., backward correction (surrogate loss) [35,36], and peer loss functions [26].We move on to explore the best way to learn with multiple copies of noisy labels, when combined with existing robust approaches.

Backward Loss Correction
When combined with the backward loss correction approach (ℓ → ℓ ← ), the empirical ℓ risks become: , where the corrected loss in the binary case is defined as Bias of given classifier w.r.t.ℓ ← Suppose the loss function ℓ(f (x), y) is L-Lipschitz for any feasible y.Define , where d is the VC-dimension of F .For backward loss correction, the separation bias proxy ∆ • R← is smaller than the aggregation bias proxy ∆ We defer our empirical analysis of the monotonicity of the LHS in Eqn. ( 6) to Section 4.3 as well, which shares similar monotonicity behavior to learning w.r.t.ℓ.
Variance of given classifiers with Backward Loss Correction Similar to the previous subsection, we now move on to check how separation and aggregation methods result in different variance when training with loss correction.
The variance proxy of Var( f • ← ) in Eqn.(7) is smaller than that of Var( Moving a bit further, when the noise transition matrix is symmetric for both methods, the requirement . For a fixed K, a more efficient aggregation method decreases ρ • i , which makes it harder to satisfy this condition.Recall L u ← := L u ←0 • L, the theoretical insights of ℓ ← between binary case and the multi-class setting could be bridged by replacing L u 0 with the multi-class constant specified in the following corollary.Corollary 4.4 (Multi-Class Extension (ℓ ← -Loss)).Given a diagonal-dominant transition matrix T u , we have , where λ min (T u ) denotes the minimal eigenvalue of the matrix T u .Particularly, if T u ii < 0.5, ∀i ∈ [M ], we further have , where e u := max (1 − T u ii ).

Peer Loss Functions
Peer Loss function [26] is a family of loss functions that are shown to be robust to label noise, without requiring the knowledge of noise rates.Formally, ℓ (f where the second term checks on mismatched data samples with (x i , ỹi ), (x 1 i , ỹ1 i ), (x 2 i , ỹ2 i ), which are randomly drawn from the same data distribution.When combined with the peer loss approach, i.e., ℓ → ℓ , the two risks become: Bias of given classifier w.r.t.ℓ Suppose the loss function ℓ(f (x), y) is L-Lipschitz for any feasible y.
To evaluate the performance of a given classifier yielded by the optimization w.r.t.ℓ , Lemma 4.5 provides the bias proxy for both treatments.Similarly, we adopt such a proxy to analyze which treatment is more preferable.
Theorem 4.6.Denote by α , where d denotes the VC-dimension of F .For peer loss, the separation bias proxy Note that the condition in Eqn. ( 8) shares a similar pattern to that which appeared in the basic loss ℓ and ℓ ← .We will empirically illustrate the monotonicity of its LHS in Section 4.3.

Variance of given classifiers with Peer Loss
We now move on to check how separation and aggregation methods result in different variances when training with peer loss.Similarly, we can obtain: The variance proxy of Var( f • ) in Eqn. ( 9) is smaller than that of Var( f Theoretical insights of ℓ also have the multi-class extensions, we only need to generate L u 0 to the multi-class setting along with additional conditions specified as below:

Corollary 4.8 (Multi-Class Extension (ℓ -Loss)). Assume ℓ is classification-calibrated in the multi-class setting, and the clean label Y has equal prior
For the uniform noise transition matrix [56] we have:

Analysis of the Theoretical Conditions
Recall that the established conditions in Theorems 3.4, 4.2, 4.6 are implicitly relevant to the number of labelers K, and the RHS of Eqns.(3,6,8) are constants.We proceed to analyze the monotonicity of the corresponding LHS (in the form of α K • ) w.r.t. the increase of K, where β K = 1 for ℓ and ℓ ← , ).We visualize this order under different symmetric T • in Figure 3.It can be observed that, when K is small (e.g., K ≤ 5), the LHS parts of these conditions increase with K, while they may decrease with K if K is sufficiently large.Recall that separation is better if LHS is less than the constant value γ.Therefore, Figure 3 shows the trends that aggregation is generally better than separation when K is sufficiently large.
as proxies of the worst-case performance of the trained classifier.For the standard loss function ℓ, it has been proven that [33,20] under mild conditions of ℓ and F , the lower bound of the performance gap between a trained classifier ( f ) and the optimal achievable one (i.e., f * ) R ℓ,D ( f ) − R ℓ,D (f * ) is of the order O( 1/N ), which is of the same order as that in Theorem 3.4.Noting the behavior concluded from the worst-case bounds may not always hold for each individual case, we further use experiments to validate our analyses in the next section.

Experimental Results
In this section, we empirically compare the performance of different treatments on the multiple noisy labels when learning with robust loss functions (CE loss, forward loss correction, and peer loss).We consider several treatments including label aggregation methods (majority vote and EM inference) and the label separation method.Assuming that multiple noisy labels have different weights, EM inference can be used to solve the problem under this assumption by treating the aggregated labels as hidden variables [8,47,40,39].In the E-step, the probabilities of the aggregated labels are estimated using the weighted aggregation approach based on the fixed weights of multiple noisy labels.In the M-step, EM inference method re-estimates the weights of multiple noisy labels based on the current aggregated labels.This iteration continues until all aggregated labels remain unchanged.As for label separation, we adopted the mini-batch separation method, i.e., each training sample x n is assigned with K noisy labels in each batch.

Experiment on Synthetic Noisy Datasets
Experimental results on synthetic noisy UCI datasets [9] We adopt six UCI datasets to empirically compare the performances of label separation and aggregation methods, when learning with CE loss, backward correction [35,36], and Peer Loss [26].The noisy annotations given by multiple annotators are simulated by symmetric label noise, which assumes T i,j = ǫ M−1 for j = i for each annotator, where ǫ quantifies the overall noise rate of the generated noisy labels.In Figure 4, we adopt two UCI datasets (StatLog: (M = 6); Optical: (M = 10)) for illustration.From the results in Figure 4, it is quite clear that: the label separation method outperforms both aggregation methods (majorityvote and EM inference) consistently, and is considered to be more beneficial on such small scale datasets.Results on additional datasets and more details are deferred to the Appendix.
Experimental results on synthetic noisy CIFAR-10 dataset [18] On CIFAR-10 dataset, we consider two types of simulation for the separate noisy labels: symmetric label noise model and instance-dependent label noise [6,67], where ǫ is the average noise rate and different labelers follow different instance-dependent noise transition matrices.For a fair comparison, we adopt the ResNet-34 model [15], the same training procedure and batch-size for all considered treatments on the separate noisy labels.
Figure 5 shares the following insights regarding the preference of the treatments: in the low noise regime or when K is large, aggregating separate noisy labels significantly reduces the noise rates and aggregation methods tend out to have a better performance; while in the high noise regime or when K is small, the performances of separation methods tend out to be more promising.With the increasing of K or ǫ, we can observe a preference transition from label separation to label aggregation methods.
Table 1: The performances of CE/BW/PeerLoss trained on 2 UCI datasets (Breast, and German datasets), with aggregated labels (majority vote, EM inference), and separated labels.We highlight the results with Green (for separation method) and Red (for aggregation methods) if the performance gap is larger than 0.05.(K is the number of labels per training image)

Empirical Verification of the Theoretical Bounds
To verify the comparisons of bias proxies (i.e., Theorem 3.4) through an empirical perspective, we adopt two binary classification UCI datasets for demonstration: Breast and German datasets, as shown in Table 1.Clearly, on these two binary classification tasks, label aggregation methods tend to outperform label separation, and we attribute this phenomenon to the fact that "denoising effect of label aggregation is more significant in the binary case".For Theorem 3.4 (CE loss), the condition requires , where α = (ρ For two binary UCI datasets (Breast & German), the information could be summarized in Table 2, where the column (1 − δ, S K ) means: when the number of annotators belongs to the set S K , the label separation method is likely to under-perform label aggregation (i.e., majority vote) with probability at least 1 − δ.For example, in the last row of Table 2, when training on UCI German dataset with CE loss under noise rate 0.4 (the noise rate of separate noisy labels), Theorem 3.4 reveals that with probability at least 0.98, label aggregation (with majority vote) is better than label separation when K > 23, which aligns well with our empirical observations (label separation is better only when K < 15).

Experiments on realistic noisy datasets
Note that in real-world scenarios, the label-noise pattern may differ due to the expertise of each human annotator.We further compare the different treatments on two realistic noisy datasets: CIFAR-10N [57], and CIFAR-10H [38].CIFAR-10N provides each CIFAR-10 train image with 3 independent human annotations, while CIFAR-10H gives ≈ 50 annotations for each CIFAR-10 test image.
In Table 3, we repeat the reproduce of three robust loss functions with three different treatments on the separate noisy labels.We report the best achieved test accuracy for Cross-Entropy/Backward Correction/Peer Loss methods when learning with label aggregation methods (majority-vote and EM inference) and the separation method (soft-label).We observe that the separation method tends to have a better performance than aggregation ones.This may be attributed to the relative high noise rate (ǫ ≈ 0.18) in CIFAR-N and the insufficient amount of labelers (K = 3).Note that since the noise level in CIFAR-10H is low (ǫ ≈ 0.07 wrong labels), label aggregation methods can infer higher quality labels, and thus, result in a better performance than separation methods (Red colored cells in Table 3 and 4).

Hypothesis Testing
We adopt the paired t-test to show which treatment on the separate noisy labels is better, under certain conditions.In Table 5, we report the statistic and p-value given by the hypothesis testing results.The column "Methods" indicate the two methods we want to compare (A & B).Positive statistics means that A is better than B in the metric of test accuracy.Given a specific setting, denote by Acc method as the list of test accuracy that belongs to this setting (i.e., CIFAR-10N, K = 3), including CE, BW, PL loss functions, the basic hypothesis could be summarized as below: • Null hypothesis: there exists zero mean difference between (1) Acc MV and Acc EM ; or (2) Acc MV and Acc Sep ; or (3) Acc EM and Acc Sep ; • Alternative hypothesis: there exists non-zero mean difference between (1) Acc MV and Acc EM ; or (2) Acc MV and Acc Sep ; or (3) Acc EM and Acc Sep .To clarify, the three cases in the above hypothesis are tested independently.For test accuracy comparisons of CIFAR-10N in Table 3, the setting of hypothesis test is K = 3 and the label noise rate is relatively high (18%).All p-values are larger than 0.05, indicating that we should reject the null hypothesis, and we can conclude that the performance of these three methods on CIFAR-10N (high noise, small K) satisfies: EM<MV<Sep.
For CIFAR-10H in Table 3 and 4, all the label noise rate is relatively low.We consider two scenarios (K < 15: the number of annotators is small; K ≥ 15: the number of annotators is large).p-values among MV and EM are always large, which mean that the denoising effect of the advanced label aggregation method (EM) is negligible under CIFAR-10H dataset.However, p-values of remaining settings are larger than 0.05, indicating that we should reject the null hypothesis, and we can conclude that the performance of these 3 methods on CIFAR-10H (low noise, small/large K) satisfies: EM/MV > Sep.

Conclusions
When learning with separate noisy labels, we explore the answer to the question "whether one should aggregate separate noisy labels into single ones or use them separately as given".In the empirical risk minimization framework, we theoretically show that label separation could be more beneficial than label aggregation when the noise rates are high or the number of labelers is insufficient.These insights hold for a number of popular loss function including several robust treatments.Empirical results on synthetic and real-world datasets validate our conclusion.
[65] Zhaowei Zhu, Tongliang Liu, and Yang Liu.A second-order approach to learning with instance-dependent label noise.

A Proof Sketch of Core Theorems
We briefly introduce the proof sketch of Lemma 4.1 because it sets up the foundation for the analyses on Backward Loss Correction and it covers the proofs of the standard ℓ loss in Section 3 as a special case.
A.1 Proof of Lemma 4.1 Proof.Our proof can be divided into four steps as follows.
Step 1: Apply Hoeffding's inequality for each group.We divide the noisy train samples where we have Step 2: Adopt the union bound for all groups.Applying the above technique on the other groups and by the union bound, we know that w.p. at least can be seen as a random variable within range: The randomness is from noisy labels ỹn,k .
Step 3: Hoeffding inequality for These K random variables are i.i.d.when the feature set is fixed.By Hoeffding's inequality, w.p. at least 1 − Kδ 0 − δ 1 , ∀f , we have Step 4: Rademacher bound on the maximal deviation For δ 0 = δ 1 = δ K+1 , with the Rademacher bound on the maximal deviation between risks and empirical ones, for f * ∈ F and the separation method, with probability at least 1 − δ, we have: where we define ℓ, ℓ as the upper and lower bound of loss function ℓ respectively, and R u (ℓ ← • F ) is the Rademacher complexity.
Step 5: Adopt the Lipshitz composition property of Rademacher averages.If ℓ is L−Lipshitz, then for separation and aggregation methods, Step 6: Triangle inequality Bound with the triangle inequality: Conclusions could be derived then.

B Full Proofs
In this section, we briefly introduce all omitted proofs in the main paper.
We firstly give the proof of Lemma 4.1 because it is beneficial for the proofs in Section 3.

B.1 Proof of Lemma 4.1
Proof.To apply Hoeffding's inequality on the dataset of the separation method, we divide the noisy train samples .Thus, with one group {(x n , ỹn,1 )} n∈[N ] , w.p. 1 − δ 0 , we have Note that: we have: Applying the above technique on the other groups and by the union bound, we know that w.p. at least can be seen as a random variable within range: The randomness is from noisy labels ỹn,k .Recall that the samples between different groups are i.i.d.given Then the above K random variables are i.i.d.when the feature set is fixed.By Hoeffding's inequality, w.p. at least 1 − Kδ 0 − δ 1 , ∀f , we have For δ 0 = δ 1 = δ K+1 , with the Rademacher bound on the maximal deviation between risks and empirical ones, for f * ∈ F and the separation method, with probability at least 1 − δ, we have: where we define ℓ, ℓ as the upper and lower bound of loss function ℓ respectively, and: Note that we assume the noisy labels given by the K labelers follow the same noise transition matrix, if ℓ is L−Lipshitz, then for separation and aggregation methods, . By the Lipshitz composition property of Rademacher averages, we have: . Thus, we have: Assume f * ← min f ∈F R ℓ,D (f ), for separation methods, we further have: Similarly, for aggregation methods, we have: 2 and η • K ≡ 1, we then have: Proof.The proof is straightforward if we proceed the proof of Lemma 4.1 with below discussions.With the knowledge of noise rates for both methods, remember that , we have: For any finite concept class F ⊂ {f : X → {0, 1}}, and the sample set S = {x 1 , ..., x N }, the Rademacher complexity is upper bounded by 2d log(N ) where d is the VC dimension of F .To achieve ∆ • R← < ∆ • R← , we simply need to find the condition of K (or η • K ) that satisfies the below in-equation: where we denote by α
The term of distribution shift can be upper bounded by: Combine similar terms, we then have: B.6 Proof of Lemma 3.3 Proof.For the term Estimation error, we have: The upper bound of Error 1 could be derived directly from the proof of Lemma 4.1: since the loss function makes no use of loss correction, the L-Lipschitz constant does not have to multiply with the constant and L u ← → L. Besides, the constant for the variance term (square term) reduces to (ℓ − ℓ).Thus, we have: we could derive Var( f u ) ≤ g( 2 log(1/δ) B.9 Proof of Corollary 3.7 Proof.In the multi-class extension, the only difference is the upper bound of Distribution Shift term in Eqn.(12), which now becomes: B.10 Proof of Lemma 4.5 Proof.The proof of Lemma 4.5 builds on Theorem 7 in [26]: The performance bound for aggregation methods is the special case of Theorem 7 in [26] (adopting α * = 1 defined in [26]).As for that of separation methods, the incurred difference lies in the appearance of the weight of sample complexity η • K .Thus, we have: where B.11 Proof of Theorem 4.6 Proof.Denote by which is further equivalent to: are positive, the above requirement then reduces to: Note that for any finite concept class F ⊂ {f : X → {0, 1}}, and the sample set S = {x 1 , ..., x N }, the Rademacher complexity is upper bounded by 2d log(N )

N
where d is the VC dimension of F , a more strict condition to get becomes: .

Denote by α
4d log(N ) .The above condition is sastisfied if and only if B.12 Proof of Theorem 4.7 Proof.Similar to the proof of Theorem 3.6, for u ∈ {•, •}, we have: A special case is the 0-1 loss, i.e., ℓ(•) = 1(•), we then have: where R ℓ, D u ( f u ) ∈ [0, 1] and g(a) = a − a 2 is monotonically increasing when a < 1 2 .Note that: Accordingly, noting X p and Y u p are independent, the second term in ( 16) is In this case, we have The second term becomes Comparing the above two terms we have: Thus, substituting L u 0 := , the proof of Corollary 4.8 is finished if we repeat the corresponding proof of the binary task.

C Additional Results and Details D Experiment Details D.1 Experiment Details on UCI Datasets
Datasets In this paper, we conducted the experiments on two binary (Breast and German) and two multiclass (Stat-Log and Optical) UCI classification datasets.As for the splitting of training and testing, the original settings are used when training and testing files are provided.The remaining datasets only give one data file.We adopt 50/50 splitting for the testing results' statistical significance as more data is distributed to testing dataset.More specifically, the numbers of (training, testing) samples in Breast, German, StatLog, and Optical datasets are (285, 284), (500, 500), (4435,2000), and (3823, 1797).
Generating the noisy labels on UCI datasets For each UCI dataset adopted in this paper, the label of each sample in the training dataset will be flipped to the other classes with the probability ǫ (noise rate).For the multiclass classification datasets, the specific label which will be flipped to is randomly selected with the equal probabilities.For binary and multiclass classification datasets, (0.1, 0.2, 0.3, 0.4) and (0.2, 0.4, 0.6, 0.8) are used as different lists of noise rates respectively.
Implementation details We implemented a simple two-layer ReLU Multi-Layer Perceptron (MLP) for the classification task on these four UCI datasets.The Adam optimizer is used with a learning rate of 0.001 and the batch size is 128.

D.3 Details Results on CIFAR-10 Dataset
Table 6 includes all the detailed accuracy values appeared in Figure 5.The results on synthetic noisy CIFAR-10 dataset aligns well with the theoretical observations: label separation is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient.

3
Bias and Variance Analyses w.r.t.ℓ-lossIn this section, we provide theoretical insights on how label separation and aggregation methods result in different biases and variances of the classifier prediction, when learning with the standard loss function ℓ.Suppose the clean training samples {(x n , y n )} n∈[N ] are given by variables (X, Y ) such that (X, Y ) ∼ D. Recall that instead of having access to a set of clean training samples D = {(x n , y n )} n∈[N ] , the learner only observes K noisy labels ỹ• n,1 , ..., ỹ• n,K for each x n , denoted by D

Lemma 3 . 2 (
Distribution shift).Denote by p i := P(Y = i), assume ℓ is upper bounded by l and lower bounded by ℓ.The distribution shift in Eqn.(1) is upper bounded by 2 and Lemma 3.3 as a proxy and derive Theorem 3.4.Theorem 3.4.Denote by α

1 .Lemma 4 . 1 .
Denote by R ℓ,D ( f ) the ℓ-risk of the classifier f under the clean data distribution D, with f = f u ← = argmin f ∈F Rℓ←, D u (f ).Lemma 4.1 gives the upper bound of classifier prediction bias when learning with ℓ ← via separation or aggregation methods.With probability at least 1 − δ, we have:

Lemma 4 .
1 offers the upper bound of the performance gap for the given classifier f w.r.t the clean distribution D, comparing to the minimum achievable risk.We consider the bound ∆ u R← as a proxy of the bias, and we are interested in the case where training the classifier separately yields a smaller bias proxy compared to that of the aggregation method, formally ∆ • R← < ∆ • R← .For any finite hypothesis class F ⊂ {f : X → {0, 1}}, and the sample set S = {x 1 , ..., x N }, denote by d the VC-dimension of F , we give conditions when training separately yields a smaller bias proxy.Theorem 4.2.Denote by α

Figure 4 :
Figure 4: The performances of Cross-Entropy, Backward Loss Correction, and Peer Loss trained on synthetic noisy Statlog-6/Optical-10 aggregated labels (we report the better results between majority vote and EM inference for each K, and noise rate ǫ), and separated labels.X-axis: the value of the number of labelers √ K; Y axis: the best test accuracy achieved.

Table 3 :
Experimental results on CIFAR-10N and CIFAR-10H dataset with K = 3.We highlight the results with Green (for separation method) and Red (for aggregation methods) if the performance gap is large than 0.05.

Table 4 :
Experimental results on CIFAR10-H with K ≥ 5. We highlight the results with Green (for separation method) and Red (for aggregation methods) if the performance gap is large than 0.05.

Table 5 :
Hypothesis testing results of the comparisons between label aggregation methods and the label separation method on realistic noisy datasets.
Note within each group, e.g., group {(x n , ỹ• n,1 )} n∈[N ] , all the N training samples are i.i.d.Additionally, training samples between any two different groups are also i.i.d.given feature set {x n } n∈[N ] .Thus, with one group {(x n , ỹn,1 )} n∈[N ] , w.p. 1 − δ 0 , we have Note within each group, e.g., group {(x n , ỹ• n,1 )} n∈[N ] , all the N training samples are i.i.d.Additionally, training samples between any two different groups are also i.i.d.given feature set {x n } n∈[N ]

Table 6 :
The performances of CE/BW/PeerLoss trained on (Left half: symmetric noise; right half: instance noise) CIFAR-10 aggregated labels (majority vote, EM inference), and separated labels.(Different number of labels per training image)