Abstract
Deep neural networks (DNNs) for social image classification are prone to performance reduction and overfitting when trained on datasets plagued by noisy or imbalanced labels. Weight loss methods tend to ignore the influence of noisy or frequent category examples during the training, resulting in a reduction of final accuracy and, in the presence of extreme noise, even a failure of the learning process. A new advisor network is introduced to address both imbalance and noise problems, and is able to pilot learning of a main network by adjusting the visual features and the gradient with a meta-learning strategy. In a curriculum learning fashion, the impact of redundant data is reduced while recognizable noisy label images are downplayed or redirected. Meta Feature Re-Weighting (MFRW) and Meta Equalization Softmax (MES) methods are introduced to let the main network focus only on the information in an image deemed relevant by the advisor network and to adjust the training gradient to reduce the adverse effects of frequent or noisy categories. The proposed method is first tested on synthetic versions of CIFAR10 and CIFAR100, and then on the more realistic ImageNet-LT, Places-LT, and Clothing1M datasets, reporting state-of-the-art results.
1 INTRODUCTION
Training deep neural networks on large numbers of labeled images is critical for social multimedia retrieval [30]. Several recent applications depend on correctly retrieving many different concepts in images like micro-video recommendation [29] or predicting image popularity [37]. Covering many concepts is very challenging, especially for rare ones, since a large number of images are required for training classifiers. Therefore, automatic methods for label generation have recently been investigated by researchers, for instance in the form of semi-supervised. They exploit labeled images from non-experts as is typical in multimedia sources (e.g., social networks, textual description of products, video captions, etc.) or even unlabeled ones that are available in very large quantities at no cost. Due to their nature, these data have mislabeled or unbalanced samples [1], which can follow a long-tail distribution [25]. The great adaptability of deep neural networks, provided by their large number of parameters, leads to the generation of highly discriminative models if the training data are balanced and correct. When this assumption is not true and the data are unbalanced or their annotations are noisy, then there is a resulting reduction in performance and possible overfitting [9, 19]. Recent methods have attempted to address the label noise problem by measuring network confidence during training through curriculum learning [26], employing another co-trained network [14], or directly estimating noise in the set [16]. Finding samples out of the correct distribution and trying to reduce their impact on training is the general idea for dealing with this dataset problem. Instead, the unbalanced distribution (long-tail) of concepts is solved through feature augmentation techniques [7], but above all with the design of ad-hoc loss functions for this type of problem [9, 43, 46]. In contrast, a meta-learning approach is here proposed to address long-tailed and noisy labels problems, which is based on an advisor network trained to help the main classifier model (Figure 1) to perform better at the image classification task. During the training of a standard classification model, the advisor network adjusts feature activations and gradients of the main model by observing its feature activations and training loss. At test time, the advisor is discarded keeping only the main network as the final model. Compared to the teacher-learner paradigm our advisor network is trained to help another model instead of being trained to do image classification. Our contributions are:
Fig. 1. General overview of our system. An advisor network assists an image classifier by exploiting an auxiliary meta-set to reduce noise and unbalance problems in annotations of the training set.
Following the principles of our advisor network, we present a new meta-model to solve concurrently both the imbalance and label noise problems for the image classification task.
We introduce a Meta Feature Re-Weighting (MFRW) method that automatically generates an attention mask on the visual features of the classifier so that it focuses only on the information in an image deemed relevant by the advisor network.
The Meta Equalization Softmax (MES) activation function has been formulated to automatically adjust the grader gradients so that its learning is not adversely affected by an image belonging to a frequent category or with a noisy label.
The effective performance of our method is shown numerically and qualitatively by experiments conducted on synthetic (long-tailed and noisy label corruption) and real-world datasets. We achieve the state-of-the-art result on Clothing1M.
The code will be released upon the acceptance of this paper.
2 RELATED WORKS
Noisy training labels
In literature, the problem of noisy labels in training data is well studied because machine learning systems are prone to performance degradation when noise is present in the training label [38, 41]. Loss correction was a well-treated technique to mitigate the effect of noisy samples on the classifier network. Works like Reed [42], F-correction [13], GLC [16], M-correction [2], and S-adaptation [40] made loss adjustments based on the estimated corruption probabilities matrix, changing the wrong labels to the correct ones. In [35, 47, 54] the noise distribution was modeled by linearly combining the noisy label with the output of the network. Different approaches assigned a weight to each example, avoiding the contribution of a noisy sample to the training by giving it a lower weight value. MentorNet [20] and MentorMix [19] found the latent weights through data-driven curriculum learning. Some works used augmentation strategies that encourage the main model to behave linearly in-between training examples like Mixup [56] and AdvAug [6]. DivideMix [27] dynamically separated training data into clean and noisy sets to optimize two diverged networks with a semi-supervised strategy. In contrast, our method takes advantage of an advisor network that alters activations and gradients of the main classifier and can increase its performance, without isolating noisy label samples from the clean ones.
Imbalanced training labels
Training on imbalanced (or Long-tailed) datasets is an active research field in computer vision [3, 7, 18, 43, 46, 57, 59]. A common solution present in the literature is re-sampling. While [4, 15, 36] sampled more (over-sampling) training data from the minority classes to balance the distribution of all classes, [12] removed (under-sampling) data from frequent classes to make the data distribution more uniform. Under-sampling is infeasible in extreme long-tailed datasets, where the imbalance ratio between the head and the tail class is high, because most of the examples would be excluded from training. Another solution is re-weighting, where weight was assigned to each different training sample, according to its importance. [17, 49] used the inverse of class frequency to determine the weight value. Re-weighting can be done even on a sample level. In [31] a modulating loss factor was introduced to make the neural network cost-sensitive, reducing the loss contribution from easy examples. Instead, [3, 9, 23, 57] manipulated the loss based on the category distribution. In [43] an unbiased softmax function was derived to explicitly model the class distribution shift and minimize the generalization error bound. [46] introduced a new loss that avoids discouraging gradients for the rare class. [7, 55] exploited the feature augmentation method to transfer the feature variance of common classes to the rare ones. The solution proposed by [33] adopted a memory module to augment the rare categories with semantic feature representation obtained from common ones. Instead, our method exploits a new layer of meta-attention to direct the classifier’s attention much more to the rare categories, while still not forgetting the common ones. Moreover, our advisor network automatically modifies the classifier’s gradients to avoid the negative impact of the common classes on the rare ones. Self-supervised learning approaches [22] handle severe class imbalances effectively. It has been shown that self-learned representations are also robust to label noise if adjusted with an unbalance- and noise-resistant loss function.
Meta-learning
Meta-learning was used to assist the training and optimization of learning models. The noisy labels problem was addressed in [28, 44, 50] with this approach. For example, L2R [44] weighed each sample by giving less importance to the noisy one. MLNT [28] imitated regular training with fabricated noisy labels. MLC [50] estimated the corruption probabilities matrix to adjust the training loss values. Meta-learning was also applied to the long-tailed classification task. In [18, 28] the meta-model learned to assign higher weights to the examples of the rare classes. MW-Net [45] automatically determined an explicit weighting function that can be easily fitting to a different type of task, and it works on both the noise and imbalance training problems. GDW [5] introduced class-level meta-weights for several gradients flows and adjusted them to make better use of class-level information. All these methods took advantage of a small clean validation dataset to apply the meta-learning scheme. Differently from them, our advisor network modifies the network activations using a meta-attention layer and simultaneously learns to weigh training gradients to increase the performance of the main classifier during its training. Our method addresses both imbalance and noise training concurrently.
3 METHOD
3.1 Task
We developed a new advisor network that helps the deep neural network (DNN) to address both the noisy labels and long-tail image classification problems. Our method is composed of two main parts that can work jointly: Meta Feature Re-Weighting (MFRW) and Meta Equalization Softmax (MES). The first component, MFRW, makes use of an auxiliary advisor network that automatically learns how to weigh the features extracted from a DNN during its training. Our idea was to exploit an attention mechanism to enhance the useful parts of visual information and lower the rest. If a network can concentrate only on some convenient parts of an image, that information can contribute to increasing the overall generalization capacity of the network even if the annotation is wrong. This is also true for the long-tailed distribution of data where information from common classes can be leveraged to improve performance on the rare ones. In the second component, MES, the advisor network automatically learns how to reduce the discouraging gradients of some images concerning others. In long-tailed distributions, the discouraging gradients of frequent categories samples can worsen the learning of the rare ones [46]. This can happen even with noisy labels because the discouraging gradients of an example with the wrong label can affect the correct learning of the entire classifier. These two methods can be used simultaneously to help the learning of an image classifier, reducing the negative effect that unbalanced or noisy annotations in the training datasets can produce. Our advisor network is trained with the meta-learning paradigm, so it can know the current state of the classifier and learn how to help it at that moment.
We first introduce a meta-learning basic formulation for methods that learn robust deep neural networks from noisy and long-tailed category distributions. We then proceed to show in detail each part of our method: Meta Feature Re-Weighting is specified in Section 3.3 and Meta Equalization Softmax in Section 3.4. Finally, the learning process of the classifier together with the advisor network is described in Section 3.6.
3.2 Background Meta-learning
In general, meta-learning (ML) refers to the process of improving a learning algorithm over multiple learning episodes, it is also called learning to learn. ML is divided into two algorithms: an inner (or base) and an outer (or upper/meta) algorithm. The inner one solves the main task minimizing an objective function, for image classification we have a convolutional neural network and the cross-entropy loss, respectively. Instead, the outer algorithm updates the inner one such that it improves also on an outer objective function. When the objective functions are the same for both algorithms, the outer algorithm can help the inner one to work well on new data distribution. If the new distribution is a smaller version of training data but free of errors and balanced, it is possible to train the outer algorithm to solve the problem of noisy or imbalanced labels inside the main training data. We refer to this distribution as meta-set. As in [45], the outer algorithm can be a multilayer perceptron network, called meta-model, that learns automatically how to address these problems helping the main image classifier during its learning. We introduce the symbols useful for understanding ML in this particular setting and how the entire learning process is divided, describing the [45] algorithm for simplicity.
Let \(D^{train} = \lbrace x_i^{tra}, y_i^{tra}\rbrace ^N_{i=1}\) be the training set with noisy or imbalanced annotations, where \(N\) is the total number of samples, composed of an image \(x_i\) and the correspondent one-hot label \(y_i\) over \(C\) classes. The main DNN model is defined as \(\Phi (\cdot ; w)\), where \(w\) are its parameters. The prediction on an input image \(x\) is \(\hat{y} = \Phi (x; w)\) and the optimal parameters \(w^*\) are obtained by minimizing the softmax cross-entropy loss \(\ell (\hat{y}, y)\) on the training set. Let \(D^{meta} = \lbrace x_j^{meta}, y_j^{meta}\rbrace ^M_{j=1}\) be the meta-set, a well-verified and balanced version of training one but much smaller, \(M \ll N\). The meta-model is defined with \(\Psi (\cdot ;\theta)\), parameterized by \(\theta\). In [45] the optimal parameter \(w^*\) is derived using a loss weighted with a value predicted by the meta-model. The meta-model is trained minimizing the softmax cross-entropy loss of previously updated \(\Phi (\cdot ;w^*(\theta))\) on the meta dataset \(D^{meta}\).
Both \(\Phi (\cdot ; w)\) and \(\Psi (\cdot ;\theta)\) can be updated by alternating optimization through gradient descent. An online strategy, that is divided into three main steps, can be adopted to update \(\theta\) and \(w\) through a single optimization loop. This guarantees the efficiency of the algorithm and its convergence [45]. In the first step, called Virtual-Train, the original DNN will not be updated and the optimization is carried out on a virtual model that is the clone of the original one. Keeping in mind the virtual update previously carried out, in the successive step called Meta-Train, the meta-model is updated. Actual-Train is the last step where the base DNN model is optimized taking into account the already updated meta-model.
3.3 Meta Feature Re-Weighting (MFRW)
Human attention is the ability of the brain to selectively concentrate on one aspect of the environment while ignoring other information. Attention for a DNN is a mechanism that tries to mimic the cognitive attention of the human brain, calculating a soft (or hard) mask which is then multiplied with the visual features of the network. The mask \(W\) is usually the output of a function \(g\) of some input \(x\) (1) \(\begin{equation} W = g(x) \end{equation}\) and \(W\) is element-wise multiplied with a feature \(f\) of the network (2) \(\begin{equation} f_{att} = W \odot f \end{equation}\) where \(\odot\) is the symbol for element-wise multiplication.
This intensifies the important parts of the feature and reduces the rest. We proposed a meta-attention mechanism, called Meta Feature Re-Weighting (MFRW), that can be used to mitigate noisy or imbalanced labels problems in the training data. In the first case, if there is a mismatch between the content of the image and its associated annotation, that can lead to a degradation of the classifier’s performance for that annotated class. With our method, the main network can use only parts of the erroneous visual information to improve performance in that class. Instead in the case of imbalance, MFRW can attribute to the visual information relative to common classes smaller importance than those of the rare ones. Finding the handwritten function \(g\) that generates the right masks for each of the two cases is challenging. We used a meta-model to automatically infer the correct \(g\). This gives two properties to \(g\): it can change during the training of the main network and it can adapt automatically to the problem present in the training data. The element-wise product is done between the feature \(f\) extracted from a DNN and a vector of weights \(W_f\) learned from a meta-model (3) \(\begin{equation} f_{att} = W_f \odot f \end{equation}\) The meta-model can take into consideration important aspects of each training data, so it can generate the proper activation weights based on them. Attention must be differentiated between the various categories, as each may have a different number of examples and noise levels. This is done by giving as input to the meta-model visual features extracted from the classifier’s backbone. In the case of unbalanced labels, examples of the more common categories, since they are presented many more times during training, are easier to be learned than those of the rarer ones. In addition, mislabeled images have a different difficulty than cleanly labeled images, as they are outside the correct distribution of each category, and their size is often smaller than the correct ones. The attention should be adjusted according to the difficulty that a training example represents for the classifier. This way, it can focus differently on the information in the data by learning a better representation. The cost value typically used to express the difficulty of classification samples [26] is given to the meta-model in combination with the visual features.
3.4 Meta Equalization Softmax (MES)
The conventional loss function for image classification is the softmax cross-entropy. A multinomial distribution \(p\) over \(C\) categories is obtained from the network outputs score \(z\) with the softmax activation function. Then the cross-entropy is calculated between \(p\) and target distribution \(y\). The softmax cross-entropy loss can be formulated as: (4) \(\begin{equation} \mathcal {L}_{SCE} = - \sum _{j=1}^{C}y_j log(p_j) \end{equation}\) where the distribution \(p_j\) is described as follows: (5) \(\begin{equation} p_j = Softmax(z_j) = \frac{e^{z_j}}{\sum _{k=1}^{C}e^{z_k}} \end{equation}\)
When the distribution of categories in the training dataset is imbalanced, the softmax cross-entropy loss makes the learning of rare categories easily suppressed by the common ones. In [46] a softmax equalization loss is proposed to avoid discouraging gradients from samples of frequent categories for the rare ones. The difference with the softmax cross-entropy is the weighting of a term within the softmax activation function. The new distribution \(p_j\) is calculated as: (6) \(\begin{equation} p_j = EQ_{Softmax}(z_j, \tilde{w}_k) = \frac{e^{z_j}}{\sum _{k=1}^{C} \tilde{w}_k e^{z_k}} \end{equation}\) where (7) \(\begin{equation} \tilde{w}_k = 1 -\beta T_\lambda (f_k)(1-y_k) \end{equation}\)
The element \(T_\lambda (f_k)\) is a handcrafted threshold function that outputs a value \(\in \lbrace 0,1\rbrace\) based on the category frequency value \(f_k\). When \(T_\lambda\) output is 1 the gradient is ignored, otherwise it is taken into account. Instead, \(\beta\) is a Bernoulli random variable with a probability of \(\rho\) to be 1 and \(1 - \rho\) to be 0.
The strategy of avoiding the discouraging gradients can be useful in other problems different from imbalance training, for example image classification with noisy labels. The discouraging gradients of a mislabeled image can be scaled to not harm the correct learning of the classifier model. This behavior can be obtained by modifying the weights \(\tilde{w}_k\) passed to the \(EQ_{Softmax}\) function in the Equation (6). The element that determines each category’s weight is \(T_\lambda (f_k)\), but it works only for the imbalance training problem. Writing a new function for the noisy annotation problem is hard because the noise can be really complex or completely unknown, for example when data is collected automatically [52].
Inspired by this we proposed a meta-learned equalization loss (MES) that can adapt the weight to the task that needs to be solved. The new formulation of the weights \(\tilde{w}_k\) passed to the \(EQ_{Softmax}\) is: (8) \(\begin{equation} \tilde{w}_k = 1 -\beta s_k (1-y_k) \end{equation}\) where \(s_k\) is the vector of value \(\in (0,1)\). This vector \(s_k\) is the output of a meta-model trained to help the main model handle noise and imbalance label problems present in the training data. The visual feature and the cost of each training data are given as input to the meta-model. This allows the model to generate output vectors \(s_k\) that are differentiated between classes and between “easy” and “hard” examples.
3.5 Meta-model Architecture
Since both MFRW and MES require the same input data, it was possible to combine the two methods through a single meta-model. Our meta-model \(\Psi\) is a neural network composed only of a fully connected layer. The inputs are a feature \(f\) and a loss value \(\mathcal {L}_x\). Each input is projected in a fixed-size embedding space through a separate fully connected layer followed by a ReLu function. These embeddings are concatenated to form a larger common space, the size of which is the sum of the dimension of each previous embedding. MFRW method requires a weight vector \(W_f\) in the range \(\in (0,1)\) of the same size as the feature input \(f\). Instead, MES needs a weight vector in the range \(\in (0,1)\) but with a length equal to the number of classes \(C\). From the last embedding space, we get the needed outputs thanks to a fully connected layer followed by a sigmoid activation function for each of the outputs. In this way, MFRW and MES can learn a common internal representation from the inputs, obtaining their respective benefits.
3.6 Algorithm
In this section, we describe how the base classifier \(\Phi\) and our meta-model \(\Psi\) are trained jointly. Because the meta-model needs as input the visual feature, we separate the main model \(\Phi (\cdot ; w)\) into two different parts: the backbone \(\Phi _b(\cdot ; w_b)\) and the category predictor \(\Phi _c(\cdot ; w_c)\). The first one has an image \(x\) as input and gives out a feature vector \(f\). Instead, the second part has \(f\) as input and a probability score vector \(z\) as output. In this way, it is possible to manipulate the feature \(f\) directly with our meta-model \(\Psi\). The meta-model takes two different inputs \(\Psi (f,\mathcal {L})\) and gives back two vectors of weights \(W_f\) and \(s_k\). Our algorithm is divided into four main phases that are shown in Figure 2 and summarized in Algorithm 1. We describe our method in detail starting with the \(t\)-th iteration and moving forward each step until we reach the \((t+1)\)-th. Different from the meta-learning optimization strategy described in Section 3.2, we need an additional initial phase, called Loss Pre-Calculation (Figure 2(a)). The value of loss \(\mathcal {L}^{pre}\) related to the training batch \(X^{train}\) must be calculated at the beginning. This loss value must be dependent on the original feature \(f^{train}\) and not on the weighted one \(f^{att}\). In the second step Virtual-Train (Figure 2(b)), \(\Phi _b^t\) and \(\Phi _c^t\) are the virtual clones of the backbone \(\Phi _b(\cdot ; w_b)\) and the category predictor \(\Phi _c(\cdot ; w_c)\) at the beginning of the \(t\)-th iteration. We obtain the features \(f^{train}\) passing through \(\Phi _b^t\) the batch \(X^{train}\). Then the loss values \(\mathcal {L}^{pre}\) pre-calculated and its relative feature \(f^{train}\) are given to \(\Psi ^t\) (the meta-model at time \(t\)) to obtain the two vectors of weights \(W_f\) and \(s_k\). The feature \(f^{train}\) is multiplied element-wise with \(W_f\) to get a new feature vector with attention \(f^{att}\) as in Section 3.3. The modified feature is passed to the predictor \(\Phi _c^t\) obtaining the score \(z^{train}\). Now we calculate the \(\mathcal {L}^{train}\) with the equalization loss, described in Section 3.4, using the vector \(s_k\) in the Equation (8). Then \(\Phi _b^t\) and \(\Phi _c^t\) parameters are virtually updated to minimize \(\mathcal {L}^{train}\), excluding those of \(\Psi ^t\). For the third step Meta-Train (Figure 2(c)), we need a clean and balanced meta-dataset that will be used to train the meta-model \(\Psi\). We pass a meta batch \(X^{meta}\) through the virtually updated \(\Phi _b^{t+1}\) and \(\Phi _c^{t+1}\) in order to get a validation loss \(\mathcal {L}^{meta}\). In this step, the feature is not modified and the loss is the classic softmax cross-entropy loss. Then only \(\Psi ^t\) is updated minimizing \(\mathcal {L}^{meta}\). In this way, the meta-model is optimized to help the main model minimize its error on clean and balanced data. Here the optimization takes into consideration also the previous Virtual-Train. In the last phase, Actual-Train (Figure 2(d)) the original \(\Phi _b^t\) and \(\Phi _c^t\) are optimized taking into account the updated meta-model \(\Psi ^{t+1}\). Our meta-model is used only during the training of the main network \(\Phi\) when external help is needed to solve noisy or imbalance label problems. It is discarded at test time when only the main network is retained as the final model.
Fig. 2. Full training scheme, divided by steps, of our method reaching the \((t+1)\) -th iteration from the \((t)\) -th.
Computation and memory overhead
Excluding the Loss Pre-Calculation phase, the Virtual-Train, Meta-Train, and Actual-Train steps need a backward operation in addition to a forward one. The Meta-Train backward step, in which the meta-gradient is computed from the loss on the meta-set, takes more than \(80\%\) of the total computation [53]. In this step, to update the meta-model parameters, the meta-gradient is back-propagated backward through each layer of the main network. Since normal training does not involve this step, this additional cost quickly becomes significant as the number of layers in deep networks increases. In addition, the amount of GPU memory required is duplicated compared to traditional training. The gradients obtained in the Virtual-Train step must still be kept in memory so that the meta-gradient can be calculated during the Meta-Train step. These computation and memory problems are typical of a lot of meta-learning approaches. However, a method like [53] which computes the meta-gradient with a faster layer-wise approximation provides strategies to overcome them. These overhead costs are present only during the training. In the test phase, the meta-model is not used and there is only a forward pass on the classifier to get a prediction on the input.

4 EXPERIMENTS
To demonstrate the effectiveness of our method, we conducted several experiments on synthetically generated datasets with a controlled level of noise and imbalance. We also tested it in real-world datasets to prove its ability to adapt to any context.
4.1 Datasets
CIFAR10 and CIFAR100 synthetic datasets
Following previous works [20, 44, 45], we used CIFAR-10 and CIFAR-100 as bases to generate synthetic datasets. They are composed of \(50,\!000\) training images and \(10,\!000\) test images of size 32 \(\times\) 32. Off the training set, we randomly selected 100 images for CIFAR-10, and 10 images for CIFAR-100, per class to create the meta-set for meta-training. The long-tailed versions of the datasets, CIFAR-LT-10 and CIFAR-LT-100, are created randomly removing training examples [3]. Following the standard evaluation protocol for the long-tailed problem, we studied five different imbalance factors (IFs) of 200, 100, 50, 20, and 10, where IF = 1 coincides with the original datasets. These IFs are related to the parameter \(\mu \in (0, 1)\), where the number of examples dropped from the y-th class is \(n_y\mu ^y\) and \(n_y\) is the original number of training examples for that class.
We tested our method also on a noisy labels version of CIFAR-10 and CIFAR-100, namely Flip CIFAR-10 and Flip CIFAR-100. In these variants, we chose the standard Flip (or asymmetric) noise because it is designed to mimic the structure where labels are only replaced by similar classes, e.g., dog\(\leftrightarrow\)cat. This type of noise usually happens when there is ambiguity between categories or visual similarity between images [52]. The noise ratio is controlled with a parameter \(p\), which represents the probability that a correct label is flipped with the corresponding similar one. In this way, we could test our method on different levels of noise, from \(p = 0.0\) (no noise) to \(p=0.6\) (heavy noise).
Merging the two strategies to inject data issues, we also introduced a new synthetic version of each dataset, named respectively LT Flip CIFAR-10 and LT Flip CIFAR-100, as a new evaluation protocol for the case of training data with simultaneously unbalanced and noisy labels.
ImageNet-LT
In [33] a long-tailed version of ImageNet-2012 [10] called ImageNet-LT, was introduced as standard evaluation protocol for the long-tailed problem. From a Pareto distribution with shape value \(\alpha = 6\), each class size is sampled to obtain the corresponding number of images for each one. ImageNet-LT has \(115,\!800\) training images in \(1,\!000\) classes with an imbalance factor of \(1280/5\). We randomly selected 10 images per class from the provided validation set to create our meta-set for meta-training. The test set is the original balanced ImageNet-2012 validation set with 50 images per class.
Places-LT
The Places-LT dataset is created by sampling from the dataset Places-2 [58] with the same strategy used for ImageNet-LT. The training set is composed of \(62{,}500\) images from 365 classes with an imbalance factor of \(4980/5\). The test set has 100 images per class. Our meta-set is created by randomly selecting 10 example per class from a validation set of 20 images per class.
Clothing1M
The Clothing1M [52] is a dataset that is composed of 1 million images of clothing taken from online shopping websites. There are 14 categories like T-shirts, Shirts, Knitwear, and so on. The labels are obtained from the text of the images provided by the sellers and not from an expert annotator. This process introduces into the labels a real-world noise, which cannot be predicted in advance. A validation set of 14,313 manually well-annotated images is provided and it was used as the meta dataset in our experiments.
4.2 Meta-model Implementation Details
In every experiment, the meta-model was optimized with Adam [24] and a learning rate of 1e-4. The size of each embedding space was set always to 100. The probability of \(\rho\) of the Bernoulli distribution \(\beta\) for MES was equal to 0.9.
4.3 Long-Tail Label Distribution Results
We conducted several experiments on the imbalance training problem related to image classification. We tried different settings and datasets to compare our method with the others in the literature. We tested MFRW and MES both disjointly and together (MFRW-MES). We consider as baseline method the direct training of the classifier with a standard softmax cross-entropy loss.
CIFAR-LT-10 and CIFAR-LT-100
The first part of experiments on CIFAR-LT-10 and CIFAR-LT-100 was conducted with a Resnet-32 network trained through SGD with a momentum of 0.9, weight decay of 5e-4, batch size of 128, and a starting learning rate of 0.1. The learning rate decreased to its \(1/10\) at the 160 epoch and 180 epoch, stopping the learning at 200 epochs. The results of our methods and related works are shown in Table 1.
| Dataset | Long-Tailed CIFAR-10 | Long-Tailed CIFAR-100 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| IF | 200 | 100 | 50 | 20 | 10 | 200 | 100 | 50 | 20 | 10 |
| Baseline (CE) [45] | 65.68 | 70.36 | 74.81 | 82.23 | 86.39 | 34.84 | 38.32 | 43.85 | 51.14 | 55.71 |
| Focal Loss [31] | 65.29 | 70.38 | 76.71 | 82.76 | 86.66 | 35.62 | 38.41 | 44.32 | 51.95 | 55.78 |
| Fine-tuning [45] | 66.08 | 71.33 | 77.42 | 83.37 | 86.42 | 38.22 | 41.83 | 46.4 | 52.11 | 57.44 |
| CB Loss [9] | 68.89 | 74.57 | 79.27 | 84.36 | 87.49 | 36.23 | 39.6 | 45.32 | 52.59 | 57.99 |
| L2RW [44] | 66.51 | 74.16 | 78.93 | 82.12 | 85.19 | 33.38 | 40.23 | 44.44 | 51.64 | 53.73 |
| MW-Net [45] | 68.91 | 75.21 | 80.06 | 84.94 | 87.84 | 37.91 | 42.09 | 46.74 | 54.37 | 58.46 |
| LDAM-DRW [3] | - | 77.03 | - | - | 88.16 | - | 42.04 | - | - | 58.71 |
| LDAM [18] | - | 80.00 | 82.34 | 84.37 | 87.40 | - | 44.08 | 49.16 | 52.38 | 58.00 |
| FaMUS CE [53] | - | 79.30 | 83.15 | 87.15 | 89.39 | - | 45.60 | 49.56 | 56.22 | 60.42 |
| FaMUS LDAM [53] | - | 80.96 | 83.32 | 86.24 | 87.90 | - | 46.03 | 49.93 | 55.95 | 59.03 |
| GDW [5] | - | 72.34 | - | - | 87.32 | - | 39.52 | - | - | 57.3 |
| BALMS\(^\dagger\) [43] | 74.76 | 80.42 | 83.56 | 89.19 | 47.21 | 57.43 | 61.61 | |||
| MFRW | 78.07 | 84.08 | 87.43 | 88.76 | 40.77 | 44.85 | 49.65 | 56.46 | 60.25 | |
| MES | 72.23 | 78.35 | 81.84 | 86.71 | 40.56 | 44.68 | 50.81 | |||
| MFRW-MES | 81.19 | 86.84 | 88.83 | 43.33 | 52.02 | 56.95 | 60.6 | |||
Table 1. Test Accuracy ( \(\%\) ) of ResNet-32 Architecture on CIFAR-LT-10 and CIFAR-LT-100 under Different Imbalance Factors (IFs)
For the CIFAR-10-LT our methods MFRW and MFRW-MES got the first and the second-best accuracy values, especially at higher values of IF. Instead, in CIFAR-100-LT the higher accuracy results are shared across BALMS [43] and our method MFRW-MES.
Increasing the number of categories to be classified from CIFAR-10-LT to CIFAR-100-LT, but maintaining the same ResNet-32 backbone, the gain in performance of MFRW was less pronounced than MES. Instead, the few classes of CIFAR-10 led MES to a modest improvement when compared to MFRW. These behaviors might depend on the number of examples per class or even on the classifier backbone.
To investigate further on that we conducted a second phase of experiments where a more strong preprocessing (increase the variety of training samples) and a different learning rate scheduler were applied to the training. A Resnet-32 classifier is trained for \(13,\!000\) iteration with a batch of 512, on which was applied AutoAugment [8] and Cutout [11]. The initial learning rate was 0.1, then decreased to zero with a Cosine Annealing scheduler [34]. The optimizer used was SGD with a momentum of 0.9 and a weight decay of 5e-4.
The results of this experiment are reported in Table 2. It shows an overall performance improvement over the results in Table 1 due to the additional preprocessing applied to the inputs of the classifier. In this setting our method MFRW-MES, which benefits jointly from MFRW and MES strategies, got comparable results with BALMS [43]. Compared to the results obtained in Table 1, it can be seen that MES benefits more from the increase than MFRW in both the CIFAR-10-LT and CIFAR-100-LT datasets, mainly in the highest imbalance factor IF = 200. This suggests that the few classes of CIFAR-10 and the very small number of training samples, as in Table 1 (where there is no data augmentation), led MES to a confused estimation of the vector \(s_k\) used in Equation (8). Because MFRW modifies the visual feature size of the classifier, the small number of parameters of the Resnet-32 architecture could be a limitation for this method. For this reason, we also tested our methods on the backbone ResNet-18 with more parameters (11.17M trainable parameters) and a bigger visual feature size than ResNet-32 (0.48M trainable parameters), but with the same settings as the experiments done in Table 1. The new visual feature size went from 64 of ResNet-32 to 512 of ResNet-18.
| Dataset | Long-Tailed CIFAR-10 | Long-Tailed CIFAR-100 | ||||
|---|---|---|---|---|---|---|
| IF | 200 | 100 | 10 | 200 | 100 | 10 |
| Baseline (CE) | 71.2 | 77.4 | 90.0 | 41.0 | 45.3 | 61.9 |
| CBW | 72.5 | 78.6 | 90.1 | 36.7 | 42.3 | 61.4 |
| CBS | 68.3 | 77.8 | 90.2 | 37.8 | 42.6 | 61.2 |
| Focal Loss [31] | 71.8 | 77.1 | 90.3 | 40.2 | 43.8 | 60.0 |
| CB Loss [9] | 72.6 | 78.2 | 89.9 | 39.9 | 44.6 | 59.8 |
| LDAM Loss [3] | 73.6 | 78.9 | 90.3 | 41.3 | 46.1 | 62.1 |
| Equalization Loss [46] | 74.6 | 78.5 | 90.2 | 43.3 | 47.4 | 60.5 |
| cRT [21] | 76.6 | 82.0 | 91.0 | 44.5 | 50.0 | 63.3 |
| LWS [21] | 78.1 | 83.7 | 45.3 | 63.4 | ||
| BALMS [43] | 81.5 | 91.3 | 50.8 | 63.0 | ||
| MFRW | 78.32 | 82.49 | 90.22 | 42.9 | 47.05 | 62.77 |
| MES | 77.12 | 81.19 | 91.03 | 43.52 | 48.44 | |
| MFRW-MES | 84.97 | 90.99 | 46.52 | 50.44 | 64.06 | |
Autoaugment and Cutout are additionally applied as preprocessing on the data. The results of the cited methods are reported directly from their original papers. Bold is used for the first results and underline for the second ones.
Table 2. Test Accuracy ( \(\%\) ) of ResNet-32 Architecture on CIFAR-LT-10 and CIFAR-LT-100 under Different Imbalance Factors (IFs)
Autoaugment and Cutout are additionally applied as preprocessing on the data. The results of the cited methods are reported directly from their original papers. Bold is used for the first results and underline for the second ones.
The results in Table 3 demonstrate our intuition that MFRW benefits from a bigger classifier backbone. Instead, MES achieved only a small improvement in having a more complex backbone showing that its performance is related to the number of examples in the training set and their preprocessing. With Resnet-18 as backbone, our method MFRW-MES and MFRW got the first and the second-best accuracy result on almost every IF value. Figure 3 shows how the accuracy of MFRW-MES and BALMS [43] varies with the number of parameters of the classifier’s backbone on CIFAR-10-LT (3(a)) and CIFAR-100-LT (3(b)) with IF = 100.
| Dataset | Long-Tailed CIFAR-10 | Long-Tailed CIFAR-100 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| IF | 200 | 100 | 50 | 20 | 10 | 200 | 100 | 50 | 20 | 10 |
| Baseline (CE) | 70.22 | 75.16 | 82.32 | 87.24 | 90.73 | 38.87 | 43.65 | 48.55 | 57.09 | 62.59 |
| CB Loss [9] | 69.16 | 75.16 | 81.9 | 86.61 | 90.79 | 38.58 | 43.51 | 48.15 | 57.02 | 63.1 |
| FSA [7] | 77.06 | 80.57 | 84.51 | 88.54 | 91.75 | 42.84 | 46.57 | 51.9 | 58.69 | 65.08 |
| BALMS\(^\dagger\) [43] | 76.86 | 85.28 | 89.27 | 90.86 | 42.19 | 59.87 | 64.13 | |||
| MFRW | 81.35 | 47.51 | 53.15 | 65.36 | ||||||
| MES | 73.11 | 77.96 | 83.33 | 88.69 | 90.63 | 40.28 | 44.49 | 50.1 | 58.24 | 63.99 |
| MFRW-MES | 79.94 | 83.43 | 86.8 | 89.13 | 91.02 | 43.85 | 50.04 | 54.12 | 61.37 | |
Table 3. Classification Accuracy ( \(\%\) ) of the Architecture ResNet-18, Trained on the Same Settings as Table 1
From Tables 1–3 it is possible to notice how our method exceeds or is in line with the results obtained from the state-of-the-art algorithms for long-tailed training. We obtained the best accuracy values in almost all IFs, especially when the dataset is extremely unbalanced (IF = 200,100,50). Table 3 shows the effectiveness of our method with a bigger network backbone ResNet-18. Both MFRW and MES obtained good results, even individually. We could observe how they could be used simultaneously without compromising the final performance of the classifier. We designed MFRW-MES to address even the more complex case of imbalance together with noisy labels.
The embedding space utilized in the meta-model employed by MFRW-MES is learned by taking into account the collaboration between MFRW and MES. During the training of MFRW-MES, the embedding space is influenced first by MES, which acts directly on the loss of the classifier, and then by MFRW, which operates on the visual feature. In Table 1 for the case of CIFAR-10-LT, where MES has poor performance, the application of confused predicted vectors \(s_k\) to the loss let MFRW received a diminished gradient on the visual features, rendering it incapable of learning the \(W_f\) masks correctly. In some cases, this makes MFRW-MES having a lower accuracy result than the application of the singular method MFRW. Instead, in Table 3, where a more capable backbone Resnet-18 (11.17M trainable parameters) with a bigger visual feature size than ResNet-32 (0.48M trainable parameters) was used, MFRW-MES got better accuracy results than each singular method MFRW and MES. In this case, when the two methods were applied jointly, MFRW could act upon the visual features even if it received a reduced gradient (produced by the application of MES), exploiting the bigger number of learnable parameters of the backbone.
ImageNet-LT and Places-LT
Following the experiment setup of [43], we employed ResNet-10 and ResNet-152 networks for ImageNet-LT and Places-LT, respectively. For ImageNet-LT, we adopted an initial learning rate of 0.2 and decayed with Cosine Annealing scheduler during training of 180 epochs. For Places-LT, the learning rate started at 0.005 and it was reduced like for ImageNet-LT. We trained ResNet-152 for a total of 60 epochs with a batch size of 64. In both cases, our method started from a baseline that had been pre-trained on the entire dataset. We did not freeze the feature extractor part of the pre-trained network as the decoupled training strategy of [21] does. We pre-trained the backbone to accelerate the total training time and to make our method starts from an almost good feature extractor.
Table 4 shows the result of MFRW-MES on ImageNet-LT and Places-LT. In the first dataset, our method achieved a Top-1 accuracy value comparable to other methods that only target this task. Instead, for the Places-LT dataset, we got the second-best result.
| Dataset | ImageNet-LT | Places-LT | ||||
|---|---|---|---|---|---|---|
| Method | Top-1 | Top-3 | Top-5 | Top-1 | Top-3 | Top-5 |
| Baseline (CE) | 25.26 | 38.65 | 47.88 | 27 | 47.95 | 58.56 |
| RCB [18] | 29.9 | 54.82 | 30.8 | |||
| OLTR [33] | 35.6 | - | - | 35.9 | - | - |
| Equalization Loss [46] | 36.44 | - | - | - | - | |
| cRT [21] | - | - | 36.7 | - | - | |
| LWS [21] | 41.4 | - | - | 37.6 | - | - |
| BALMS [43] | 41.8 | - | - | 38.7 | - | - |
| MFRW-MES | 41.78 | 59.87 | 67.25 | 61.2 | 71.09 | |
We report directly the result of the cited methods from their original papers. The first and the second results are marked in bold and the second ones with an underline.
Table 4. Top-1, Top-3, and Top-5 Accuracy ( \(\%\) ) of ResNet-10 Classifier on ImageNet-LT and Places-LT
We report directly the result of the cited methods from their original papers. The first and the second results are marked in bold and the second ones with an underline.
With these experiments, we showed how our algorithm can solve the long-tail data problem via a simple advisor network trained with the meta-learning paradigm.
4.4 Flip Label Noise Results
We trained our model under Flip (or asymmetric) label corruption noise at various levels. To assess the performance of the advisor network, we compared it to other works that studied this type of noise. We trained a ResNet-32 through SGD with a starting learning rate of 0.1 and batch size of 128. We decreased the learning rate at epoch 50 and 70 by a factor of 0.1. We stopped the training after 100 epochs. We also reproduced the [43] algorithm under this experiment setting to observe how an ad-hoc long-tailed distribution method works under the Flip noise. The baseline method is a direct training of the classifier with a standard softmax cross-entropy loss.
We can notice from Table 5 that our method obtained the best results for the flip noise on CIFAR10 and CIFAR100. The use of our advisor network avoided a drastic accuracy drop than the other methods, especially when the noise was really strong (\(p = 0.6\)). When there is no noise (\(p = 0.0\)) our method got worse accuracy values than a normal training with the classic softmax cross-entropy loss on both CIFAR10 and CIFAR100. It happens because the advisor network, trying to help the classifier, introduces a bias of the examples distribution contained in the meta-set. If the training distribution already reflects the test one better than the one contained in the meta-set then, introducing this meta bias, the accuracy is a little worse than without. In some experiments, MES may slightly outperform MFRW-MES. This behavior is reasonable because MFRW-MES, which tries to address even the more complex case of imbalance and noisy labels, attempts to give the classifier model well-balanced training.
| Dataset | Flip CIFAR-10 | Flip CIFAR-100 | ||||||
|---|---|---|---|---|---|---|---|---|
| Noise \(p\) | 0.0 | 0.2 | 0.4 | 0.6 | 0.0 | 0.2 | 0.4 | 0.6 |
| Baseline (CE) [45] | 92.89 | 76.83 | 70.77 | - | 70.50 | 50.86 | 43.01 | - |
| Reed-Hard [42] | 92.31 | 88.28 | 81.06 | - | 69.02 | 60.27 | 50.40 | - |
| S-Model [13] | 83.61 | 79.25 | 75.73 | - | 51.46 | 45.45 | 43.8 | - |
| Self-paced [26] | 88.52 | 87.03 | 81.63 | - | 67.55 | 63.63 | 53.51 | - |
| Focal Loss [31] | 86.45 | 80.45 | - | 70.02 | 61.87 | 54.13 | - | |
| Co-teaching [14] | 89.87 | 82.83 | 75.41 | - | 63.31 | 54.13 | 44.85 | - |
| D2L [35] | 92.02 | 87.66 | 83.89 | - | 68.11 | 63.48 | 51.83 | - |
| Fine-tuning [45] | 93.23 | 82.47 | 74.07 | - | 70.72 | 56.98 | 46.37 | - |
| MentorNet [20] | 92.13 | 86.3 | 81.76 | - | 70.24 | 61.97 | 52.66 | - |
| L2RW [44] | 89.25 | 87.86 | 85.66 | - | 64.11 | 57.47 | 50.98 | - |
| GLC [16] | 91.02 | 89.68 | 88.92 | - | 65.42 | 63.07 | 62.22 | - |
| MW-Net [45] | 92.04 | 90.33 | 87.54 | - | 70.11 | 64.22 | 58.64 | - |
| GDW [5] | 92.94 | 91.05 | 87.70 | - | 52.44 | - | ||
| Baseline (CE)\(^\dagger\) | 92.33 | 90.56 | 86.25 | 26.67 | 70.18 | 65.02 | 50.25 | 18.67 |
| MW-Net\(^\dagger\) [45] | 92.19 | 90.74 | 87.63 | 42.41 | 70.57 | 64.13 | 51.23 | 19.89 |
| BALMS\(^\dagger\) [43] | 92.86 | 90.99 | 83.51 | 51.76 | 69.66 | 65.61 | 56.83 | 39.16 |
| MFRW | 91.87 | 91.09 | 90.26 | 89.34 | 68.93 | 63.54 | 59.07 | 56.13 |
| MES | 90.76 | 90.58 | 69.74 | 65.36 | 62.96 | 60.82 | ||
| MFRW-MES | 92.46 | 91.44 | 68.33 | 65.17 | ||||
The backbone used is a ResNet-32. \(p\) denotes the different levels of noise. The results for the cited methods are reported directly from their original papers. Instead, \(^\dagger\) indicates the results obtained by our implementation. The first and the second-best results are respectively marked in bold and underline.
Table 5. Test Accuracy on CIFAR10 and CIFAR100 Dataset with Flip (Asymmetric) Label Noise
The backbone used is a ResNet-32. \(p\) denotes the different levels of noise. The results for the cited methods are reported directly from their original papers. Instead, \(^\dagger\) indicates the results obtained by our implementation. The first and the second-best results are respectively marked in bold and underline.
4.5 Long-Tail & Flip Label Noise Results
We decided to introduce a new synthetic dataset setting in which unbalanced and noisy label problems are both present. We chose three values of IFs (200, 100, 10) and two of \(p\) (0.4, 0.6), and all possible combinations for both CIFAR10 and CIFAR100 were generated. All experiments were performed by training a ResNet-32 with the same settings and hyperparameters as the one used to obtain the results listed in Table 1. This experiment is important to establish the ability of an algorithm to handle different types of dataset conditions at the same time.
We compared our method with BALMS [43] because it is designed for long-tailed distributions, and with MW-Net [45] and GDW [5], which can deal with any type of problem in the data, similar to us. The results shown in Table 6 indicate that our advisor network can manage at the same time both noisy labels and long-tailed distributions better than the other methods. MFRW-MES can exploit the two different properties of MFRW and MES at the same time, achieving better results than each method used separately.
| Dataset | LT Flip CIFAR-10 | LT Flip CIFAR-100 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Noise p | 0.4 | 0.6 | 0.4 | 0.6 | ||||||||
| I.R | 200 | 100 | 10 | 200 | 100 | 10 | 200 | 100 | 10 | 200 | 100 | 10 |
| Baseline (CE)\(^\dagger\) | 49.64 | 56.98 | 76.58 | 31.78 | 31.96 | 31.78 | 22.03 | 23.81 | 39.48 | 12.08 | 13.65 | 19.6 |
| MW-Net\(^\dagger\) [45] | 45.74 | 52.43 | 82.22 | 32.06 | 33.22 | 46.5 | 24.34 | 25.24 | 39.05 | 12.96 | 14.96 | 20.01 |
| GDW\(^\dagger\) [5] | 36.28 | 49.73 | 81.3 | 27.9 | 34.39 | 60.83 | 21.98 | 23.62 | 34.92 | 13.47 | 14.12 | 20.06 |
| BALMS\(^\dagger\) [43] | 53.73 | 59.24 | 70.4 | 52.55 | 57.39 | 44.44 | 32.02 | |||||
| MFRW | 55.9 | 83.76 | 44.41 | 69.16 | 23.45 | 25.26 | 38.08 | 17.65 | 18.48 | 29.58 | ||
| MES | 53.33 | 64.51 | 85.8 | 44.45 | 52.46 | 25.41 | 26.76 | 17.12 | 18.79 | |||
| MFRW-MES | 61.67 | 50.19 | 59.38 | 90.26 | 31.73 | 34.2 | 53.89 | 21.79 | 24.3 | 39.99 | ||
Table 6. Test Accuracy on CIFAR10 and CIFAR100 Dataset with Two Levels of Flip Label Noise p (0.4, 0.6) and Three Different Imbalance Factors IFs (200, 100, 10)
4.6 Real-world Label Noise Results
In order to test real-world noise, we used Clothing1M and ResNet-50 as backbone, pre-trained on ImageNet, which was trained through SGD with a momentum of 0.9, weight decay of 1e-3, and a starting learning rate of 0.01. The batch had a size of 32 and it was preprocessed by resizing the image to 256 \(\times\) 256, then random cropping a 224 \(\times\) 224 patch, and finally performing normalization. The total training process consisted of 20 epochs where the learning rate was multiplied by 0.1 after 10 and 15 epochs.
The results reported in Table 7 show how our method obtains the state-of-the-art accuracy on the clothing dataset, improving it by \(3,10\%\) compared to the best algorithm previously used [39]. We got a \(12.33\%\) increment to the baseline accuracy.
| Method | Accuracy (%) |
|---|---|
| Baseline (CE) [45] | 68.94 |
| F-correction [40] | 69.84 |
| JoCoR [51] | 70.30 |
| S-adaptation [13] | 70.36 |
| M-correction [2] | 71.00 |
| MLC [50] | 71.06 |
| Joint-Optim [47] | 72.16 |
| MLNT [28] | 73.47 |
| P-correction [54] | 73.49 |
| MW-Net [45] | 73.72 |
| MentorMix [19] | 74.30 |
| FaMUS [53] | 74.43 |
| DivideMix [27] | 74.76 |
| AugDesc [39] | 75.11 |
| MFRW | 75.35 |
| MES | |
| MFRW-MES | 77.44 |
Results for cited methods were copied from original papers.
Table 7. Comparison with State-of-the-Art Methods in Test Accuracy \((\%)\) on Clothing1M Dataset with Real-world Noise
Results for cited methods were copied from original papers.
4.7 Real-world Label Noise with Long-Tail Imbalance Results
To verify the effectiveness of our method in a mixed setting of real-word label noise and long-tail distribution, we apply three different levels of IF (1, 50, 100) to Clothing1M as described in [22]. We trained a ResNet-18, pre-trained on ImageNet, with the SGD optimizer with a momentum of 0.9, weight decay of 1e-4, and a starting learning rate of 0.01. We optimized the classifier for a total of 20 epochs, and the learning rate was multiplied by 0.1 after 10 and 15 epochs. We used a batch size of 64 that was preprocessed by resizing the image to 256 \(\times\) 256, then random cropping a 224 \(\times\) 224 patch, and finally performing normalization.
From the results in Table 8, we can observe how our method obtains state-of-the-art accuracy values on the Clothing1M dataset with and without an artificially applied long-tail distribution.
Table 8. Accuracy \((\%)\) Values of State-of-the-Art Methods on the Clothing1M Dataset with Real-world Noise and Three IFs (1, 50, 100)
4.8 Variation of Meta-set Size
We verified how the size of the meta-set affects the actual performance of our method MFRW-MES. We increased the number of samples in the meta-set from 0, which corresponds to the baseline method, to a maximum of 1,000 for the CIFAR-10 and CIFAR-100 datasets and the full validation set (\(14k\) images) for the Clothing1M dataset. We chose two specific settings for the CIFAR-10/100 datasets, one with Flip noise of intensity \(p=0.6\) and the other with a long-tail distribution generated by an \(IF=100\). The results of each experiment are shown in Figure 4. Even with few examples per class, starting from a meta-set size that is the \(0.2 \%\) of the entire training dataset, our method got good performance on the two artificial settings of CIFAR-10 and CIFAR-100. Instead, from the plot in Figure 4(c), it is possible to notice how the meta-set size is relevant to reaching the state-of-the-art result on Clothing1M. This was to be expected since the noise structure in annotations of Clothing1M is much harder to model than an artificially generated noise, so having more examples in the meta-set allows our method to learn a much more complex function to help the classifier. However, with the size of the meta-set that is only \(1.38\%\) of all the training data, and using our method we got a \(12.33\%\) increment from the baseline accuracy.
Fig. 4. Plot of accuracy results of MFRW-MES method at variations of meta-set size.
4.9 Qualitative Advisor Network Results
This section provides a qualitative analysis of various aspects of our advisor network. To better understand how our method is helping the main classifier, it is important to look at what and how the meta-model learns.
Distribution of learned attention weights
First, we checked how the predicted weight masks, that the meta-model learns for the meta-activation part (MFRW), are distributed across the training examples. We extracted the first two main components of a PCA reduction on the predicted weights \(W_f\) of the meta-model after the classifier’s training on Flip noised CIFAR10. The two PCA components are plotted in Figure 5 for 4 different values of Flip noise strength, from \(p=0.0\) to \(p=0.6\). For every \(p\) value, except \(p=0\) where there is no noise, the predicted \(W_f\) are separated into two large clusters which indicate that the meta-model learned to weigh the examples that contain label errors differently from those that have correct labeling. This is the effect of giving the advisor network the loss value of each training data. An out-of-distribution example has a bigger cost than a good one.
Fig. 5. Plot of the first two main components of a PCA reduction on the weight \(W_f\) obtained from the training on CIFAR10 with four different values of Flip noise \(p=0.0 (5(a)), 0.2 (5(b)), 0.4 (5(c)), 0.6 (5(d))\) . Pink dots indicate an example with the correct label, instead, the light blue ones are for example with the noisy label. The clear separation between noisy and correct examples indicates a different way of generating weights between these two categories.
Next, we did a T-SNE [48] on the predicted \(W_f\) to see if there is also a per-class separation between them. From the T-SNE plot in Figure 6, it is possible to deduce that the \(W_f\) have also an additional per-class separation concerning the noisy/correct one. This is due to the contribution of having the visual features as input to the meta-model, which allows predicting different weights not only based on the loss value but also depending on the image content. In Figure 7(a), the weights \(W_f\) relative to the first 24 examples of the original class “airplane”, affected by Flip noise with \(p=0.6\), are shown. The information from noisy label examples (light blue border) is manipulated differently from one of the correct examples (pink border).
Fig. 6. T-SNE of the predicted weight vectors \(W_f\) learned on CIFAR-10 with four different values of Flip noise \(p=0.0 (5(a)), 0.2 (5(b)), 0.4 (5(c)), 0.6 (5(d))\) . Pink dots indicate an example with the correct label, instead, the light blue ones are for an example with the noisy label. Each category is denoted by a colored border. Besides a separation between noisy/correct examples, there is also one at the category level. This indicates distinct predicted weight vectors \(W_f\) for features belonging to different classes.
Fig. 7. Attention weights \(W_f\) relative to the first 24 examples of the class “airplane” (7(a)) learned by our meta-model at the end of training on Flip ( \(p=0.6\) ) noised CIFAR10. The pink color indicates examples with the correct label, instead, the light blue is for the noisy ones. Attention weights \(W_f\) relative to examples of the common class “apples” (7(b)) and the ones of the rare class “roses” (7(c)) of CIFAR-LT-100 with IF 200.
We analyzed the \(W_f\) learned on CIFAR-LT-100 when the IF is 200, the most difficult setting case. As shown in Figures 7(b) and 7(c), the predicted weights \(W_f\) differ both between different classes and within the same class. The weights of the frequent class “apples” (Figure 7(b)) have more values closer to zero (black color) instead of the one belonging to the rare class “roses” (Figure 7(c)) with a lot of value close to one (white color). To better visualize the \(W_f\) distribution in the case of imbalance, we conducted a PCA reduction and a T-SNE on the attention mask learned on CIFAR-LT-10 when the IF is 200 and 100. We plotted the results of these two operations in Figure 8.
Fig. 8. PCA (top) and T-SNE (bottom) of the predicted weight vectors \(W_f\) learned on CIFAR-10-LT with two different values of imbalance, \(IF=200\) (8(a)), \(IF=100\) (8(b)). Each category is denoted by a different color described in the legend. An indicator (bar) of each class size is present under the category legend. The indicator doesn’t express the exact number but serves only to understand which category has fewer examples than another.
This means that the information from common category examples is ignored much more than the one belonging to the rare classes. Moreover, every example belonging to the same class is not weighted equally. This is shown in Figure 7(b) where there are some weight vectors with more values close to one (white color) than others. It happens because those examples contain information that is still useful to the main classifier.
Softmax weights learned with MES
We investigated how our softmax weight \(s_k\), learned with MES, differs from the handmade solution proposed by [46]. We measured the effectiveness of each solution by calculating the Mean Absolute Error (MAE) between the distribution of the class sizes, normalized between 0 and 1, and the vector of the weights passed to Equation 6. Figure 9 shows the MAE values obtained during the training of the main network on CIFAR-LT-100 with different IF values. MES fits the target distribution better than the simple threshold function applied in [46] for all IF values and does not need any extra hyperparameter tuning.
Fig. 9. Comparison of MES with the two functions defined in [46] on CIFAR-LT-100 with different values of IF 10 (9(a)), 50 (9(b)), 100 (9(c)), 200 (9(d)). In the graph is reported the Mean Absolute Error (MAE) between the distribution of the size of the classes (normalized between 0 and 1) and the vector of weights given to Equation (6). Lower values of MAE indicate a better fit of the target distribution. In the graph, there are also details of the predicted vector weights \(s_k\) at various learning steps.
5 CONCLUSIONS
We introduced two new methods Meta Feature Re-Weighting (MFRW) and Meta Equalization Softmax (MES), that make use of a novel concept of advisor network to mitigate the problem of training DNNs on noisy labels and long-tailed class distributions. We empirically showed the effectiveness of our method on synthetic generated and real-world datasets for the classification task. Experimental results demonstrate that the advisor strategy can help the main classifier achieve better generalization performance for both training data problems. We introduced a new synthetic dataset setting where the long-tailed distribution is mixed with the noisy label problem. Then we showed how our method succeeds in solving both problems simultaneously unlike other similar work. We got the state-of-the-art performance on the Clothing1M dataset, which contains real-world label noise. Future research in this area may include adapting the advisor network to a more complex task than classification, like Object Detection or Image Segmentation.
- [1] . 2021. Image classification with deep learning in the presence of noisy labels: A survey. Knowledge-Based Systems 215 (2021), 106771.Google Scholar
Cross Ref
- [2] . 2019. Unsupervised label noise modeling and loss correction. In International Conference on Machine Learning. PMLR, 312–321.Google Scholar
- [3] . 2019. Learning imbalanced datasets with label-distribution-aware margin loss. arXiv preprint arXiv:1906.07413 (2019).Google Scholar
- [4] . 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321–357.Google Scholar
Cross Ref
- [5] . 2021. Generalized DataWeighting via class-level gradient manipulation. Advances in Neural Information Processing Systems 34 (2021), 14097–14109.Google Scholar
- [6] . 2020. AdvAug: Robust adversarial augmentation for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5961–5970.Google Scholar
Cross Ref
- [7] . 2020. Feature space augmentation for long-tailed data. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16. Springer, 694–710.Google Scholar
Digital Library
- [8] . 2019. AutoAugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 113–123.Google Scholar
Cross Ref
- [9] . 2019. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9268–9277.Google Scholar
Cross Ref
- [10] . 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.Google Scholar
Cross Ref
- [11] . 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).Google Scholar
- [12] . 2003. Class imbalance and cost sensitivity: Why undersampling beats oversampling. In ICML-KDD 2003 Workshop: Learning from Imbalanced Datasets, Vol. 3.Google Scholar
- [13] . 2016. Training deep neural-networks using a noise adaptation layer. (2016).Google Scholar
- [14] . 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. arXiv preprint arXiv:1804.06872 (2018).Google Scholar
- [15] . 2005. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing. Springer, 878–887.Google Scholar
Digital Library
- [16] . 2018. Using trusted data to train deep networks on labels corrupted by severe noise. Advances in Neural Information Processing Systems 31 (2018), 10456–10465.Google Scholar
- [17] . 2016. Learning deep representation for imbalanced classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5375–5384.Google Scholar
Cross Ref
- [18] . 2020. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7610–7619.Google Scholar
Cross Ref
- [19] . 2020. Beyond synthetic noise: Deep learning on controlled noisy labels. In International Conference on Machine Learning. PMLR, 4804–4815.Google Scholar
- [20] . 2018. MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning. PMLR, 2304–2313.Google Scholar
- [21] . 2019. Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217 (2019).Google Scholar
- [22] . 2021. Learning from long-tailed data with noisy labels. arXiv preprint arXiv:2108.11096 (2021).Google Scholar
- [23] . 2017. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transactions on Neural Networks and Learning Systems 29, 8 (2017), 3573–3587.Google Scholar
- [24] . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- [25] . 2016. Exploring the long tail of social media tags. In International Conference on Multimedia Modeling. Springer, 51–62.Google Scholar
Digital Library
- [26] . 2010. Self-paced learning for latent variable models. Advances in Neural Information Processing Systems 23 (2010), 1189–1197.Google Scholar
- [27] . 2020. DivideMix: Learning with noisy labels as semi-supervised learning. In International Conference on Learning Representations. https://openreview.net/forum?id=HJgExaVtwr.Google Scholar
- [28] . 2019. Learning to learn from noisy labeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5051–5059.Google Scholar
Cross Ref
- [29] . 2019. Long-tail hashtag recommendation for micro-videos with graph convolutional network. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 509–518.Google Scholar
Digital Library
- [30] . 2016. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Computing Surveys (CSUR) 49, 1 (2016), 1–39.Google Scholar
Digital Library
- [31] . 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980–2988.Google Scholar
Cross Ref
- [32] . 2020. Early-learning regularization prevents memorization of noisy labels. Advances in Neural Information Processing Systems 33 (2020), 20331–20342.Google Scholar
- [33] . 2019. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2537–2546.Google Scholar
Cross Ref
- [34] . 2016. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).Google Scholar
- [35] . 2018. Dimensionality-driven learning with noisy labels. In International Conference on Machine Learning. PMLR, 3355–3364.Google Scholar
- [36] . 2018. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV’18). 181–196.Google Scholar
Digital Library
- [37] . 2018. A multimodal approach to predict social media popularity. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR’18). IEEE, 190–195.Google Scholar
Cross Ref
- [38] . 2010. A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial Intelligence Review 33, 4 (2010), 275–306.Google Scholar
Digital Library
- [39] . 2021. Augmentation strategies for learning with noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8022–8031.Google Scholar
Cross Ref
- [40] . 2017. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1944–1952.Google Scholar
Cross Ref
- [41] . 2006. Class noise and supervised learning in medical domains: The effect of feature extraction. In 19th IEEE Symposium on Computer-based Medical Systems (CBMS’06). IEEE, 708–713.Google Scholar
Digital Library
- [42] . 2014. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596 (2014).Google Scholar
- [43] . 2020. Balanced meta-softmax for long-tailed visual recognition. arXiv preprint arXiv:2007.10740 (2020).Google Scholar
- [44] . 2018. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning. PMLR, 4334–4343.Google Scholar
- [45] . 2019. Meta-weight-net: Learning an explicit mapping for sample weighting. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 1919–1930.Google Scholar
- [46] . 2020. Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11662–11671.Google Scholar
Cross Ref
- [47] . 2018. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5552–5560.Google Scholar
Cross Ref
- [48] . 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 11 (2008).Google Scholar
- [49] . 2017. Learning to model the tail. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 7032–7042.Google Scholar
Digital Library
- [50] . 2020. Training noise-robust deep neural networks via meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4524–4533.Google Scholar
Cross Ref
- [51] . 2020. Combating noisy labels by agreement: A joint training method with co-regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13726–13735.Google Scholar
Cross Ref
- [52] . 2015. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2691–2699.Google Scholar
- [53] . 2021. Faster meta update strategy for noise-robust deep learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 144–153.Google Scholar
Cross Ref
- [54] . 2019. Probabilistic end-to-end noise correction for learning with noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7017–7025.Google Scholar
Cross Ref
- [55] . 2019. Feature transfer learning for face recognition with under-represented data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5704–5713.Google Scholar
Cross Ref
- [56] . 2018. Mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.Google Scholar
- [57] . 2020. BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9719–9728.Google Scholar
Cross Ref
- [58] . 2017. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1452–1464.Google Scholar
Cross Ref
- [59] . 2020. Inflated episodic memory with region self-attention for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4344–4353.Google Scholar
Cross Ref
Index Terms
(auto-classified)Meta-learning Advisor Networks for Long-tail and Noisy Labels in Social Image Classification
Recommendations
Learning Advisor Networks for Noisy Image Classification
Image Analysis and Processing – ICIAP 2022AbstractIn this paper, we introduced the novel concept of advisor network to address the problem of noisy labels in image classification. Deep neural networks (DNN) are prone to performance reduction and overfitting problems on training data with noisy ...
Collaborative Learning with Pseudo Labels for Robust Classification in the Presence of Noisy Labels
Computer Vision – ECCV 2020 WorkshopsAbstractSupervised learning depends on labels of dataset to train models with desired properties. Therefore, data containing mislabeled samples (a.k.a. noisy labels) can deteriorate supervised learning performance significantly as it makes models to be ...
Multi-label Text Classification with Label Correction under Noise
ICCPR '21: Proceedings of the 2021 10th International Conference on Computing and Pattern RecognitionMulti-label text classification (MLTC) is a fundamental but difficult problem in text mining, the goal of MLTC is to assign a set of most relevant labels for the given document. While existing supervised training of deep learning models for MLTC ...















Comments