Semi-supervised Semantic Segmentation with Mutual Knowledge Distillation

Consistency regularization has been widely studied in recent semisupervised semantic segmentation methods, and promising performance has been achieved. In this work, we propose a new consistency regularization framework, termed mutual knowledge distillation (MKD), combined with data and feature augmentation. We introduce two auxiliary mean-teacher models based on consistency regularization. More specifically, we use the pseudo-labels generated by a mean teacher to supervise the student network to achieve a mutual knowledge distillation between the two branches. In addition to using image-level strong and weak augmentation, we also discuss feature augmentation. This involves considering various sources of knowledge to distill the student network. Thus, we can significantly increase the diversity of the training samples. Experiments on public benchmarks show that our framework outperforms previous state-of-the-art (SOTA) methods under various semi-supervised settings. Code is available at semi-mmseg.


INTRODUCTION
Segmentation is a fundamental task in visual understanding that aims to classify each pixel in an image into a predefined set of categories.While recent works in semantic segmentation [4,32,36,40,44,45] have made significant progress using supervised learning with the use of large-scale datasets [8,9,27,47].However, labeling such datasets can be labor-intensive and time-consuming for dense prediction problems, requiring up to 60 times more effort than image-level labeling [23].To address this limitation, semisupervised learning [2,29,41,42] attempts to learn a model with a limited set of labeled images and a large set of unlabeled images.
State-of-the-art semi-supervised semantic segmentation methods employ consistency regularization to enhance the similarity between the outputs of teacher and student during training.Data augmentations are commonly implemented on images, a practice evidenced by studies such as [39,41,42].Furthermore, they can also be applied to features, as indicated in [28].Additionally, the utilization of various networks, each possessing distinct initialization parameters, is a common practice, as delineated in [5,20].For example, the CPS [5] method feeds the same image into two different initialized networks and uses the pseudo labels generated from one branch to supervise the other branch.However, this method does not preserve important historical information during training.Note that the two branches are optimized with back-propagation without moving average during training.Thus, the model 'forgets' important historical information along with the training steps as stated in previous research [13,15,31].
To further improve the performance of the semi-supervised semantic segmentation models, we propose a novel mutual knowledge distillation framework.This framework employs two branches of co-training [5] with different initialized parameters and two auxiliary mean teacher models to record the information during the training process and provide extra supervision.The pseudo labels generated from one teacher network supervise the other student and vice versa.Weak augmentation is applied to teacher input images to increase prediction confidence, while input images from the student networks are strongly augmented to diversify samples.Pseudo labels from the teacher network tend to be more reliable, while the student network can be trained on more diverse and challenging samples.We explore feature-level augmentation in student networks, drawing inspiration from the implicit semantic data augmentation technique applied in [33,34].Our approach achieves state-of-the-art performance on the PAS-CAL VOC 2012 [9], Cityscapes [8], and COCO [23] datasets, under various splits of semi-supervised settings.Our main contributions are summarized as follows:

𝒙 𝒍
1), We propose the mutual knowledge distillation framework, a new consistency regularization approach for semi-supervised semantic segmentation( 4 ).This framework involves two different initialized student networks and two corresponding mean teacher networks.The knowledge from one teacher network is used to supervise the other student branch, and vice versa.
2), We investigate the efficacy of different data augmentation methods for  4 .Specifically, we discuss feature-level to enhance the diversity of the training dataset.Furthermore, we employ strong and weak augmentations to the student and teacher networks, respectively.
3), We empirically demonstrate the effectiveness of our approach, which achieves state-of-the-art performance on PASCAL VOC 2012, Cityscapes, and COCO datasets under various semi-supervised settings.A detailed ablation study verifies the usefulness of each component in the proposed framework.

RELATED WORK 2.1 Semantic segmentation
A variety of methods have been proposed for this task [4,32,36,40,44,45], starting with the fully convolutional network (FCN) [25], which trains a pixel-level classifier.It is worth mentioning that our work is based on DeepLabV3Plus [4], which applies a spatial pyramid pooling structure and an encoder-decoder structure to refine object boundaries.The majority of existing approaches in the literature operate under the fully-supervised regime, wherein a significant amount of labeled data is necessary.
Consistency-based methods enforce the model to generate the same prediction from augmented images and original ones.Temporal ensembling [22] implements the idea of ensemble multiple checkpoints of students.In particular, mean teacher [1,31,32] employs the exponential moving average of the model parameters to update the weights of the teacher model.Moreover, the student model is supervised under the pseudo-label generated by the teacher model.
Co-training for consistency [5,20] feeds the same image into two different initialized networks and uses the pseudo labels generated from one branch to supervise the other branch.U 2 PL [35] designs a method to select reliable annotations from unreliable candidate pixels.Self-training [17,38,39,41] based methods aim at generating pseudo labels to enlarge the training set.They use a teacher model to generate pseudo-labels based on suitable data augmentation and thresholds.Unlike these methods, our approach designs two student networks and two auxiliary mean teacher networks.Furthermore, we apply augmentation to the images and feature augmentation in the same framework, boosting the student network's performance.

Data augmentation
We describe data augmentation in SSL from two different perspectives: image-level augmentation [43] and feature-level augmentation [10,12].For instance, FixMatch [29] considers weaklyaugmented samples as more reliable anchors and constrains its output to be the same as strongly-augmented data.Similarly, UDA [37] uses weakly-augmented data and complex-augmented data to generate similar output.CutMix [42] is a widely adopted technique that generates pseudo labels and implicitly implements the idea of entropy minimization by ensuring the decision boundary passes through a low-density region of the distribution.CCT [28] and GCT [20] use a similar idea to achieve feature augmentation through cross-confidence consistency.Consistency-based SSL methods with auxiliary networks can be considered network-level augmentation.In this paper, we propose applying complex augmentation (imagelevel augmentation and feature augmentation) to the students and weak augmentation to the teachers.

METHOD
We propose a novel consistency regularization framework based on mutual knowledge distillation, as described in Sec.3.1.The image augmentation method is discussed in Sec.3.2.Finally, the training procedure is introduced in Sec.3.3.We aim at training an end-toend segmentation model with a massive amount of unlabeled and few labeled data in a semi-supervised learning manner.

Mutual Knowledge Distillation Framework
Overview.We first present the settings for a typical semi-supervised semantic segmentation task.Labeled datasets and unlabeled datasets are denoted as is the input RGB image with the size of  ×  and   ∈ R  × × represents the pixel-level one-hot label map for  classes.The proposed MKD framework is illustrated in Figure 1, which consists of four branches: two baseline student networks and two auxiliary mean teacher networks.The labeled images are fed into the student network and optimized with the normal cross-entropy loss L  between ground truth labels.The unlabeled images with strong (weak) augmentation are fed into the student (teacher) networks.Each student network is trained under the supervision of the pseudo labels generated by the other student network (L  ) and by the other teacher network (L  ).The knowledge between the two branches is transferred with the proposed MKD framework.Details for each part are described as follows.Baseline student networks.The co-training methodology entails the establishment of two networks, each exhibiting identical structures but divergent initializations.This process is further characterized by the imposition of constraints to ensure consistency between the outputs of the two networks.Our baseline student networks are based on the previous state-of-the-art co-training method, CPS [5].Student networks are defined as  1 and  2 , respectively.Network structures for  1 and  2 are the same, but the parameters are initialized differently with  1 and  2 .For example, given an input image , the baseline student networks produce features denoted as  1 and  2 , respectively.Following the typical co-training baseline [5], each student network is supervised by the pseudo labels generated by the other student network, which is denoted as L  .Auxiliary mean teacher networks.Prior research, as demonstrated in [13,15,31], has illustrated the potential of mean teacher models to store and leverage historical information, thereby enhancing model performance.It does not need to be optimized, thus adding relatively little computation.Building on this idea, we incorporate two auxiliary mean teacher networks denoted as   1 and   2 into our MKD framework.The network structure of the teacher is the same as the network structure of the student.However, the mean teacher does not require back-propagation during training.The corresponding student model updates the mean teacher parameters according to exponential moving average(EMA) as Eq. ( 1), where  controls the speed of updates and  ∈ [1,2] represents the index of the branch being updated: Mutual knowledge distillation.As illustrated in Figure 1, labeled samples are used to train student models, and losses are calculated using supervised loss.Unlabeled samples, after strong augmentation, are fed into the two student models to obtain different outputs (  1 ,   2 ).Similarly, to generate more reliable supervision, samples after weak augmentation are fed into the two teacher models to obtain different outputs (   1 ,    2 ).There are two main objectives for the proposed MKD framework.First, we enable the teacher network to update smoothly and produce high-confidence predictions on easy samples.Second, we allow the student network to learn from more challenging samples and get more useful information.Thus, we apply all levels of augmentation to the samples fed into the student network.By adding two auxiliary mean teacher networks, we can obtain reliable supervision, which is not easy to collapse.To apply the network augmentation, the output pseudo label y   1 from the   1 is used to supervise the logits map p  2 from  2 , and vice versa.Eq. ( 2) is the consistency loss between teacher and student models: where y   1 and y   2 denotes one hot labels of the teachers' outputs.Knowledge selection.We add a threshold on the teacher branch to confirm the training process supervised under higher confidence.This process is called knowledge selection, where the threshold ensures that the teacher is confident about transferring knowledge to students during the distillation process.By selecting a threshold greater than 0.95, we leave noisy signals out and keep the supervision between student and teacher reliable.And we found that if the threshold is applied to the supervision between the student and the other student, some useful information is lost in this process, and worse results will be obtained.Following Eq. ( 3), we add a threshold to achieve the knowledge selection. is set to be 0.95 by default in Eq. ( 3):

Augmentation
Pseudo-labels are generated by the model itself and provide limited information.To increase the diversity of the samples for the student network in our MKD framework, we apply image-level augmentations.
Image augmentation.Image augmentation is based on weak and strong augmentation pairs.Weak data augmentation (WDA) (e.g.image flipping, cropping, resizing) is applied to the images passed to the teacher models.In addition, strong data augmentation (SDA) (e.g.image flipping, cropping, resizing, cutMix, random select an operator from color jitter, blur, gray-scale, equalize and solarize) is applied to the same ones fed to the student model to improve overall generalization.Motivated by the distribution of batch normalization [41], we do not consider many strong color augmentation operations.Particularly, the CutMix [42] augmentation is achieved by applying a binary mask  that combines two images using the function  = (1 − ) ⊙   +  ⊙   .We apply CutMix by combining two input images in the batch for student models and use the same binary mask on the feature of teachers' logits with  = (1 −) ⊙   + ⊙   .Then we apply  to supervising students.Feature augmentation.Feature Data Augmentation (FDA) employs limitless meaningful semantic transformations to modify the feature spaces.This method tweaks the image semantically without the need for an auxiliary network.FDA works by finding suitable translation vectors in the feature space and generating an enhanced feature set.For this enhancement process, category information is essential.However, for unlabeled data where such information is absent, we use pseudo-labels as a replacement.
When applying FDA to semi-supervised semantic segmentation, we meticulously augment each feature to develop an improved feature set.The network's refinement is achieved by reducing the cross-entropy (CE) loss.If we consider M tending towards infinity, we can compute the CE loss for all feasible augmented features.The upper bound of this loss is presented in Eq. (4).
To incorporate semantic augmentation for semi-supervised semantic segmentation, we analyze the structure of the ISDA loss.Eq. ( 4) ends up with just one more term than the standard cross-entropy loss.Thus, each pixel in the student's features can be calculated by p  =  T   +  +  2  T Σ.Pseudo-labels are used to obtain the category information required for feature augmentation.Further details can be found in the supplementary materials.

Optimization of the Framework
The full training loss for the whole framework is described in Eq. ( 5), where  and  are the loss weights.
The first loss in Eq. ( 5) is the supervised segmentation loss for student models, defined as Eq. ( 6), where ℓ  is the cross-entropy loss function, y presents the ground truth, and  1 and  2 are model parameters of different students.
The second term is described in Eq. ( 2), which is the consistency loss based on cross-entropy.The last term ℓ   in Eq. ( 5) is the consistency loss between students same as CPS [5].
We show our MKD framework in Algorithm 1. First, we initialize student models with different random initialization parameters and set the same parameters for its teacher network.Then, after obtaining augmented data from the input images with SDA and WDA, we first use EMA to update the teachers' parameters.Finally, as described in Fig. 1, we follow Eq. ( 5) to train the model.12 Get featured augmented   1 ,   2 based on Eq. (3.2).

EXPERIMENTS 4.1 Implementation Details
Datasets.Following previous methods [5,35,41], experiments are performed on three widely used image segmentation datasets, PAS-CAL VOC 2012 (VOC) [9], Cityscapes [8] and COCO [23].VOC [9] is a standard semantic segmentation benchmark with 21 classes, including the background.The standard VOC datasets have 1464 images for training, 1449 images for validation, and 1456 images for testing.Following the previous works [4], we combine VOC with 9118 training images from the Segmentation Boundary Dataset (SBD) [14] as VOCAug.During training on VOCAug and VOC, we employ a crop size of 512 × 512.
Cityscpaes [8] consists of 2975/500/1525 finely annotated urban scene images with resolution 2048 × 1024 for train/validation/test, Table 1: Compared with state-of-the-art methods on the Pascal VOC 2012 val set under different partition protocols.Here '1/n' means that we use '1/n' labeled dataset and the remaining images in the training set are used as the unlabeled dataset.† means we introduce the unlabeled dataset with a total of 10582 images.* denotes an enhanced training scheme, which will be further discussed in Table 4. SupOnly stands for supervised training without using any unlabeled data.Blue text indicates the performance between our methods compared with the supervised-only method.respectively.The segmentation performance is evaluated over 19 challenging categories.We use a training crop size of 1024 × 512.

Method
We have chosen the COCO dataset [23] for our experiments to conduct further benchmarking of the proposed method.It is a challenging benchmark for semantic segmentation composed of 118k/5k for training/validation.We employ a crop size of 512 × 512.Training.Our method is implemented on MMSegmentation [7].Following DeepLabV3Plus [4], we use the "poly" learning rate policy where the initial learning rate is multiplied by (1 − /  ) 0.9 .For VOC and COCO datasets, the initial learning rate is set to 0.0025, while for Cityscapes, it is set to 0.01.Specifically, the batch size is set to 16 for all datasets, and all training was performed on the four NVIDIA A100.We train the network with mini-bath stochastic gradient descent (SGD).The momentum is fixed as 0.9, and the weight decay is set to 0.0005.Network architecture.We use DeepLabv3plus [4] with ResNet [16] pre-trained on ImageNet [21] as our segmentation network for VOC and Cityscapes datasets.The decoder head is composed of separable convolution same as standard DeepLabv3plus.It is worth noting that we do not use any tricks in the model structure.We adopt Xception-65 [6] as our backbone network for the COCO datasets, following the same architectural design as other methods for a fair comparison.Evaluation metrics.Following [4], we adopt the mean Intersection over Union (mIoU) as the evaluation metrics.All results are estimated on the validation set.Particularly, we report results via only single-scale testing.

Comparison with State-of-the-art Methods
We conduct the comparison experiments with state-of-the-art algorithms in Table 1, Table 2, and Table 3. Results on PASCAL VOC 2012 dataset.Table 1 shows comparison results on PASCAL VOC 2012 dataset.Following previous settings, we sample labeled images from 1) the original VOC 1464 training images, and 2) VOCAug with a total of 10582 images.It is important to notice that the methods with † and without † only differ in unlabeled images.They share the same '1/n' labeled data set and validation set with 1449 images.On the first data splits, assuming 1464 images in total for training, the proposed framework accomplishes 1.28%, 2.76%, 2.73%, with only 92, 183, and 366 labeled images under ResNet-R101 compared with CPS-R101 [5].
Note that we achieve similar performance with CPS [5] under 1/16 partitions, which is trained with only 92 label images as the number of labeled images is too small to generate reliable labels for the teacher network.As shown in Table 1, our method improved by 30.35% based on ResNet-101 compared with the supervised-only method with only 92 labeled images on VOCAug.When additional unlabeled data are incorporated, we adopt the identical splits used in previous works such as [24,35,39].
Table 2: Comparison with state-of-the-art on the PASCAL VOCAug and Cityscapes val set under different partition protocols.The VOCAug trainset consists of 10,582 labeled samples in total.‡ means the same split as U 2 PL.Other methods use the same split as CPS.⋆ presents the approach reproduced by [35].(-) means data is not available.The underline represents the best result of the CPS split.The best results of the U 2 PL split are shown in bold.SupOnly stands for supervised training without using any unlabeled data.Blue text indicates the performance between our methods compared with the supervised-only method.

Method
ResNet-50 ResNet-101   Our method still gains remarkable performance based on ResNet-101.It shows that more unlabeled images could bootstrap performance.In particular, compared with the U 2 PL method, our method improves by 8.14% and 8.38% under 1/16 and 1/8 partition protocols.It is also demonstrated that our method is more effective on fewer data.In addition, we set a confidence of 0.95 for selecting regions with higher confidence to select useful knowledge.The results are reported at the last in table 1.
Table 2 compares our method with the other state-of-the-art methods on VOCAug.To make a fair comparison, we train our MKD framework under two different split lists following previous work.Using the same split as CPS, the proposed method performs favorably against the previous state-of-the-art methods.Furthermore, as the amount of data increases, the performance gap between the various methods becomes smaller, proving that the segmentation task does not require a lot of labeled data.
Figure 2 shows the quantitative results of different methods on the PASCAL VOC 2012 datasets.We can see that co-training can not reasonably separate the objects (especially large-sized objects such as cows, boats, sheep, and motorbikes) completely while ours corrects these errors.Compared to co-training, our method performs well on these complex examples, such as potted plant and chair.
Results on Cityscapes dataset.The Cityscapes dataset consists of images focusing on urban scenes.As shown in Table 2, our method achieves notable improvements under various partition protocols with the same split as CPS [5].In addition, we improve by 5.01% under 1/16(186) partition protocol with the same split as U 2 PL [35].Our method outperforms the existing state-of-the-art method by a notable margin.Specifically, we report results with single-scale testing.We attribute this significant improvement to the fact that the Cityscapes dataset is relatively redundant so the teacher model can provide more accurate pseudo-labeling.Results on the COCO dataset.The COCO dataset is a quite challenging task with 118k training images, consisting of 81 classes in total.As shown in Table 3, our method achieves much better results compared with PC 2 Seg [46] based Xception-65 [6] under 1/512, 1/256, 1/128, 1/64 and 1/32 partition protocols the same as PseudoSeg [48].In addition, we improve by 6.2%-8.8%with the same split as PC 2 Seg [46].Our method outperforms the existing state-of-the-art method by a notable margin.

Ablation Study
In this subsection, we conduct experiments to explore the effectiveness of each proposed module on the VOC dataset under different semi-supervised settings.Effectiveness of mutual knowledge distillation.As illustrated in Table 4, we conduct a series of experiments to identify each module's performance.We take co-training as our baseline, the same as CPS [5].We first try to add naive mean teacher (MT) and find that the results do not improve or reduce and even lead to training instability.It may be because the teacher and student models are too similar, leading to collapse.By adding the mutual   Ablation study on feature augmentation.We conducted an ablation study on the effect of feature augmentation using the ResNet101 backbone on the CPS splits.Our results indicate that the proposed approach yields a performance boost of 2.33% and 0.61% for 1/16 and 1/8 labeled data ratios, respectively.However, we also observed that this approach can harm training performance when the labeled data is overwhelming.The details of this are discussed in the supplementary section.
Ablation study on knowledge selection.The Table 5 shows the effectiveness of the knowledge selection with different components L  and L  .It can be seen that the knowledge selection applied on L  is superior to others, indicating that the improvement brought by knowledge selection applied to students and teachers will leave noise out with lower confidence.And the simple form of knowledge selection with a 0.95 threshold is reliable, but other reasonable values are also acceptable.
Ablation study on heterogeneous network augmentation.To assess the effectiveness of our method, we conducted an experiment comparing the same model architecture with the heterogeneous network (HN) models, which employed PSPNet and Deeplabv3plus as teachers.The results, shown in Table 6, indicate that our method achieved a 1.23% improvement over the HN under the 1/16 (662) partition protocol, demonstrating the superiority of our method.

CONCLUSION
We have proposed a new consistency learning scheme, called mutual knowledge distillation, for semantic segmentation.Our method utilizes two auxiliary mean-teacher models and a combination of strong-weak augmentation and feature augmentation to increase the diversity of training samples for the student network.Experimental results show that our proposed method outperforms recent state-of-the-art methods on several benchmark datasets for semantic segmentation, including PASCAL VOC 2012, Cityscapes, and Microsoft COCO.Notably, our framework achieves significant performance improvements even when labeled data is limited.

Figure 1 :
Figure1: For each image   , we apply weak augmentation (WeakAug) on the teacher network and strong augmentation (StrongAug) on the student network.Here p denotes the logits. is the parameters of the model and y  is the one-hot labels generated from p  .We train the model by minimizing the consistency loss ℓ  and ℓ  on the unlabeled set and the cross-entropy loss ℓ  on the labeled set.

Figure 2 :
Figure 2: Qualitative results on the PASCAL VOC 2012 datasets using 1/16 (662) labeled samples and ResNet50.The first line exhibits the input images.The second line shows the ground truth.The third line presents the baseline of co-training.The fourth line displays our method.
Sample labeled images   and corresponding labels   from   .3 Sample unlabeled images without labels   from   .

Table 3 :
[6]]arison with state-of-the-art on the COCO[23]dataset based on Xception-65[6]under different partition protocols.SupOnly stands for supervised training without using any unlabeled data.Blue text indicates the performance between our methods compared with the supervised-only method.

Table 4 :
Ablation study on the proposed semi-supervised learning framework.The model here is Deeplabv3Plus with ResNet101 backbone.Co-training denotes the baseline the same as CPS.Mutual MT presents mutual knowledge distillation.SDA denotes strong data augmentation.KS is knowledge selection.FDA denotes feature data augmentation

Table 5 :
Ablation study of the knowledge selection with different components L  and L  .It uses R101 as the backbone with PseudoSeg splits.

Table 6 :
Ablation study of the heterogeneous network augmentation, which uses ResNet101 as the backbone.SN means the same network, and HN means heterogeneous network.Effectiveness of data augmentation.In Table4, to introduce more augmentation, we also added the strategies of strong augmentation(SDA) which accompanied 2.30% performance improvement under 1/16(662) partition protocols.Combining SDA and Mutual MT, we improve original co-training from 72.18% to 78.00% resulting in a 5.82% gain.The final combination of all methods obtains the best result and yields a performance improvement of 6.47% under 1/16(662) partition protocols.