A Self-Supervised Semantic Segmentation Method for Identifying Barrett' s Esophagus in Endoscopic Images

Barrett's esophagus is considered a precancerous condition that may lead to esophageal cancer. The condition is usually diagnosed through an endoscopy with biopsy. This careful imaging examination is quite labor-intensive and usually lacks of diagnostic consistency. In this paper, a computer-aided diagnosis (CAD) method was proposed to assist pathologist in Barrett's esophagus diagnosis from endoscopic images. The proposed semantic segmentation was built based on the U-Net architecture, which features capturing detailed spatial information and context to produce accurate pixel-wise predictions. An autoencoder, which is quite capable of learning a compact and meaningful representation of the input data, is incorporated to achieve self-supervised learning. A two-stage pretraining for the encoder and the decoder is adopted for better performance and reduced data requirements. The proposed method can extract features and patterns from unlabeled data without requiring human annotation or labels, which is very suitable in many biomedical image analyses. The experimental results show that the segmentation accuracy with mean pixel accuracy, mean dice coefficient, and mIoU reaches 98.20%, 87.96%, and 83.18% respectively, indicating that the proposed method has a good performance on the identification of Barrett's esophagus in endoscopic images.


INTRODUCTION
Gastro-esophageal reflux disease (GERD) is a common condition, where acid from the stomach leaks up into the esophagus.Several factors can contribute to the development of GERD, including obesity, smoking and drinking alcohol.GERD is often accompanied by symptoms such as heartburn or acid reflux, and may trigger a change in the cells lining the lower esophagus, causing Barrett's esophagus (BE) [1] [2].
BE does not cause symptoms on its own, however, it does raise the risk of developing esophageal cancer, the 8th most common cancer worldwide [3].BE is usually diagnosed through an upper endoscopy with biopsy [4][5] [6].Once such precancerous tissue is found, early treatment can be conducted to prevent esophageal cancer.However, during the endoscopy procedure, the careful imaging examination is quite labor-intensive and usually lacks diagnostic consistency, which might lead to delays in the identification or even misdiagnosis.Therefore, it would be an important topic to enhance the accuracy and efficiency of BE identification.
In recent years, Computer-Aided Diagnosis (CAD) systems have played an immensely significant role in assisting medical professionals.These systems utilize advanced algorithms and machine learning techniques to analyze medical images, such as CT and MRI [7] [8], to provide additional insights and support to clinicians during the diagnostic process.They have the potential to reduce human errors, increase detection rates, and improve overall patient outcomes by assisting medical professionals in making more informed decisions.
More recently, there is a trend of applying deep learning(DL) in endoscopic image analysis of colon, stomach, intestine, etc.Several researchers used a convolutional neural network (CNN) to segment endoscopic images for identifying BE [9][10] [11].Most of the above algorithms were built on supervised learning (SL) models, which required collecting and labeling a lot of images for training purposes.However, the data labeling job was tedious, labor-intensive, and also required specific medical expertise or experience.On the other hand, collecting many endoscopic images with annotations is difficult in practical applications, especially medical image analysis, so researchers paid more attention to model learning via self-supervised learning (SSL).Compared with SL, the advantage of SSL is to use unlabeled images for model training.It is expected that developing a CAD method based on SSL is suitable and useful for applications in medical image analysis.Therefore, a SSL-based semantic segmentation method was proposed to identify BE in endoscopic images in this paper.Such a self-supervised learning model can reduce the labor-intensive load of annotating specific areas in a large dataset, making it more practical in CAD systems.
The rest of this paper is organized in the following.We describe related works and the proposed method in Sections 2 and 3, respectively.Section 4 demonstrates the experimental results.Finally, Section 5 gives conclusions.

RELATED WORKS
Semantic segmentation is a computer vision technique classifying each pixel in an image into one of several predefined labels [12].In recent years, deep learning-based semantic segmentation has been widely employed in various visual recognition tasks, such as autonomous driving, intelligent transportation, and biomedical image analysis [13].Most semantic segmentation techniques are based on SL, including U-Net [14], fully convolutional network (FCN) [15], DeepLab v3+ [16], etc.For example, U-Net consists of a specific encoder-decoder scheme, also known as a contracting-expanding path.The encoder captures the context of the input image through a series of convolutional and pooling layers, while the decoder uses upsampling and convolutional layers to reconstruct the segmented image.One unique feature of U-Net is that it includes skip connections between the encoder and the decoder, and thus mitigate the problem of information loss that can occur in traditional convolutional neural network architectures.This makes U-Net very suitable for biomedical image analysis.In [9], the existing SL-based method was developed based on FCN for identification of Barrett's esophagus.
Though SL is a common machine learning approach for many applications, SL heavily relies on a large number of labeled training data.In SL, we have a set of input data with predefined labels or annotations, and the goal is to predict labels for new unlabeled data based on these annotations.However, it has a major limitation, which is the requirement for a large amount of labeled data, often requiring manual annotation, and being time-consuming and laborintensive.Additionally, for certain tasks, obtaining a large amount of labeled data can be challenging or impractical.In recent years, the combination of supervised learning and unsupervised learning has also been widely researched, and self-supervised learning is one important approach in this context.It learns useful feature representations by utilizing self-generated signals in the data, without the need for manual labeling.Common self-supervised learning models include SimSiam [17], SimCLR [18], and BYOL [19].For example, SimSiam, a variant of the Siamese neural network, is a selfsupervised learning approach which can learn from unlabeled data without the need for human-labeled annotations.The approach employs two identical backbone networks to transform original images.Furthermore, it introduces a straightforward yet powerful contrastive loss function to encourage the alignment of the two feature representations.An autoencoder [20] is an unsupervised learning model used for acquiring efficient data representations.It is a neural network structure composed of an encoder and a decoder.The objective of an autoencoder is to encode input data into a lower-dimensional representation, and then decode it back into the original input data such that the output of the decoder closely matches the original data.This way, autoencoders can learn the structure and features of the data through data compression and reconstruction.

PROPOSED SSL-BASED METHOD
As mentioned in Section 1, the goal of this paper is to develop an effective SSL-based semantic segmentation method for endoscopic images.Since the target of semantic segmentation in this research is to identify Barrett's esophagus, each pixel of the image should be categorized into one of the following predefined labels: 1) Barrett's esophagus, 2) Normal cell, 3) Gastric tissue, and 4) Background.

Semantic Segmentation Model
The semantic segmentation model was built based on the U-Net architecture, as shown in Fig. 1.The encoder part of the network was developed based on ResNet network [21].Since the encoder aims at extracting image features, only residual blocks of ResNet were preserved.As shown in Fig. 1, there are five stages and each stage has some residual blocks in the semantic segmentation network.To reduce the computational complexity, the max pooling layer is used to down-sample the spatial resolution of feature maps between stages.The decoder part used transposed convolution to up-sample feature maps.Meanwhile, four skip-connections are added to pass features from the encoder to the decoder in the first four stages in order to recover information lost during down-sampling.Lastly, there is a prediction head to output the segmentation result.

Model training via SSL
The model training for the proposed semantic segmentation is performed by SSL.To train the proposed model via SSL, there are two major tasks: (1) pretext task -to pretrain both encoder and decoder to obtain model weights for later use; (2) downstream taskto fine-tune the semantic segmentation model based on parameters from the pretrained network with the segmentation head.Since the proposed semantic segmentation method is devised based on U-Net, a two-step approach is designed to train the proposed U-Net-like network in the pretext task.

Pretext task.
In the pretext task, the first step is to train the encoder and then followed by training the decoder.We elaborate each step in the following.

Encoder
In the first step, the encoder of the proposed model was trained by a SSL algorithm, SimSiam [17], as illustrated in Fig. 2. It consists Since SimSiam employs a specific training recipe that makes it effective for self-supervised learning, it is adopted for model training here.It uses data augmentation techniques to create pairs of augmented views, Aug1 and Aug2, of the same input.These pairs are considered as positive pairs.The two views are processed by two networks separately.Within the training procedure, one view is modified and adopted as the target for the other and vice versa.To measure the similarity of two views, the cosine similarity D is defined as follows: where z is the output representation of a predictor, while is the output representation of a projector on the other view.Then the total loss L of SimSiam can be calculated based on similarities from two views, which is defined as follows: It is expected that a pre-trained encoder can be obtained when the total loss L is minimized.Decoder In Step 2, the decoder is trained via SSL.In fact, the concept of image reconstruction is a common approach to achieve SSL.According to the concept of image reconstruction, the decoder of the proposed model was trained.Fig. 4 illustrates image reconstruction for training the decoder.The input image was augmented by adding some white noise at random, which became the input image of the whole image reconstruction procedure.The loss function was evaluated by L2 loss and it is described below: where N is the number of images, is the ith input image, is the reconstructed image.The decoder is pretrained via minimize between original input image and reconstructed image.It is expected that the encoder and the decoder of the proposed semantic segmentation model can be trained after the pretext task.

Downstream
Task.After the pretext task, we have the pretrained encoder and decoder.The downstream task aims at finetuning the entire network in an end-to-end approach for semantic segmentation, including the encoder, the decoder, and the prediction head via SL.Both the encoder and the decoder are initialized with the weights that learned in the previous pretext task, which makes the training process more efficient and robust.Fig. 3 illustrates the training process for the downstream task in this paper.The encoder and decoder are respectively designed for feature extraction and image resolution restoration.The Seghead, a prediction head for segmentation, is formed by a 1x1 convolution layer combined with a sigmoid function, mapping the output of the 1x1 convolution layer to a value between 0 and 1.This value represents the probability of the pixel belonging to the positive class, used for predicting the final semantic segmentation image.During the overall network training process, we utilize Categorical Cross-Entropy Loss (CCE) as the loss function, which is commonly where N represents the total number of pixels in a batch, indicates the number of classes, , and ˆ represent the values of the j-th channel at the i-th pixel position in the ground truth and the probability output by the neural network's sigmoid at the i-th pixel position in the j-th channel.Our objective is to minimize the categorical cross-entropy loss.We achieve this by calculating gradients using the backpropagation algorithm and then updating the network parameters with an optimization algorithm to make the model's predictions closer to the ground truth, thus achieving more accurate image classification.

EXPERIMENT RESULTS
For performance evaluation, 86 endoscopic images were collected from the hospital.These images are divided into three different sets -68 images for the training, 9 images for the validation, and the remaining 9 images for model evaluation.

Evaluation measurements
Four metrics were used to evaluate the performance of the proposed model for semantic segmentation, which are defined in the following: • 1.Pixel accuracy (PA): the proportion of the number of correctly predicted pixels to the number of total pixels.
• 2. Intersection-over-union (IoU): the proportion of the range of the predicted category to the actual category.
• 4. Mean IoU (mIoU): the average of all IoUs, n is the number of predefined classes.where TP, FP, FN, and TN represent true positive, false positive, false negative, and true negative, respectively, which are defined as follows: True Positive(TP):the pixel belongs to the target class, and is predicted to be the target class.
False Positive(FP):the pixel belongs to the other class, but is predicted to be the target class.
False Negative(FN):the pixel belongs to the target class, but is predicted to be the other class.
True Negative(TN):the pixel belongs to the other class, but is predicted to be the other class.

Ablation study
In this experiment, we evaluated the semantic segmentation models using three different backbone architectures.The experimental results are listed in Table1.
For the VGG16 [22], we achieved a mean Pixel Accuracy of 97.62%, a mean Dice Coefficient of 87.48%, and an intersection-over-union (IoU) of 81.58%.On the other hand, the ResNet18 backbone achieved a mean Pixel Accuracy of 98.20%, a mean Dice Coefficient of 87.96%, and an IoU of 83.18%.The results for the Vanilla U-Net achieved quite commendable performance with mPA of 97.90%, mDC of 85.27%, and IoU of 80.22%.However, it still slightly lags behind ResNet18, which further validates the efficacy of self-supervised learning proposed in this paper in enhancing model performance.
These outcomes demonstrate strong performance across different backbone selections in semantic segmentation model.Particularly, both the VGG16 and ResNet18 exhibit commendable segmentation accuracy in terms of pixel accuracy and Dice coefficient.The VGG16 also showcases significant outcomes, validating its suitability for semantic segmentation tasks.The ResNet18 shows slightly superior results, underscoring its exceptional feature extraction and classification capabilities.That is why ResNet18 is selected as the backbone in the proposed semantic segmentation method.
In conclusion, these experimental results further validate the effectiveness and superiority of our proposed semantic segmentation model.Not only can it achieve accurate segmentation in medical images, but it also maintains consistent performance across different encoder architectures.This contributes substantially to the realm of automated analysis and diagnosis in medical imaging.

Subjective evaluation
Two endoscopic images, i.e., Case 1 and Case 8, were selected from 9 test images for subjective evaluation purposes.Figs.5(a) and 8(a) are the endoscopic images.As we can see in Fig. 4(a), it is a little difficult to observe the boundaries of Barrett's esophagus.Furthermore, as we can see in Figs.4(a) and 7(a), the area of Barrett's esophagus in Fig. 5(a) looks obvious compared with that in Fig. 4(a).

Objective evaluation
Table 2 presents the experimental evaluation of the proposed method in the context of a semantic segmentation model using ResNet18 as the encoder.The model was thoroughly evaluated on 9 test images.The experimental results demonstrate that our method achieved an impressive 98.20% mPA, indicating extremely high accuracy in pixel-wise classification.Furthermore, in terms of evaluation metrics mDC and mIoU, our method also delivered strong performance, achieving 87.96% and 83.18%, respectively.
Tables 3 are the evaluation metrics of Case1 and Case8 test images.Since the Gastric tissue is not involved in Case 1, it is excluded from calculating mIoU.However, the Gastric tissue is

CONCLUSIONS
In this paper, a self-supervised-learning deep semantic segmentation method was proposed to identify Barrett's esophagus in endoscopic images.The deep network model was built based on U-Net architecture and self-supervised learning.The encoder and decoder of the proposed method was trained via two steps in the pretext task and the prediction head was trained with a small dataset of endoscopic images with annotations in the downstream task.To evaluate the performance of the proposed scheme, some endoscopic images were collected for subjective and objective testing.The mPA, mDC and mIoU values of the proposed method are 98.20%, 87.96%, and 83.18%, respectively.The experimental results show that the proposed semantic segmentation method has a good performance on the identification of Barrett's esophagus in endoscopic images.

Figure 1 :
Figure 1: The network architecture of proposed SSL-based semantic segmentation

Figure 2 :Figure 3 :
Figure 2: Illustration of SimSiam structure for training the encoder

Table 1 :
Experimental results with different encoders

Table 2 :
Experimental results using the ResNet18 encoder

Table 3 :
Performance of the proposed method for Case1 and Case8 Case 8.According to Tables 3, the PA values of identifying Barrett's esophagus, in both cases, are higher than 97%.The results show that the proposed method can effectively identify the areas of Barrett's esophagus in endoscopic images.