Merging/Filtering/Voting to Improve Segmentation of Diabetic Retinopathy Eye Fundus Lesions

Diabetic Retinopathy (DR) is a fast-progressing disease affecting millions of people world-wide. An early diagnosis is very important to prevent further damage, which can be done by analysis of the Eye Fundus Images (EFI). In this context, deep learning networks can be used to help medical doctors, both by segmenting potential lesions automatically and by classifying the degree of the illness at a certain instant in time. The segmentation task classifies each individual pixel as belonging to either background (BK=non-lesion), microaneurism (MA), soft or hard exudate (SE and HE) or hemorrhage (HM), the optic disk (OD) and the macula (M). Existing deep learning-based segmentation approaches can detect lesions, but there are a relevant number of pixel misclassifications that should be dealt with. Besides calling attention to the issue of correctly evaluating lesions segmentation quality by using the most appropriate metrics, in this paper we investigate the possibility of bringing together the output labelmaps of different deep learning networks and also hardcoded segmentation to improve the end result by means of filtering/merging/voting. Using a publicly available dataset, we show that the approach improves quality significantly as measured using Intersection over the Union IoU (or Jaccard Index JI), from initial IoU scores of 0.9 (BK) 0.09 (MA) 0.17 (HM) 0.29 (HE) 0.18 (SE) 0.8 (OD) to a final score of 0.99 (BK) 0.143 (MA) 0.32 (HM) 0.39 (HE) 0.37 (SE) 0.9 (OD). This corresponds to a significant improvement of around plus 10 percentage points in average. We end the work by delineating future work on this promising direction of research.


INTRODUCTION
Diabetic Retinopathy (DR) is a fast-progressing disease that often results in blindness.An early diagnosis is very important to prevent further damage.Eye Fundus Images (EFI) can be analyzed to detect lesions and the degree of DR.At the onset of the disease, the EFI typically exhibits only a small number of micro-aneurisms (dilated capillaries with the appearance of small red dots).Subsequent degrees of severity may reveal exudates (yellow deposits of proteins and lipids) and hemorrhages, and the number of micro-aneurisms may also increase.In the latest proliferative stages DR exhibits neovascularization and related affections.Datasets such as IDRID [1] containing both eye fundus images and lesion masks created by expert medical doctors can be used to assess the quality of automatic mechanisms to detect and segment those lesions.Deep learning networks can be used in this context to help medical doctors identify both the lesions and the degree of the illness at a certain instant in time.The deep learning network can learn to segment the lesions and to classify the degree of DR.Deep learning networks rely on backpropagation learning from a large corpus of training images.Most frequently these networks are pretrained on image datasets such as ImageNet [2] to gain generic imaging feature extraction capabilities and then further training on domain specific images such as the eye fundus in our case to further learn the specific task (transfer learning).In the case of semantic segmentation, the objective is to output a labelmap with the same size as the image but where each pixel contains the ID of a specific lesion or class, such as background BK (non-lesion), microaneurisms (MA), soft and hard exudates (SE and HE) and hemorrhages (HM).The optic disk (OD) and the macula (M) can also be added to the structures that are automatically segmented.The training dataset must include at least the training images and corresponding expected segmentations for each of those images.Prior to deep learning, segmentation would recur to hardcoded algorithms that would isolate regions based on colour/intensity homogeneity, such as in [3].A thorough review of deep learning tasks and methods related to detection of diabetic retinopathy on eye fundus images and segmentation of lesions is available in [4] [5].
Given the normal variability of different specimens and acquisition conditions, linked also to colour and texture fuzziness that are normal in EFI images and between different structures in those images, we noticed that segmentations suffer from reasonably high rates of false positives for lesions and some misclassifications due to similar colours.In such a context, voting over more than one output has the potential to increase accuracy.Adding hardcoded segmentation also allows a certain degree of tuning conditions to avoid those many false positives, then applying the result as a post-filter to the deep learning networks output.In this work we investigate merging/filtering/voting over the outputs of two different deep learning networks and hardcoded segmentation to improve the quality of the result.Using publicly available data we were able to improve from a mean IoU (JI) of 0.4 (eye fundus EF 0.9; MA 0.087.HM 0.17.HE 0.29.SE 0.18.OD 0.8) to a best voting-based result of 0.52 (EF 0.998.MA 0.143.HM 0.32.HW 0.39.SE 0.37.OD 0.9).which is a significant improvement.

RELATED WORK
In this paper we focus on improving the quality of segmentation of lesions in eye fundus images.There have been a few original works and surveys regarding the subject of segmentation of lesions in eye fundus images, therefore we review some of the most relevant ones in this section, and call attention of the reader to the issue of carefully evaluating the quality of such segmentation.Lesions detection and localization from eye fundus images using either classical Machine Learning (ML) or Deep Learning (DL) was surveyed in works such as [4], [5], [6], where it is common for all tasks related to the issue to score very high, mostly between 90% and 100%.However, if we look closer only at reviewed approaches that claim to detect the locations of lesions or similar, the list is already filtered to a smaller number, including Prentasic et al. [7], Gondal at al. [8], Quellec et al. [9] (exudates, hemorrhages and microaneurysms), Haloi et al. [10], van Grinsven et al. [11], Orlando et al. [12] and Shan et al. [13] (microaneurysms, hemorrhages or both).From those, Prentasic et al. [7], Haloi et.al. [10], Van Grinsven [11] and Shan et al. [13] are classification CNNs that classify small square windows around potential lesions.This task involves no segmentation, as it assumes the squares are already given and the proposals are evaluated statically by previously extracting thousands of those windows.The authors also make no attempt at scaling this to (real-time) operation involving segmentation and classification of individual pixels to obtain the per-pixel classifications that enables semantic segmentation, therefore these are in reality examples of small pixel window classification networks that are not directly applicable for semantic segmentation.The remaining related works Gondal at al. [8], Quellec et al. [9] are two variations of DR classifiers (classification of EFI images as DR or not), that at the same time up-sample and extract heatmaps to get the positions of lesions.Orlando [12] uses a different approach that combines DL with image processing to find candidate regions.
Contrary to the 90 to 100% scores of simpler tasks, which are also reported by these authors, some evaluations there reveal the difficulty finding each lesion in detail.For instance, reported sensitivities against 1 false positive per image (FPI) in those works were (HA=hemorrhages, MA=micro-aneurisms, HE=hard exudates, SE=Soft Exudates): Quellec et al. [9] (HA=47%; HE=57%; SE=70% and MA=38%), Gondal et al. [8] (HA=50%; HE=40%; SE=64% and MA=7%) and Orlando et al. [12] (HA:50%, MA: 30%).Instead of 90 to 100%, between 7% and 70% of individual lesions are detected in these results.But there is also another difference from those works when compared with evaluating the quality of semantic segmentation considering the correct classification of each individual pixel.Those works consider groundtruths where medical doctors marked large coarse segments around groups of lesions (e.g.datasets Diaret [14] or e-ophtha [15]), and what is usually evaluated there is whether automatically found segments have at least a partial overlap with the coarse groundtruth segments (either any overlap or an overlap above a certain threshold).That is very different from evaluation of classifications of individual pixels.In spite of an apparent consensus around the high performance of DL on EFI analysis, these results already show that semantic segmentation on EFI analysis is much harder.Metrics are also very important in machine learning in general and in medical imaging in particular.If we use global metrics to evaluate the result of segmentation of EFI images, and since the background is almost 95% of all pixels of the image, we will be in fact evaluating the quality of segmentation of the background.The importance of metrics was also a subject in [16] and [17], and it becomes clear that, in order to evaluate segmentation correctly, you have to analyze the degree of match of each class.

DEEP VOTING-BASED SEGMENTATION 3.1 Approach
We noted experimentally the difficulty of segmenting lesion contours precisely using either deep learning or handcrafted algorithms.Deep learning approaches based on deep convolution neural networks (DCNN) are some of the top scoring state-of-the-art approaches, however they still result in many misclassifications and in particular false positives for lesion classes.In this context they could be used as "weak" pixel classifiers in a voting scheme, in the sense that a network gives its pixel classification for each pixel and those pixel classifications based on more than one network or algorithm are merged by a voting scheme to maximize quality of the final results.We can even add hardcoded segmentation to improve the result, for instance by filtering out false positive lesions.Figure 1 shows n DCNN segmenters that are trained individually, plus hardcoded segmentation algorithms.The input image to be classified is fed to all those approaches, while a merging mechanism computes the final segmentation based on the outcomes of the various segmenters.
In order to test this possibility we created a setup that merges the pixel classifications of the two best performing DCNNs that we were able to achieve and a hardcoded segmentation algorithm: • (1) DeepLabV3, a well-known residue-encoder based deep learning network; • (2) FCN, the fully convolutional network using VGG-16 as the base encoder; • (3) Hardcoded segmentation based on a set of well-known steps to first extract/remove vascular tree and optical disk, then segmenting each of the individual types of lesions.
In preliminary experiments we tested other DCNNs, including UNet, however the best results were obtained for DeepLabv3 and FCN, therefore we picked these two best performing networks for the voting scheme.
The individual approaches are summarily described in the next subsections, then we explain how we used the three to obtain filtering, merging and voting in the "approaches tested" subsection of the experimental section.

Fully Convolutional Network (FCN)
The FCN [18] uses a DCNN classification network (feature extraction or encoding stages), plus a sequence of up-sampling layers Figure 2: The FCN netwok setup: FCN with 51 layers in total, using VGG-16 as encoding network (7 stages, 41 layers), forwarding feature maps to be fused with output of up-sampling layers, fusing layers and final pixel classification layer (decoding stages) that use interpolation to compute the full image size pixelmap.In FCN backpropagation learning adjusts the weights of the coding layers.Our FCN implementation had 51 layers in total, using VGG-16 as encoding network (VGG-16 has 7 stages corresponding to 41 layers).Figure 2 shows a sketch of the architecture, where it is possible to see that most FCN layers are VGG-16, but FCN also forwards feature maps: the pooled output of coding stage 4 is fused with output of the first up-sampling layer that is placed just after stage 7 of VGG16, and the pooled output of coding stage 3 is fused with the output of the second up-sampling layer.Finally, the image input is also fused with the output of the third up-sampling layer, all this followed by the final pixel classification layer.The figure indicates forwarding links and "fusing" layers.

DeepLabV3
DeepLabV3 [19] is the deepest network tested in this work, with 100 layers.Figure 3 shows a summary of its main layers.Our DeepLabV3 architecture used Resnet-18 pre-trained network as feature extractor, plus forwarding connections to the the Atrous Spatial Pyramid Pooling (ASPP) layers, indicated in the figure, which enables segmenting of objects at multiple scales.The outputs of the final DCNN layer are combined with a fully connected Conditional Random Field (CRF) for improved localization of object boundaries using mechanisms from probabilistic graphical models.The figure shows that the feature extraction part of our DeepLabV3 implementation is using Resnet-18 layers, with 8 stages and totaling 71 layers, the remaining stages being ASPP plus the final stages.

HardCoded Segmentation
DCNN segmentation frequently results in many background regions that were classified erroneously as lesions.This problem occurred both with DeepLabV3 and FCN.For this reason, we decided to add to the merging/filtering/voting scheme a more traditional hardcoded segmentation algorithm that could be used to filter out those false positives.The hardcoded segmentation algorithm is inspired by the approach described in [3].It uses channels intensities, morphological operators and geometry analysis together to distinguish regions as lesions, structures and other parts.A careful tuning of the approach allowed us to eliminate most false positives in the eye fundus background.The hardcoded segmentation involves separate procedures to isolate the optical disk (OD), vascular tree (VA), as well as detection of EFI background pixels.It uses histogram normalization (to account for different acquisition conditions), a set of intensity thresholds on each RGB channel, labeling of connected components (bwlabel in matlab), regions filling (imfill), dilation, opening and closing operators, and computation of region properties (regionprop in matlab), followed by conditions to assign a probability that a specific region is of each type of lesion, the OD, the vascular tree or the remaining EFI background.

EXPERIMENTS
For these experiments we used a well-known dataset (IDRiD [1]) that is made of eye fundus images and corresponding lesion masks.All approaches were implemented in Matlab2020b.A set of basic training options for DCNNs included stochastic gradient decent with momentum, an initial learning rate of 0.001, a maximum number of epochs set to 500, momentum 0.9, L2 regularization of 0.005, shuffling training and validation data every epoch; a minibatch size of 16 (we also tested other configurations, but report the best only).For the two deep learning approaches we applied transfer learning from previously trained ImageNet [2] and data augmentation to increase the effective data set size, by introducing random modifications to the images (i.e.random translations of up to 10 pixels, random rotations up to 10 degrees, shearing up to 10 pixels and scaling up to 10% as well).

The dataset
The IDRiD dataset [1] consists of 81 color fundus images with signs of Diabetic Retinopathy (DR).The number of images (some images contain multiple lesions) with binary masks available for each particular lesion type is given as follows: MA -81, HE -81, HM -80, SE -40, there are also masks for the optical disk (OD) in every EFI.The fundus images were captured by a retinal specialist at an Eye Clinic located in Nanded, Maharashtra, India.Images were acquired using a Kowa VX-10 alpha digital fundus camera with 50-degree field of view (FOV), and all are centered near to the macula.The images have a resolution of 4288x2848 pixels and are stored in jpg file format.The data used in our experiments consisted of dividing the dataset randomly into 55 train and 26 test Eye Fundus Images (EFI).

Approaches tested
We defined three segmentation approaches based on the alternatives discussed in the previous section, calling them Seg=Segmentation, the hardcoded segmentation approach; deepL, the DeepLabV3 deep learning network-base segmentation; FCN, the FCN deep learning network-based segmentation.
Given those three approaches, we evaluated their quality and added voting among their segmentation outputs to obtain different segmentation outputs.Looking at the quality of segmentation of different parts, using IoU, we could understand where each approach outperforms the others.Based on that, instead of just voting every pixel indistinctively of those results, we defined one DCNN approach (either deepL or FCN as base segmentations approaches and filtered results based on merging-based merging operations, keeping those that achieved the best results. deepLSeg: DeepLabV3 used as a base, post-filtered by the handcrafted segmentation to remove lesions that are not in both outputs simultaneously.This approach is quite effective at removing BK pixels that were erroneously classified as lesions (false positives); deepLFCN: DeepLabV3 as a base, then pixels classified as a specific lesion in deepLabV3 but classified as eye fundus in FCN were reclassified as eye fundus.Once again, the objective was to eliminate lesion false positives; voteDeepLFCNSeg_voteSegMA: vote lesions and optic disk based on a majority of deepL, FCN and harcoded segmentation output.It means that a pixel with two out of three approaches having the same label will be labelled as such, otherwise it will be labeled as eye fundus BK.Then, filter MA lesions based on handcrafted segmentation (pixels not classified as MA in hardcoded segmentation are not considered MA).

Metrics
Metrics used to evaluate the quality of segmentation influence the reported results significantly.Since in an EFI the background (BK=eye fundus that is not lesion) is frequently more than 95% of all pixels, average metrics over all pixels will be mostly influenced by the quality of segmentation of the background itself instead of lesions.
Hence, a global pixel accuracy of 90% may still correspond to a very bad segmentation of one or more types of lesions.A metric such as Intersection over the Union (IoU) a.k.a.Jaccard Index (JI) for each type of lesion IoU = TP / (TP + FP + FN) is most appropriate to correctly evaluate the quality, not only because it is applied for each lesion separately, but also because while accuracy evaluates only against the pixels that actually belong to the class acc=TP/(TP + FN), IoU also considers false positive lesions.Accordingly, we have chosen to report IoU of each lesion type, and IoU of the background (BK) and of the optic disk (OD).Two more IoU measurements were added.The first one is the mean IoU over all classes (IoUC), which is simply the mean over the previous IoU values for all the classes, includes optical disk OD and eye fundus BK.The second one is the mean IoU of lesions IoUL, which is the mean of the previous IoU values over all lesions (therefore, BK and OD are excluded in this case).

Results and analysis
Note that, in the results (Table 1), typically lesions have small IoU (0 to 30%), while large and constant structures such as the background (eye fundus) and the optical disk have 70% to 97% IoU.This is as From the results we can see, first of all, that the two DCNNs (DeepLabV3 and FCN) exhibit better IoU scores than the hardcoded algorithm.However, the hardcoded algorithm had much better IoU for the background (BK) than the two DCNN results.This hints that merging results from hardcoded and DCNN could improve results overall, which we did next.
DeepLabV3 post-filtered by hardcoded segmentation (deepLSeg) improved quality of segmentation further.The post-filtering helped reduce the amounts of false-positive lesions that are part of the BK.But an even better improvement was obtained by post-filtering DeepL based on FCN output (deepLFCN).Since the false positive lesion misclassifications of DeepL and FCN were not exactly the same, this step helped reduce the number of false positive lesions by only accepting those that are similarly classified by both approaches.Finally, the last experiment using the three options with MA postfiltering achieved the best results.

Conclusions and future work
Segmenting lesions from eye fundus images is a difficult task, which can be analyzed using appropriate metrics to evaluate the quality of segmentation.In this work we tested the hypothesis that merging different segmentation network outputs using some voting scheme, and/or also adding hardcoded segmentation to the merging scheme, can help significantly improve the quality of segmentation.We created an experimental setup to prove this hypothesis using two well-known deep learning segmentation networks and a hardcoded algorithm.Experimental results with a public dataset have shown promising improvement using the voting/merging schemes.In the future we hope to investigate other alternatives.We hope to build and experiment with other network architectures, varying inner architecture blocks to create a forest of multiple networks, adding further attention mechanisms into the network architectures and integrating the multiple networks into a single DCNN with multiple branches.Generically, the idea is to consider a DCNN output as a weak classifier, create a large number of such weak classifiers by changing the network architectures and using voting over the resulting forest of networks to reach a more accurate final classification of each pixel.

Figure 1 :
Figure 1: Sketch of the proposed approach based on voting.

Table 1 :
Quality of each method against each type of structure and/or lesion measured as IoU (Intersection over the Union), a.k.a.JI Jaccard Index.The classes are (BK=eye fundus; MA=microaneurisms; HM=hemorrhages; HE=hard exudates; SE=soft exudates; OD=optical disk).lesions are much smaller, more varying in size, shape and colour, and their colour and texture can be easily confounded by the algorithms (deep learning or otherwise) with vascular tree, optic disk or other lesions.Finally, it should also be noted that IoU is measured as a percentage of the area of the lesions.As a consequence, small IoU in small lesions means a significant error relative to the size of those lesions.