Self-Correlation and Cross-Correlation Learning for Few-Shot Remote Sensing Image Semantic Segmentation

Remote sensing image semantic segmentation is an important problem for remote sensing image interpretation. Although remarkable progress has been achieved, existing deep neural network methods suffer from the reliance on massive training data. Few-shot remote sensing semantic segmentation aims at learning to segment target objects from a query image using only a few annotated support images of the target class. Most existing few-shot learning methods stem primarily from their sole focus on extracting information from support images, thereby failing to effectively address the large variance in appearance and scales of geographic objects. To tackle these challenges, we propose a Self-Correlation and Cross-Correlation Learning Network for the few-shot remote sensing image semantic segmentation. Our model enhances the generalization by considering both self-correlation and cross-correlation between support and query images to make segmentation predictions. To further explore the self-correlation with the query image, we propose to adopt a classical spectral method to produce a class-agnostic segmentation mask based on the basic visual information of the image. Extensive experiments on two remote sensing image datasets demonstrate the effectiveness and superiority of our model in few-shot remote sensing image semantic segmentation. Code and models will be accessed at https://github.com/linhanwang/SCCNet.


INTRODUCTION
Semantic segmentation in remote sensing images has become an essential task for various applications, such as land use analysis [11], urban management [35], environmental monitoring [13], and other areas of national economic development.Although deep neural networks for semantic segmentation [2,4,18,43] have achieved remarkable progress, their reliance on large-scale datasets greatly restricts their application in low-resource domains.For example, collecting an adequate amount of remote sensing data is hard, and the expense associated with hiring domain experts to annotate the data is too costly to be feasible.To reduce such burden on data annotation, few-shot semantic segmentation has been proposed [24], which aims to learn a model that can perform segmentation on novel classes with only a few annotated images.to guide the segmentation process of the query image.Building upon this research direction, some studies [33,36] strive to design more representative support prototypes to enhance segmentation performance.While significant advancements have been made for natural images, these methods encounter challenges when applied to remote sensing images, primarily due to the presence of large intra-class variances.Specifically, geographic objects of the same class can exhibit substantial variations in appearance and scales from different angles.Recently, SDM [38] proposes scale-aware focal loss to focus training on tiny hard-parsed objects and performs detailed matching with multiple prototypes for providing more accurate parsing guidance.However, SDM only considers the cross-correlation between support and query images, ignoring the self-correlation between pixels within the query image.We found that self-correlations within the query image could provide extra knowledge to help segment tiny objects, which is very important for few-shot remote sensing image semantic segmentation, particularly when there is a significant discrepancy between the support and query images.
To address the aforementioned challenges, we propose a novel model, named SCCNet, to leverage knowledge from query images for few-shot remote sensing image semantic segmentation.As illustrated in Fig. 1, the proposed model consists of two key components.First, we incorporate the initial query mask prediction to collect query features in high-confidence regions and then use the generated masked query features to perform self-matching with query features.Since pixels belonging to the same object are expected to exhibit higher similarity than those belonging to different objects, Self-Matching Module can provide auxiliary support information to segment the query image.Second, we propose a novel Spectral Segmentation Module to extract knowledge from query images further with classical spectral methods.Specifically, we first construct the affinity matrices using basic visual information (i.e.color and position information) and semantic information derived from the middle-layer features of the pretrained backbone.Then we decompose images using the eigenvectors of Laplacian of affinity matrices as soft segments and obtain the class-agnostic eigensegments.Since it operates solely on the query images without relying on the support annotations, it is naturally resilient to the significant discrepancies that may exist between the support and query images.The final prediction mask of the query image is obtained by fusing the optimized query mask and the eigensegment.
Our key contributions can be summarized as follows: • We propose a Self-Correlation and Cross-Correlation Learning Network for the few-shot remote sensing image semantic segmentation.Our model enhances the generalization by considering both self-correlation within query images and cross-correlation between support and query images to make segmentation predictions.• We proposed a Self-Matching Module to extract more comprehensive query information.The correlation between the initial segment and the query images is introduced to the model to tackle the large discrepancy between support and query images.
• We propose a novel Spectral Segmentation Module with spectral analysis to produce class-agnostic segmentations of query images without the supervision of any annotations.• We evaluate the proposed model on two remote sensing image datasets for few-shot semantic segmentation tasks.
Comprehensive experiments demonstrate that our SCCNet consistently outperforms all the baselines for both 1-shot and 5-shot settings.

RELATED WORK 2.1 Remote Sensing Image Semantic Segmentation
Deep learning-based methods have gained significant popularity in the remote sensing community, showcasing remarkable progress in segmenting remote sensing images.Specifically, Maggiori et al. [20] introduced a multilayer perceptron (MLP) into the segmentation network to produce better segmentation results.Yu et al. [39] introduced the pyramid pooling module as a means to address semantic segmentation in remote sensing images, while Yue et al. [40] developed TreeUNet as the first adaptive Convolutional Neural Network (CNN) specifically tailored for semantic segmentation in this domain.Zhang et al. [42] adopted the multibranch parallel convolution structure in HRNet [29] to generate multiscale feature maps and designed an adaptive spatial pooling module to aggregate more local contexts.To tackle the challenge in small-scale object segmentation, Kammpffmeyer et al. [12] assembled patch-based pixel classification and pixel-to-pixel segmentation, which introduced uncertain mapping to achieve high performance on small-scale objects.FactSeg [19] proposed a symmetrical dual-branch decoder consisting of a foreground activation branch and a semantic refinement branch.The two branches performed multiscale feature fusion through skip connection, thereby improving the accuracy of small-scale object segmentation.Furthermore, with the emergence of multiple attention mechanisms, Ding et al. [7] designed an efficient local attention embedding to enhance segmentation performance.
Although existing methods effectively demonstrate the capabilities of deep learning in remote sensing image semantic segmentation, they typically require a large number of densely-annotated images for training and have difficulties in generalizing to unseen object categories.

Few-shot Semantic Segmentation
To address the generalization issue and reduce massive training data annotation, Few-Shot Semantic Segmentation (FSS) task has been proposed, which aims to learn a model that can perform segmentation on novel classes with only a few pixel-level annotated images.Shaban et al. [24] first proposed one-shot semantic segmentation networks to address FSS.It uses global average pooling over the foreground region of the support features to generate class prototypes, which are then employed to guide the segmentation process of the query image.Building upon the concept of prototypical networks [28], utilizing prototype representations to guide mask prediction in query images has become a popular paradigm in the field of few-shot segmentation.Specifically, PANet [33] proposed a prototype alignment regularization between support and query images to generate high-quality prototypes.PMMs [36] employ the Expectation-Maximization algorithm to generate multiple prototypes corresponding to different parts of the objects.Recently, a group of matching-based methods has been proposed to leverage dense correspondences between query images and support annotations.HSNet [22] utilizes 4D convolutions to extract precise segmentation masks by compressing the multilevel 4D correlation tensors.VAT [10] proposes a 4D Convolutional Swin Transformer to aggregate the correlation map.To fully harness the information within the support set, Yang et al. [37] employ clustering techniques to mine latent novel classes in the support set and subsequently treat them as pseudo labels during the training process.
Despite the remarkable progress achieved in natural images, Yao et al. [38] found that performance drops dramatically on unseen classes in remote sensing images.This limitation arises from the inability of these methods to effectively handle the significant variations in object appearance and scales prevalent in remote sensing images.To address this challenge, SDM [38] proposes a scaled-aware focal loss, which enhances the focus on tiny objects.DMML-Net [32] uses an affinity-based fusion mechanism to adaptively calibrate the deviation of the prototype induced by intra-class variation.
It is worth noting that all existing methods primarily focus on extracting information solely from the support set to make a segmentation.However, we argue that this approach may not be sufficient for remote sensing images, where substantial discrepancies exist between the support and query images.In this study, we aim to pioneer a novel direction by extracting the self-contained knowledge in the query images to boost the performance for few-shot remote sensing image semantic segmentation.

Spectral Methods for Segmentation
Spectral analysis originally emerged from the exploration of continuous operators on Riemannian manifolds [3].Subsequent research efforts extended this line of research to the discrete setting of graphs, leading to numerous findings that connect global graph properties to the eigenvalues and eigenvectors of their associated Laplacian matrices.Lin et al. [17] demonstrate that the eigenvectors of graph Laplacians yield graph partitions with minimum energy.Building upon this insight, Shi et al. [26] view image segmentation as a graph partitioning problem and propose a novel global criterion called the normalized cut for image segmentation.As presented by Aksoy et al. [1], soft segmentations are automatically generated by fusing high-level and low-level image features within a graph structure.The construction of this graph facilitates the utilization of the corresponding Laplacian matrix and its eigenvectors to reveal semantic objects and capture soft transitions between them.To mitigate the risk of overfitting caused by limited training data, we adopt a commonly used meta-learning technique known as episodic training [31].In the -shot setting, we employ episodic sampling to select  +1 annotated image pairs, denoted as {(  1 ,   1 ), (  2 ,   2 ), ..., (   ,    ), (  ,   )}, with the same targeted class from the training dataset D  .Here, {(   ,    )}  =1 represents the support samples, and (  ,   ) denotes the query pair.During the training phase, the segmentation model takes both the support samples {(   ,    )}  =1 and the query image   as inputs and generates a predicted mask M .This prediction is then supervised by the corresponding ground truth mask   .Similarly, during the testing phase, we employ  annotated image pairs from D  to infer the semantic objects present in the query images.

PROPOSED APPROACH
To solve the few-shot semantic segmentation problem in remote sensing images, we propose a novel model named SCCNet as shown in Fig 2 .First, we use pre-trained CNNs (VGG [27] or Resnet [9]) as the feature extractor to generate the corresponding query and support features.In the cross-matching module, pixel-wise multi-scale correlation tensors between masked support features and query features are built and squeezed to generate the initial predicted query mask M  .To tackle the high intra-class variance problem in remote sensing images, the Self-Matching Module calculates the correlations between query features masked by M  and other query features.These correlations are further squeezed and merged with squeezed correlations between support and query features to generate optimized query mask M  .To further mine knowledge from the query images, in the spectral segmentation module, the classic spectral analysis method is utilized to exploit the proximity of local regions.Specifically, eigenvectors of the Laplacian of the affinity matrix are utilized as soft segments and transformed into eigensegments by Thresholding algorithms afterward.In the end, the final prediction mask of the query image is obtained by fusing the optimized query mask and the eigensegment.

Cross-Matching Module
Different from encoding an annotated support image to a feature vector to facilitate query image segmentation, we adopt the pixelwise correlation between the support and query images to make a segmentation in our Cross-Matching Module.Hypercorrelation pyramid construction.We extract features from query and support images and compute the correlation between them.Given a pair of query and support images,   and   , we adopt a pretrained backbone to produce a sequence of  feature maps, {(   ,    )}  =1 , where    and    denote query and support feature maps at the -th level, respectively.A support mask   is used to encode segmentation information and filter out the background information.We obtain a masked support feature as F  =    ⊙   (  ), where ⊙ denotes the Hadamard product and   : R  × → R   ×  ×  denotes a function that resizes the given tensor followed by expansion along the channel dimension of the -th layer.=1 to effectively squeeze it into a condensed feature map  ∈ R 128× 1 × 1 .As shown in Figure 3, sequences of multi-channel 4D convolution with large strides periodically squeeze the last two (support) spatial dimensions of Ĉ down to (  ,  ) while the first two spatial (query) dimensions remain the same as (  ,   ).Similar to FPN [16] structure, two outputs from adjacent pyramidal layers,  and  + 1, are merged by element-wise addition after upsampling the (query) spatial dimensions of the upper layer.After merging, the output tensor of the lowest block is further compressed by average-pooling its last two (support) spatial dimensions, which in turn provides a 2-dimensional feature map  ∈ R

Self-Matching Module
While the cross-matching module successfully captures intricate correlations between support and query images, it faces limitations when significant disparities exist between the support and query features.Consequently, the initial query mask M  generated by the cross-matching module may lack crucial details, which is a pain point for the segmentation task.To tackle this issue, Self-Matching Module (SMM) is proposed to provide auxiliary support information to segment the query image.Suppose the query image is   , and the initial query mask is M  .In the Self-Matching Module, different from calculating the correlation tensor between masked support features and query features, we calculate the correlation tensor between initial masked query features and query features: where F Following the procedure in Cross-Matching Module, we can obtain M   ∈ [0, 1] 2× × .Then, we concatenate M   with M and utilize 1x1 conv to reduce the channel dimension to get M  ∈ [0, 1] 2× × .
In Self-Matching Module, the loss function L  for training the model can be computed as follows: where  (•) is the binary cross entropy loss and   is the ground truth mask of the query image.
To further facilitate the Self-Matching procedure, we propose a query self-matching loss: Here, M  is generated following the procedure of M   , but with ground truth query mask to calculate the masked query feature F  =    ⊙   (  ).The motivation is that the quality of the initial predicted query mask directly influences the auxiliary information extracted during the self-matching stage.Finally, we train the model in an end-to-end manner by jointly optimizing L = L  + L  , where  serves as weight strength, and we set  = 1.0 in our experiments.

Spectral Segmentation Module
Self-Matching Module incorporates the proximity between the initial query mask M  and the query image within the model, effectively addressing the challenge of large intra-class variance.However, the performance of this module is influenced by the quality of M  .To overcome this limitation, we employ a spectral analysis method to extract valuable knowledge from the affinity matrix, which is constructed solely based on the query image.
The derivation of the affinity matrix is the key to spectral decomposition.Inspired by Melas-Kyriazi et al. [21], we leverage the features  from the middle layer of the pretrained backbone to construct an affinity matrix.Additionally, since the features are extracted for aggregating similar features rather than anti-correlated features, we set the affinity thresholding as 0 : While the affinities derived from embedding features are rich in semantic information, it lacks low-level proximity including color similarity and spatial distance.To solve this problem, we adopted image matting [5,14] to consider the basic visual information in Spectral Segmentation Module.Specifically, we first transform the input image into the HSV color space:  () = ( (ℎ), (ℎ), , , , )  , where ℎ, ,  are the respective HSV coordinates and (, ) denotes the spatial coordinates of pixel .Here  contians color information and position information which can be seen as the 0-th layer feature of the network.Then, we construct a sparse affinity matrix from pixel-wise nearest neighbors based on  : where || • || denotes 2-norm and  ∈ KNN  ( ) are the k-nearest neighbors of j under the distance defined by  .The overall affinity matrix is defined as the weighted sum of the two: The residual ratio  is the hyper-parameter weighing the importance of the visual and semantic information.Empirically, we set  = 5 in our experiments.
With the affinity matrix  , we can compute the top  eigenvetors {  }  −1 =0 of the Laplacian .As shown in Figure 4, after being resized to  ×  , the eigenvectors are soft segments with continuous values.To convert the soft sements to the hard mask predictions, we propose to introduce two thresholding algorithms into Spectral Segmentation Module.The pipeline of this combination process is illustrated in Fig. 5. Specifically, we first utilize Multi-Ostu algorithm [15] to find salient objects and adopt Adaptive Thresholding [8] algorithm to extract the sharp boundaries in the eignvectors.Then we combine them together with Hadamard product to generate the final eigensegments Ẽ ∈ {0, 1}  × : Ẽ =  (  ) ⊙  (  ),  ∈ {1, ..,  − 1} (9) where we exclude the zero-th constant eigenvector.

Inference
Given a pair of annotated images (  ,   ), (  ,   ), we first generate the predicted query mask M  through Cross-Matching and Self-Matching Modules.Meanwhile, we calculate the top  − 1 spectral eigensegments { Ẽ }  −1 =1 of each query image.Since the eigensegments are class-agnostic, we fuse the merged mask M  with the first eigenvector Ẽ1 , which has the highest confidence, to obtain the final prediction.In addition, to explore the full potential of spectral segmentation, we also present the result of selecting the best Ẽ from { Ẽ }  −1 =1 ranked by IoU with ground truth query mask.The final prediction of query mask is a union of Ẽ and where Φ(•) is pixel-wise logical or function.
Our model can be easily extended to -shot setting: Given  support image-mask pairs S = {(  ,    )}  =1 and a query image   , the model performs  forward passes to provide a set of  mask predictions { M  }  =1 .We perform voting at every pixel location by summing all the  predictions and dividing each output score by the maximum voting score.We assign foreground labels to pixels if their values are larger than some threshold  whereas the others are classified as background.We set  = 0.4 in our experiments.

EXPERIMENT
To demonstrate the effectiveness of the proposed method, the experiments are organized as follows.We first describe the adopted dataset iSAID-5  and DLRSD-5  .Next, the evaluation metrics and implementation details are introduced.Then, the segmentation results and comparison with the state-of-the-art few-shot segmentation methods are presented.We finally conducted a series of ablation studies to analyze the impact of each component in our proposed method.

Datasets
iSAID-5  The iSAID dataset [34] contains 655,451 object instances for 15 categories across 2,806 high-resolution images, which exactly match the requirement of the few-shot segmentation task.Based on this, Yao et al. [38] create the iSAID-5  dataset following the setting in PASCAL-5  [24], and the class details are show in Table 1.Particularly, for the 15 object categories in the iSAID-5  dataset, the cross-validation method is leveraged to evaluate the proposed model by using five classes in one fold as test categories D  and leveraging the ten classes in the left two folds as the categories of the training set D  .The details of the class splits are shown in Table 3, where  is the fold number.For every fold, we use the same model with the same hyperparameter setup following standard cross-validation protocol.The iSAID-5  dataset contains 18,076 images for training, 6,363 images for validation and the resolution of all the images is fixed to be 256 × 256.Furthermore, this dataset provides sufficient size diversity for the few-shot remote sensing images' semantic segmentation task.DLRSD-5  The Dense Labeling Remote Sensing Dataset (DLRSD) [25] is a publicly available dataset for evaluating multi-label remote sensing image retrieval and semantic segmentation algorithms.DLRSD contains 2,100 RGB images in total, 17 object classes and the image sizes are fixed as 256 × 256 pixel.To balance the number in each fold, we use 15 categories of DLRSD to build DLRSD-5  .The details of the class splits are shown in Table 4.

Evaluations metrics
We adopt mean intersection over union (mIoU) as our evaluation metrics.For each category, the IoU is calculated by IoU =     +  +  , where  ,   ,   respectively denote the number of true positive, false negative and false positive pixels of the predicted mask.The mIoU metric averages over IoU values of all classes in a fold: mIoU = 1   =1 IoU  where  is the number of classes in the target fold and IoU  is the intersection over union of class .

Implementation details
For the backbone network, we employ VGG [27] and ResNet [9] families pre-trained on ImageNet [6], e.g., VGG16, ResNet50, and ResNet101.For VGG16 backone, we extract features after every conv layer in the last two building blocks: from conv4_x to conv5_x, and after the last maxpooling layer.For ResNet backbones, we extract features at the end of each bottleneck before ReLU activation: from conv3_x to conv5_x.This feature extracting scheme results in 3 pyramidal layers ( = 3) for each backbones.In spectral segmentation module, we peek the layer with size 64 × 64 as  to construct affinity matrix   , which contains rich semantic information and high resolution.The image size in both iSAID-5  and DLRSD-5  is 256 × 256, i.e., , = 256.This network is implemented in PyTorch [23] and optimized with SGD optimizer where the learning rate is 9e-4, the weight decay is 5e-4, and the momentum is 0.9.The learning rate is scheduled with polynomial strategy.The backbone is trained together with 10 times smaller learning rate.

Compared with SOTA
To assess the efficacy of our model, we extensively compare it with state-of-the-art (SOTA) methods [22,30,33,36,38,41] on the iSAID-5  and DLRSD-5  dataset, employing different backbone networks and few-shot settings.iSAID-5  Table 5 presents a summary of the results on iSAID-5  .When using Ẽ1 , our method outperforms other state-of-theart methods in almost all the experiment settings.Notable, with Resnet50 as backbone, our method achieves 4.57% and 2.33% improvement in mIoU over the state-of-the-art in the 1-shot setting

Dataset
Test classes DLRSD-5 0 airplane, bare soil, buildings, cars, chaparral DLRSD-5 1 court, dock, field, grass, mobile home DLRSD-5 2 pavement, sand, sea, ship, tanks and 5-shot setting respectively.When Ẽ is used, the improvement is further enlarged and comes to 8.35% and 3.39%.DLRSD-5  Table 6 presents a summary of the results on DLRSD-5  .Resnet50 is used as the backbone.When Ẽ1 is used, our method achieves 2.73% and 1.26% improvement in 1-shot setting and 5-shot setting respectively.When Ẽ rather than Ẽ1 is used, out method achieves 4.68% improvement over the state-of-the-art in the 1-shot setting.
To conduct a more thorough analysis of the performance across diverse classes in the few-shot setting, we have gathered detailed results for the one-shot scenario, utilizing the ResNet50 [9] backbone.The specific outcomes are presented in Table 7 and 8 on iSAID-5  and DLRSD-5  .On both datasets, our model demonstrates the highest performance when compared to other state-of-the-art (SOTA) methods in 10 out of 15 categories, while in the remaining classes, our model achieves the second-best performance.This substantiates the effectiveness and versatility of our approach.
Notably, we observe an intriguing trend where the improvement in the 1-shot setting is more significant than that in the 5-shot setting across all three backbones.This observation aligns with our design choice, suggesting that our method effectively mitigates intra-class variation.Conversely, in the 5-shot setting, it is more likely that some support images closely resemble the query image.
Considering the extensive analysis conducted, we can confidently conclude that our proposed method effectively tackles the few-shot semantic segmentation task for remote sensing images.Qualitative results are shown in Fig 6.

Ablation study
Ablation study on designed modules.To further demonstrate the effectiveness of our designed modules, we conduct ablation experiments on iSAID-5  using the 1-shot setting and ResNet50 backbone.Table 9 presents the results obtained.The baseline model solely comprises the Cross-Matching Module, which is based on HSNet [22].By introducing the Self-Matching Module, we observe a notable improvement of 3.95% in mIoU.This outcome highlights the significant benefit derived from the Self-Matching Module, which introduce proximity information between initial query mask and query image into the model.Ablation study on fusion strategy of eigensegments.As shown in Table 9, when we fuse Ẽ1 with M  generated by the Cross-Matching Module, we achieve a notable improvement of 1.27% in mIoU, which proves the efficacy of the Spectral Segmentation Module.When we fuse Ẽ1 with M  generated by Self-Matching Module, the total improvement comes to 4.57%, which is a large margin.In addition, our investigation reveals that the target object is not always contained within the first eigensegment, as it may not be the most salient foreground object.For instance, in the first image of Figure 4, the buses are the most salient objects and they are present in the first eigenvector, while the small vehicle is present in the second eigenvector.To fully explore the capabilities of spectral segmentations, as discussed in Section 4.4, we fuse Ẽ with M  .This operation yields a significant increase in improvement, with a difference of 8.13% from the baseline.This result demonstrates that the Spectral Segmentation Module, which solely mine knowledge from the query image, successfully tackles the large discrepancies between the support and query image observed in remote sensing images.Ablation study on design of Self-Matching Module.In our model architecture, we employ a two-branch network, where the Cross-Matching Module and Self-Matching Module have separate weights.This choice doubles the number of learnable parameters in our model.To investigate the possibility of reducing memory consumption, we conduct an ablation study on a single-branch structure, where the Cross-Matching Module and Self-Matching Module share the same weights.However, as shown in Table 10, the performance of the single-branch structure is even inferior to that of HSNet [22], not to mention the two-branch network.This observation suggests that the Cross-Matching Module and Self-Matching Module have subtle differences, and sharing weights actually harms the performance of the Cross-Matching Module instead of enhancing it.Nevertheless, due to the sparse design of center-pivot 4D convolution [22] we adopt, our model still has a relatively small number of learnable parameters compared to other methods [30,33,36,38,41].
Ablation study on  of spectral segmentation module.In the spectral segmentation module,  is a key hyperparameter to balance the semantic affinity matrix   and   which contains raw image information.To select the best value of , we construct some ablation studies on iSAID-5  with 1-shot setting and Resnet50 backbone.As shown in Table 11,  = 5 achieves the best performance.prediction to collect query features in high-confidence regions and then use the generated query prototype to perform self-matching with query features.In addition, we propose the Spectral Segmentation Module with spectral analysis methods to produce classagnostic segmentations of query images without the supervision of any annotations.The proposed model is evaluated on two commonly adopted benmarks for few-shot remote sensing image semantic segmentation.Without any extra knowledge or data information, our SCCNet outperforms previous work by a large margin.

Figure 1 :
Figure 1: Overview of our proposed model (SCCNet) for fewshot remote sensing image semantic segmentation.SCCNet incorporates self-correlation information into the model and decomposes an image using the eigenvectors of a Laplacian matrix constructed from a visual feature to extract selfcontained knowledge from the query image.

Few-shot semantic
segmentation aims to perform segmentation on the novel classes with only a few annotated images.Suppose we are provided with images from two non-overlapping class sets: C  and C  .The training dataset D  is constructed from the class set C  and the test dataset D  is constructed from the class set C  .

Figure 2 :
Figure2: Overall pipeline of our proposed network.The pretrained backbone is utilized as the feature extractor to generate corresponding support and query features.In the Cross-Matching Module, correlation between masked support features and query features is built and squeezed to generate the initial query mask.In the Self-Matching Module, correlation between query image and the initial query mask is further added into the model to generate finer query mask.Finally in the Spectral Segmentation Module, the query mask is fused with eigensegments obtained by non-learning-based spectral analysis.

Figure 3 :
Figure 3: Simplified illustration of the effect of 4D convolution kernels that squeeze the support spatial dimensions.
128× 1 × 1 that signifies a condensed representation of the hypercorrelation pyramid Ĉ. 2D-convolutional context decoder.The decoder network consists of a series of 2D convolutions, ReLU, and upsampling layers followed by sofmax function.The network takes the context representation  and predicts two-channel map M  ∈ [0, 1] 2× × where two channel values indicate probabilities of foreground and background.Then we take the maximum channel value at each pixel of M  to obtain initial query mask prediction M  ∈ {0, 1}  × .

Figure 4 :
Figure 4: Visualization examples of first four Eigenvectors of our feature affinity matrix on iSAID-5  dataset.The eigenvectors correspond to semantic regions, with the first eigenvector usually identifying the most salient object in the image.

Figure 5 :
Figure 5: Pipeline of generating eigensegment from an image.

Table 3 :
Testing Classes for Threefold Cross Validation Test

Table 4 :
Testing Classes for Threefold Cross Validation Test

Table 5 :
[38]ormance on iSAID-5  in mIoU.Some results are reported in[38].Numbers in bold indicate the best performance and underlined ones are the second best.Superscript † denotes Ẽ is used instead of Ẽ1 .

Table 6 :
Performance on DLRD-5  in mIoU.Resnet50 is used as the backbone

Table 10 :
Ablation study on design of Self-Matching Module.In single-branch design, we use same 4D conv kernels in both Cross-Matching and Self-Matching Module.In this work, we propose a novel SCCNet for the few-shot remote sensing image semantic segmentation task.Specifically, Self-Matching Module is designed to incorporate the initial query mask