Image Segmentation with Vision-Language Models

Image segmentation traditionally relies on predefined object classes, which can pose challenges when accommodating new categories or complex queries, often necessitating model retraining. Relying solely on visual information for segmentation heavily depends on annotated samples, and as the number of unknown classes increases, the model’s segmentation performance experiences significant declines. To address these challenges, this paper introduces ViLaSeg, an innovative image segmentation model that generates binary segmentation maps for query images based on either free-text prompts or support images. Our model capitalizes on text prompts to establish comprehensive contextual logical relationships, while visual prompts harness the power of the GroupViT encoder to capture local features of multiple objects, enhancing segmentation precision. By employing selective attention and facilitating cross-modal interactions, our model seamlessly fuses image and text features, further refined by a transformer-based decoder designed for dense prediction tasks. ViLaSeg excels across a spectrum of segmentation tasks, including referring expression, zero-shot, and one-shot segmentation, surpassing prior state-of-the-art approaches.


INTRODUCTION
Image segmentation [12,24,27] is fundamental within computer vision, facilitating semantic-level processing, analysis, and comprehension of images.Achieving a high level of precision in this endeavor mandates the availability of copious data [19].Nevertheless, the manual annotation process is labor-intensive and timeconsuming, which can introduce potential inaccuracies when working with a limited set of annotated data.Furthermore, accurately segmenting novel or unseen semantic categories using pre-trained models [3] designed for specific classes poses a substantial challenge.Consequently, researchers have been spurred to explore the realm of low-shot image segmentation.
Various methodologies have been developed to address the challenges posed by low-shot image segmentation [11,16,25].Referring expression segmentation [26] entails training models using intricate textual queries and access to all available classes during the training phase.However, it exhibits limitations in generalizing to previously unseen classes.Zero-shot segmentation [16], on the other hand, deals with segmenting images belonging to classes not encountered during training, employing a repertoire of approaches including attribute-based, generative model-based, and transfer learning-based methods.One-shot segmentation [25] involves pixel classification with just a single labeled sample accessible within the training datasets.Notably, image generation-based techniques, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), alongside meta-learning methodologies, constitute the primary arsenal for one-shot segmentation.
Presently, low-shot segmentation strategies in the domain of computer vision predominantly center around the utilization of single images as input, underutilizing the latent potential of multimodal information.The integration of multimodal information into the segmentation process presents a substantial challenge.The recently introduced CLIPSeg model has achieved remarkable performance in tasks pertaining to image and text-based segmentation.This model offers a unified framework accommodating referring expression segmentation, zero-shot segmentation, and one-shot segmentation [19].Nonetheless, CLIPSeg adheres to a fixed format of text prompts (e.g., "a photo of ..."), potentially constraining its practical applicability for broader human language understanding.Additionally, the image encoder employed in CLIPSeg is designed to extract features for the entire image, which may not be optimal for segmentation tasks requiring the division of images into multiple regions, each necessitating an independent feature representation.Furthermore, the fusion of image and text is executed through interpolation, lacking the capacity for selective attention to salient features.
In this research paper, we introduce a new image segmentation model, ViLaSeg, which capitalizes on local features and attention mechanisms [13].Our model offers a unified framework for learning both text and image feature representations for segmentation tasks, guided by text and image prompts.Diverging from prevailing methodologies, we employ the pre-trained CLIP model [22] as the backbone and enhance its visual encoder using GroupViT [32] to capture local features within image regions better.Simultaneously, we incorporate multiple forms of text prompts to augment its proficiency in text comprehension.Moreover, our model integrates multi-head self-attention modules to facilitate the fusion of image and text representations.This attention mechanism enables the adaptive selection of critical features, thereby bolstering deep learning for enhanced recognition accuracy.Subsequently, we train a lightweight decoder atop the enhanced backbone, employing a binary prediction setting to discriminate between foreground and background regions aligned with the provided prompt.
This paper contributes to the field in three significant ways: Vision-language pre-training initially relied on curated datasets of limited scale, often necessitating fine-tuning for specific tasks.Recent studies [22] explored the benefits of using extensive noisy web data.CLIP, a representative work, excels in zero-shot image classification through a contrastive learning framework.GroupViT [32] introduces a grouping mechanism along with a learnable grouping vector, building upon CLIP.We propose augmenting the CLIP model's segmentation performance by integrating GroupViT and employing text as self-supervised signals for zero-shot segmentation tasks.

Referring Expression Segmentation.
Referring expression segmentation associates natural language referring expressions with image objects, necessitating precise language understanding and pixel-level segmentation.Previous works for this task include HULANet [30], a Mask-RCNN-based model that utilizes specialized modules for categories, attributes, and relationships to generate segmentation masks.Another approach, MDETR [11], an enhancement of DETR [4], is a CNN-Transformer model that predicts bounding boxes based on query prompts.Notably, referring expression segmentation does not require generalization to unseen object categories or understanding of visual support images.

Zero-shot Segmentation.
Zero-shot segmentation requires training the model with labeled seen classes and predicting unseen classes during testing by leveraging generalized image features.SPNet [31] and ZS3Net [3] achieve this by mapping visual features to semantic embeddings for knowledge transfer, while CSRL [10] preserves class relationships by merging visual and semantic features.CaGNet [9] introduces a context module using convolutional layers and attention mechanisms to capture diverse features.Recent methods adopt a joint embedding space approach using both visual and semantic encoders, with semantic prototypes serving as class centers[1].

One-Shot Semantic Segmentation.
One-shot segmentation trains a model to segment images with minimal annotated examples during training.Shaban et al. introduced the OSLSM model [25], using a Siamese network to extract features from labeled support and query images.Rakelly et al. [23] improved this model to enhance segmentation performance.Other models, such as PGNet [33], use attention maps to leverage support set information for segmentation enhancement, while CANet [34] optimizes segmentation results through a shared encoder and spatial pyramid pooling modules.Prototype modeling methods have also been explored.PFENet [28] utilizes high-level features to generate prior information, while MGNet [5] employs a pyramid structure to address scale invariance in image segmentation.

VILASEG METHOD
In this work, we crop and standardize the image, and enhance text prompts and the visual encoder based on CLIP (Figure 1).We fuse image-text features using multi-head self-attention and employ an efficient transformer decoder.Inspired by U-Net, we connect the decoder with the visual encoder, extracting and projecting activations from specific layers (S= [3,7,9]) to match the decoder's token embedding size (D).To guide the decoder in segmenting the target, we use the FiLM module [8] to modulate input activations.The decoder generates binary segmentation by projecting the final transformer block output with a 64-dimensional projection.We train the decoder for image segmentation while keeping the improved encoder frozen.

Task-Driven Prompt
In terms of text prompts, the key lies in designing a well-structured template.CLIP utilizes manual templates to enable zero-shot inference.Given an input text x, a function is employed to transform it into a prompt form.This function can be expressed as follows: This function uses a natural language sentence template with a blank position [x] for input 'x'.The prompt can be in the middle (cloze prompt) or at the end (prefix prompt) of the sentence.CLIP adopts prefix prompts, and the position and number of prompts can impact the results.Thus, it is crucial to carefully consider the model's characteristics and task requirements, making suitable adjustments and optimizations as necessary.
In this study, we employ various prefix prompts for different tasks, such as "a photo of a .".These additional task-specific descriptions before and after the input generally enhance task performance.Specifically, for the image segmentation task, we include "for segmentation task" at the end of the input, establishing a complete logical relationship and providing the necessary context.=1 represents the tokens after passing through the Transformer layer.Subsequently, Grouping Blocks are employed to merge existing groups into larger ones, which can be seen as performing clustering assignments:

Visual Encoder
2  represents the merged tokens, referred to as Segment Tokens, with a dimension of 64×768.The combination of these tokens shortens the sequence length, thereby reducing computational complexity and training time.In the second stage, 8 Group Tokens are used to merge similar categories, resulting in  3   .These merged tokens are then input to the Transformer layers, generating an output ŝ3 with a dimension of 8×768.At this stage, the text side still has a text feature of length 1, while the image side has a feature sequence of length 8, which cannot be aligned with the text feature.To resolve this, GroupViT uses average pooling to adjust the image feature sequence's dimension, aggregating the output into a global image representation   .
In this manner, a 1×768 feature sequence is obtained, aligning the image feature with the text feature.

Image-Text Fusion Module
The model takes segmentation target information via a conditional vector.The text and image are first encoded using Transformer and GroupViT, respectively, resulting in the hidden representation  =  1 ,  2 , ...,   of the text and the hidden representation  ′ ∈ R   ×  of the image. ∈ R   ×  . ′ represents the image feature map from the last layer of the GroupViT image encoder, with a dimension of 768, while the text's hidden representation has a dimension of 512.To match the dimensions, we use the following formula for converting  ′ to the same dimension as T.
In this case,  =  1 ,  2 , ...,    ,  ∈ R   ×  .To align and fuse text and image features, T and I are merged into a new sequence  ∈ R ℎ  ×2×512 along a new stacking axis.ViLaSeg employs multi-head self-attention for text-image fusion, mirroring human vision's selective focus on key features while disregarding irrelevant information.This module fosters cross-modal interactions between image regions and text.Combined features are then processed into query, key, and value features following [7] with the transformed region feature denoted as   ,   , and   respectively.
=  ( ;   ),   =  ( ;   ),   =  ( ;   ) (6) Here, 'stack' denotes the stacking of multiple sequences, and 'Linear' represents a fully connected layer with parameters  .Since the number of attention heads is 8, the shared dimension of the transformed features for both modalities is 64.
By taking the inner product      between the transformed features   and   , we obtain the original attention weights for the aggregated sequence features.Next, we normalize the raw attention weights using the square root of the dimension in formula ( 7) and the non-linear    () function.
The Z matrix is used to capture the key features between each image region and text.We denote the updated features as   ∈ R ℎ ×2×512 .  denotes the unweighted value feature, which is weighted by matrix Z.After obtaining the updated features, the sequence features are not sufficient for segmentation tasks yet.Therefore, we use a mean function to derive the conditional vector R, which contributes to the multi-modal representation.
mean() denotes the mean function and  ∈ R ℎ ×512 represents the resulting value after applying the mean function.

EXPERIMENTS 4.1 Datasets and Evaluation Metrics
4.1.1Datasets.We use public computer vision datasets in our experiments, encompassing a diverse range of datasets for both training and evaluation.The training utilized the enriched Phrase-Cut+ dataset, an extension of PhraseCut from Visual Genome [4], augmented with visual support and negative samples [19].Referring expression segmentation testing was performed on the same PhraseCut+ dataset.Zero-shot segmentation evaluation utilized the Pascal-VOC2012 [21] and PASCAL-Context datasets [16].One-shot segmentation assessment encompassed the Pascal-5i and COCO-20i [17] datasets, offering comprehensive evaluations with 20 and 80 object classes, respectively. [7]calculates the average IoU values for foreground and background without considering image classes.AP measures the area under the Recall-Precision curve and evaluates the system's ability to distinguish between matches and no-matches.

Models and Baselines
As a comprehensive study, we evaluate our model across referring expression, zero-shot, and one-shot segmentation tasks.Additionally, to better understand our proposed techniques, we introduce two additional baselines: 1.) NoGroupViT, which shares the architecture of ViLaSeg but uses an ImageNet-trained visual transformer as the backbone.Text encoding relies on the same text transformer as GroupViT, allowing us to assess the importance of specific GroupViT weights.2.) NoFusion, which incorporates GroupViT but removes the image-text fusion layer, instead using interpolation for feature fusion.This approach helps evaluate the significance of the multi-modal fusion module.Table 1 and Table 2 assess models' zero-shot segmentation performance using Pascal-VOC [21] and Pascal-Context [16]   In contrast to zero-shot segmentation, one-shot segmentation involves comprehending text prompts and annotated support images.We used the same visual prompt methods as CLIPSeg, including object cropping, background blur, and darkening, to aid comprehension.Figure 2 illustrates the one-shot segmentation results of ViLaSeg and two extended models on the PASCAL-5i and COCO-20i datasets, utilizing mIoU as the evaluation metric.ViLaSeg performs competitively on both datasets.NoGroupViT exhibits a significant performance drop.Additionally, noFusion, which uses interpolation for image-text fusion instead of a multi-head attention mechanism, significantly underperforms compared to ViLaSeg.
ViLaSeg is compared with several advanced one-shot and fewshot models (e.g., PPNet, RePRI, PFENet, HSNet, etc.).Table 3 and Table 4 present one-shot segmentation results on the Pascal-5i and COCO-20i datasets.ViLaSeg demonstrates competitive performance on the Pascal-5i dataset compared to state-of-the-art methods.The results on the COCO-20i dataset indicate that the model works well on datasets other than PhraseCut+.Except for HSNet and PFENet, which demonstrate superior performance.This difference can be attributed to HSNet's explicit design for one-shot segmentation and its reliance on pre-trained convolutional neural networks [14,29].In the case of using PFENet for zero-shot segmentation following the one-shot segmentation protocol, ViLaSeg performs well and outperforms the CLIPSeg with the same training method, as shown in Table 5.

Referring Expression Segmentation Comparative Experiment.
The evaluation results of referring expression segmentation on the PhraseCut+ dataset are presented in Table 6.Our method demonstrates superior performance compared to the two-stage HULANet method and the CLIPSeg method.However, it performs slightly worse than MDETR in terms of mIoU metrics.It is important to note that MDETR operates at the full image resolution and undergoes two rounds of fine-tuning on the PhraseCut dataset.Furthermore, the performance of the noGroupViT baseline is consistently lower than that of ViLaSeg, highlighting the effectiveness of GroupViT pretraining.

CONCLUSION
In our paper, we have introduced ViLaSeg, a highly adaptable image segmentation model capable of seamlessly adjusting to new tasks through the utilization of text or image prompts during inference.
Our approach goes beyond by enhancing both text prompts and the visual encoder, thereby providing critical contextual information essential for segmentation, leading to improved feature expressiveness.Moreover, we have introduced an innovative image-text fusion layer designed to enable selective attention to pivotal features present in both text and images, further elevating the overall segmentation performance.One of the significant advantages of our method is its innate flexibility to effortlessly accommodate new tasks with the aid of prompts, eliminating the necessity for costly and time-consuming retraining on fresh data.This approach not only streamlines the model adaptation process but also opens up a realm of new possibilities for prompt-based segmentation models, ushering in a novel perspective within the domain of image segmentation.
Limitations.Our experiments are limited just a few benchmarks,in future work more modalities such as sound and touch could be incorporated.Additionally, Our model is primarily designed for images, applying it to video may result in a lack of temporal consistency.The image size should be around 350×350 pixels, as going significantly larger or smaller can impact experimental accuracy negatively.

Figure 1 :
Figure 1: The architecture of ViLaSeg: We extend a frozen CLIP model (yellow and green) with a transformer that segments the query image based on either a support image or a support text prompt, fusing features through multi-head self-attention (blue).
[x]. "; "a photograph of a [x]."; "an image of a [x]."; "[x].";"a cropped photo of a [x]."; "a good photo of a [x]."; "a photo of one [x].";"a bad photo of a [x]."; "a photo of the [x] CLIP uses the Vision Transformer, designed for image classification with a single class token for global image features.In image segmentation, where multiple regions need independent feature representations, grouping vision transformer(GroupViT) is introduced to represent features for multiple objects within an image.The GroupViT enhances the Vision Transformer model with Grouping Blocks and learnable Group Tokens.The input includes Patch Embeddings from the 352×352×3 original image and learnable Group Tokens.Each patch is 16×16, yielding 484 sequences of size 768.The Group Tokens have dimensions of 64×768.After 6 Transformer layers, they are well-learned and represented as ĝ1  64 =1 .Following this, ŝ1  484

Table 1 :
Zero-shot segmentation performance on Pascal-VOC.  and   indicate performance in seen and unseen classes, respectively.Our model is trained on PhraseCut+ with the Pascal classes removed, but it utilizes a pre-trained GroupViT backbone.Unseen classes are not encountered during training, with the following number indicating the count of unseen classes.IN-seen indicates Ima-geNet pre-training with unseen classes being removed.

4. 3
Comparison to State-of-the-art Methods
datasets.While OSR excels in seen classes, ViLaSeg and CLIPSeg outperform in unseen classes.Additionally, ViLaSeg outperforms CLIPSeg, NoGroupViT, and NoFusion in both seen and unseen classes, emphasizing the importance of our techniques utilizing local features and attention mechanisms.

Table 3 :
One-shot performance on Pascal-5i.Evaluation Metrics.In the experiment, we use mIoU,   ,    , and Average Precision (AP) as evaluation metrics.
[25][25]measures the average foreground IoU across different classes, reflecting the model's generalization ability and prediction quality.  calculates the IoU only on foreground pixels.

Table 5 :
Zero-shot performance on Pascal-5i.The scores were obtained by following the evaluation protocol of one-shot segmentation but using text input.