skip to main content
research-article
Open Access

Disentangle Saliency Detection into Cascaded Detail Modeling and Body Filling

Published:05 January 2023Publication History

Skip Abstract Section

Abstract

Salient object detection has been long studied to identify the most visually attractive objects in images/videos. Recently, a growing amount of approaches have been proposed, all of which rely on the contour/edge information to improve detection performance. The edge labels are either put into the loss directly or used as extra supervision. The edge and body can also be learned separately and then fused afterward. Both methods either lead to high prediction errors near the edge or cannot be trained in an end-to-end manner. Another problem is that existing methods may fail to detect objects of various sizes due to the lack of efficient and effective feature fusion mechanisms. In this work, we propose to decompose the saliency detection task into two cascaded sub-tasks, i.e., detail modeling and body filling. Specifically, detail modeling focuses on capturing the object edges by supervision of explicitly decomposed detail label that consists of the pixels that are nested on the edge and near the edge. Then the body filling learns the body part that will be filled into the detail map to generate more accurate saliency map. To effectively fuse the features and handle objects at different scales, we have also proposed two novel multi-scale detail attention and body attention blocks for precise detail and body modeling. Experimental results show that our method achieves state-of-the-art performances on six public datasets.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Human Visual System (HVS) has the innate ability to capture salient objects from visual scenes rapidly without training [31]. Salient Object Detection (SOD) aims at simulating HVS to detect distinctive regions or objects, where people would like to focus their eyes on [2, 34]. In the past decades, it has attracted much interest from research communities, mainly because it can find objects or regions that can represent a scene efficiently, a useful step in downstream computer vision tasks. Saliency detection models have been evolved from traditional hand-engineering approaches via different saliency cues (e.g., global contrast [6], background prior [42], and spectral analysis [16]) to Fully Convolutional Neural Networks (FCN) [21]-based methods.

Despite that FCN-based solutions [15, 20, 25, 27, 29, 36, 37, 43, 47, 50] have made remarkable progress so far, there still exist two main challenges: (i) the pixels near the object edge have a very imbalanced distribution, which makes these pixels harder to predict than the non-edge ones. Existing saliency detection models usually get large prediction error when the pixel is close to the object boundary [37]; (ii) most saliency detection methods build models on the encoder-decoder framework and develop different strategies to aggregate multi-scale features for better representation. However, due to the lack of effective fusion mechanisms to integrate multi-scale or multi-level feature, the generated saliency maps may fail to accurately predict objects in different scales. Because of these two issues, existing methods might fail to generate accurate saliency maps with sharp boundaries and coherent details (see Figure 1).

Fig. 1.

Fig. 1. (Top) We disentangle the task of salient object detection into cascaded detail modeling and body filling. The proposed multi-scale attention blocks polish the features passed via short connections and help the network attend to the salient regions. (Bottom) Qualitative comparison between our method and two recent state-of-the-art methods LDF [37] and GateNet [50]. Our approach can precisely segment objects of various sizes with subtle details.

For the first problem, many methods attempted to introduce boundary information as extra supervision to improve the prediction performances [12, 19, 27, 29, 40, 47]. However, the introduced edge label only indicates the pixel on the edge. Its direct use as supervision can decrease the global saliency prediction error but will degrade the prediction performance near the edge [37]. More recently, Wei et al. [37] proposed to explicitly decompose the ground truth saliency label into the body label and the detail label. The decomposed detail label consists of both edges as well as nearby pixels, which makes full use of pixels near the edge and thus has a more balanced pixel distribution. The decoupled body and detail maps are used to train two separate network branches, and a feature fusion branch is needed to combine the two streams to generate the final saliency map. Their proposed architecture involves two iterations i.e., train the detail and body branches until the two branches can output good body/detail maps, and then train the fusion module. It is non-trivial to control the two iterations and train the model in an end-to-end manner.

For the second problem, some methods tried to pass the features at the corresponding level in the encoder to the decoder via different connection pathways to leverage multi-level context information [4, 12, 22, 27, 34, 38, 39, 43, 46, 51]. Without being processed by proper mechanism, the representation power of details in a single shallow layer may be weakened or disturbed by deeper features with high-level semantic information. To more effectively utilize multi-scale features, some methods proposed to pass multi-layer features to a decoder in a single layer in the fully connected manner or the heuristic style [15, 33, 44]. However, this kind of solution suffers from huge computational burden and fusion difficulties brought by the excessive amount of features and their resolution gap. There lacks such a mechanism that can effectively and efficiently fuse multi-level features without losing representation power at different scales.

To address both aforementioned issues, we propose a novel solution: a cascaded framework that disentangles traditional SOD task into two sub-tasks, where the first sub-network is forced to generate the detail map by supervision of decomposed detail label as proposed in Reference [37], subsequently the second sub-network takes in the detail map, detail feature, and the image to generate the body map and fuse into the final saliency map. The proposed framework explicitly divides the original task into two cascaded sub-tasks each of which has its own specific target. This reduces the difficulty in directly predicting the whole saliency map. Besides the framework, we also propose two novel multi-scale attention blocks that aim at fusing features at different levels and detecting objects of various sizes. The proposed blocks can enrich the fused feature with multi-scale attention, which can effectively fuse two or three multi-level features and relax the difficulty in the detection of multi-scale objects. In addition, we suggest a hybrid loss setting that targets the accurate generation of each map and can well complement each other. Our model is trained in an end-to-end fashion and has a reasonable inference speed of 20 FPS on a single GPU. The proposed model is thoroughly validated under four metrics across six public benchmark datasets to demonstrate its superior performances.

In summary, the main contributions of the article are as follows:

  • We propose a novel cascaded saliency detection framework that first produces detail maps of the object and then generates accurate saliency map by filling the detail map with body map. The proposed framework reduces the difficulty in directly predicting the whole saliency map and can be trained efficiently in an end-to-end manner.

  • We propose two novel multi-scale attention blocks that can attentively fuse multiple features at multiple scales for precise detail and body map generation. We also suggest a hybrid loss setting that specifically targets the detail and body maps and complement each other.

  • Our proposed model achieves state-of-the-art performances against 10 most recent state-of-the-art methods on six benchmark datasets under four widely used metrics. Extensive ablation studies are also conducted to demonstrate the effectiveness of each proposed module.

Skip 2RELATED WORK Section

2 RELATED WORK

Early saliency detection methods in hand-engineered era mainly rely on various saliency cues, including global or local contrast [6, 11], background prior [42], and spectral analysis [13, 16]. Due to the page limits, the readers are kindly referred to Reference [2] for a detailed review. Here, we recap modern approaches in deep learning era. These FCN-based methods can be broadly divided into two families as follows:

Aggregation-based Models. Most modern saliency detection models are based on the encoder-decoder framework to integrate multi-level features and leverage contextual information across different layers [3, 4, 12, 15, 22, 25, 27, 30, 33, 36, 38, 39, 43, 44, 45, 46]. The encoder is often used to extract multi-level features from the image, and the decoder is designed to effectively combine the features and predict the saliency map. During the past years, researchers have developed lots of feature fusion mechanisms and feature connection pathways for better representation. Liu et al. [20] proposed a hierarchical pixel-wise contextual attention network to learn the local and global context for each pixel. Zhao et al. [49] utilized channel attention for high-level representation and spatial attention for low-level feature maps to improve the detection performances. Our proposed two attention blocks share some similarities with Reference [49] but are fundamentally different in three aspects. First, we do not distinguish channel attention or spatial attention for high-level or low-level features. Instead, we keep consistently using the combination of global and local attention for all the feature maps to be fused, regardless of its layer. Second, our attention blocks work in multiple scales and the fused representation would choose the information needed at a certain scale. Last, we have different architecture design and can take in two or three feature streams.

Edge-guided Models. In recent years, increasingly more approaches incorporate the edge/contour information to assist SOD task and improve the detection performances [12, 19, 27, 29, 37, 40, 44, 47, 49]. Zhao et al. [47] used edge label to supervise low-level feature maps to enable the network to have the capacity of modeling edge information. More recently, Wei et al. [37] proposed to explicitly decompose the ground truth label into the detail label that consists of pixels on the edge as well as pixels nearby the edge and the body label that concentrates on the pixels far from the edge. The two decoupled labels are used to supervise two branches in the first iteration, and a second iteration is still needed to train the fusion module for combing the results. Although we also decompose the original label, only the detail label is used to supervise intermediate results in our method. Another key difference is that our model is cascaded and can be trained end-to-end efficiently.

Skip 3METHODOLOGY Section

3 METHODOLOGY

We start by introducing how the detail label is decomposed, then describe each part of the model in detail, and end with the loss function.

Detail Label Generation. As stated before, the pixels near the edge are hard to predict and prone to be misclassified. In the saliency detection task, the ground truth label is often binary and all the pixels in the salient regions have a unique value. Inspired by Reference [37], we explicitly decompose the detail label from the ground truth label and use it for our first sub-task detail modeling. More specifically, given the ground truth label \( G \), we use the distance transformation to convert the original label into the detail label in which each pixel in the original salient regions is defined by its minimum distance to the object boundary. This distance transformation process can be described as (1) \( \begin{equation} G_{\rm detail}(p,q)={\left\lbrace \begin{array}{ll} |G(p,q)-E(p,q)|, & G(p,q)=1,\\ 0, & G(p,q)=0, \end{array}\right.} \end{equation} \)

where \( G_{\rm detail} \) represents the detail label, and \( E(p,q) \) denotes the salient edge point that has the minimum Euclidean distance to saliency pixel \( G(p,q) \). Figure 3 displays two examples of decomposed detail labels. After decoupling the detail label, it will be used to supervise the left sub-network in Figure 2 to detect detail points close to the edge.

Fig. 2.

Fig. 2. Overview of the proposed framework. The left sub-network consists of a backbone network and a proposed multi-scale detail attention decoder. The detail decoder takes in the feature flow transmitted from the backbone and outputs detail map. Subsequently, the detail encoder absorbs the input image, generated detail map, and the detail flow from the last MDAB. The fused feature is exploited by detail encoder to produce new detail flow for the body decoder. Then, we feed the proposed multi-scale body attention decoder with detail and feature streams, where the contextual information is leveraged to predict the body map. The final saliency map is obtained by summing up the detail and body map.

Fig. 3.

Fig. 3. Some examples of decoupled detail labels. The pixels in decomposed label have larger values closer to the edge and smaller even zero value when far from the edge, which has a more balanced distribution than pure edge label.

Feature Extractor. Similar to existing models [36, 37, 47], we use ResNet-50 [14] as our backbone network. We choose ResNet-50 as the backbone, because it has the moderate model size and reasonable feature extraction power. Lager models can lead to better performances but will slow down the training and inference. The last fully connected layer and average pooling layer are removed, and we only keep the convolutional blocks. These blocks generate feature maps at five different scales \( \lbrace F_{i}|i{=}1,\dots ,5\rbrace \), where the resolution is down-sampled by two between subsequent blocks. As pointed out in Reference [39], the representation \( F_{5} \) at the shallowest layer contains too much coarse and redundant information. This increases computational burdens dramatically but brings little performance improvement. Hence, we abandon this layer and use only the remaining finer features \( \lbrace F_{i}|i{=}1,\dots ,4\rbrace \). Two sets of the features will be passed to the detail decoder and body decoder, respectively, to assist their tasks.

3.1 Multi-scale Detail Attention Modeling

The detail decoder takes input of the image feature to fulfill the task of detail generation. It consists of three Multi-scale Detail Attention Block (MDAB). As shown in Figure 4 (left), each MDAB absorbs both the detail flow from the block before and the feature flow from the backbone encoder at the corresponding scale, which can be denoted as (2) \( \begin{equation} D_{i+1}={\rm MDAB}_{i}(D_{i},F_{i}),\ i=2,3,4, \end{equation} \) where \( F_{i} \) denotes the feature flow at the same level, \( D_{i} \) stands for the current detail flow, and \( D_{i+1} \) represents the new detail flow to be passed to the next block. At the last MDAB, we use \( 3\times 3 \) convolution layer and sigmoid gate to extract detail map from the final detail flow. This operation can be denoted as \( S_{\rm detail} = \sigma ({\rm Conv}(D_{5})) \), where \( \sigma (\cdot) \) denotes the sigmoid gate, and \( {\rm Conv}(\cdot) \) represents the convolution operation. The MDAB is mainly comprised of three attention units, where each one calculates the combination of local and global attention at one scale. The detailed architecture of the proposed attention unit is illustrated in Figure 5. Inside the attention unit, the detail flow and feature flow first go through one representation sampler to filter out useless noise and keep only the informative features, (3) \( \begin{equation} F^{att}_{i}={\rm ReLU}({\rm Conv}(D_{i})+{\rm Conv}(F_{i})), \end{equation} \) where the ReLU gate actively polishes the representation summed by detail flow and feature flow. Then the raw feature attention \( F^{att}_{i} \) will be fed to two branches to obtain the corresponding local and global attention, (4) \( \begin{equation} F^{att}_{i}=\sigma \left({\rm Conv}\left(F^{att}_{i}\right)\right)+\sigma \left({\rm GAP}\left({\rm Conv}\left(F^{att}_{i}\right)\right)\right), \end{equation} \) where \( {\rm GAP} \) denotes the global average pooling module. Assume the feature \( F_{i} \) has the size of \( B\times C\times H \times W \), the first term calculates the local spatial-wise attention of size \( B\times 1 \times H \times W \), and the second term computes the global channel-wise attention of size \( B\times C\times 1\times 1 \). As can be seen from the left part of Figure 4, the feature and detail flow pass three attention units after different pooling layers. We argue that the strategy of pooling with different kernels before the attention unit enables the feature to automatically search for useful information at a certain scale, which will benefit identifying salient objects of various sizes. Notice that the representation samplers of different attention units have shared weights, as we expect the sampler to have the function of filtering out the noise of fused flow, which should be robust against scale variation. In the end, we obtain new attentive detail flow that can identify both the object and its sharp boundaries, (5) \( \begin{equation} D_{i+1}={\rm Conv}\left(D_{i}+\sum _{t=1}^{3}\left(F_{i(t)}\odot F_{i(t)}^{att}\right)\right), \end{equation} \) where \( \odot \) represents element-wise multiplication, and \( t \) denotes the number of scales used in MDAB.

Fig. 4.

Fig. 4. (Left) Workflow of our MDAB. The feature and detail streams are processed by three attention units to leverage contextual information and produce multi-scale attention for feature flow. Then the attentive feature flow is combined with detail flow for the generation of new detail flow. (Right) Pipeline of the proposed MBAB. We use three more attention units at different scales to generate attentive feature flow. The new body flow comes out of the summation of attentive detail flow, attentive feature flow, and original body flow.

Fig. 5.

Fig. 5. Architecture of attention unit. The attention unit works as a basic element for our MDAB and MBAB. Depending on the number of input streams, the representation sampler may have different number of convolution blocks. The unit is used to generate the combination of global and local attention for either the feature or detail flow.

3.2 Multi-scale Body Attention Filling

After the left detail decoder generates the detail map, the detail encoder in the right part of Figure 2 will be fed with the input image, the detail flow, and the detail map to extract new detail flow \( \lbrace D^{^{\prime }}_{i}|i = 1,2,3,4\rbrace \) for the task of body filling. Similar to the backbone network, the detail encoder will pass the detail flows to the body decoder via both normal path and short connections. Afterward, the body decoder takes in the image feature and detail feature to generate the body map. The body decoding task is completed by three subsequent Multi-scale Body Attention Block (MBAB). Each MBAB absorbs three streams, including the feature flow from the backbone network, the detail flow from the detail encoder, and the body flow from the previous block. This procedure can be represented as (6) \( \begin{equation} B_{i+1}={\rm MBAB}(B_{i},F_{i},D^{^{\prime }}_{i}),\ i=2,3,4, \end{equation} \) where \( B_{i} \) is the current body flow, and \( B_{i+1} \) is the new body flow passing to the next block. After the three consecutive MBAB, we extract the body map and fill it with the detail map to generate the final saliency prediction \( S=S_{\rm detail}{+}S_{\rm body} \). As we can see in Figure 4 (right), our proposed MBAB resembles MDAB, with the main difference in three additional attention units dedicated for the detail flow. The detail and feature flow with multi-scale attention will be fused with body flow to create the new body stream. We can simply denote the MBAB workflow as (7) \( \begin{equation} B_{i+1}={\rm Conv}\left(B_{i}+\sum _{t=1}^{3}\left(F_{i(t)}\odot F_{i(t)}^{att}+D^{^{\prime }}_{i(t)}\odot D_{i(t)}^{^{\prime }att}\right)\right). \end{equation} \)

3.3 Hybrid Loss Function

As we use both ground truth label and our decomposed detail label to train the network, we design different loss settings for the two tasks. For the detail output, we have the loss function defined as (8) \( \begin{equation} l_{\rm detail}=l_{\rm CE}(S_{\rm detail},G_{\rm detail}) + l_{\rm SSIM}(S_{\rm detail},G_{\rm detail}), \end{equation} \) where the first term is the commonly used cross-entropy loss, and the second term is the structural similarity loss, which enforces the detail decoder to focus on the edges. Structural similarity index measure (SSIM) [35] was originally used to calculate the similarity of a pair of images by assessing the structural information. Motivated by Reference [27], we also integrate this loss to let the detail output learn the structural information of the image for keeping the precise edges. This loss is calculated as (9) \( \begin{equation} l_{\rm SSIM}=1-\frac{(2\mu _{x}\mu _{y}+C_{1})(2\sigma _{xy}+C_{2})}{(\mu _{x}^2+\mu _{y}^{2}+C_{1})(\sigma _{x}^{2}+\sigma _{y}^{2}+C_{2})}, \end{equation} \) where \( \mu _{x} \), \( \mu _{y} \) and \( \sigma _{x} \), \( \sigma _{y} \) are the mean and standard deviation of the image, respectively. \( C_{1} \) and \( C_{2} \) are small positive constants, and we set them as \( 0.01^2 \) and \( 0.03^2 \) to avoid dividing zero. As for the body mask prediction, we have the following loss configuration: (10) \( \begin{equation} l_{\rm body}=l_{\rm CE}(S,G) + l_{\rm IoU}(S,G)+l_{\rm F}(S,G), \end{equation} \) where the second term is the IoU loss, which can help the body decoder quickly attend to the main body of the object as adopted in References [24, 27, 28]. This loss can be computed as (11) \( \begin{equation} l_{\rm IoU}=1-\frac{\sum _{i=1}^{H}\sum _{j=1}^{W}S(i,j)G(i,j)}{\sum _{i=1}^{H}\sum _{j=1}^{W}(S(i,j)+G(i,j)-S(i,j)G(i,j))}. \end{equation} \)

The third term in Equation (10) is the so-called F-loss [48]. It is proposed to directly optimize the metric F-measure as defined by: \( l_{\rm F} = 1{-}F(S,G) \). Here, we expect that adopting this loss can balance the detail and body map to complement the information of each other by pushing their fused mask to achieve a high F-measure score. In total, our model is trained end-to-end using the hybrid loss function: (12) \( \begin{equation} l=\frac{1}{2}(l_{\rm detail}+l_{\rm body}). \end{equation} \)

Skip 4EXPERIMENTS Section

4 EXPERIMENTS

Datasets. Following References [36, 37, 39, 40], we conduct extensive experiments on six widely used benchmark datasets to evaluate the effectiveness of the proposed method, i.e., ECSSD [41], PASCAL-S [18], DUT-OMRON [42], HKU-IS [17], THUR15K [5], and DUTS [32].

Specifically, ECSSD [41] contains 1,000 structurally complex natural images. PASCAL-S [18] consists of 850 images with cluttered backgrounds chosen from the validation set of the PASCAL-VOC segmentation dataset [8]. DUT-OMRON [42] has 5,168 images with high content variety. HKU-IS [17] has 4,447 images containing mostly multiple disconnected objects. THUR15K [5] consists of 6,232 diverse and heterogeneous images categorized into several groups. Among these datasets, DUTS [32] is currently the largest saliency detection dataset consisting of two subsets: DUTS-TR contains 10,553 images for training and DUTS-TE has 5,019 images for testing.

Implementation Details. In line with most existing methods [25, 27, 36, 37, 50], we use the DUTS-TR dataset for training and the rest of the datasets as the test set for evaluation. ResNet-50 [14] classifier pre-trained on ImageNet [7] is used as backbone to initialize the model, and the other parameters are randomly initialized. Our network is trained end-to-end for 50 epochs with a mini-batch size of 32 by stochastic gradient descent (SGD). The momentum and weight decay are set to 0.9 and 0.0005, respectively. We set the maximum learning rate to 0.005 for the ResNet-50 backbone and 0.05 for the other parts. Warm-up and linear decay strategies are also used.

During training, we use random horizontal flip, random crop, and multi-scale input images for data augmentation. The images are resized to the resolution of \( 352\, \times \, 352 \) during testing and fed into the network to generate the saliency prediction without any post-processing step. Resizing with bilinear interpolation is consistently used throughout all the experiments. The proposed model achieves the inference time of 20 FPS on a single Quadro RTX 6000 GPU.

Evaluation Metrics. We use four widely used metrics to evaluate the proposed method, i.e., Mean Absolute Error (MAE) [26], mean F-measure (m \( F_{\beta } \)) [1], weighted F-measure (\( F_{\beta }^{\omega } \)) [23], and precision-recall curve. pecifically, \( MAE \) [26] calculates the absolute per-pixel difference between the saliency prediction and its ground truth, (13) \( \begin{equation} MAE=\frac{1}{W\times H}\sum _{i=1}^{W}\sum _{j=1}^{H}|{S(i,j)-G(i,j)}|, \end{equation} \) where \( W \) and \( H \) is the width and height of the mask, respectively, \( S \) denotes the predicted saliency map, and \( G \) represents the ground truth. As the most fundamental and direct measure, \( MAE \) has been widely applied to evaluate the quality of saliency map [25, 27, 36, 37, 50]. The generated saliency map \( S \) is first converted into a binary map using a threshold and is compared with ground truth \( G \) to compute the precision and recall score: (14) \( \begin{equation} \begin{gathered}precision=\frac{|S\cap G|}{S},\ recall=\frac{|S\cap G|}{G}. \end{gathered} \end{equation} \)

The precision-recall curve is plotted by varying the binarized thresholds from 0 to 255 to obtain a sequence of precision-recalled pairs. The larger the area under the PR curve, the better the performance of the model. F-measure \( F_{\beta } \) and its weighted variant \( F_{\beta }^{\omega } \) are used to jointly assess the saliency prediction by taking both precision and recall into consideration. The basic F-measure can be formulated as (15) \( \begin{equation} F_{\beta }=\frac{(1+\beta ^2) { precision}\times { recall}}{\beta ^{2} { precision} + { recall}}, \end{equation} \) where \( \beta \) is the relative weight to control the importance of precision and recall. \( \beta \) is usually set to 0.3 to give a larger weight to precision as suggested in Reference [1]. Mean F-measure (m \( F_{\beta } \)) is computed by taking the mean value of F-measure from the PR curve. Weighted F-measure [23] is an intuitive generalization of F-measure for non-binary maps, which is defined as (16) \( \begin{equation} F_{\beta }^{\omega }=\frac{(1+\beta ^2){ precision}^{\omega }\times { recall}^{\omega }}{\beta ^{2}{ precision}^{\omega }+{ recall}^{\omega }}. \end{equation} \)

The basic quantities \( precision \) and \( recall \) are extended to non-binary values and assigned different weights to different errors according to location and neighborhood information. Except for these four measures, Max F-measure that selects the maximum value of F-measure from the PR curve, E-measure [10], and S-measure [9] are also used in the literature. S-measure is proposed to compute the region-aware and object-aware similarities, and E-measure is designed to combine local pixel values with image-level mean values for joint assessment.

4.1 State-of-the-Art Comparisons

Quantitative Evaluation. We demonstrate the efficacy of our model by comparing with other 10 most recent state-of-the-art models, including BMPM [43], CPD [39], EGNet [47], BANet [29], BASNet [27], SCRN [40], F3Net [36], GateNet [50], MinNet [25], and LDF [37]. To assure comparison fairness, the saliency maps are either provided by the authors or generated using officially released pre-trained models. Table 1 displays the performances of aforementioned methods on six datasets. Our method consistently outperforms other models and achieves the best performances across six datasets, refreshing the leaderboard and setting the new baseline. In particular, we have significantly improved the best F-score (\( F_{\beta } \)) over all datasets, with 1.5% increase on DUT-OMRON, 1.1% on HKU-IS, and 0.9% on DUTS-TE. It is also worth mentioning that our method surpasses others by larger margin on large datasets, while the difference on small-scale datasets (\( \lt \)1,000 images) is less obvious. Due to the limited number of images, small datasets may not well reflect the actual performance of a model.

Table 1.
Method\( _{\rm year} \)ECSSD (#1,000)DUTS-TE (#5,019)DUT-OMRON (#5,168)PASCAL-S (#850)HKU-IS (#4,447)THUR15K (#6,232)
\( MAE \)m \( F_{\beta } \)\( F_{\beta }^{\omega } \)\( MAE \)m \( F_{\beta } \)\( F_{\beta }^{\omega } \)\( MAE \)m \( F_{\beta } \)\( F_{\beta }^{\omega } \)\( MAE \)m \( F_{\beta } \)\( F_{\beta }^{\omega } \)\( MAE \)m \( F_{\beta } \)\( F_{\beta }^{\omega } \)\( MAE \)m \( F_{\beta } \)\( F_{\beta }^{\omega } \)
BMPM\( _{2018} \) [43]0.0450.8680.8710.0490.7450.7610.0640.6920.6810.0760.7690.7820.0390.8710.8590.0790.704
CPD-R\( _{2019} \) [39]0.0370.9170.8980.0430.8050.7950.0560.7470.7190.0740.8290.8000.0340.8910.8750.0680.7380.730
EGNet-R\( _{2019} \) [47]0.0370.9200.9030.0390.8150.8160.0530.7560.7380.0750.8310.8070.0310.9010.8870.0670.7390.733
BANet\( _{2019} \) [29]0.0350.9230.9080.0400.8150.8110.0590.7460.7360.0700.8380.8170.0320.8990.8870.0680.741
BASNet\( _{2019} \) [27]0.0370.8800.9040.0480.7910.8030.0560.7560.7510.0790.7770.7970.0320.8950.8890.0730.7330.721
SCRN\( _{2019} \) [40]0.0370.9180.8990.0400.8090.8030.0560.7460.7200.0650.8390.8160.0330.8970.8780.0660.7410.734
F3Net\( _{2020} \) [36]0.0330.9250.9120.0350.7910.8350.0530.7660.7470.0640.8440.8230.0280.9100.9000.0650.7560.744
GateNet\( _{2020} \) [50]0.0350.9170.9060.0350.8160.8280.0510.7610.7490.0650.8270.8210.0290.9030.893
MINet\( _{2020} \) [25]0.0330.9240.9110.0370.8280.8250.0550.7560.7380.0640.8420.8210.0280.9080.899
LDF\( _{2020} \) [37]0.0340.9300.9150.0340.8550.8450.0510.7730.7520.0620.8530.8280.0270.9140.9040.0640.7630.752
Ours0.0350.9310.9110.0330.8630.8470.0480.7850.7580.0620.8550.8290.0270.9240.9070.0620.7690.755
  • ‘–’ means the results can not be obtained. For all metrics except for \( MAE \), higher is better. The best three results are highlighted in red, blue, and green, respectively.

Table 1. Quantitative Results Compared with State-of-the-art Methods on Six Datasets

  • ‘–’ means the results can not be obtained. For all metrics except for \( MAE \), higher is better. The best three results are highlighted in red, blue, and green, respectively.

Figure 6 shows the precision-recall curve (first row) and the F-measure curve (second row) of all the methods. Our PR curve consistently lies above other methods and achieves best performances on ECSSD, PASCAL-S, DUTS-TE, and HKU-IS, and has very competitive results on DUT-OMRON. Moreover, our PR curve is significantly shorter than other methods and has larger recall value ranges, which indicates that our method has less false negative predictions in the saliency maps. Across all datasets, our F-measure curve has the flattest slope and largest area under the curve, demonstrating that our generated saliency maps present good quality against varying thresholds.

Fig. 6.

Fig. 6. The precision-recall curve (first row) and F-measure versus different thresholds (second row) of all the methods.

Notice that on ECSSD and DUT-OMRON the GateNet [50] seem to have the highest F-score at certain thresholds but the score drops very quickly when the threshold is increased. This implies that their saliency maps have many non-binary predictions (i.e., numerical values on (0,1)). Only when an appropriate threshold is carefully chosen, their method could have reasonable performances.

Qualitative Evaluation. Some representative visual examples are shown in Figure 7. We select images from some challenging scenarios, including low color contrast (first row), high inter-object contrast (second row), low contrast near object boundary (third row), multiple objects with low background contrast (fourth row), partly occluded object (fifth row), object in cluttered backgrounds (sixth row), small object near image border (seventh row), and object with irregular and complex edges (last row). It can be seen that our method well suppresses background noise and accurately segments the salient objects of various sizes with coherent details.

Fig. 7.

Fig. 7. Visual comparison of our method with other state-of-the-art methods in different challenging scenarios. Our method can well distinguish salient objects and suppress background noise, giving better visual appeal than others.

4.2 Ablation Study

Baseline Models. To investigate the effect of each proposed module, we conduct ablation studies on several baselines to validate the effectiveness of each proposed component: (1) B1 uses ResNet backbone and a decoder network for direct saliency prediction; (2) B2 first generates intermediate body map and then produces detail map to refine the boundary; (3) B3 models the detail first and then fills the body map into the detail mask; (4) B6 uses structural similarity loss to enforce the detail decoder to learn the structural information; (5) B7 adopts IoU loss to help body decoder quickly attend to body and F-loss to balance the body and detail information. (6) B4 additionally employs MDAB for better detail modeling; (7) B5 applies MBAB to fuse the detail, feature for body generation; Table 2 shows the results of the ablation study on THUR15K. As we deploy more proposed modules, the performance gains step-wise improvement, demonstrating the effectiveness of each proposed module. We then analyze the impact of each module in the following paragraphs.

Table 2.
Setting\( MAE \downarrow \)m \( F_{\beta } \uparrow \)\( F_{\beta }^{\omega } \uparrow \)
B1Baseline0.0710.7260.721
B2B1 + Body Map \( \rightarrow \) Detail Map0.0700.7320.727
B3B1 + Detail Map \( \rightarrow \) Body Map0.0680.7370.732
B4B3 + \( l_{\rm SSIM} \) on Detail Map0.0660.7490.744
B5B4+ \( l_{\rm IoU} \) and \( l_{\rm F} \) on Body Map0.0650.7530.750
B6B5 + MDAB0.0640.7600.752
B7B6 + MBAB0.0620.7690.755

Table 2. Ablation Studies on THUR15K

Effect of Detail Modeling. We evaluate the impact of detail modeling by comparing the baseline B3 that captures detail first to baseline B1 that directly outputs the saliency map. As can be seen from Table 2, detail modeling brings about 1.5% increase in terms of both \( F_{\beta } \) and \( F_{\beta }^{\omega } \). Figure 8 (left) illustrates the visual impact of detail modeling. We can see that the detail map can first identify the informative edges of the object and thus help the body filling part to generate saliency map with more accurate boundaries.

Fig. 8.

Fig. 8. (Left) Visual illustration of detail modeling. The first stage of B3 aims at identifying the informative details of the object, which benefits the downstream body filling task for more accurate saliency map generation. (Middle) Effect of multi-scale attention blocks. Equipped with the two proposed blocks, the network learns to attend to the crucial regions and remove noise to refine the mask. (Right) Visual effect of the hybrid loss function. \( l_{\rm SSIM} \) improves the mask by enriching the representation with the structural information, while \( l_{\rm IoU} \) and \( l_{\rm F} \) helps the network to concentrate on the body and complements the body and detail map by ensuring the fused mask has a high F-score, respectively.

Impact of Generation Order. We design an interesting baseline B2 that generates intermediate body map to investigate the impact of generation order, namely, taking the strategy “easier first” to let the network focus on the easier body map task and then refine the boundary, or the strategy “harder first” to make the model learn the harder detail first and then fill in the body. From Table 2, we can see that B3 surpasses B2, proving the effectiveness of “harder first” strategy. The reason behind may be that the neural network naturally focuses on the low-level information like the edges first then gradually shift to high-level semantics.

Effect of Multi-scale Attention Block. Based on B3 with detail modeling, we demonstrate the effect of our proposed MDAB and MBAB by setting baseline B4 and B5. Table 2 tells that the successive deployment of the two attention blocks improves the baseline by 1% in \( F_{\beta } \) and \( F_{\beta }^{\omega } \). As we can see from the visual illustration shown in Figure 8 (middle), the two attention blocks enforce the model to concentrate on the salient object and refine the mask by removing the misclassified background region.

Effect of Hybrid Loss Function. To validate the effect of the hybrid loss function, we conduct a set of experiments over different loss configurations on our model. The evaluation results in Table 2 show that the combination of structural similarity loss on the detail map and the IoU and F loss on the body map works best. Figure 8 (right) displays some visual examples to further demonstrate this. We can observe that \( l_{\rm SSIM} \) can effectively complement the structural information and results in saliency maps with sharp and clear boundaries. However, \( l_{\rm IoU} \) and \( l_{\rm F} \) can ensure that the network focuses on the salient object and well combine the two maps to achieve a high F-score.

Visualizing Attention, Detail, and Body Maps. To understand how the multi-scale attention works, we visualize the local spatial-wise attention maps in the last MBAB and MDAB, the detail and body map, and the fused saliency map in Figure 9 (left). As can be seen, the first detail attention map for detail decoder highlights the edges of the object, pushing the network to concentrate on the boundary. The second detail attention for body decoder mainly emphasizes some crucial details and reveals regions that the previous attention map may neglect. The feature attention map pays attention to the important features of the image and encourages the network to segment more accurate saliency maps.

Fig. 9.

Fig. 9. (Left) Visual examples of the attention and output maps. (Right) Visualization of the deviation between predicted detail map and decomposed detail label. The detail map may also assign values to features that can characterize the object.

Deviation of Detail Map. To study the concrete effect of detail label, we measure the deviation between generated detail map and explicitly decoupled detail label and present some examples in Figure 9 (right). We can see that the generated detail maps do not necessarily follow the exact distribution of the decoupled detail label that only assigns larger values to pixels nearby edge. Instead, the produced detail map may also assign values to crucial features that characterize the object (e.g., the neck of the eagle). We expect the decoupled detail map plays the role that leads the detail map to distinguish crucial pixels based on the detail label.

Failure Cases.Figure 10 presents two examples of failure cases on ECSSD where our model has the narrowest margin over other methods. The left example is an image that has a “smiling” ball in the center. The ground truth displays only the “smiling” ball, whereas our map predicts all the balls of high color contrast. The reason may be that our model focuses too much on the background contrast but fails to handle the inter-object contrast in this specific instance. The right example is a flower where the ground truth presents the whole flower but ours only predict its stamen. The stamen does naturally pops out of the image but is only a sub-object of the flower. We think it is, because there is a lack of image-level class label supervision to let the network learn the category of an object.

Fig. 10.

Fig. 10. Examples of two failure cases of our model.

Skip 5CONCLUSION Section

5 CONCLUSION

We propose a novel end-to-end SOD framework that disentangles the original task into cascaded detail modeling and body filling. This framework can effectively reduce the difficulty of direct saliency detection. Moreover, we propose two multi-scale attention blocks that target feature fusion and help the network to generate more accurate detail and body maps. Extensive experiments have demonstrated that our method achieves state-of-the-art performances on different metrics across six datasets.

Skip Supplemental Material Section

Supplemental Material

REFERENCES

  1. [1] Achanta Radhakrishna, Hemami Sheila, Estrada Francisco, and Süsstrunk Sabine. 2009. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Borji Ali, Cheng Ming-Ming, Jiang Huaizu, and Li Jia. 2015. Salient object detection: A benchmark. IEEE Trans. Image Process. 24, 12 (2015), 57065722.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Bruno Alessandro, Gugliuzza Francesco, Pirrone Roberto, and Ardizzone Edoardo. 2020. A multi-scale colour and keypoint density-based approach for visual saliency detection. IEEE Access 8 (2020), 121330121343.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Shuhan, Tan Xiuli, Wang Ben, and Hu Xuelong. 2018. Reverse attention for salient object detection. In Proceedings of the European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Cheng Ming-Ming, Mitra Niloy J., Huang Xiaolei, and Hu Shi-Min. 2014. Salientshape: Group saliency in image collections. Visual Comput. 30, 4 (2014), 443453.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Cheng Ming-Ming, Mitra Niloy J., Huang Xiaolei, Torr Philip H. S., and Hu Shi-Min. 2014. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 3 (2014), 569582.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Everingham Mark, Gool Luc Van, Williams Christopher K. I., Winn John, and Zisserman Andrew. 2010. The pascal visual object classes (voc) challenge. Springer Int. J. Comput. Vision 88, 2 (2010), 303338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Fan Deng-Ping, Cheng Ming-Ming, Liu Yun, Li Tao, and Borji Ali. 2017. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE/CVF International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Fan Deng-Ping, Gong Cheng, Cao Yang, Ren Bo, Cheng Ming-Ming, and Borji Ali. 2018. Enhanced-alignment measure for binary foreground map evaluation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’18).Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Felzenszwalb Pedro F. and Huttenlocher Daniel P.. 2004. Efficient graph-based image segmentation. Springer Int. J. Comput. Vision 59, 2 (2004), 167181.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Feng Mengyang, Lu Huchuan, and Ding Errui. 2019. Attentive feedback network for boundary-aware salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Guo Chenlei, Ma Qi, and Zhang Liming. 2008. Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  14. [14] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Hou Qibin, Cheng Ming-Ming, Hu Xiaowei, Borji Ali, Tu Zhuowen, and Torr Philip H. S.. 2017. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Hou Xiaodi and Zhang Liqing. 2007. Saliency detection: A spectral residual approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Li Guanbin and Yu Yizhou. 2016. Visual saliency detection based on multiscale deep CNN features. IEEE Trans. Image Process. 25, 11 (2016), 50125024.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Li Yin, Hou Xiaodi, Koch Christof, Rehg James M., and Yuille Alan L.. 2014. The secrets of salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Liu Jiang-Jiang, Hou Qibin, Cheng Ming-Ming, Feng Jiashi, and Jiang Jianmin. 2019. A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Liu Nian, Han Junwei, and Yang Ming-Hsuan. 2018. Picanet: Learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Long Jonathan, Shelhamer Evan, and Darrell Trevor. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Luo Zhiming, Mishra Akshaya, Achkar Andrew, Eichel Justin, Li Shaozi, and Jodoin Pierre-Marc. 2017. Non-local deep features for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Margolin Ran, Zelnik-Manor Lihi, and Tal Ayellet. 2014. How to evaluate foreground maps?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Máttyus Gellért, Luo Wenjie, and Urtasun Raquel. 2017. Deeproadmapper: Extracting road topology from aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Pang Youwei, Zhao Xiaoqi, Zhang Lihe, and Lu Huchuan. 2020. Multi-scale interactive network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Perazzi Federico, Krähenbühl Philipp, Pritch Yael, and Hornung Alexander. 2012. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Qin Xuebin, Zhang Zichen, Huang Chenyang, Gao Chao, Dehghan Masood, and Jagersand Martin. 2019. Basnet: Boundary-aware salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Rahman Md Atiqur and Wang Yang. 2016. Optimizing intersection-over-union in deep neural networks for image segmentation. In Proceedings of the International Symposium on Visual Computing.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Su Jinming, Li Jia, Zhang Yu, Xia Changqun, and Tian Yonghong. 2019. Selectivity or invariance: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Tong Na, Lu Huchuan, Zhang Ying, and Ruan Xiang. 2015. Salient object detection via global and local cues. Pattern Recogn. 48, 10 (2015), 32583267.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Treisman Anne M. and Gelade Garry. 1980. A feature-integration theory of attention. Cogn. Psychol. 12, 1 (1980), 97136.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Wang Lijun, Lu Huchuan, Wang Yifan, Feng Mengyang, Wang Dong, Yin Baocai, and Ruan Xiang. 2017. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Wang Tiantian, Zhang Lihe, Wang Shuo, Lu Huchuan, Yang Gang, Ruan Xiang, and Borji Ali. 2018. Detect globally, refine locally: A novel approach to saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Wang Wenguan, Lai Qiuxia, Fu Huazhu, Shen Jianbing, Ling Haibin, and Yang Ruigang. 2021. Salient object detection in the deep learning era: An in-depth survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6 (2021), 3239–3259.Google ScholarGoogle Scholar
  35. [35] Wang Zhou, Simoncelli Eero P., and Bovik Alan C.. 2003. Multiscale structural similarity for image quality assessment. In Proceedings of the 37th Asilomar Conference on Signals, Systems and Computers.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Wei Jun, Wang Shuhui, and Huang Qingming. 2020. F\( ^3 \)Net: Fusion, feedback and focus for salient object detection. In Proceedings of the AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Wei Jun, Wang Shuhui, Wu Zhe, Su Chi, Huang Qingming, and Tian Qi. 2020. Label decoupling framework for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wu Runmin, Feng Mengyang, Guan Wenlong, Wang Dong, Lu Huchuan, and Ding Errui. 2019. A mutual learning method for salient object detection with intertwined multi-supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Wu Zhe, Su Li, and Huang Qingming. 2019. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Wu Zhe, Su Li, and Huang Qingming. 2019. Stacked cross refinement network for edge-aware salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Yan Qiong, Xu Li, Shi Jianping, and Jia Jiaya. 2013. Hierarchical saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Yang Chuan, Zhang Lihe, Lu Huchuan, Ruan Xiang, and Yang Ming-Hsuan. 2013. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Zhang Lu, Dai Ju, Lu Huchuan, He You, and Wang Gang. 2018. A bi-directional message passing model for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Zhang Pingping, Wang Dong, Lu Huchuan, Wang Hongyu, and Ruan Xiang. 2017. Amulet: Aggregating multi-level convolutional features for salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Zhang Qing, Lin Jiajun, Tao Yanyun, Li Wenju, and Shi Yanjiao. 2017. Salient object detection via color and texture cues. Neurocomputing 243 (2017), 3548.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Zhang Xiaoning, Wang Tiantian, Qi Jinqing, Lu Huchuan, and Wang Gang. 2018. Progressive attention guided recurrent network for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Zhao Jia-Xing, Liu Jiang-Jiang, Fan Deng-Ping, Cao Yang, Yang Jufeng, and Cheng Ming-Ming. 2019. EGNet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Zhao Kai, Gao Shanghua, Wang Wenguan, and Cheng Ming-Ming. 2019. Optimizing the f-measure for threshold-free salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Zhao Ting and Wu Xiangqian. 2019. Pyramid feature attention network for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Zhao Xiaoqi, Pang Youwei, Zhang Lihe, Lu Huchuan, and Zhang Lei. 2020. Suppress and balance: A simple gated network for salient object detection. In Proceedings of the European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Zhong Yijie, Li Bo, Tang Lv, Tang Hao, and Ding Shouhong. 2021. Highly efficient natural image matting. In Proceedings of the British Machine Vision Conference.Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. Disentangle Saliency Detection into Cascaded Detail Modeling and Body Filling

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1
          January 2023
          505 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3572858
          • Editor:
          • Abdulmotaleb El Saddik
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 5 January 2023
          • Online AM: 3 March 2022
          • Accepted: 21 January 2022
          • Revised: 15 January 2022
          • Received: 20 September 2021
          Published in tomm Volume 19, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!