skip to main content
research-article

Y-Net: Dual-branch Joint Network for Semantic Segmentation

Authors Info & Claims
Published:12 November 2021Publication History
Skip Abstract Section

Abstract

Most existing segmentation networks are built upon a “U-shaped” encoder–decoder structure, where the multi-level features extracted by the encoder are gradually aggregated by the decoder. Although this structure has been proven to be effective in improving segmentation performance, there are two main drawbacks. On the one hand, the introduction of low-level features brings a significant increase in calculations without an obvious performance gain. On the other hand, general strategies of feature aggregation such as addition and concatenation fuse features without considering the usefulness of each feature vector, which mixes the useful information with massive noises. In this article, we abandon the traditional “U-shaped” architecture and propose Y-Net, a dual-branch joint network for accurate semantic segmentation. Specifically, it only aggregates the high-level features with low-resolution and utilizes the global context guidance generated by the first branch to refine the second branch. The dual branches are effectively connected through a Semantic Enhancing Module, which can be regarded as the combination of spatial attention and channel attention. We also design a novel Channel-Selective Decoder (CSD) to adaptively integrate features from different receptive fields by assigning specific channelwise weights, where the weights are input-dependent. Our Y-Net is capable of breaking through the limit of singe-branch network and attaining higher performance with less computational cost than “U-shaped” structure. The proposed CSD can better integrate useful information and suppress interference noises. Comprehensive experiments are carried out on three public datasets to evaluate the effectiveness of our method. Eventually, our Y-Net achieves state-of-the-art performance on PASCAL VOC 2012, PASCAL Person-Part, and ADE20K dataset without pre-training on extra datasets.

REFERENCES

  1. [1] Ahn Ilkoo and Kim Changick. 2016. Face and hair region labeling using semi-supervised spectral clustering based multiple segmentations. IEEE Trans. Multimedia 18, 7 (2016), 14141421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Badrinarayanan Vijay, Kendall Alex, and Cipolla Roberto. 2017. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 12 (2017), 24812495.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Bak Cagdas, Kocak Aysun, Erdem Erkut, and Erdem Aykut. 2017. Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Trans. Multimedia 20, 7 (2017), 16881698.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Bolei Zhou, Hang Zhao, and Puig Xavier. 2017. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 633641.Google ScholarGoogle Scholar
  5. [5] Bottou Leon. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of International Conference on Computational Statistics (COMPSTAT’2010).177186.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Changqian Yu, Jingbo Wang, and Peng Chao. 2018. Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 18571866.Google ScholarGoogle Scholar
  7. [7] L C. Chen,. Papandreou G., and Schroff F.. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).Google ScholarGoogle Scholar
  8. [8] L C. Chen,. Zhu Y., and Papandreou G.. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedingss of the European Conference on Computer Vision. 801818.Google ScholarGoogle Scholar
  9. [9] Chen LiangChieh, George Papandreou, and Iasonas Kokkinos. 2014. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062 (2014).Google ScholarGoogle Scholar
  10. [10] Chen L. C., Papandreou G., Kokkinos I., Murphy K., and Yuille A. L.. 2017. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 4 (2017), 834848.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Chen Liang Chieh, Yang Yi, Wang Jiang, Xu Wei, and Yuille Alan L.. 2016. Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 36403649.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Chen Liang-Chieh, Zhu Yukun, Papandreou George, Schroff Florian, and Adam Hartwig. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision. 801818.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Chen X., Mottaghi R., and Liu X.. 2014. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19711978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Chen Yizhen and Hu Haifeng. 2020. Multi-layer adaptive feature fusion for semantic segmentation. Neural Process. Lett. 51, 2 (2020), 10811092.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Dong Chao, Loy Chen Change, He Kaiming, and Tang Xiaoou. 2016. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38, 2 (2016), 295307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Everingham Mark, Gool Luc Van, Williams Christopher K. I., Winn John, and Zisserman Andrew. 2010. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Gilbert Charles D. and Wiesel Torsten N.. 1992. Receptive field dynamics in adult primary visual cortex. Nature 356, 6365 (1992), 150.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Hariharan Bharath, Arbelaez Pablo, Girshick Ross, and Malik Jitendra. 2015. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 447456.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Hu Jie, Shen Li, Albanie Samuel, Sun Gang, and Wu Enhua. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 71327141.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Huang Gao, Liu Zhuang, Laurens Van Der Maaten, and Weinberger Kilian Q.. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 47004708.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Huo Shuwei, Zhou Yuan, Lei Jianjun, Ling Nam, and Hou Chunping. 2018. Iterative feedback control-based salient object segmentation. IEEE Trans. Multimedia 20, 6 (2018), 13501364.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Ioffe Sergey and Szegedy Christian. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Jun Fu, Jing Liu, and Haijie Tian. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31463154.Google ScholarGoogle Scholar
  25. [25] Krhenbĺźhl Philipp and Koltun Vladlen. 2011. Efficient inference in fully connected CRFs with gaussian edge potentials. In Advances in Neural Information Processing Systems, Vol. 24. 109117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Li Hanchao, Xiong Pengfei, and An Jie. 2018. Pyramid attention network for semantic segmentation. arXiv preprint arXiv:1805.10180 (2018).Google ScholarGoogle Scholar
  27. [27] Li Xiang, Wang Wenhai, Hu Xiaolin, and Yang Jian. 2019. Selective kernel networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 510519.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Liang Xiaodan, Shen Xiaohui, Xiang Donglai, Feng Jiashi, Lin Liang, and Yan Shuicheng. 2016. Semantic object parsing with local-global long short-term memory. In Proceedings of the European Conference on Computer Vision. 31853193.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Lin Guosheng, Milan Anton, Shen Chunhua, and Reid Ian. 2017. RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19251934.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Lin Guosheng, Shen Chunhua, and Hengel Anton. 2016. Efficient piecewise training of deep structured models for semantic segmentatio. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31943203.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Lin Tsung Yi, Maire Michael, Belongie Serge, Hays James, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740755.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Liu Yifan, Chen Ke, and Liu Chris. 2019. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 26042613.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Liu Ziwei, Li Xiaoxiao, Luo Ping, Loy Chen Change, and Tang Xiaoou. 2015. Semantic image segmentation via deep parsing network. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 13771385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Newell Alejandro, Yang Kaiyu, and Deng Jia. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision. 483499.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Peng Chao, Zhang Xiangyu, Yu Gang, Luo Guiming, and Sun Jian. 2017. Large kernel matters – improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 43534361.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Pettet Mark W. and Gilbert Charles D.. 1992. Dynamic changes in receptive-field size in cat primary visual cortex. Proc. Natl. Acad. Sci. U.S.A. 89, 17 (1992), 366370.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Poudel Rudra P. K., Bonde Ujwal, and Liwicki Stephan. 2018. ContextNet: Exploring context and detail for semantic segmentation in real-time. arXiv preprint arXiv:1805.04554 (2018).Google ScholarGoogle Scholar
  38. [38] Qin Huang, Xia Chunyang, Wu Chihao, Li Siyang, and Kuo C. C. Jay. 2017. Semantic segmentation with reverse attention. arXiv preprint arXiv:1707.06426 (2017).Google ScholarGoogle Scholar
  39. [39] Ronneberger Olaf, Fischer Philipp, and Brox Thomas. 2015. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. 234241.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Shelhamer Evan, Long Jonathan, and Darrell Trevor. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 34313440.Google ScholarGoogle Scholar
  41. [41] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  42. [42] Spillmann Lothar, Dresp-Langley Birgitta, and Tseng Chia Huei. 2015. Beyond the classical receptive field: The effect of contextual stimuli. J. Vis. 15, 9 (2015), 7.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the IEEE Conference on Machine Learning, 807814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Wang Panqu, Chen Pengfei, Yuan Ye, Liu Ding, Huang Zehua, Hou Xiaodi, and Cottrell Garrison. 2018. Understanding convolution for semantic segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. IEEE, 14511460.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Wang Qiurui, Yuan Chun, and Liu Yan. 2019. Learning deep conditional neural network for image segmentation. IEEE Trans. Multimedia 21, 7 (2019), 18391852.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Wenguan, Wang, Jianbing, and Shen. 2016. Higher-order image co-segmentation. IEEE Trans. Multimedia 18, 6 (2016), 10111021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Xia Fangting, Wang Peng, Chen Liang Chieh, and Yuille Alan L.. 2016. Zoom better to see clearer: Huamn part segmentation with auto zoom net. In Proceedings of the European Conference on Computer Vision. 648663.Google ScholarGoogle Scholar
  48. [48] Xiao Tete, Liu Yingcheng, Zhou Bolei, Jiang Yuning, and Sun Jian. 2018. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (2018). 432448.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Yu Fisher and Koltun Vladlen. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).Google ScholarGoogle Scholar
  50. [50] Yu Fisher, Koltun Vladlen, and Funkhouser Thomas. 2017. Dilated residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 472480.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Yu Fang, Xuehe Zhang, He Zhang, Gangfeng Liu, Changle Li, and Jie Zhao. 2019. Spatial-semantic fusion network for semantic segmentation in real-time. In Proceedings of the IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM’19). IEEE, 3035.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Zhang Hang, Dana Kristin, Shi Jianping, Zhang Zhongyue, Wang Xiaogang, Tyagi Ambrish, and Agrawal Amit. 2018. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 71517160.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Zhang Pingping, Liu Wei, Wang Hongyu, Lei Yinjie, and Lu Huchuan. 2019. Deep gated attention networks for large-scale street-level scene segmentation. Pattern Recogn. 88 (2019), 702714.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Zhang Ziyu, Fidler Sanja, and Urtasun Raquel. 2015. Instance-level segmentation for autonomous driving with deep densely connected MRFs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 669677.Google ScholarGoogle Scholar
  55. [55] Zhao Hengshuang, Shi Jianping, Qi Xiaojuan, Wang Xiaogang, and Jia Jiaya. 2017. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 28812890.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Zhao H., Zhang Y., and Liu S.. 2018. PSANet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision. 270286.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Zheng S., Jayasumana S., and Romera-Paredes B.. 2015. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 15291537. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Y-Net: Dual-branch Joint Network for Semantic Segmentation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 17, Issue 4
      November 2021
      529 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3492437
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 November 2021
      • Accepted: 1 April 2021
      • Revised: 1 December 2020
      • Received: 1 August 2020
      Published in tomm Volume 17, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!