skip to main content
research-article

Spatial Structure Preserving Feature Pyramid Network for Semantic Image Segmentation

Authors Info & Claims
Published:31 August 2019Publication History
Skip Abstract Section

Abstract

Recently, progress on semantic image segmentation is substantial, benefiting from the rapid development of Convolutional Neural Networks. Semantic image segmentation approaches proposed lately have been mostly based on Fully convolutional Networks (FCNs). However, these FCN-based methods use large receptive fields and too many pooling layers to depict the discriminative semantic information of the images. Specifically, on one hand, convolutional kernel with large receptive field smooth the detailed edges, since too much contexture information is used to depict the “center pixel.” However, the pooling layer increases the receptive field through zooming out the latest feature maps, which loses many detailed information of the image, especially in the deeper layers of the network. These operations often cause low spatial resolution inside deep layers, which leads to spatially fragmented prediction. To address this problem, we exploit the inherent multi-scale and pyramidal hierarchy of deep convolutional networks to extract the feature maps with different resolutions and take full advantages of these feature maps via a gradually stacked fusing way. Specifically, for two adjacent convolutional layers, we upsample the features from deeper layer with stride of 2 and then stack them on the features from shallower layer. Then, a convolutional layer with kernels of 1× 1 is followed to fuse these stacked features. The fused feature preserves the spatial structure information of the image; meanwhile, it owns strong discriminative capability for pixel classification. Additionally, to further preserve the spatial structure information and regional connectivity of the predicted category label map, we propose a novel loss term for the network. In detail, two graph model-based spatial affinity matrixes are proposed, which are used to depict the pixel-level relationships in the input image and predicted category label map respectively, and then their cosine distance is backward propagated to the network. The proposed architecture, called spatial structure preserving feature pyramid network, significantly improves the spatial resolution of the predicted category label map for semantic image segmentation. The proposed method achieves state-of-the-art results on three public and challenging datasets for semantic image segmentation.

References

  1. A. E. Abdel-Hakim and A. A Farag. 2006. CSIFT: A SIFT descriptor with color invariant characteristics. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1978--1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Eduardo Aguilar, Beatriz Remeseiro, Marc Bolaos, and Petia Radeva. 2018. Grab, pay and eat: Semantic food detection for smart restaurants. IEEE Trans. Multimedia 20, 12 (2018), 3266--3275.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross Girshick. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2874--2883.Google ScholarGoogle ScholarCross RefCross Ref
  4. Carlos Castillo, Soham De, Xintong Han, Bharat Singh, Abhay Kumar Yadav, and Tom Goldstein. 2017. Son of Zorn’s lemma: Targeted style transfer using instance-aware semantic segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'17). IEEE, 1348--1352.Google ScholarGoogle ScholarCross RefCross Ref
  5. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2018a. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 4 (2018), 834--848.Google ScholarGoogle ScholarCross RefCross Ref
  6. Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).Google ScholarGoogle Scholar
  7. Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018b. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV'18). 801--818.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jifeng Dai, Kaiming He, and Jian Sun. 2015. BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. (2015), 1635--1643. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). 886--893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Clément Dechesne, Clément Mallet, Arnaud Le Bris, and Valérie Gouet-Brunet. 2017. Semantic segmentation of forest stands of pure species combining airborne lidar data and very high resolution multispectral imagery. ISPRS J. Photogram. Remote Sens. 126 (2017), 129--145.Google ScholarGoogle ScholarCross RefCross Ref
  11. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2013. DeCAF: A deep convolutional activation feature for generic visual recognition. Comput. Sci. 50, 1 (2013), 815--830.Google ScholarGoogle Scholar
  12. David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. Comput. Sci. (2014), 2366--2374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Pedro F. Felzenszwalb, Ross B. Girshick, David Mcallester, and Deva Ramanan. 2010. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 9 (2010), 1627. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yaroslav Ganin and Victor Lempitsky. 2014. N<sup>4</sup>-Fields: Neural Network Nearest Neighbor Fields for Image Transforms. Springer International Publishing. 536--551.Google ScholarGoogle Scholar
  15. Golnaz Ghiasi and Charless C. Fowlkes. 2016. Laplacian pyramid reconstruction and refinement for semantic segmentation. In Proceedings of the European Conference on Computer Vision. 519--534.Google ScholarGoogle Scholar
  16. Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. 2013. Perceptual organization and recognition of indoor scenes from RGB-D images. In Computer Vision and Pattern Recognition. 564--571. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. 2014. Learning rich features from RGB-D images for object detection and segmentation. 8695 (2014), 345--360.Google ScholarGoogle Scholar
  18. Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. 2011. Semantic contours from inverse detectors. In Proceedings of the International Conference on Computer Vision. 991--998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. 2015. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 447--456.Google ScholarGoogle Scholar
  20. Yang He, Wei Chen Chiu, Margret Keuper, and Mario Fritz. 2017. STD2P: RGBD semantic segmentation using spatio-temporal data-driven pooling. (2017).Google ScholarGoogle Scholar
  21. Andrew Holliday, Mohammadamin Barekatain, Johannes Laurmaa, Chetak Kandaswamy, and Helmut Prendinger. 2017. Speedup of deep learning ensembles for semantic segmentation using a model compression technique. Computer Vision and Image Understanding (2017).Google ScholarGoogle Scholar
  22. Sina Honari, Jason Yosinski, Pascal Vincent, and Christopher Pal. 2016. Recombinator networks: Learning coarse-to-fine feature aggregation. In Computer Vision and Pattern Recognition. 5743--5752.Google ScholarGoogle Scholar
  23. Seunghoon Hong, Junhyuk Oh, Honglak Lee, and Bohyung Han. 2016. Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3204--3212.Google ScholarGoogle ScholarCross RefCross Ref
  24. Seunghoon Hong, Tackgeun You, Suha Kwak, and Bohyung Han. 2015. Online tracking by learning discriminative saliency map with convolutional neural network. In Proceedings of the International Conference on Machine Learning. 597--606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2261--2269.Google ScholarGoogle Scholar
  26. Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, and Jonathan. 2014. Caffe: Convolutional architecture for fast feature embedding. eprint arxiv (2014), 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michael Kampffmeyer, Arnt Borre Salberg, and Robert Jenssen. 2016. Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 680--688.Google ScholarGoogle ScholarCross RefCross Ref
  28. Byeongkeun Kang, Yeejin Lee, and Truong Q. Nguyen. 2018. Depth adaptive deep neural network for semantic segmentation. IEEE Trans. Multimedia 20, 9 (2018), 2478--2490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ronald Kemker, Carl Salvaggio, and Christopher Kanan. 2017. High-resolution multispectral dataset for semantic segmentation. (2017).Google ScholarGoogle Scholar
  30. Tao Kong, Anbang Yao, Yurong Chen, and Fuchun Sun. 2016. HyperNet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 845--853.Google ScholarGoogle ScholarCross RefCross Ref
  31. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Guosheng Lin, Chunhua Shen, Anton Van Den Hengel, and Ian Reid. 2016. Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3194--3203.Google ScholarGoogle ScholarCross RefCross Ref
  33. Tsung Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference and Computer Vision and Pattern Recognition. 2117--2125.Google ScholarGoogle ScholarCross RefCross Ref
  34. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot MultiBox detector. In Proceedings of the European Conference on Computer Vision. Springer, 21--37.Google ScholarGoogle Scholar
  35. Wei Liu, Andrew Rabinovich, and Alexander C. Berg. 2015. ParseNet: Looking wider to see better. Comput. Sci. (2015).Google ScholarGoogle Scholar
  36. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 79, 10 (2015), 1337--1342.Google ScholarGoogle Scholar
  37. Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, 483--499.Google ScholarGoogle ScholarCross RefCross Ref
  38. Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2016. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 1520--1528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. 2016. ENet: A deep neural network architecture for real-time semantic segmentation. (2016).Google ScholarGoogle Scholar
  40. Pedro O. Pinheiro, Tsung Yi Lin, Ronan Collobert, and Piotr Dollár. 2016. Learning to refine object segments. In Proceedings of the European Conference on Computer Vision. Springer, 75--91.Google ScholarGoogle ScholarCross RefCross Ref
  41. D. Ravi, H. Fabelo, G. M. Callico, and G. Yang. 2017. Manifold embedding and semantic segmentation for intraoperative guidance with hyperspectral brain imaging. IEEE Trans. Medical Imag. 36, 9 (2017), 1845--1857.Google ScholarGoogle ScholarCross RefCross Ref
  42. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234--241.Google ScholarGoogle ScholarCross RefCross Ref
  43. Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann Lecun. 2013. OverFeat: Integrated recognition, localization and detection using convolutional networks. eprint Arxiv (2013).Google ScholarGoogle Scholar
  44. Laura Sevillalara, Deqing Sun, Varun Jampani, and Michael J. Black. 2016. Optical flow with semantic segmentation and localized layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3889--3898.Google ScholarGoogle Scholar
  45. Hengcan Shi, Hongliang Li, Fanman Meng, Qingbo Wu, Linfeng Xu, and King N. Ngan. 2018. Hierarchical parsing net: Semantic scene parsing from global scene to objects. IEEE Trans. Multimedia 20, 10 (2018), 2670--2682.Google ScholarGoogle ScholarCross RefCross Ref
  46. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In Proceedings of the European Conference on Computer Vision. 746--760. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014).Google ScholarGoogle Scholar
  48. Nasim Souly, Concetto Spampinato, and Mubarak Shah. 2017. Semi and weakly supervised semantic segmentation using generative adversarial network. (2017).Google ScholarGoogle Scholar
  49. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Computer Vision and Pattern Recognition. 1--9.Google ScholarGoogle Scholar
  50. Joseph Tighe and Svetlana Lazebnik. 2010. SuperParsing: Scalable nonparametric image parsing with superpixels. Int. J. Comput. Vis. 101, 2 (2010), 352--365. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. 2018. Understanding convolution for semantic segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV'18). IEEE, 1451--1460.Google ScholarGoogle ScholarCross RefCross Ref
  52. Huaxin Xiao, Jiashi Feng, Yunchao Wei, Maojun Zhang, and Shuicheng Yan. 2018. Deep salient object detection with dense connections and distraction diagnosis. IEEE Trans. Multimedia 20, 12 (2018), 3239--3251.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Wei Xu, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1 (2013), 221--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Robail Yasrab. 2017. DCSeg: Decoupled CNN for classification and semantic segmentation. In Proceedings of the IEEE Sponsored International Conference on Knowledge and Smart Technologies.Google ScholarGoogle Scholar
  55. Hao Zhou, Jun Zhang, Shuohao Lei, Jun, and Dan Tu. 2016. Image semantic segmentation based on FCN-CRF model. In Proceedings of the International Conference on Image, Vision and Computing. 9--14.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Spatial Structure Preserving Feature Pyramid Network for Semantic Image Segmentation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!