Abstract
Recently, progress on semantic image segmentation is substantial, benefiting from the rapid development of Convolutional Neural Networks. Semantic image segmentation approaches proposed lately have been mostly based on Fully convolutional Networks (FCNs). However, these FCN-based methods use large receptive fields and too many pooling layers to depict the discriminative semantic information of the images. Specifically, on one hand, convolutional kernel with large receptive field smooth the detailed edges, since too much contexture information is used to depict the “center pixel.” However, the pooling layer increases the receptive field through zooming out the latest feature maps, which loses many detailed information of the image, especially in the deeper layers of the network. These operations often cause low spatial resolution inside deep layers, which leads to spatially fragmented prediction. To address this problem, we exploit the inherent multi-scale and pyramidal hierarchy of deep convolutional networks to extract the feature maps with different resolutions and take full advantages of these feature maps via a gradually stacked fusing way. Specifically, for two adjacent convolutional layers, we upsample the features from deeper layer with stride of 2 and then stack them on the features from shallower layer. Then, a convolutional layer with kernels of 1× 1 is followed to fuse these stacked features. The fused feature preserves the spatial structure information of the image; meanwhile, it owns strong discriminative capability for pixel classification. Additionally, to further preserve the spatial structure information and regional connectivity of the predicted category label map, we propose a novel loss term for the network. In detail, two graph model-based spatial affinity matrixes are proposed, which are used to depict the pixel-level relationships in the input image and predicted category label map respectively, and then their cosine distance is backward propagated to the network. The proposed architecture, called spatial structure preserving feature pyramid network, significantly improves the spatial resolution of the predicted category label map for semantic image segmentation. The proposed method achieves state-of-the-art results on three public and challenging datasets for semantic image segmentation.
- A. E. Abdel-Hakim and A. A Farag. 2006. CSIFT: A SIFT descriptor with color invariant characteristics. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1978--1983. Google Scholar
Digital Library
- Eduardo Aguilar, Beatriz Remeseiro, Marc Bolaos, and Petia Radeva. 2018. Grab, pay and eat: Semantic food detection for smart restaurants. IEEE Trans. Multimedia 20, 12 (2018), 3266--3275.Google Scholar
Digital Library
- Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross Girshick. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2874--2883.Google Scholar
Cross Ref
- Carlos Castillo, Soham De, Xintong Han, Bharat Singh, Abhay Kumar Yadav, and Tom Goldstein. 2017. Son of Zorn’s lemma: Targeted style transfer using instance-aware semantic segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'17). IEEE, 1348--1352.Google Scholar
Cross Ref
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2018a. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40, 4 (2018), 834--848.Google Scholar
Cross Ref
- Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017).Google Scholar
- Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018b. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV'18). 801--818.Google Scholar
Cross Ref
- Jifeng Dai, Kaiming He, and Jian Sun. 2015. BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. (2015), 1635--1643. Google Scholar
Digital Library
- N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). 886--893. Google Scholar
Digital Library
- Clément Dechesne, Clément Mallet, Arnaud Le Bris, and Valérie Gouet-Brunet. 2017. Semantic segmentation of forest stands of pure species combining airborne lidar data and very high resolution multispectral imagery. ISPRS J. Photogram. Remote Sens. 126 (2017), 129--145.Google Scholar
Cross Ref
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2013. DeCAF: A deep convolutional activation feature for generic visual recognition. Comput. Sci. 50, 1 (2013), 815--830.Google Scholar
- David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. Comput. Sci. (2014), 2366--2374. Google Scholar
Digital Library
- Pedro F. Felzenszwalb, Ross B. Girshick, David Mcallester, and Deva Ramanan. 2010. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 9 (2010), 1627. Google Scholar
Digital Library
- Yaroslav Ganin and Victor Lempitsky. 2014. N<sup>4</sup>-Fields: Neural Network Nearest Neighbor Fields for Image Transforms. Springer International Publishing. 536--551.Google Scholar
- Golnaz Ghiasi and Charless C. Fowlkes. 2016. Laplacian pyramid reconstruction and refinement for semantic segmentation. In Proceedings of the European Conference on Computer Vision. 519--534.Google Scholar
- Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. 2013. Perceptual organization and recognition of indoor scenes from RGB-D images. In Computer Vision and Pattern Recognition. 564--571. Google Scholar
Digital Library
- Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. 2014. Learning rich features from RGB-D images for object detection and segmentation. 8695 (2014), 345--360.Google Scholar
- Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. 2011. Semantic contours from inverse detectors. In Proceedings of the International Conference on Computer Vision. 991--998. Google Scholar
Digital Library
- B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. 2015. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 447--456.Google Scholar
- Yang He, Wei Chen Chiu, Margret Keuper, and Mario Fritz. 2017. STD2P: RGBD semantic segmentation using spatio-temporal data-driven pooling. (2017).Google Scholar
- Andrew Holliday, Mohammadamin Barekatain, Johannes Laurmaa, Chetak Kandaswamy, and Helmut Prendinger. 2017. Speedup of deep learning ensembles for semantic segmentation using a model compression technique. Computer Vision and Image Understanding (2017).Google Scholar
- Sina Honari, Jason Yosinski, Pascal Vincent, and Christopher Pal. 2016. Recombinator networks: Learning coarse-to-fine feature aggregation. In Computer Vision and Pattern Recognition. 5743--5752.Google Scholar
- Seunghoon Hong, Junhyuk Oh, Honglak Lee, and Bohyung Han. 2016. Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3204--3212.Google Scholar
Cross Ref
- Seunghoon Hong, Tackgeun You, Suha Kwak, and Bohyung Han. 2015. Online tracking by learning discriminative saliency map with convolutional neural network. In Proceedings of the International Conference on Machine Learning. 597--606. Google Scholar
Digital Library
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2261--2269.Google Scholar
- Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, and Jonathan. 2014. Caffe: Convolutional architecture for fast feature embedding. eprint arxiv (2014), 675--678. Google Scholar
Digital Library
- Michael Kampffmeyer, Arnt Borre Salberg, and Robert Jenssen. 2016. Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 680--688.Google Scholar
Cross Ref
- Byeongkeun Kang, Yeejin Lee, and Truong Q. Nguyen. 2018. Depth adaptive deep neural network for semantic segmentation. IEEE Trans. Multimedia 20, 9 (2018), 2478--2490. Google Scholar
Digital Library
- Ronald Kemker, Carl Salvaggio, and Christopher Kanan. 2017. High-resolution multispectral dataset for semantic segmentation. (2017).Google Scholar
- Tao Kong, Anbang Yao, Yurong Chen, and Fuchun Sun. 2016. HyperNet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 845--853.Google Scholar
Cross Ref
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems. 1097--1105. Google Scholar
Digital Library
- Guosheng Lin, Chunhua Shen, Anton Van Den Hengel, and Ian Reid. 2016. Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3194--3203.Google Scholar
Cross Ref
- Tsung Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference and Computer Vision and Pattern Recognition. 2117--2125.Google Scholar
Cross Ref
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot MultiBox detector. In Proceedings of the European Conference on Computer Vision. Springer, 21--37.Google Scholar
- Wei Liu, Andrew Rabinovich, and Alexander C. Berg. 2015. ParseNet: Looking wider to see better. Comput. Sci. (2015).Google Scholar
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 79, 10 (2015), 1337--1342.Google Scholar
- Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, 483--499.Google Scholar
Cross Ref
- Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. 2016. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 1520--1528. Google Scholar
Digital Library
- Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. 2016. ENet: A deep neural network architecture for real-time semantic segmentation. (2016).Google Scholar
- Pedro O. Pinheiro, Tsung Yi Lin, Ronan Collobert, and Piotr Dollár. 2016. Learning to refine object segments. In Proceedings of the European Conference on Computer Vision. Springer, 75--91.Google Scholar
Cross Ref
- D. Ravi, H. Fabelo, G. M. Callico, and G. Yang. 2017. Manifold embedding and semantic segmentation for intraoperative guidance with hyperspectral brain imaging. IEEE Trans. Medical Imag. 36, 9 (2017), 1845--1857.Google Scholar
Cross Ref
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234--241.Google Scholar
Cross Ref
- Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann Lecun. 2013. OverFeat: Integrated recognition, localization and detection using convolutional networks. eprint Arxiv (2013).Google Scholar
- Laura Sevillalara, Deqing Sun, Varun Jampani, and Michael J. Black. 2016. Optical flow with semantic segmentation and localized layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3889--3898.Google Scholar
- Hengcan Shi, Hongliang Li, Fanman Meng, Qingbo Wu, Linfeng Xu, and King N. Ngan. 2018. Hierarchical parsing net: Semantic scene parsing from global scene to objects. IEEE Trans. Multimedia 20, 10 (2018), 2670--2682.Google Scholar
Cross Ref
- Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In Proceedings of the European Conference on Computer Vision. 746--760. Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014).Google Scholar
- Nasim Souly, Concetto Spampinato, and Mubarak Shah. 2017. Semi and weakly supervised semantic segmentation using generative adversarial network. (2017).Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Computer Vision and Pattern Recognition. 1--9.Google Scholar
- Joseph Tighe and Svetlana Lazebnik. 2010. SuperParsing: Scalable nonparametric image parsing with superpixels. Int. J. Comput. Vis. 101, 2 (2010), 352--365. Google Scholar
Digital Library
- Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. 2018. Understanding convolution for semantic segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV'18). IEEE, 1451--1460.Google Scholar
Cross Ref
- Huaxin Xiao, Jiashi Feng, Yunchao Wei, Maojun Zhang, and Shuicheng Yan. 2018. Deep salient object detection with dense connections and distraction diagnosis. IEEE Trans. Multimedia 20, 12 (2018), 3239--3251.Google Scholar
Digital Library
- Wei Xu, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1 (2013), 221--231. Google Scholar
Digital Library
- Robail Yasrab. 2017. DCSeg: Decoupled CNN for classification and semantic segmentation. In Proceedings of the IEEE Sponsored International Conference on Knowledge and Smart Technologies.Google Scholar
- Hao Zhou, Jun Zhang, Shuohao Lei, Jun, and Dan Tu. 2016. Image semantic segmentation based on FCN-CRF model. In Proceedings of the International Conference on Image, Vision and Computing. 9--14.Google Scholar
Cross Ref
Index Terms
Spatial Structure Preserving Feature Pyramid Network for Semantic Image Segmentation
Recommendations
Multi-Scale Deep Convolutional Nets with Attention Model and Conditional Random Fields for Semantic Image Segmentation
SPML '19: Proceedings of the 2019 2nd International Conference on Signal Processing and Machine LearningAlthough Convolutional Neural Networks are effective visual models that generate hierarchies of features, there still exist some shortcomings in the application of Deep Convolutional Neural Networks to semantic image segmentation. In this work, our ...
Stable self-attention adversarial learning for semi-supervised semantic image segmentation
Graphical abstractAn overview of the proposed system for semi-supervised semantic image segmentation, where the segmentation network G outputs a class probability map, SA represents the self-attention modules, SN represents the application of ...
Highlights- The application of semi-supervised semantic image segmentation can effectively reduce the number of manually generated labels required in the training ...
AbstractThe application of adversarial learning for semi-supervised semantic image segmentation based on convolutional neural networks can effectively reduce the number of manually generated labels required in the training process. However, ...
Semantic image segmentation using fully convolutional neural networks with multi-scale images and multi-scale dilated convolutions
In this work, we investigate the effects of the cascade architecture of dilated convolutions and the deep network architecture of multi-resolution input images on the accuracy of semantic segmentation. We show that a cascade of dilated convolutions is ...






Comments