Abstract
Due to the challenges of densely compacted body parts, nonrigid clothing items, and severe overlap in crowd scenes, human parsing needs to focus more on multilevel feature representations compared to general scene parsing tasks. Based on this observation, we propose to introduce the auxiliary task of human mask and edge detection to facilitate human parsing. Different from human parsing, which exploits the discriminative features of each category, human mask and edge detection emphasizes the boundaries of semantic parsing regions and the difference between foreground humans and background clutter, which benefits the parsing predictions of crowd scenes and small human parts. Specifically, we extract human mask and edge labels from the human parsing annotations and train a shared encoder with three independent decoders for the three mutually beneficial tasks. Furthermore, the decoder feature maps of the human mask prediction branch are further exploited as attention maps, indicating human regions to facilitate the decoding process of human parsing and human edge detection. In addition to these auxiliary tasks, we further alleviate the problem of deformed clothing items under various human poses by tracking the deformation patterns with the deformable convolution. Extensive experiments show that the proposed method can achieve superior performance against state-of-the-art methods on both single and multiple human parsing datasets. Codes and trained models are available https://github.com/ViktorLiang/MGDAN.
- [1] . 2018. Dense decoder shortcut connections for single-pass semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6596–6605.Google Scholar
Cross Ref
- [2] . 2016. Semantic image segmentation with task-specific edge detection using CNNs and a discriminatively trained domain transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4545–4554.Google Scholar
Cross Ref
- [3] . 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision.Google Scholar
Cross Ref
- [4] . 2014. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1979–1986.Google Scholar
Digital Library
- [5] . 2017. Deformable convolutional networks. In Proceedings of IEEE International Conference on Computer Vision. IEEE, 764–773.Google Scholar
- [6] . 2018. Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 70–78.Google Scholar
Cross Ref
- [7] . 2009. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2009), 1627–1645.Google Scholar
Digital Library
- [8] . 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3141–3149.Google Scholar
Cross Ref
- [9] . 2019. Graphonomy: Universal human parsing via graph transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7442–7451.Google Scholar
Cross Ref
- [10] . 2018. Instance-level human parsing via part grouping network. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 805–822.Google Scholar
Cross Ref
- [11] . 2017. Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6757–6765.Google Scholar
Cross Ref
- [12] . 2020. Grapy-ML: Graph pyramid mutual learning for cross-dataset human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI.Google Scholar
Cross Ref
- [13] . 2016. Deep residual learning for image recognition. In CVPR. IEEE, 770–778.Google Scholar
- [14] . 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7132–7141.Google Scholar
Cross Ref
- [15] . 2015. Spatial transformer networks. In NeurIPS. Curran Associates, Inc., Montreal, Quebec, Canada.Google Scholar
- [16] . 2020. Learning semantic neural tree for human parsing. In ECCV, , , , and (Eds.).Google Scholar
- [17] . 2018. Human semantic parsing for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1062–1071.Google Scholar
Cross Ref
- [18] . 2020. Self-correction for human parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence (Early Access) (2020), 1–1.Google Scholar
- [19] . 2020. Self-learning with rectification strategy for human parsing. In CVPR. 9260–9269.Google Scholar
- [20] . 2019. Attention-guided unified network for panoptic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7019–7028.Google Scholar
Cross Ref
- [21] . 2019. Look into person: Joint body parsing pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2019), 871–885.Google Scholar
Digital Library
- [22] . 2018. Proposal-free network for instance-level object segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2018), 2978–2991.Google Scholar
Digital Library
- [23] . 2016. Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval. IEEE Transactions on Multimedia 18 (2016), 1175–1186.Google Scholar
Digital Library
- [24] . 2015. Deep human parsing with active template regression. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2015), 2402–2414.Google Scholar
Digital Library
- [25] . 2019. RefineNet: Multi-path refinement networks for dense prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2019), 1228–1242.Google Scholar
- [26] . 2019. Improving person re-identification by attribute and identity learning. Pattern Recognition 95 (2019), 151–161.Google Scholar
Digital Library
- [27] . 2014. Fashion parsing with weak color-category labels. IEEE Transactions on Multimedia 16 (2014), 253–265.Google Scholar
Cross Ref
- [28] . 2015. Matching-CNN meets KNN: Quasi-parametric human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1419–1427.Google Scholar
Cross Ref
- [29] . 2013. Pedestrian parsing via deep decompositional network. In Proceedings of IEEE International Conference on Computer Vision. IEEE, 2380–7504.Google Scholar
- [30] . 2018. Macro-micro adversarial network for human parsing. In Proceedings of the European Conference on Computer Vision. Springer, Cham, Munich, Germany, 424–440.Google Scholar
Cross Ref
- [31] . 2018. Mutual learning to adapt for joint human parsing and pose estimation. In Proceedings of the European Conference on Computer Vision. Springer, Cham, Munich, Germany, 519–534.Google Scholar
Cross Ref
- [32] . 2017. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4353–4361.Google Scholar
Cross Ref
- [33] . 2019. BASNet: Boundary-Aware salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7471–7481.Google Scholar
Cross Ref
- [34] . 2019. Enhanced person re-identification based on saliency and semantic parsing with deep neural network models. Image and Vision Computing 92 (2019), 103809.Google Scholar
Digital Library
- [35] . 2019. Devil in the details: Towards accurate single and multiple human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4814–4821.Google Scholar
Digital Library
- [36] . 2019. Devil in the details: Towards accurate single and multiple human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 4814–4821.Google Scholar
Digital Library
- [37] . 2016. Multimodal multipart learning for action recognition in depth videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2016), 2123–2129.Google Scholar
Digital Library
- [38] . 2013. On the importance of initialization and momentum in deep learning. In Proceedings of International Conference on Machine Learning. PMLR, Atlanta, Georgia, 1139–1147.Google Scholar
- [39] . 2019. Gated-SCNN: Gated shape CNNs for semantic segmentation. In Proceedings of IEEE International Conference on Computer Vision. IEEE, 5228–5237.Google Scholar
- [40] . 2019. Learning compositional neural information fusion for human parsing. In Proceedings of IEEE International Conference on Computer Vision. IEEE, 5702–5712.Google Scholar
- [41] . 2021. Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence (Early Access) (2021), 1–1. Google Scholar
Cross Ref
- [42] . 2020. Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8926–8936.Google Scholar
Cross Ref
- [43] . 2012. Discriminative hierarchical part-based models for human parsing and action recognition. Journal of Machine Learning Research 13 (2012), 3075–3102.Google Scholar
Digital Library
- [44] . 2019. Progressive learning for person re-identification with one example. IEEE Transactions on Image Processing 28 (2019), 2872–2881.Google Scholar
Cross Ref
- [45] . 2016. Pose-Guided human parsing by an and/or graph using pose-context features. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 3632–3640.Google Scholar
Cross Ref
- [46] . 2017. Holistically-nested edge detection. International Journal of Computer Vision 125 (2017), 3–18.Google Scholar
Digital Library
- [47] . 2019. UPSNet: A unified panoptic segmentation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 8810–8818.Google Scholar
Cross Ref
- [48] . 2015. Retrieving similar styles to parse clothing. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2015), 1028–1040.Google Scholar
Digital Library
- [49] . 2017. CASENet: Deep category-aware semantic edge detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1761–1770.Google Scholar
Cross Ref
- [50] . 2020. Blended grammar network for human parsing. In Proceedings of the European Conference on Computer Vision.Google Scholar
Digital Library
- [51] . 2020. Part-aware context network for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8968–8977.Google Scholar
Cross Ref
- [52] . 2017. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6230–6239.Google Scholar
Cross Ref
- [53] . 2017. Self-Supervised neural aggregation networks for human parsing. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’17). IEEE.Google Scholar
Cross Ref
- [54] . 2019. Pyramid feature attention network for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3080–3089.Google Scholar
Cross Ref
- [55] . 2018. Progressive cognitive human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI.Google Scholar
Cross Ref
- [56] . 2019. Deformable ConvNets V2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 9300–9308.Google Scholar
Cross Ref
Index Terms
Mask-Guided Deformation Adaptive Network for Human Parsing
Recommendations
Multi-human Parsing with a Graph-based Generative Adversarial Model
Human parsing is an important task in human-centric image understanding in computer vision and multimedia systems. However, most existing works on human parsing mainly tackle the single-person scenario, which deviates from real-world applications where ...
Hybrid Resolution Network Using Edge Guided Region Mutual Information Loss for Human Parsing
MM '20: Proceedings of the 28th ACM International Conference on MultimediaIn this paper, we propose a new method for human parsing, which effectively maintains high-resolution representations and leverages body edge details to improve the performance. First, we propose a hybrid resolution network (HyRN) for human parsing and ...
Multi-Human Parsing Machines
MM '18: Proceedings of the 26th ACM international conference on MultimediaHuman parsing is an important task in human-centric analysis. Despite the remarkable progress in single-human parsing, the more realistic case of multi-human parsing remains challenging in terms of the data and the model. Compared with the considerable ...






Comments