Abstract
Human parsing is an important task in human-centric image understanding in computer vision and multimedia systems. However, most existing works on human parsing mainly tackle the single-person scenario, which deviates from real-world applications where multiple persons are present simultaneously with interaction and occlusion. To address such a challenging multi-human parsing problem, we introduce a novel multi-human parsing model named MH-Parser, which uses a graph-based generative adversarial model to address the challenges of close-person interaction and occlusion in multi-human parsing. To validate the effectiveness of the new model, we collect a new dataset named Multi-Human Parsing (MHP), which contains multiple persons with intensive person interaction and entanglement. Experiments on the new MHP dataset and existing datasets demonstrate that the proposed method is effective in addressing the multi-human parsing problem compared with existing solutions in the literature.
- Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 11 (2012), 2274--2282.Google Scholar
Digital Library
- Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. 2011. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33, 5 (2011), 898–916.Google Scholar
Digital Library
- Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning (ICML’17). 214–223.Google Scholar
- Anurag Arnab, Sadeep Jayasumana, Shuai Zheng, and Philip H. S. Torr. 2016. Higher order conditional random fields in deep neural networks. In Proceedings of the European Conference on Computer Vision. Springer, 524–540.Google Scholar
- Anurag Arnab and Philip H. S. Torr. 2017. Pixelwise instance segmentation with a dynamically instantiated network. arXiv preprint arXiv:1704.02386 (2017).Google Scholar
- Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. 2018. NetGAN: Generating graphs via random walks. arXiv preprint arXiv:1803.00816 (2018).Google Scholar
- Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291–7299.Google Scholar
Cross Ref
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2016. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv preprint arXiv:1606.00915 (2016).Google Scholar
- Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1971–1978.Google Scholar
Digital Library
- Xiao Chu, Wanli Ouyang, Wei Yang, and Xiaogang Wang. 2015. Multi-task recurrent neural network for immediacy prediction. In Proceedings of the IEEE International Conference on Computer Vision. 3352–3360.Google Scholar
Digital Library
- Jifeng Dai, Kaiming He, and Jian Sun. 2016. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3150–3158.Google Scholar
Cross Ref
- Bert De Brabandere, Davy Neven, and Luc Van Gool. 2017. Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551 (2017).Google Scholar
- Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’16). 3844–3852.Google Scholar
- Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2015. The PASCAL visual object classes challenge: A retrospective. Int. J. Comput. Vis. 111, 1 (2015), 98–136.Google Scholar
Digital Library
- Vittorio Ferrari, Manuel Marin-Jimenez, and Andrew Zisserman. 2008. Progressive search space reduction for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08). 1–8.Google Scholar
Cross Ref
- Raghudeep Gadde, Varun Jampani, Martin Kiefel, Daniel Kappler, and Peter V. Gehler. 2016. Superpixel convolutional networks using bilateral inceptions. In Proceedings of the European Conference on Computer Vision. Springer, 597–613.Google Scholar
- Chuang Gan, Ming Lin, Yi Yang, Gerard de Melo, and Alexander G. Hauptmann. 2016. Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition. In Proceedings of the Association for the Advance of Artificial Intelligence Conference on Artificial Intelligence (AAAI’16). 3487.Google Scholar
- Ke Gong, Xiaodan Liang, Xiaohui Shen, and Liang Lin. 2017. Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. arXiv preprint arXiv:1703.05446 (2017).Google Scholar
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’14). 2672–2680.Google Scholar
- Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2014. Simultaneous detection and segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’14). 297–312.Google Scholar
Cross Ref
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE, 2980–2988.Google Scholar
- Rui Huang, Shu Zhang, Tianyu Li, and Ran He. 2017. Beyond face rotation: Global and local perception GAN for photorealistic and identity preserving frontal view synthesis. arXiv preprint arXiv:1704.04086 (2017).Google Scholar
- Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004 (2016).Google Scholar
- Hao Jiang and Kristen Grauman. 2017. Detangling people: Individuating multiple close people and their body parts via region assembly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6021–6029.Google Scholar
Cross Ref
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
- Thomas N. Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).Google Scholar
- Pushmeet Kohli, Philip H. S. Torr, et al. 2009. Robust higher order potentials for enforcing label consistency. Int. J. Comput. Vis. 82, 3 (2009), 302–324.Google Scholar
Digital Library
- Philipp Krähenbühl and Vladlen Koltun. 2011. Efficient inference in fully connected CRFs with Gaussian edge potentials. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’11). 109–117.Google Scholar
- John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML’01).Google Scholar
Digital Library
- Jianshu Li, Jian Zhao, Yunpeng Chen, Sujoy Roy, Shuicheng Yan, Jiashi Feng, and Terence Sim. 2018. Multi-human parsing machines. In Proceedings of the 26th ACM International Conference on Multimedia. 45–53.Google Scholar
Digital Library
- Qizhu Li, Anurag Arnab, and Philip H. S. Torr. 2017. Holistic, instance-level human parsing. arXiv preprint arXiv:1709.03612 (2017).Google Scholar
- Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. 2016. Fully convolutional instance-aware semantic segmentation. arXiv preprint arXiv:1611.07709 (2016).Google Scholar
- Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015).Google Scholar
- Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang, Luoqi Liu, Jian Dong, Liang Lin, and Shuicheng Yan. 2015. Deep human parsing with active template regression. IEEE Trans. Pattern Anal. Mach. Intell. 37, 12 (2015), 2402–2414.Google Scholar
Digital Library
- Xiaodan Liang, Xiaohui Shen, Donglai Xiang, Jiashi Feng, Liang Lin, and Shuicheng Yan. 2016. Semantic object parsing with local-global long short-term memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3185–3193.Google Scholar
Cross Ref
- Xiaodan Liang, Yunchao Wei, Xiaohui Shen, Jianchao Yang, Liang Lin, and Shuicheng Yan. 2015. Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636 (2015).Google Scholar
- Xiaodan Liang, Chunyan Xu, Xiaohui Shen, Jianchao Yang, Si Liu, Jinhui Tang, Liang Lin, and Shuicheng Yan. 2015. Human parsing with contextualized convolutional neural network. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1386–1394.Google Scholar
Digital Library
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.Google Scholar
- Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017).Google Scholar
- Si Liu, Xiaodan Liang, Luoqi Liu, Xiaohui Shen, Jianchao Yang, Changsheng Xu, Liang Lin, Xiaochun Cao, and Shuicheng Yan. 2015. Matching-CNN meets KNN: Quasi-parametric human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1419–1427.Google Scholar
Cross Ref
- Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. 2016. Semantic segmentation using adversarial networks. arXiv preprint arXiv:1611.08408 (2016).Google Scholar
- Franco Manessi, Alessandro Rozza, and Mario Manzo. 2017. Dynamic graph convolutional networks. arXiv preprint arXiv:1704.06199 (2017).Google Scholar
- Davy Neven, Bert De Brabandere, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. 2017. Fast scene understanding for autonomous driving. arXiv preprint arXiv:1708.02550 (2017).Google Scholar
- Alejandro Newell, Zhiao Huang, and Jia Deng. 2016. Associative embedding: End-to-end learning for joint detection and grouping. arXiv preprint arXiv:1611.05424 (2016).Google Scholar
- Zhang Ning, Paluri Manohar, Taigman Yaniv, Fergus Rob, and Bourdev Lubomir. 2015. Beyond frontal faces: Improving person recognition using multiple cues. arXiv preprint arXiv:1501.05703 (2015).Google Scholar
- Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015).Google Scholar
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’15). 91–99.Google Scholar
- Girshick Ross, Radosavovic Ilija, Gkioxari Georgia, Dollár Piotr, and He Kaiming. 2018. Detectron. Retrieved from: https://github.com/facebookresearch/detectron.Google Scholar
- Chris Russell, Pushmeet Kohli, Philip H. S. Torr, et al. 2009. Associative hierarchical CRFs for object class image segmentation. In Proceedings of the IEEE 12th International Conference on Computer Vision (ICCV’06). IEEE, 739–746.Google Scholar
- Benjamin Sapp and Ben Taskar. 2013. MODEC: Multimodal decomposable models for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13).Google Scholar
Digital Library
- Vibhav Vineet, Jonathan Warrell, Lubor Ladicky, and Philip H. S. Torr. 2011. Human instance segmentation from video using detector-based conditional random fields. In Proceedings of the British Machine Vision Conference (BMVC’11), Vol. 2. 12–15.Google Scholar
- S. Vichy N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, and Karsten M. Borgwardt. 2010. Graph kernels. J. Mach. Learn. Res. 11, Apr. (2010), 1201–1242.Google Scholar
- Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. GraphGAN: Graph representation learning with generative adversarial nets. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google Scholar
- Kota Yamaguchi, M. Hadi Kiapour, Luis E. Ortiz, and Tamara L. Berg. 2012. Parsing clothing in fashion photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 3570–3577.Google Scholar
Digital Library
- Zhang Zhanpeng, Luo Ping, Chen Change Loy, and Tang Xiaoou. 2016. From facial expression recognition to interpersonal relation prediction. arXiv preprint arXiv:1609.06426v2 (2016).Google Scholar
- Jian Zhao, Jianshu Li, Yu Cheng, Terence Sim, Shuicheng Yan, and Jiashi Feng. 2018. Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In Proceedings of the 26th ACM International Conference on Multimedia. 792–800.Google Scholar
Digital Library
Index Terms
Multi-human Parsing with a Graph-based Generative Adversarial Model
Recommendations
Mask-Guided Deformation Adaptive Network for Human Parsing
Due to the challenges of densely compacted body parts, nonrigid clothing items, and severe overlap in crowd scenes, human parsing needs to focus more on multilevel feature representations compared to general scene parsing tasks. Based on this observation, ...
Multi-Human Parsing Machines
MM '18: Proceedings of the 26th ACM international conference on MultimediaHuman parsing is an important task in human-centric analysis. Despite the remarkable progress in single-human parsing, the more realistic case of multi-human parsing remains challenging in terms of the data and the model. Compared with the considerable ...
Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing
MM '18: Proceedings of the 26th ACM international conference on MultimediaDespite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-...






Comments