Abstract
Human pose estimation has an important impact on a wide range of applications, from human-computer interface to surveillance and content-based video retrieval. For human pose estimation, joint obstructions and overlapping upon human bodies result in departed pose estimation. To address these problems, by integrating priors of the structure of human bodies, we present a novel structure-aware network to discreetly consider such priors during the training of the network. Typically, learning such constraints is a challenging task. Instead, we propose generative adversarial networks as our learning model in which we design two residual Multiple-Instance Learning (MIL) models with identical architecture—one is used as the generator, and the other one is used as the discriminator. The discriminator task is to distinguish the actual poses from the fake ones. If the pose generator generates results that the discriminator is not able to distinguish from the real ones, then the model has successfully learned the priors. In the proposed model, the discriminator differentiates the ground-truth heatmaps from the generated ones, and later the adversarial loss back-propagates to the generator. Such procedure assists the generator to learn reasonable body configurations and is proved to be advantageous to improve the pose estimation accuracy. Meanwhile, we propose a novel function for MIL. It is an adjustable structure for both instance selection and modeling to appropriately pass the information between instances in a single bag. In the proposed residual MIL neural network, the pooling action adequately updates the instance contribution to its bag. The proposed adversarial residual multi-instance neural network that is based on pooling has been validated on two datasets for the human pose estimation task and successfully outperforms the other state-of-the-art models. The code will be made available on https://github.com/pshams55/AMIL.
- M. Andriluka, S. Roth, and B. Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1014--1021.Google Scholar
- P. F. Felzenszwalb, D. A. McAllester, and D. Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08).Google Scholar
- S. Johnson and M. Everingham. 2011. Learning effective human pose estimation from inaccurate annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 1465--1472.Google Scholar
- V. Belagiannis and A. Zisserman. 2017. Recurrent human pose estimation. In Proceedings of the IEEE International Conference on Automatic Face 8 Gesture Recognition (FG’17). 468--475.Google Scholar
- P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proceedings of the IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 65--72.Google Scholar
- A. Bulat and G. Tzimiropoulos. 2016. Human pose estimation via convolutional part heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV’16). 717--732.Google Scholar
- C. Schuldt, I. Laptev, and B. Caputo. 2004. Recognizing human actions: A local SVM approach. In Proceedings of the International Conference on Pattern Recognition (ICPR’04). 3, 32--36.Google Scholar
- T. Yu, H. Jin, W. T. Tan, and K. Nahrstedt. 2018. SKEPRID: Pose and illumination change-resistant skeleton-based person re-identification. ACM Trans. Multimedia Comput. Commun. 4, 82 (2018), 1--24.Google Scholar
- F. Zhang, Q. Mao, X. Shen, Y. Zhan, and M. Dong. 2018. Spatially coherent feature learning for pose-invariant facial expression recognition. ACM Trans. Multimedia Comput. Commun. 1, 27 (2018), 1--19.Google Scholar
- J. Zhang and H. Hu. 2018. Joint head attribute classifier and domain-specific refinement networks for face alignment. ACM Trans. Multimedia Comput. Commun. 4, 79 (2018), 1--19.Google Scholar
- A. Newell, K. Yang, and J. Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV’16). 483--449.Google Scholar
- J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’14). 1799--1807.Google Scholar
- Y. Chen, C. Shen, X. S. Wei, L. Liu, and J. Yang. 2017. Adversarial PoseNet: A structure-aware convolutional network for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision (ICCV’17). 1212--1221.Google Scholar
- A. Radford, L. Metz, and S. Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. ArXiv Preprint ArXiv 1511.06434:1--16.Google Scholar
- T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. 2016. Improved techniques for training GANs. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’16). 2226--2234.Google Scholar
- E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’15). 1486--1494.Google Scholar
- C. J. Chou, J. T. Chien, and H. T. Chen. 2017. Self adversarial training for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google Scholar
- M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe. 2017. Training adversarial discriminators for cross-channel abnormal event detection in crowds. CoRR abs/1706.07680, 2017.Google Scholar
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google Scholar
Cross Ref
- I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. 2017. Improved training of Wasserstein GANs. In Proceedings of the International Conference on Neural Information Processing Systems. 5769--5779.Google Scholar
- J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. FeiFei. 2012. Imagenet large scale visual recognition competition. Retrieved from http://www.image-net.org/ challenges/LSVRC/2012/.Google Scholar
- M. Ilse, J. M. Tomczak, and M. Welling. 2018. Attention-based deep multiple instance learning. In Proceedings of the International Conference on Machine Learning (PMLR’18).Google Scholar
- S. Ali and M. Shah. 2010. Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell 32, 2 (2010), 288--303.Google Scholar
Digital Library
- M. Andriluka, L. Pishchulin, P. V. Gehler, and B. Schiele. 2014. 2D human pose estimation: New benchmark and state-of-the-art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 3686--3693,.Google Scholar
- J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 648--656.Google Scholar
- A. Toshev and C. Szegedy. 2014. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1653--1660.Google Scholar
- R. A. Güler, N. Neverova, and I. Kokkinos. 2018. DensePose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (ICCV’18). 7297--7306.Google Scholar
- N. Ukita and Y. Uematsu. 2018. Semi- and weakly-supervised human pose estimation. Comput. Vis. Image Underst. 170, 67--78.Google Scholar
Digital Library
- G. Papandreou, T. Zhu, L. C. Chen, S. Gidaris, J. Tompson, and K. Murphy. 2018. PersonLab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference in Computer Vision (ECCV). 282--299.Google Scholar
- E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. 2016. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the European Conference in Computer Vision (ECCV’16). 34--50.Google Scholar
- Z. Cao, T. Simon, S. Wei, and Y. Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google Scholar
- T. G. Dietterich, R. H. Lathrop, and L. P. Tomas. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89, 1--2, 31--71.Google Scholar
Digital Library
- Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. S. Huang. 2009. Action detection in complex scenes with spatial and temporal ambiguities. In Proceedings of the IEEE Conference on Computer Vision (ICCV’09). 128--135.Google Scholar
- B. Babenko, M. H. Yang, and S. Belongie. 2009. Visual tracking with online multiple instance learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 983--990.Google Scholar
- M. R. Ronchi and P. Perona. 2017. Benchmarking and error diagnosis in multi-instance pose estimation. In Proceedings of the IEEE Conference on Computer Vision (ICCV’17). 369--378.Google Scholar
- B. Babenko, M. H. Yang, and S. Belongie. 2009. Visual tracking with online multiple instance learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 983--990.Google Scholar
- K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras. 2012. Two-person interaction detection using body-pose features and multiple instance learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 28--35.Google Scholar
- D. Pathak, E. Shelhamer, J. Long, and T. Darrell. 2015. Fully convolutional multi-class multiple instance learning. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
- J. Hoffman, D. Pathak, E. Tzeng, J. Long, S. Guadarrama, and T. Darrell. 2016. Large scale visual recognition through adaptation using joint representation and multiple instance learning. J. Mach. Learn. Res. 17 (2016), 1--31.Google Scholar
- Y. Y. Xu. 2016. Multiple-instance learning based decision neural networks for image retrieval and classification. Neurocomputing 171, 826--836.Google Scholar
Digital Library
- T. Zeng and S. Ji. 2015. Deep convolutional neural networks for multi-instance multi-task learning. In Proceedings of the IEEE International Conference on Data Mining (ICDM’15). 579--588.Google Scholar
- D. Zhang, D. Meng, and J. Han. 2017. Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans. Pattern Anal. Mach. Intell. 39, 5 (2017), 865--878.Google Scholar
Digital Library
- I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. 2014. Generative adversarial networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’14). 2672--2680.Google Scholar
- D. Berthelot, T. Schumm, and L. Metz. 2017. BEGAN: Boundary equilibrium generative adversarial networks. Arxiv Preprint Arxiv:1703.10717, 2017.Google Scholar
- M. Mirza and S. Osindero. 2014. Conditional generative adversarial nets. CoRR, abs/1411.1784.Google Scholar
- P. Luc, C. Couprie, S. Chintala, and J. Verbeek. 2016. Semantic segmentation using adversarial networks. CoRR, abs/1611.08408, 2016.Google Scholar
- X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang. 2017. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5669--5678.Google Scholar
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas. 2017. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 652--660.Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google Scholar
- X. Wang, Y. Yan, P. Tang, X. Bai, and W. Liu. 2018. Revisiting multiple instance neural networks. Pattern Recognition 74 (2018), 15--24.Google Scholar
Digital Library
- Y. Yan, X. Wang, X. Guo, J. Fang, W. Liu, and J. Huang. 2018. Deep multi-instance learning with dynamic pooling. In Proceedings of the Conference on Machine Learning Research (ACML’18). 80, 1--16.Google Scholar
- Y. Zhou, X. Sun, D. Liu, Z. Zha, and W. Zeng. 2017. Adaptive pooling in multi-instance learning for web video annotation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 318--327.Google Scholar
- D. P. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’14). 1--15.Google Scholar
- Y. Yang and D. Ramanan. 2011. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 1385--1392.Google Scholar
- J. Liu, J. Luo, and M. Shah. 2009. Recognizing realistic actions from videos “in the Wild”. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1996--2003.Google Scholar
- S. Sabour, N. Frosst, and G. E. Hinton. 2017. Dynamic routing between capsules. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’17). 3859--3869.Google Scholar
- M. Sun, T. X. Han, M. C. Liu, and A. K. Rostamabad. 2016. Multiple instance learning convolutional neural networks for object recognition. In Proceedings of the International Conference on Pattern Recognition (ICPR’16). 3270--3275.Google Scholar
- S. Zhang, X. Lan, H. Yao, H. Zhou, D. Tao, and X. Li. 2017. A biologically inspired appearance model for robust visual tracking. IEEE Trans. Neural Netw. Learn. Syst. 28, 10 (2017), 2357--2370.Google Scholar
Cross Ref
- S. Zhang, X. Lan, Y. Qi, and P. C. Yuen. 2017. Robust visual tracking via basis matching. IEEE Trans. Circ. Syst. Video Technol. 27, 3 (2017), 421--430.Google Scholar
Digital Library
- P. Wilf, S. Zhang, S. Chikkerur, S. A. Little, S. L. Wing, and T. Serre. 2016. Computer vision cracks the leaf code. Proc. Nat. Acad. Sci. USA 113, 12 (2016), 3305--3310.Google Scholar
Cross Ref
- S. Zhang, H. Zhou, F. Jiang, and X. Li. 2015. Robust visual tracking using structurally random projection and weighted least squares. IEEE Trans. Circ. Syst. Video Technol. 25, 11 (2015), 1749--1760.Google Scholar
Digital Library
- J. M. Graving, D. Chae, H. Naik, L. Li, B. Koger, B. R. Costelloe, and I. D. Couzin. 2019. DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife 8 (2019), e47994.Google Scholar
- Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7103--7112.Google Scholar
- L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. V. Gool. 2017. Pose guided person image generation. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’17). 406--416.Google Scholar
- Y. Chen, C. Shen, H. Chen, X. S. Wei, L. Liu, and J. Yang. 2019. Adversarial learning of structure-aware fully convolutional networks for landmark localization. IEEE Trans. Pattern Anal. Mach. Intel. Early Access Article.Google Scholar
Cross Ref
- A. Zhu, S. Zhang, Y. Huang, F. Hu, R. Cui, and G. Hua. 2019. Exploring hard joints mining via hourglass-based generative adversarial network for human pose estimation. AIP Advances 9 (2019), 035321.Google Scholar
Cross Ref
- T. D. Nguyen, T. Le, H. Vu, and D. Phung. 2017. Dual discriminator generative adversarial nets. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’17). 2670--2680.Google Scholar
- Q. Hoang, T. D. Nguyen, T. Le, and D. Phung. 2018. MGAN: Training generative adversarial nets with multiple generators. In Proceedings of the International Conference on Learning Representations (ICLR'18).Google Scholar
- T. Chavdarova and F. Fleuret. 2018. Sgan: An alternative training of generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 9407--9415.Google Scholar
Index Terms
AMIL: Adversarial Multi-instance Learning for Human Pose Estimation
Recommendations
MILD: Multiple-Instance Learning via Disambiguation
In multiple-instance learning (MIL), an individual example is called an instance and a bag contains a single or multiple instances. The class labels available in the training set are associated with bags rather than instances. A bag is labeled positive ...
Designing bag-level multiple-instance feature-weighting algorithms based on the large margin principle
In multiple-instance learning (MIL), class labels are attached to bags instead of instances, and the goal is to predict the class labels of unseen bags. Existing MIL algorithms generally fall into two types: those designed at the bag level and those ...
Using Similarity between Paired Instances to Improve Multiple-Instance Learning via Embedded Instance Selection
ICONIP 2013: Proceedings, Part II, of the 20th International Conference on Neural Information Processing - Volume 8227Multiple-instance Learning MIL copes with classification of sets of instances named bags, as opposed to the traditional view that aims at learning from single instances. Recently, several instance selection-based MIL algorithms have been presented to ...






Comments