skip to main content
research-article

AMIL: Adversarial Multi-instance Learning for Human Pose Estimation

Published:17 April 2020Publication History
Skip Abstract Section

Abstract

Human pose estimation has an important impact on a wide range of applications, from human-computer interface to surveillance and content-based video retrieval. For human pose estimation, joint obstructions and overlapping upon human bodies result in departed pose estimation. To address these problems, by integrating priors of the structure of human bodies, we present a novel structure-aware network to discreetly consider such priors during the training of the network. Typically, learning such constraints is a challenging task. Instead, we propose generative adversarial networks as our learning model in which we design two residual Multiple-Instance Learning (MIL) models with identical architecture—one is used as the generator, and the other one is used as the discriminator. The discriminator task is to distinguish the actual poses from the fake ones. If the pose generator generates results that the discriminator is not able to distinguish from the real ones, then the model has successfully learned the priors. In the proposed model, the discriminator differentiates the ground-truth heatmaps from the generated ones, and later the adversarial loss back-propagates to the generator. Such procedure assists the generator to learn reasonable body configurations and is proved to be advantageous to improve the pose estimation accuracy. Meanwhile, we propose a novel function for MIL. It is an adjustable structure for both instance selection and modeling to appropriately pass the information between instances in a single bag. In the proposed residual MIL neural network, the pooling action adequately updates the instance contribution to its bag. The proposed adversarial residual multi-instance neural network that is based on pooling has been validated on two datasets for the human pose estimation task and successfully outperforms the other state-of-the-art models. The code will be made available on https://github.com/pshams55/AMIL.

References

  1. M. Andriluka, S. Roth, and B. Schiele. 2009. Pictorial structures revisited: People detection and articulated pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1014--1021.Google ScholarGoogle Scholar
  2. P. F. Felzenszwalb, D. A. McAllester, and D. Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08).Google ScholarGoogle Scholar
  3. S. Johnson and M. Everingham. 2011. Learning effective human pose estimation from inaccurate annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 1465--1472.Google ScholarGoogle Scholar
  4. V. Belagiannis and A. Zisserman. 2017. Recurrent human pose estimation. In Proceedings of the IEEE International Conference on Automatic Face 8 Gesture Recognition (FG’17). 468--475.Google ScholarGoogle Scholar
  5. P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proceedings of the IEEE Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 65--72.Google ScholarGoogle Scholar
  6. A. Bulat and G. Tzimiropoulos. 2016. Human pose estimation via convolutional part heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV’16). 717--732.Google ScholarGoogle Scholar
  7. C. Schuldt, I. Laptev, and B. Caputo. 2004. Recognizing human actions: A local SVM approach. In Proceedings of the International Conference on Pattern Recognition (ICPR’04). 3, 32--36.Google ScholarGoogle Scholar
  8. T. Yu, H. Jin, W. T. Tan, and K. Nahrstedt. 2018. SKEPRID: Pose and illumination change-resistant skeleton-based person re-identification. ACM Trans. Multimedia Comput. Commun. 4, 82 (2018), 1--24.Google ScholarGoogle Scholar
  9. F. Zhang, Q. Mao, X. Shen, Y. Zhan, and M. Dong. 2018. Spatially coherent feature learning for pose-invariant facial expression recognition. ACM Trans. Multimedia Comput. Commun. 1, 27 (2018), 1--19.Google ScholarGoogle Scholar
  10. J. Zhang and H. Hu. 2018. Joint head attribute classifier and domain-specific refinement networks for face alignment. ACM Trans. Multimedia Comput. Commun. 4, 79 (2018), 1--19.Google ScholarGoogle Scholar
  11. A. Newell, K. Yang, and J. Deng. 2016. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV’16). 483--449.Google ScholarGoogle Scholar
  12. J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’14). 1799--1807.Google ScholarGoogle Scholar
  13. Y. Chen, C. Shen, X. S. Wei, L. Liu, and J. Yang. 2017. Adversarial PoseNet: A structure-aware convolutional network for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision (ICCV’17). 1212--1221.Google ScholarGoogle Scholar
  14. A. Radford, L. Metz, and S. Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. ArXiv Preprint ArXiv 1511.06434:1--16.Google ScholarGoogle Scholar
  15. T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. 2016. Improved techniques for training GANs. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’16). 2226--2234.Google ScholarGoogle Scholar
  16. E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’15). 1486--1494.Google ScholarGoogle Scholar
  17. C. J. Chou, J. T. Chien, and H. T. Chen. 2017. Self adversarial training for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle Scholar
  18. M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe. 2017. Training adversarial discriminators for cross-channel abnormal event detection in crowds. CoRR abs/1706.07680, 2017.Google ScholarGoogle Scholar
  19. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  20. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. 2017. Improved training of Wasserstein GANs. In Proceedings of the International Conference on Neural Information Processing Systems. 5769--5779.Google ScholarGoogle Scholar
  21. J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. FeiFei. 2012. Imagenet large scale visual recognition competition. Retrieved from http://www.image-net.org/ challenges/LSVRC/2012/.Google ScholarGoogle Scholar
  22. M. Ilse, J. M. Tomczak, and M. Welling. 2018. Attention-based deep multiple instance learning. In Proceedings of the International Conference on Machine Learning (PMLR’18).Google ScholarGoogle Scholar
  23. S. Ali and M. Shah. 2010. Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell 32, 2 (2010), 288--303.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Andriluka, L. Pishchulin, P. V. Gehler, and B. Schiele. 2014. 2D human pose estimation: New benchmark and state-of-the-art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 3686--3693,.Google ScholarGoogle Scholar
  25. J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 648--656.Google ScholarGoogle Scholar
  26. A. Toshev and C. Szegedy. 2014. DeepPose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1653--1660.Google ScholarGoogle Scholar
  27. R. A. Güler, N. Neverova, and I. Kokkinos. 2018. DensePose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (ICCV’18). 7297--7306.Google ScholarGoogle Scholar
  28. N. Ukita and Y. Uematsu. 2018. Semi- and weakly-supervised human pose estimation. Comput. Vis. Image Underst. 170, 67--78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Papandreou, T. Zhu, L. C. Chen, S. Gidaris, J. Tompson, and K. Murphy. 2018. PersonLab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference in Computer Vision (ECCV). 282--299.Google ScholarGoogle Scholar
  30. E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. 2016. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the European Conference in Computer Vision (ECCV’16). 34--50.Google ScholarGoogle Scholar
  31. Z. Cao, T. Simon, S. Wei, and Y. Sheikh. 2017. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle Scholar
  32. T. G. Dietterich, R. H. Lathrop, and L. P. Tomas. 1997. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89, 1--2, 31--71.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. S. Huang. 2009. Action detection in complex scenes with spatial and temporal ambiguities. In Proceedings of the IEEE Conference on Computer Vision (ICCV’09). 128--135.Google ScholarGoogle Scholar
  34. B. Babenko, M. H. Yang, and S. Belongie. 2009. Visual tracking with online multiple instance learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 983--990.Google ScholarGoogle Scholar
  35. M. R. Ronchi and P. Perona. 2017. Benchmarking and error diagnosis in multi-instance pose estimation. In Proceedings of the IEEE Conference on Computer Vision (ICCV’17). 369--378.Google ScholarGoogle Scholar
  36. B. Babenko, M. H. Yang, and S. Belongie. 2009. Visual tracking with online multiple instance learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 983--990.Google ScholarGoogle Scholar
  37. K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras. 2012. Two-person interaction detection using body-pose features and multiple instance learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 28--35.Google ScholarGoogle Scholar
  38. D. Pathak, E. Shelhamer, J. Long, and T. Darrell. 2015. Fully convolutional multi-class multiple instance learning. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  39. J. Hoffman, D. Pathak, E. Tzeng, J. Long, S. Guadarrama, and T. Darrell. 2016. Large scale visual recognition through adaptation using joint representation and multiple instance learning. J. Mach. Learn. Res. 17 (2016), 1--31.Google ScholarGoogle Scholar
  40. Y. Y. Xu. 2016. Multiple-instance learning based decision neural networks for image retrieval and classification. Neurocomputing 171, 826--836.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. T. Zeng and S. Ji. 2015. Deep convolutional neural networks for multi-instance multi-task learning. In Proceedings of the IEEE International Conference on Data Mining (ICDM’15). 579--588.Google ScholarGoogle Scholar
  42. D. Zhang, D. Meng, and J. Han. 2017. Co-saliency detection via a self-paced multiple-instance learning framework. IEEE Trans. Pattern Anal. Mach. Intell. 39, 5 (2017), 865--878.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. 2014. Generative adversarial networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’14). 2672--2680.Google ScholarGoogle Scholar
  44. D. Berthelot, T. Schumm, and L. Metz. 2017. BEGAN: Boundary equilibrium generative adversarial networks. Arxiv Preprint Arxiv:1703.10717, 2017.Google ScholarGoogle Scholar
  45. M. Mirza and S. Osindero. 2014. Conditional generative adversarial nets. CoRR, abs/1411.1784.Google ScholarGoogle Scholar
  46. P. Luc, C. Couprie, S. Chintala, and J. Verbeek. 2016. Semantic segmentation using adversarial networks. CoRR, abs/1611.08408, 2016.Google ScholarGoogle Scholar
  47. X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang. 2017. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5669--5678.Google ScholarGoogle Scholar
  48. C. R. Qi, H. Su, K. Mo, and L. J. Guibas. 2017. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 652--660.Google ScholarGoogle Scholar
  49. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google ScholarGoogle Scholar
  50. X. Wang, Y. Yan, P. Tang, X. Bai, and W. Liu. 2018. Revisiting multiple instance neural networks. Pattern Recognition 74 (2018), 15--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Y. Yan, X. Wang, X. Guo, J. Fang, W. Liu, and J. Huang. 2018. Deep multi-instance learning with dynamic pooling. In Proceedings of the Conference on Machine Learning Research (ACML’18). 80, 1--16.Google ScholarGoogle Scholar
  52. Y. Zhou, X. Sun, D. Liu, Z. Zha, and W. Zeng. 2017. Adaptive pooling in multi-instance learning for web video annotation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 318--327.Google ScholarGoogle Scholar
  53. D. P. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’14). 1--15.Google ScholarGoogle Scholar
  54. Y. Yang and D. Ramanan. 2011. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 1385--1392.Google ScholarGoogle Scholar
  55. J. Liu, J. Luo, and M. Shah. 2009. Recognizing realistic actions from videos “in the Wild”. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1996--2003.Google ScholarGoogle Scholar
  56. S. Sabour, N. Frosst, and G. E. Hinton. 2017. Dynamic routing between capsules. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’17). 3859--3869.Google ScholarGoogle Scholar
  57. M. Sun, T. X. Han, M. C. Liu, and A. K. Rostamabad. 2016. Multiple instance learning convolutional neural networks for object recognition. In Proceedings of the International Conference on Pattern Recognition (ICPR’16). 3270--3275.Google ScholarGoogle Scholar
  58. S. Zhang, X. Lan, H. Yao, H. Zhou, D. Tao, and X. Li. 2017. A biologically inspired appearance model for robust visual tracking. IEEE Trans. Neural Netw. Learn. Syst. 28, 10 (2017), 2357--2370.Google ScholarGoogle ScholarCross RefCross Ref
  59. S. Zhang, X. Lan, Y. Qi, and P. C. Yuen. 2017. Robust visual tracking via basis matching. IEEE Trans. Circ. Syst. Video Technol. 27, 3 (2017), 421--430.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. P. Wilf, S. Zhang, S. Chikkerur, S. A. Little, S. L. Wing, and T. Serre. 2016. Computer vision cracks the leaf code. Proc. Nat. Acad. Sci. USA 113, 12 (2016), 3305--3310.Google ScholarGoogle ScholarCross RefCross Ref
  61. S. Zhang, H. Zhou, F. Jiang, and X. Li. 2015. Robust visual tracking using structurally random projection and weighted least squares. IEEE Trans. Circ. Syst. Video Technol. 25, 11 (2015), 1749--1760.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. J. M. Graving, D. Chae, H. Naik, L. Li, B. Koger, B. R. Costelloe, and I. D. Couzin. 2019. DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife 8 (2019), e47994.Google ScholarGoogle Scholar
  63. Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7103--7112.Google ScholarGoogle Scholar
  64. L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. V. Gool. 2017. Pose guided person image generation. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’17). 406--416.Google ScholarGoogle Scholar
  65. Y. Chen, C. Shen, H. Chen, X. S. Wei, L. Liu, and J. Yang. 2019. Adversarial learning of structure-aware fully convolutional networks for landmark localization. IEEE Trans. Pattern Anal. Mach. Intel. Early Access Article.Google ScholarGoogle ScholarCross RefCross Ref
  66. A. Zhu, S. Zhang, Y. Huang, F. Hu, R. Cui, and G. Hua. 2019. Exploring hard joints mining via hourglass-based generative adversarial network for human pose estimation. AIP Advances 9 (2019), 035321.Google ScholarGoogle ScholarCross RefCross Ref
  67. T. D. Nguyen, T. Le, H. Vu, and D. Phung. 2017. Dual discriminator generative adversarial nets. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’17). 2670--2680.Google ScholarGoogle Scholar
  68. Q. Hoang, T. D. Nguyen, T. Le, and D. Phung. 2018. MGAN: Training generative adversarial nets with multiple generators. In Proceedings of the International Conference on Learning Representations (ICLR'18).Google ScholarGoogle Scholar
  69. T. Chavdarova and F. Fleuret. 2018. Sgan: An alternative training of generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 9407--9415.Google ScholarGoogle Scholar

Index Terms

  1. AMIL: Adversarial Multi-instance Learning for Human Pose Estimation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 1s
      Special Issue on Multimodal Machine Learning for Human Behavior Analysis and Special Issue on Computational Intelligence for Biomedical Data and Imaging
      January 2020
      376 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3388236
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 April 2020
      • Accepted: 1 August 2019
      • Revised: 1 July 2019
      • Received: 1 March 2019
      Published in tomm Volume 16, Issue 1s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!