skip to main content
research-article

Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition

Authors Info & Claims
Published:22 July 2021Publication History
Skip Abstract Section

Abstract

Recently, human activity recognition using skeleton data is increasing due to its ease of acquisition and finer shape details. Still, it suffers from a wide range of intra-class variation, inter-class similarity among the actions and view variation due to which extraction of discriminative spatial and temporal features is still a challenging problem. In this regard, we present a novel Residual Inception Attention Driven CNN (RIAC-Net) Network, which visualizes the dynamics of the action in a part-wise manner. The complete skeletonis partitioned into five key parts: Head to Spine, Left Leg, Right Leg, Left Hand, Right Hand. For each part, a Compact Action Skeleton Sequence (CASS) is defined. Part-wise skeleton-based motion dynamics highlights discriminative local features of the skeleton that helps to overcome the challenges of inter-class similarity and intra-class variation with improved recognition performance. The RIAC-Net architecture is inspired by the concept of inception-residual representation that unifies the Attention Driven Residues (ADR) with inception-based Spatio-Temporal Convolution Features (STCF) to learn efficient salient action features. An ablation study is also carried out to analyze the effect of ADR over simple residue-based action representation. The robustness of the proposed framework is evaluated by performing an extensive experiment on four challenging datasets: UT Kinect Action 3D, Florence 3D action, MSR Daily Action3D, and NTU RGB-D datasets, which consistently demonstrate the superiority of the proposed method over other state-of-the-art methods.

References

  1. C. Dhiman and D. K. Vishwakarma. 2019. A review of state-of-the-art techniques for abnormal human activity recognition. Eng. Applic. Artif. Intell. 77, (2019) 21–45.Google ScholarGoogle Scholar
  2. K. Singh, S. Rajora, D. K. Vishwakarma, G. Trapathi, S. Kumar, and G. S. Walia. 2020. Crowd anomaly detection using aggregation of ensembles of fine-tuned convnets. Neurocomputing 371, (2020), 188–198.Google ScholarGoogle Scholar
  3. C. Dhiman and K. D. Vishwakarma. 2020. View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Proc. 29, (2020), 3835–3844.Google ScholarGoogle Scholar
  4. D. K. Vishwakarma and C. Dhiman. 2019. A unified model for human activity recognition using spatial distribution of gradients and difference of Gaussian kernel. Vis. Comput. 35, (2019), 1595–1613.Google ScholarGoogle Scholar
  5. C. Chen, R. Jafari, and N. Kehtarnavaz. 2015. Action recognition from depth sequences using depth motion maps-based local binary patterns. In the Winter Conference on Applications of Computer Vision (WACV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Li and Y. Fu. 2014. Prediction of human activity by discovering temporal sequence patterns. IEEE Trans. Pattern Anal. Mach. Intell. 36, 8 (2014), 1644–1657. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Vemulapalli, F. Arrate, and R. Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In the IEEE Conference on Computer Vision and Pattern Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Wu and L. Shao. 2014. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Ghorbel, G. Demisse, D. Aouada, and B. Ottersten. 2020. Fast adaptive reparametrization (FAR) with application to human action recognition. IEEE Sig. Proc. Lett. 27 (2020), 580–584.Google ScholarGoogle ScholarCross RefCross Ref
  10. A. Tamou, L. Ballihi, and D. Aboutajdine. 2017. Automatic learning of articulated skeletons based on mean of 3D joints for efficient action recognition. Int. J. Pattern Recog. Artif. Intell. 31, 4 (2017), 1–17.Google ScholarGoogle Scholar
  11. J. Wang, Z. Liu, Y. Wu, and J. Yuan. 2014. Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 5 (2014), 914–927.Google ScholarGoogle ScholarCross RefCross Ref
  12. S. Y. Jin and H. J. Choi. 2012. Essential body-joint and atomic action detection for human activity recognition using longest common subsequence algorithm. In the IEEE International Conference on Computer Vision (ICCV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. E. Hussein, M. Torki, M. A. Gowayyed, and M. E. Saban. 2013. Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In the International Joint Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Xia, C. C. Chen, and J. K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.Google ScholarGoogle Scholar
  15. J. Lia, X. Liu, M. Zhang, and D. Wang. 2019. Spatio-temporal deformable 3D convnets with attention for action recognition. Pattern Recog. 98 (2019), 107037.Google ScholarGoogle ScholarCross RefCross Ref
  16. Y. Du, Y. Fu, and L. Wang. 2015. Skeleton based action recognition with convolutional neural network. In the Asian Conference on Pattern Recognition (ACPR).Google ScholarGoogle Scholar
  17. Y. Du, Y. Fu, and L. Wang. 2016. Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans. Image Proc. 25, 7 (2016), 3010–3022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Zhang, X. Liu, and J. Xiao. 2017. On geometric features for skeleton-based action recognition using multilayer LSTM Networks. In the IEEE Winter Conference on Applications of Computer Vision (WACV).Google ScholarGoogle Scholar
  19. J. Liu, G. Wang, L. Y. Duan, K. Abdiyeva, and A. C. Kot. 2018. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Proc. 27, 4 (2018), 1586–1599. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. Wang and L. Wang. 2017. Learning robust representations using recurrent neural networks for skeleton based action classification and detection. In the IEEE International Conference on Multimedia Expo Workshops (ICMEW).Google ScholarGoogle Scholar
  21. H. Wang and L. Wang. 2018. Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Trans. Image Proc. 27, 9 (2018), 4382–4394.Google ScholarGoogle ScholarCross RefCross Ref
  22. P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In the IEEE International Conference on Computer Vision (ICCV).Google ScholarGoogle Scholar
  23. J. Wang, Z. Liu, Y. Wu, and J. Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Vemulapalli, F. Arrate, and R. Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. W. Li, Z. Zhang, and Z. Liu. 2010. Action recognition based on a bag of 3D points. In the IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.Google ScholarGoogle Scholar
  26. Z. Shi and T.-K. Kim. 2017. Learning and refining of privileged information-based RNNs for action. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  27. M. Rhif, H. Wannous, and I. R. Farah. 2018. Action recognition from 3D skeleton sequences using deep networks on lie group features. In the International Conference on Pattern Recognition (ICPR).Google ScholarGoogle Scholar
  28. J. Tu, M. Liu, and H. Liu. 2018. Skeleton based human action recognition using spatial temporal 3D convolutional neural network. In the IEEE International Conference on Multimedia and Expo.Google ScholarGoogle Scholar
  29. M. Liu, H. Liu, and C. Chen. 2017. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recog. 68 (2017), 346–362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. I. Lillo, J. C. Niebles, and A. Soto. 2016. A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  31. H. Chen, G. Wang, J.-H. Xue, and L. He. 2016. A novel hierarchical framework for human action recognition. Pattern Recog. 55 (2016), 148–159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. B. Amor, J. Su, and A. Srivastava. 2016. Action recognition using rate-invariant analysis of skeletal shape trajectories. IEEE Trans. Pattern Anal. Mach. Intell. 38, 1 (2016), 1–13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Hou, Z. Li, P. Wang, and W. Li. 2018. Skeleton optical spectra based action recognition using convolutional neural networks. IEEE Trans. Circ. Syst. Vid. Technol. 28, 3 (2018), 807–811.Google ScholarGoogle ScholarCross RefCross Ref
  34. S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. 2017. An end-to-end spatiotemporal attention model for human action recognition from skeleton data. In the 31st AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. S. Kim and A. Reiter. 2017. Interpretable 3D human action analysis with temporal convolutional networks. arXiv preprint arXiv:1704.04516 (2017).Google ScholarGoogle Scholar
  36. Y. Du, W. Wang, and L. Wang. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston 2015.Google ScholarGoogle Scholar
  37. A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  38. E. Park, X. Han, T. L. Berg, and A. C. Berg. 2016. Combining multiple sources of knowledge in deep cnns for action recognition. In the IEEE Winter Conference on Applications of Computer Vision (WACV).Google ScholarGoogle Scholar
  39. W. Zhu, C. Lan, J. Xing, W. Zen, Y. Li, L. Shen, and X. Xie. 2016. Cooccurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In the 30th AAAI Conference on Artificial Intelligence (AAAI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. H. H. Pham, L. Khoudour, A. Crouzil, and P. Zegers. 2018. Exploiting deep residual networks for human action recognition from skeletal data. Comput. Vis. Image Underst. 170 (2018), 51–66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. Li, Z. Cui, W. Zheng, C. Xu, and J. Yang. 2018. Spatio-temporal graph convolution for skeleton based action recognition. In the 32nd AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  42. S. Lohit, Q. Wang, and P. Turaga. 2019. Temporal transformer networks: Joint learning of invariant and discriminative time warping. In the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  43. K. Papadopoulos, E. Ghorbel, D. Aouada, and B. Ottersten. 2019. Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition. arXiv:1912.09745v1 (2019).Google ScholarGoogle Scholar
  44. K. Thakkar and P. J. Narayanan. 2018. Part-based graph convolutional network for action recognition. In the British Machine Vision Conference (BMVC).Google ScholarGoogle Scholar
  45. C. Si, W. Chen, W. Wang, L. Wang, and T. Tan. 2019. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  46. F. Baradel, C. Wolf, and J. Mille. 2017. Human action recognition: Pose-based attention draws focus to hands. In the International Conference on Computer Vision Workshop (ICCVW).Google ScholarGoogle Scholar
  47. F. Baradel, C. Wolf, and J. Mille. 2018. Human activity recognition with pose-driven attention to RGB. In the British Machine Vision Conference (BMVC).Google ScholarGoogle Scholar
  48. T. H. Thea, C. H. Hua, T. T. Ngo, and D. S. Kim. 2020. Image representation of pose-transition feature for 3D skeleton-based action recognition. Inf. Sci. 513 (2020), 112–126.Google ScholarGoogle ScholarCross RefCross Ref
  49. S. Cho, M. H. Maqbool, F. Liu, and H. Foroosh. 2020. Self-attention network for skeleton-based human action recognition. In the Winter Conference on Applications of Computer Vision.Google ScholarGoogle Scholar
  50. S. Sharma, R. Kiros, and R. Salakhutdinov. 2015. Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015).Google ScholarGoogle Scholar
  51. V. Mnih, N. Heess, A. Graves, and K. Kav. 2014. Recurrent models of visual attention. In the International Conference on Neural Information Processing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. J. Kuen, Z. Wang, and G. Wang. 2016. Recurrent attentional networks for saliency detection. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  53. S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. F. Fei. 2018. Every moment counts: Dense detailed labeling of actions in complex videos. Int. J. Comput. Vis. 126, (2018), 375–389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. T. Shen, T. Zhou, G. Long, J. Jiang, S. Wang, and C. Zhang. 2018. Reinforced self-attention network: A hybrid of hard and soft attention for sequence modeling. In the 27th International Joint Conference on Artificial Intelligence (IJCAI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. K. Zhu, R. Wang, Q. Zhao, J. Cheng, and D. Tao. 2020. A cuboid CNN model with an attention mechanism for skeleton-based action recognition. IEEE Trans. Multim. 22, 11 (2020), 2977–2989.Google ScholarGoogle ScholarCross RefCross Ref
  56. L. Shi, Y. Zhang, J. Cheng, and H. Lu. 2019. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. arXiv:1912.06971 [cs.CV] (2019).Google ScholarGoogle Scholar
  57. C. Szegedy, S. Ioffe, and V. Vanhoucke. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In the 31st AAAI Conference on Artificial Intelligence (AAAI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  59. K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015).Google ScholarGoogle Scholar
  60. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In the 31st Conference on Neural Information Processing Systems (NIPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027 (2016).Google ScholarGoogle Scholar
  62. E. Shelhamer, J. Long, and T. Darrell. 2017. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 640–651. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  64. D. Britz, A. Goldie, M. T. Luong, and Q. Le. 2017. Massive exploration of neural machine translation architectures. arXiv:1703.03906 (2017).Google ScholarGoogle Scholar
  65. C. G. Snoek, M. Worring, and A. W. Smeulders. 2005. Early versus late fusion in semantic video analysis. In the 13th ACM International Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. L. Seidenari, V. Varano, S. Berretti, A. D. Bimbo, and P. Pala. 2013. Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. R. Caruana, S. Lawrence, and L. Giles. 2000. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In the 13th International Conference on Neural Information Processing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. J. Liu, A. Shahroudy, D. Xu, and G. Wang. 2016. Spatio-Temporal LSTM with Trust Gates for 3D human action recognition. In the European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar
  69. D. C. Luvizon, H. Tabia, and D. Picard. 2017. Learning features combination for human action recognition from skeleton sequences. Pattern Recog. Lett. 99, 1 (2017), 13–20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. R. Slama, H. Wannous, M. Daoudi, and A. Srivastava. 2015. Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recog. 48, 2 (2015), 556–567. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. I. Lee, D. Kim, S. Kang, and S. Lee. 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In the IEEE International Conference on Computer Vision (ICCV).Google ScholarGoogle Scholar
  72. P. Koniusz, A. Cherian, and F. Porikli. 2016. Tensor representations via kernel linearization for action recognition from 3D skeletons. In the European Conference on Computer Vision (ECCV).Google ScholarGoogle Scholar
  73. M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. D. Bimbo. 2015. 3D human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Trans. Cyber. 45, 7 (2015), 1340–1352.Google ScholarGoogle ScholarCross RefCross Ref
  74. C. Wang, Y. Wang, and A. L. Yuille. 2016. Mining 3D key-pose-motifs for action recognition. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  75. J. Weng, C. Weng, and J. Yuan. 2017. SpatioTemporal naive-bayes nearest-neighbor (ST-NBNN) for skeleton-based action recognition. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  76. K. Jin, J. Min, J. Kong, H. Huo, and X. Wang. 2017. Action recognition using vague division depth motion maps. J. Eng. 4 (2017), 77–84.Google ScholarGoogle ScholarCross RefCross Ref
  77. G. Li, K. Liu, W. Ding, F. Cheng, and B. Chen. 2018. Key skeleton pattern mining on 3D skeletons represented by lie group for action recognition. Math. Prob. Eng. 2018.Google ScholarGoogle Scholar
  78. H. H. Pham, L. Khoudoury, A. Crouzil, P. Zegers, and S. A. Velastiny. 2018. Skeletal movement to color map: A novel representation for 3D action recognition with inception residual networks. arXiv:1807.07033v1 [cs.CV] (2018).Google ScholarGoogle Scholar
  79. J. Liu, A. Shahroudy, D. Xu, and A. C. Ko. 2018. Skeleton-based action recognition using spatio-temporal LSTM network with trust. IEEE Trans. Pattern Anal. Mach. Intell. 40, 12 (2018), 3007–3021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid. 2017. A new representation of skeleton sequences for 3D action recognition. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  81. C. Li, P. Wang, S. Wang, and Y. Ho. 2017. Skeleton-based action recognition using LSTM and CNN. In the IEEE International Conference on Multimedia Expo Workshops (ICMEW).Google ScholarGoogle Scholar
  82. M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  83. A. Graves. 2011. Practical variational inference for neural networks. In the International Conference on Neural Information Processing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!