Abstract
Recently, human activity recognition using skeleton data is increasing due to its ease of acquisition and finer shape details. Still, it suffers from a wide range of intra-class variation, inter-class similarity among the actions and view variation due to which extraction of discriminative spatial and temporal features is still a challenging problem. In this regard, we present a novel Residual Inception Attention Driven CNN (RIAC-Net) Network, which visualizes the dynamics of the action in a part-wise manner. The complete skeletonis partitioned into five key parts: Head to Spine, Left Leg, Right Leg, Left Hand, Right Hand. For each part, a Compact Action Skeleton Sequence (CASS) is defined. Part-wise skeleton-based motion dynamics highlights discriminative local features of the skeleton that helps to overcome the challenges of inter-class similarity and intra-class variation with improved recognition performance. The RIAC-Net architecture is inspired by the concept of inception-residual representation that unifies the Attention Driven Residues (ADR) with inception-based Spatio-Temporal Convolution Features (STCF) to learn efficient salient action features. An ablation study is also carried out to analyze the effect of ADR over simple residue-based action representation. The robustness of the proposed framework is evaluated by performing an extensive experiment on four challenging datasets: UT Kinect Action 3D, Florence 3D action, MSR Daily Action3D, and NTU RGB-D datasets, which consistently demonstrate the superiority of the proposed method over other state-of-the-art methods.
- C. Dhiman and D. K. Vishwakarma. 2019. A review of state-of-the-art techniques for abnormal human activity recognition. Eng. Applic. Artif. Intell. 77, (2019) 21–45.Google Scholar
- K. Singh, S. Rajora, D. K. Vishwakarma, G. Trapathi, S. Kumar, and G. S. Walia. 2020. Crowd anomaly detection using aggregation of ensembles of fine-tuned convnets. Neurocomputing 371, (2020), 188–198.Google Scholar
- C. Dhiman and K. D. Vishwakarma. 2020. View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Proc. 29, (2020), 3835–3844.Google Scholar
- D. K. Vishwakarma and C. Dhiman. 2019. A unified model for human activity recognition using spatial distribution of gradients and difference of Gaussian kernel. Vis. Comput. 35, (2019), 1595–1613.Google Scholar
- C. Chen, R. Jafari, and N. Kehtarnavaz. 2015. Action recognition from depth sequences using depth motion maps-based local binary patterns. In the Winter Conference on Applications of Computer Vision (WACV). Google Scholar
Digital Library
- K. Li and Y. Fu. 2014. Prediction of human activity by discovering temporal sequence patterns. IEEE Trans. Pattern Anal. Mach. Intell. 36, 8 (2014), 1644–1657. Google Scholar
Digital Library
- R. Vemulapalli, F. Arrate, and R. Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In the IEEE Conference on Computer Vision and Pattern Recognition. Google Scholar
Digital Library
- D. Wu and L. Shao. 2014. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Google Scholar
Digital Library
- E. Ghorbel, G. Demisse, D. Aouada, and B. Ottersten. 2020. Fast adaptive reparametrization (FAR) with application to human action recognition. IEEE Sig. Proc. Lett. 27 (2020), 580–584.Google Scholar
Cross Ref
- A. Tamou, L. Ballihi, and D. Aboutajdine. 2017. Automatic learning of articulated skeletons based on mean of 3D joints for efficient action recognition. Int. J. Pattern Recog. Artif. Intell. 31, 4 (2017), 1–17.Google Scholar
- J. Wang, Z. Liu, Y. Wu, and J. Yuan. 2014. Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 5 (2014), 914–927.Google Scholar
Cross Ref
- S. Y. Jin and H. J. Choi. 2012. Essential body-joint and atomic action detection for human activity recognition using longest common subsequence algorithm. In the IEEE International Conference on Computer Vision (ICCV). Google Scholar
Digital Library
- M. E. Hussein, M. Torki, M. A. Gowayyed, and M. E. Saban. 2013. Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In the International Joint Conference on Artificial Intelligence. Google Scholar
Digital Library
- L. Xia, C. C. Chen, and J. K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.Google Scholar
- J. Lia, X. Liu, M. Zhang, and D. Wang. 2019. Spatio-temporal deformable 3D convnets with attention for action recognition. Pattern Recog. 98 (2019), 107037.Google Scholar
Cross Ref
- Y. Du, Y. Fu, and L. Wang. 2015. Skeleton based action recognition with convolutional neural network. In the Asian Conference on Pattern Recognition (ACPR).Google Scholar
- Y. Du, Y. Fu, and L. Wang. 2016. Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans. Image Proc. 25, 7 (2016), 3010–3022.Google Scholar
Digital Library
- S. Zhang, X. Liu, and J. Xiao. 2017. On geometric features for skeleton-based action recognition using multilayer LSTM Networks. In the IEEE Winter Conference on Applications of Computer Vision (WACV).Google Scholar
- J. Liu, G. Wang, L. Y. Duan, K. Abdiyeva, and A. C. Kot. 2018. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Proc. 27, 4 (2018), 1586–1599. Google Scholar
Digital Library
- H. Wang and L. Wang. 2017. Learning robust representations using recurrent neural networks for skeleton based action classification and detection. In the IEEE International Conference on Multimedia Expo Workshops (ICMEW).Google Scholar
- H. Wang and L. Wang. 2018. Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Trans. Image Proc. 27, 9 (2018), 4382–4394.Google Scholar
Cross Ref
- P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In the IEEE International Conference on Computer Vision (ICCV).Google Scholar
- J. Wang, Z. Liu, Y. Wu, and J. Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Google Scholar
Digital Library
- R. Vemulapalli, F. Arrate, and R. Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Google Scholar
Digital Library
- W. Li, Z. Zhang, and Z. Liu. 2010. Action recognition based on a bag of 3D points. In the IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.Google Scholar
- Z. Shi and T.-K. Kim. 2017. Learning and refining of privileged information-based RNNs for action. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- M. Rhif, H. Wannous, and I. R. Farah. 2018. Action recognition from 3D skeleton sequences using deep networks on lie group features. In the International Conference on Pattern Recognition (ICPR).Google Scholar
- J. Tu, M. Liu, and H. Liu. 2018. Skeleton based human action recognition using spatial temporal 3D convolutional neural network. In the IEEE International Conference on Multimedia and Expo.Google Scholar
- M. Liu, H. Liu, and C. Chen. 2017. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recog. 68 (2017), 346–362. Google Scholar
Digital Library
- I. Lillo, J. C. Niebles, and A. Soto. 2016. A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- H. Chen, G. Wang, J.-H. Xue, and L. He. 2016. A novel hierarchical framework for human action recognition. Pattern Recog. 55 (2016), 148–159. Google Scholar
Digital Library
- B. B. Amor, J. Su, and A. Srivastava. 2016. Action recognition using rate-invariant analysis of skeletal shape trajectories. IEEE Trans. Pattern Anal. Mach. Intell. 38, 1 (2016), 1–13.Google Scholar
Digital Library
- Y. Hou, Z. Li, P. Wang, and W. Li. 2018. Skeleton optical spectra based action recognition using convolutional neural networks. IEEE Trans. Circ. Syst. Vid. Technol. 28, 3 (2018), 807–811.Google Scholar
Cross Ref
- S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. 2017. An end-to-end spatiotemporal attention model for human action recognition from skeleton data. In the 31st AAAI Conference on Artificial Intelligence. Google Scholar
Digital Library
- T. S. Kim and A. Reiter. 2017. Interpretable 3D human action analysis with temporal convolutional networks. arXiv preprint arXiv:1704.04516 (2017).Google Scholar
- Y. Du, W. Wang, and L. Wang. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston 2015.Google Scholar
- A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- E. Park, X. Han, T. L. Berg, and A. C. Berg. 2016. Combining multiple sources of knowledge in deep cnns for action recognition. In the IEEE Winter Conference on Applications of Computer Vision (WACV).Google Scholar
- W. Zhu, C. Lan, J. Xing, W. Zen, Y. Li, L. Shen, and X. Xie. 2016. Cooccurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In the 30th AAAI Conference on Artificial Intelligence (AAAI). Google Scholar
Digital Library
- H. H. Pham, L. Khoudour, A. Crouzil, and P. Zegers. 2018. Exploiting deep residual networks for human action recognition from skeletal data. Comput. Vis. Image Underst. 170 (2018), 51–66.Google Scholar
Digital Library
- C. Li, Z. Cui, W. Zheng, C. Xu, and J. Yang. 2018. Spatio-temporal graph convolution for skeleton based action recognition. In the 32nd AAAI Conference on Artificial Intelligence.Google Scholar
- S. Lohit, Q. Wang, and P. Turaga. 2019. Temporal transformer networks: Joint learning of invariant and discriminative time warping. In the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
- K. Papadopoulos, E. Ghorbel, D. Aouada, and B. Ottersten. 2019. Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition. arXiv:1912.09745v1 (2019).Google Scholar
- K. Thakkar and P. J. Narayanan. 2018. Part-based graph convolutional network for action recognition. In the British Machine Vision Conference (BMVC).Google Scholar
- C. Si, W. Chen, W. Wang, L. Wang, and T. Tan. 2019. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
- F. Baradel, C. Wolf, and J. Mille. 2017. Human action recognition: Pose-based attention draws focus to hands. In the International Conference on Computer Vision Workshop (ICCVW).Google Scholar
- F. Baradel, C. Wolf, and J. Mille. 2018. Human activity recognition with pose-driven attention to RGB. In the British Machine Vision Conference (BMVC).Google Scholar
- T. H. Thea, C. H. Hua, T. T. Ngo, and D. S. Kim. 2020. Image representation of pose-transition feature for 3D skeleton-based action recognition. Inf. Sci. 513 (2020), 112–126.Google Scholar
Cross Ref
- S. Cho, M. H. Maqbool, F. Liu, and H. Foroosh. 2020. Self-attention network for skeleton-based human action recognition. In the Winter Conference on Applications of Computer Vision.Google Scholar
- S. Sharma, R. Kiros, and R. Salakhutdinov. 2015. Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015).Google Scholar
- V. Mnih, N. Heess, A. Graves, and K. Kav. 2014. Recurrent models of visual attention. In the International Conference on Neural Information Processing Systems. Google Scholar
Digital Library
- J. Kuen, Z. Wang, and G. Wang. 2016. Recurrent attentional networks for saliency detection. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. F. Fei. 2018. Every moment counts: Dense detailed labeling of actions in complex videos. Int. J. Comput. Vis. 126, (2018), 375–389. Google Scholar
Digital Library
- T. Shen, T. Zhou, G. Long, J. Jiang, S. Wang, and C. Zhang. 2018. Reinforced self-attention network: A hybrid of hard and soft attention for sequence modeling. In the 27th International Joint Conference on Artificial Intelligence (IJCAI). Google Scholar
Digital Library
- K. Zhu, R. Wang, Q. Zhao, J. Cheng, and D. Tao. 2020. A cuboid CNN model with an attention mechanism for skeleton-based action recognition. IEEE Trans. Multim. 22, 11 (2020), 2977–2989.Google Scholar
Cross Ref
- L. Shi, Y. Zhang, J. Cheng, and H. Lu. 2019. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. arXiv:1912.06971 [cs.CV] (2019).Google Scholar
- C. Szegedy, S. Ioffe, and V. Vanhoucke. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In the 31st AAAI Conference on Artificial Intelligence (AAAI). Google Scholar
Digital Library
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015).Google Scholar
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In the 31st Conference on Neural Information Processing Systems (NIPS). Google Scholar
Digital Library
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027 (2016).Google Scholar
- E. Shelhamer, J. Long, and T. Darrell. 2017. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 640–651. Google Scholar
Digital Library
- D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014).Google Scholar
- D. Britz, A. Goldie, M. T. Luong, and Q. Le. 2017. Massive exploration of neural machine translation architectures. arXiv:1703.03906 (2017).Google Scholar
- C. G. Snoek, M. Worring, and A. W. Smeulders. 2005. Early versus late fusion in semantic video analysis. In the 13th ACM International Conference on Multimedia. Google Scholar
Digital Library
- L. Seidenari, V. Varano, S. Berretti, A. D. Bimbo, and P. Pala. 2013. Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Google Scholar
Digital Library
- R. Caruana, S. Lawrence, and L. Giles. 2000. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In the 13th International Conference on Neural Information Processing Systems. Google Scholar
Digital Library
- J. Liu, A. Shahroudy, D. Xu, and G. Wang. 2016. Spatio-Temporal LSTM with Trust Gates for 3D human action recognition. In the European Conference on Computer Vision (ECCV).Google Scholar
- D. C. Luvizon, H. Tabia, and D. Picard. 2017. Learning features combination for human action recognition from skeleton sequences. Pattern Recog. Lett. 99, 1 (2017), 13–20. Google Scholar
Digital Library
- R. Slama, H. Wannous, M. Daoudi, and A. Srivastava. 2015. Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recog. 48, 2 (2015), 556–567. Google Scholar
Digital Library
- I. Lee, D. Kim, S. Kang, and S. Lee. 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In the IEEE International Conference on Computer Vision (ICCV).Google Scholar
- P. Koniusz, A. Cherian, and F. Porikli. 2016. Tensor representations via kernel linearization for action recognition from 3D skeletons. In the European Conference on Computer Vision (ECCV).Google Scholar
- M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. D. Bimbo. 2015. 3D human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Trans. Cyber. 45, 7 (2015), 1340–1352.Google Scholar
Cross Ref
- C. Wang, Y. Wang, and A. L. Yuille. 2016. Mining 3D key-pose-motifs for action recognition. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- J. Weng, C. Weng, and J. Yuan. 2017. SpatioTemporal naive-bayes nearest-neighbor (ST-NBNN) for skeleton-based action recognition. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- K. Jin, J. Min, J. Kong, H. Huo, and X. Wang. 2017. Action recognition using vague division depth motion maps. J. Eng. 4 (2017), 77–84.Google Scholar
Cross Ref
- G. Li, K. Liu, W. Ding, F. Cheng, and B. Chen. 2018. Key skeleton pattern mining on 3D skeletons represented by lie group for action recognition. Math. Prob. Eng. 2018.Google Scholar
- H. H. Pham, L. Khoudoury, A. Crouzil, P. Zegers, and S. A. Velastiny. 2018. Skeletal movement to color map: A novel representation for 3D action recognition with inception residual networks. arXiv:1807.07033v1 [cs.CV] (2018).Google Scholar
- J. Liu, A. Shahroudy, D. Xu, and A. C. Ko. 2018. Skeleton-based action recognition using spatio-temporal LSTM network with trust. IEEE Trans. Pattern Anal. Mach. Intell. 40, 12 (2018), 3007–3021.Google Scholar
Digital Library
- Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid. 2017. A new representation of skeleton sequences for 3D action recognition. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- C. Li, P. Wang, S. Wang, and Y. Ho. 2017. Skeleton-based action recognition using LSTM and CNN. In the IEEE International Conference on Multimedia Expo Workshops (ICMEW).Google Scholar
- M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In the Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- A. Graves. 2011. Practical variational inference for neural networks. In the International Conference on Neural Information Processing Systems. Google Scholar
Digital Library
Index Terms
Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition
Recommendations
Spatio-temporal segments attention for skeleton-based action recognition
AbstractCapturing the dependencies between joints is critical in skeleton-based action recognition. However, the existing methods cannot effectively capture the correlation of different joints between frames, which is very useful since ...
Spatio-temporal feature extraction and representation for RGB-D human action recognition
A sparse coding based framework is proposed for human action recognition.The proposed CS-Mltp descriptor performs better than other descriptors on RGB videos.The proposed framework significantly outperforms the state-of-the-art algorithms.The feature- ...
Attention-based network for effective action recognition from multi-view video
AbstractA human action recognition system is affected by many challenges such as background clutter, partial occlusion, lighting, viewpoint, execution rate. Using complementary information from different views can improve view changing and occlusion ...






Comments