Abstract
In this work, we propose a dual-stream structured graph convolution network (DS-SGCN) to solve the skeleton-based action recognition problem. The spatio-temporal coordinates and appearance contexts of the skeletal joints are jointly integrated into the graph convolution learning process on both the video and skeleton modalities. To effectively represent the skeletal graph of discrete joints, we create a structured graph convolution module specifically designed to encode partitioned body parts along with their dynamic interactions in the spatio-temporal sequence. In more detail, we build a set of structured intra-part graphs, each of which can be adopted to represent a distinctive body part (e.g., left arm, right leg, head). The inter-part graph is then constructed to model the dynamic interactions across different body parts; here each node corresponds to an intra-part graph built above, while an edge between two nodes is used to express these internal relationships of human movement. We implement the graph convolution learning on both intra- and inter-part graphs in order to obtain the inherent characteristics and dynamic interactions, respectively, of human action. After integrating the intra- and inter-levels of spatial context/coordinate cues, a convolution filtering process is conducted on time slices to capture these temporal dynamics of human motion. Finally, we fuse two streams of graph convolution responses in order to predict the category information of human action in an end-to-end fashion. Comprehensive experiments on five single/multi-modal benchmark datasets (including NTU RGB+D 60, NTU RGB+D 120, MSR-Daily 3D, N-UCLA, and HDM05) demonstrate that the proposed DS-SGCN framework achieves encouraging performance on the skeleton-based action recognition task.
- [1] . 2016. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems. 1993–2001. Google Scholar
Digital Library
- [2] . 2017. Human action recognition: Pose-based attention draws focus to hands. In IEEE International Conference on Computer Vision Workshops. 604–613.Google Scholar
Cross Ref
- [3] . 2018. Human activity recognition with pose-driven attention to RGB. In British Machine Vision Conference. 200.Google Scholar
- [4] . 2018. Glimpse clouds: Human activity recognition from unstructured feature points. In IEEE Conference on Computer Vision and Pattern Recognition. 469–478.Google Scholar
Cross Ref
- [5] . 2018. Netgan: Generating graphs via random walks. In International Conference on Machine Learning. 609–618.Google Scholar
- [6] . 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT 2010. 177–186.Google Scholar
Cross Ref
- [7] . 2013. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations.Google Scholar
- [8] . 2015. Effective active skeleton representation for low latency human action recognition. IEEE Transactions on Multimedia 18, 2 (2015), 141–154.Google Scholar
Digital Library
- [9] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google Scholar
Cross Ref
- [10] . 2020. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20).Google Scholar
Cross Ref
- [11] . 2015. P-CNN: Pose-based CNN features for action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 3218–3226. Google Scholar
Digital Library
- [12] . 2019. Where to focus on for human action recognition? In IEEE Winter Conference on Applications of Computer Vision. 71–80.Google Scholar
- [13] . 2019. Toyota smarthome: Real-world activities of daily living. In Proceedings of the IEEE International Conference on Computer Vision. 833–842.Google Scholar
Cross Ref
- [14] . 2019. Toyota smarthome: Real-world activities of daily living. In IEEE International Conference on Computer Vision. 833–842.Google Scholar
Cross Ref
- [15] . 2020. VPN: Learning video-pose embedding for activities of daily living. The 16th European Conference Computer Vision(2020), 72–90.Google Scholar
- [16] . 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Annual Conference on Neural Information Processing Systems. 3837–3845. Google Scholar
Digital Library
- [17] . 2015. Hierarchical recurrent neural network for skeleton based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1110–1118.Google Scholar
- [18] . 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034. Google Scholar
Digital Library
- [19] . 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision. 346–361.Google Scholar
Cross Ref
- [20] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [21] . 2015. Deep Convolutional Networks on Graph-Structured Data. https://arxiv.org/abs/1506.05163.Google Scholar
- [22] . 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 5344–5352.Google Scholar
Cross Ref
- [23] . 2018. Deep bilinear learning for RGB-D action recognition. In European Conference on Computer Vision. 335–351.Google Scholar
Cross Ref
- [24] . 2017. A Riemannian network for SPD matrix learning. In AAAI Conference on Artificial Intelligence. 2036–2042. Google Scholar
Digital Library
- [25] . 2017. Deep learning on lie groups for skeleton-based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1243–1252.Google Scholar
Cross Ref
- [26] . 2016. Structural-RNN: Deep learning on spatio-temporal graphs. In IEEE Conference on Computer Vision and Pattern Recognition. 5308–5317.Google Scholar
Cross Ref
- [27] . 2019. Gaussian-induced convolution for graphs. In AAAI Conference on Artificial Intelligence. Google Scholar
Digital Library
- [28] . 2017. A new representation of skeleton sequences for 3D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 3288–3297.Google Scholar
Cross Ref
- [29] . 2017. Introduction to PyTorch. In Deep Learning with Python. 195–208.Google Scholar
Cross Ref
- [30] . 2016. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations.Google Scholar
- [31] . 2016. Variational graph auto-encoders. https://arxiv.org/abs/1611.07308.Google Scholar
- [32] . 2016. Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (2016), 14–29. Google Scholar
Digital Library
- [33] . 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In IEEE International Conference on Computer Vision. 1012–1020.Google Scholar
Cross Ref
- [34] . 2017. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN. In IEEE International Conference on Multimedia & Expo Workshops. 601–604.Google Scholar
- [35] . 2018. Action-attending graphic neural network. IEEE Transactions on Image Processing 27, 7 (2018), 3657–3670.Google Scholar
Cross Ref
- [36] . 2018. Spatio-temporal graph convolution for skeleton based action recognition. In AAAI Conference on Artificial Intelligence. 3482–3489. Google Scholar
Digital Library
- [37] . 2018. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In International Joint Conference on Artificial Intelligence. 786–792. Google Scholar
Digital Library
- [38] . 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 3595–3603.Google Scholar
Cross Ref
- [39] . 2016. Online human action detection using joint classification-regression recurrent neural networks. European Conference on Computer Vision.Google Scholar
- [40] . 2019. Three-stream convolutional neural network with multi-task and ensemble learning for 3D action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.Google Scholar
Cross Ref
- [41] . 2019. Three-stream convolutional neural network with multi-task and ensemble learning for 3D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 0–0.Google Scholar
Cross Ref
- [42] . 2017. PKU-MMD: A Large Scale Benchmark for Continuous Multi-modal Human Action Understanding. https://arxiv.org/abs/1703.07475. Google Scholar
Digital Library
- [43] . 2020. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2020), 2684–2701.Google Scholar
Digital Library
- [44] . 2016. Spatio-temporal LSTM with trust gates for 3D human action recognition. In European Conference on Computer Vision. 816–833.Google Scholar
Cross Ref
- [45] . 2017. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition 68 (2017), 346–362. Google Scholar
Digital Library
- [46] . 2019. Online data organizer: Micro-video categorization by structure-guided multimodal dictionary learning. IEEE Transactions on Image Processing 28, 3 (2019), 1235–1247.Google Scholar
Digital Library
- [47] . 2018. Recognizing human actions as the evolution of pose estimation maps. In IEEE Conference on Computer Vision and Pattern Recognition. 1159–1168.Google Scholar
Cross Ref
- [48] . 2019. Si-GCN: Structure-induced graph convolution network for skeleton-based action recognition. In 2019 International Joint Conference on Neural Networks (IJCNN’19).
IEEE , 1–8.Google ScholarCross Ref
- [49] . 2020. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 143–152.Google Scholar
Digital Library
- [50] . 2020. Gimme signals: Discriminative signal encoding for multimodal activity recognition. In International Conference on Intelligent Robots and Systems. 10394–10401.Google Scholar
- [51] . 2007. Documentation Mocap Database HDM05.
Technical Report CG-2007-2. Universität Bonn.Google Scholar - [52] . 2018. Novel skeleton-based action recognition using covariance descriptors on most informative joints. In International Conference on Knowledge and Systems Engineering. 50–55.Google Scholar
Cross Ref
- [53] . 2016. Learning convolutional neural networks for graphs. In International Conference on Machine Learning. 2014–2023. Google Scholar
Digital Library
- [54] . 2019. MFAS: Multi-modal fusion architecture search. In IEEE Conference on Computer Vision and Pattern Recognition. 6966–6975.Google Scholar
- [55] . 2016. 3D skeleton-based human action classification: A survey. Pattern Recognition 53 (2016), 130–147. Google Scholar
Digital Library
- [56] . 2018. Joint deep learning for RGB-D action recognition. In IEEE Visual Communications and Image Processing. 1–6.Google Scholar
- [57] . 2016. A review on human action analysis in videos for retrieval applications. Artificial Intelligence Review 46, 4 (2016), 485–514. Google Scholar
Digital Library
- [58] . 2015. Robot-centric activity prediction from first-person videos: What will they do to me? In ACM/IEEE International Conference on Human-Robot Interaction. 295–302. Google Scholar
Digital Library
- [59] . 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In IEEE Conference on Computer Vision and Pattern Recognition. 1010–1019.Google Scholar
Cross Ref
- [60] . 2017. Deep multimodal feature analysis for action recognition in RGB+ D videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 5 (2017), 1045–1058.Google Scholar
Cross Ref
- [61] . 2015. Multimodal multipart learning for action recognition in depth videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2015), 2123–2129. Google Scholar
Digital Library
- [62] . 2019. Skeleton-based action recognition with directed graph neural networks. In IEEE Conference on Computer Vision and Pattern Recognition. 7912–7921.Google Scholar
Cross Ref
- [63] . 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12026–12035.Google Scholar
Cross Ref
- [64] . 2020. Decoupled spatial-temporal attention network for skeleton-based action recognition. European Conference on Computer Vision (2020), 536–553.Google Scholar
- [65] . 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1227–1236. Google Scholar
Digital Library
- [66] . 2018. Skeleton-based action recognition with spatial reasoning and temporal stack learning. In European Conference on Computer Vision. 103–118.Google Scholar
Cross Ref
- [67] . 2020. Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network. Pattern Recognition 107 (2020), 107511.Google Scholar
Cross Ref
- [68] . 2007. Adding semantics to detectors for video retrieval. IEEE Transactions on Multimedia 9, 5 (2007), 975–986. Google Scholar
Digital Library
- [69] . 2017. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In AAAI Conference on Artificial Intelligence. Google Scholar
Digital Library
- [70] . 2018. Skeleton-indexed deep multi-modal feature learning for high performance human action recognition. In IEEE International Conference on Multimedia and Expo (ICME’18). 1–6.Google Scholar
Cross Ref
- [71] . 2018. Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Transactions on Image Processing 27, 7 (2018), 3459–3471.Google Scholar
Cross Ref
- [72] . 2017. Robust spatial filtering with graph convolutional neural networks. IEEE Journal of Selected Topics in Signal Processing 11, 6 (2017), 884–896.Google Scholar
Cross Ref
- [73] . 2018. Deep progressive reinforcement learning for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5323–5332.Google Scholar
Cross Ref
- [74] . 2015. A study of multimodal addressee detection in human-human-computer interaction. IEEE Transactions on Multimedia 17, 9 (2015), 1550–1561.Google Scholar
Digital Library
- [75] . 2014. Human action recognition by representing 3D skeletons as points in a lie group. In The IEEE Conference on Computer Vision and Pattern Recognition. Google Scholar
Digital Library
- [76] . 2018. Dividing and aggregating network for multi-view action recognition. In European Conference on Computer Vision. 451–467.Google Scholar
Cross Ref
- [77] . 2012. Mining actionlet ensemble for action recognition with depth cameras. In IEEE Conference on Computer Vision and Pattern Recognition. 1290–1297. Google Scholar
Digital Library
- [78] . 2014. Cross-view action modeling, learning and recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 2649–2656. Google Scholar
Digital Library
- [79] . 2016. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision. 20–36.Google Scholar
Cross Ref
- [80] . 2018. Action recognition based on joint trajectory maps with convolutional neural networks. Knowledge-Based Systems 158 (2018), 43–53.Google Scholar
Digital Library
- [81] . 2016. Graph based skeleton motion representation and similarity measurement for action recognition. In European Conference on Computer Vision. 370–385.Google Scholar
Cross Ref
- [82] . 2016. Multi-loss regularized deep neural network. IEEE Transactions on Circuits and Systems for Video Technology 26, 12 (2016), 2273–2283. Google Scholar
Digital Library
- [83] . 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI Conference on Artificial Intelligence. 7444–7452. Google Scholar
Digital Library
- [84] . 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In IEEE International Conference on Computer Vision. 2117–2126.Google Scholar
Cross Ref
- [85] . 2020. Semantics-guided neural networks for efficient skeleton-based human action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1109–1118.Google Scholar
- [86] . 2018. Adding attentiveness to the neurons in recurrent neural networks. In Proceedings of the European Conference on Computer Vision. 135–151.Google Scholar
Cross Ref
- [87] . 2018. Fusing geometric features for skeleton-based action recognition using multilayer LSTM networks. IEEE Transactions on Multimedia 20, 9 (2018), 2330–2343. Google Scholar
Digital Library
- [88] . 2019. Bayesian graph convolution LSTM for skeleton based action recognition. In The IEEE International Conference on Computer Vision. 6882–6892.Google Scholar
Cross Ref
- [89] . 2018. Action machine: Rethinking action recognition in trimmed videos. https://arxiv.org/abs/1812.05770.Google Scholar
- [90] . 2017. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In IEEE International Conference on Computer Vision. 2904–2913.Google Scholar
Cross Ref
Index Terms
Dual-Stream Structured Graph Convolution Network for Skeleton-Based Action Recognition
Recommendations
Graph-based approach for 3D human skeletal action recognition
We present a novel graph-based model for 3D human skeletal action representation.We present a graph kernel to measure the similarity between two graphs.The temporal pyramid covariance descriptor is proposed to preserve the individual joint ...
Efficient action recognition via local position offset of 3D skeletal body joints
To accurately recognize human actions in less computational time is one important aspect for practical usage. This paper presents an efficient framework for recognizing actions by a RGB-D camera. The novel action patterns in the framework are extracted ...
High-Order Graph Convolutional Network for Skeleton-Based Human Action Recognition
Pattern Recognition and Computer VisionAbstractSkeleton-based action recognition plays an important role in the field of human action recognition. Recently, with the introduction of Graph Convolution Network (GCN), GCN has achieved superior performance in the field of skeleton-based human ...






Comments