Abstract
Action recognition has been a heated topic in computer vision for its wide application in vision systems. Previous approaches achieve improvement by fusing the modalities of the skeleton sequence and RGB video. However, such methods pose a dilemma between the accuracy and efficiency for the high complexity of the RGB video network. To solve the problem, we propose a multi-modality feature fusion network to combine the modalities of the skeleton sequence and RGB frame instead of the RGB video, as the key information contained by the combination of the skeleton sequence and RGB frame is close to that of the skeleton sequence and RGB video. In this way, complementary information is retained while the complexity is reduced by a large margin. To better explore the correspondence of the two modalities, a two-stage fusion framework is introduced in the network. In the early fusion stage, we introduce a skeleton attention module that projects the skeleton sequence on the single RGB frame to help the RGB frame focus on the limb movement regions. In the late fusion stage, we propose a cross-attention module to fuse the skeleton feature and the RGB feature by exploiting the correlation. Experiments on two benchmarks, NTU RGB+D and SYSU, show that the proposed model achieves competitive performance compared with the state-of-the-art methods while reducing the complexity of the network.
- [1] . 2018. Glimpse clouds: Human activity recognition from unstructured feature points. In Computer Vision Foundation Salt Lake City, UT, USA, June 18-22. IEEE Computer Society, 469–478.Google Scholar
- [2] . 2021. Am I done? Predicting action progress in videos. ACM Trans. Multim. Comput. Commun. Appl. 16, 4 (2021), 119:1–119:24.Google Scholar
- [3] . 2019. Skeleton image representation for 3D action recognition based on tree structure and reference joints. In 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI’09) Rio de Janeiro, Brazil, October 28-30. IEEE, 16–23.Google Scholar
Cross Ref
- [4] . 2021. JOLO-GCN: Mining joint-centered light-weight information for skeleton-based action recognition. In WACV, Waikoloa, HI, USA, January 3-8. IEEE, 2734–2743.Google Scholar
- [5] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, Honolulu, HI, USA, July 21-26. IEEE Computer Society, 4724–4733.Google Scholar
- [6] . 2018. Multi-fiber networks for video recognition. In ECCV, Munich, Germany, September 8-14, , , , and (Eds.).Google Scholar
- [7] . 2020. Graph convolutional network with structure pooling and joint-wise channel attention for action recognition. Pattern Recognition 103 (2020), 107321.Google Scholar
Cross Ref
- [8] . 2017. Xception: Deep learning with depthwise separable convolutions. In CVPR, Honolulu, HI, USA, July 21-26. IEEE Computer Society, 1800–1807.Google Scholar
- [9] . 2017. Investigation of different skeleton features for CNN-based 3D action recognition. In ICME Workshops, Hong Kong, China, July 10-14. IEEE Computer Society, 617–622.Google Scholar
- [10] . 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 677–691.Google Scholar
Digital Library
- [11] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [12] . 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In CVPR, Boston, MA, USA, June 7-12. IEEE Computer Society, 5344–5352.Google Scholar
- [13] . 2017. Jointly learning heterogeneous features for RGB-D activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 11 (2017), 2186–2200.Google Scholar
Digital Library
- [14] . 2018. Deep bilinear learning for RGB-D action recognition. In ECCV, Munich, Germany, September 8-14. Springer, 346–362.Google Scholar
- [15] . 2020. Long-short graph memory network for skeleton-based action recognition. In WACV, Snowmass Village, CO, USA, March 1-5. IEEE, 634–641.Google Scholar
- [16] . 2019. Attention transfer (ANT) network for view-invariant action recognition. In ACM MM, Nice, France, October 21-25. ACM, 574–582.Google Scholar
- [17] . 2020. MMTM: Multimodal transfer module for CNN fusion. In CVPR, Seattle, WA, USA, June 13-19. Computer Vision Foundation/IEEE, 13286–13296.Google Scholar
- [18] . 2017. A new representation of skeleton sequences for 3D action recognition. In CVPR, Honolulu, HI, USA, July 21-26. IEEE Computer Society, 4570–4579.Google Scholar
- [19] . 2020. Learning latent global network for skeleton-based action prediction. IEEE Trans. Image Process. 29 (2020), 959–970.Google Scholar
Digital Library
- [20] . 2017. Semi-supervised classification with graph convolutional networks. In ICLR, Toulon, France, April 24-26. OpenReview.net.Google Scholar
- [21] . 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In ICCV, Venice, Italy, October 22–29. IEEE Computer Society, 1012–1020.Google Scholar
- [22] . 2018. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In IJCAI, 2018, Stockholm, Sweden, July 13–19. ijcai.org, 786–792.Google Scholar
- [23] . 2020. SGM-Net: Skeleton-guided multimodal network for action recognition. Pattern Recognition 104 (2020), 107356.Google Scholar
Cross Ref
- [24] . 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In CVPR, Long Beach, CA, USA, June 16-20. Computer Vision Foundation/IEEE, 3595–3603.Google Scholar
- [25] . 2018. Independently recurrent neural network (IndRNN): Building a longer and deeper RNN. In CVPR, Salt Lake City, UT, USA, June 18-22. Computer Vision Foundation/IEEE Computer Society, 5457–5466.Google Scholar
- [26] . 2019. Learning shape-motion representations from geometric algebra spatio-temporal model for skeleton-based action recognition. In ICME, Shanghai, China, July 8-12. IEEE, 1066–1071.Google Scholar
- [27] . 2016. Spatio-temporal LSTM with trust gates for 3D human action recognition. In ECCV, Amsterdam, The Netherlands, October 11-14, Vol. 9907. Springer, 816–833.Google Scholar
- [28] . 2020. A benchmark dataset and comparison study for multi-modal human action analytics. ACM Trans. Multim. Comput. Commun. Appl. 16, 2 (2020), 41:1–41:24.Google Scholar
Digital Library
- [29] . 2017. Global context-aware attention LSTM networks for 3D action recognition. In CVPR, Honolulu, HI, USA, July 21-26. IEEE Computer Society, 3671–3680.Google Scholar
- [30] . 2018. Recognizing human actions as the evolution of pose estimation maps. In CVPR, Salt Lake City, UT, USA, June 18-22. Computer Vision Foundation/IEEE Computer Society, 1159–1168.Google Scholar
- [31] . 2020. Disentangling and unifying graph convolutions for skeleton-based action recognition. In CVPR, Seattle, WA, USA, June 13-19. Computer Vision Foundation/IEEE, 140–149.Google Scholar
- [32] . 2018. 2D/3D pose estimation and action recognition using multitask deep learning. In CVPR, Salt Lake City, UT, USA, June 18-22. Computer Vision Foundation/IEEE Computer Society, 5137–5146.Google Scholar
- [33] . 2019. MFAS: Multimodal fusion architecture search. In CVPR, Long Beach, CA, USA, June 16-20. Computer Vision Foundation/IEEE, 6966–6975.Google Scholar
- [34] . 2017. Learning action recognition model from depth and skeleton videos. In ICCV, Venice, Italy, October 22-29. IEEE Computer Society, 5833–5842.Google Scholar
- [35] . 2008. Action snippets: How many frames does human action recognition require? In CVPR, Anchorage, Alaska, USA, 24-26 June. IEEE Computer Society.Google Scholar
- [36] . 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.Google Scholar
Digital Library
- [37] . 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In CVPR, Las Vegas, NV, USA, June 27-30. IEEE Computer Society, 1010–1019.Google Scholar
- [38] . 2018. Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 5 (2018), 1045–1058.Google Scholar
Cross Ref
- [39] . 2019. Skeleton-based action recognition with directed graph neural networks. In CVPR, Long Beach, CA, USA, June 16-20. Computer Vision Foundation/IEEE, 7912–7921.Google Scholar
- [40] . 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR, Long Beach, CA, USA, June 16-20. Computer Vision Foundation/IEEE, 12026–12035.Google Scholar
- [41] . 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In CVPR, Long Beach, CA, USA, June 16-20. Computer Vision Foundation/IEEE, 1227–1236.Google Scholar
- [42] . 2018. Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Transactions on Image Processing 27, 7 (2018), 3459–3471.Google Scholar
Cross Ref
- [43] . 2018. Deep progressive reinforcement learning for skeleton-based action recognition. In CVPR, Salt Lake City, UT, USA, June 18-22. Computer Vision Foundation/IEEE Computer Society, 5323–5332.Google Scholar
- [44] . 2019. Understanding the dynamics of social interactions: A multi-modal multi-view approach. ACM Trans. Multim. Comput. Commun. Appl. 15, 1s (2019), 15:1–15:16.Google Scholar
Digital Library
- [45] . 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV, Santiago, Chile, December 7-13. IEEE Computer Society, 4489–4497.Google Scholar
- [46] . 2015. Differential recurrent neural networks for action recognition. In ICCV, Santiago, Chile, December 7-13. IEEE Computer Society, 4041–4049.Google Scholar
- [47] . 2014. Human action recognition by representing 3D skeletons as points in a lie group. In CVPR, Columbus, OH, USA, June 23-28. IEEE Computer Society, 588–595.Google Scholar
- [48] . 2017. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In CVPR, Honolulu, HI, USA, July 21-26. IEEE Computer Society, 3633–3642.Google Scholar
- [49] . 2018. Deformable pose traversal convolution for 3D action and gesture recognition. In ECCV, Munich, Germany, September 8-14, Vol. 11211. Springer, 142–157.Google Scholar
- [50] . 2014. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In CVPR, Columbus, OH, USA, June 23-28. IEEE Computer Society, 724–731.Google Scholar
- [51] . 2021. Pose-guided inflated 3D ConvNet for action recognition in videos. Signal Process. Image Commun. 91 (2021), 116098.Google Scholar
Cross Ref
- [52] . 2016. Multi-stream multi-class fusion of deep networks for video classification. In ACM MM, Amsterdam, The Netherlands, October 15-19. ACM, 791–800.Google Scholar
- [53] . 2012. View invariant human action recognition using histograms of 3D joints. In CVPR Workshops, Providence, RI, USA, June 16-21. IEEE Computer Society, 20–27.Google Scholar
- [54] . 2018. Memory attention networks for skeleton-based action recognition. In IJCAI’18.Google Scholar
- [55] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML, Lille, France, 6-11 July, Vol. 37. JMLR.org, 2048–2057.Google Scholar
- [56] . 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, New Orleans, Louisiana, USA, February 2-7, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 7444–7452.Google Scholar
- [57] . 2019. Moving foreground-aware visual attention and key volume mining for human action recognition. ACM Trans. Multim. Comput. Commun. Appl. 15, 3 (2019), 74:1–74:16.Google Scholar
Digital Library
- [58] . 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In ICCV, Venice, Italy, October 22-29. IEEE Computer Society, 2136–2145.Google Scholar
- [59] . 2020. Semantics-guided neural networks for efficient skeleton-based human action recognition. In CVPR, Seattle, WA, USA, June 13-19. Computer Vision Foundation / IEEE, 1109–1118.Google Scholar
- [60] . 2020. EleAtt-RNN: Adding attentiveness to neurons in recurrent neural networks. IEEE Trans. Image Process. 29 (2020), 1061–1073.Google Scholar
Cross Ref
- [61] . 2018. Fusing geometric features for skeleton-based action recognition using multilayer LSTM networks. IEEE Transactions on Multimedia 20, 9 (2018), 2330–2343.Google Scholar
Digital Library
- [62] . 2020. Context aware graph convolution for skeleton-based action recognition. In CVPR, Seattle, WA, USA, June 13-19. Computer Vision Foundation/IEEE, 14321–14330.Google Scholar
- [63] . 2011. Multi-task learning in heterogeneous feature spaces. In AAAI, San Francisco, California, USA, August 7-11. AAAI Press.Google Scholar
- [64] . 2017. Deeply-learned part-aligned representations for person re-identification. In ICCV, Venice, Italy, October 22-29. IEEE Computer Society, 3239–3248.Google Scholar
- [65] . 2017. Two-stream RNN/CNN for action recognition in 3D videos. In IROS, Vancouver, BC, Canada, September 24-28. IEEE, 4260–4267.Google Scholar
- [66] . 2020. Unsupervised learning of human action categories in still images with deep representations. ACM Trans. Multim. Comput. Commun. Appl. 15, 4 (2020), 112:1–112:20.Google Scholar
- [67] . 2020. A cuboid CNN model with an attention mechanism for skeleton-based action recognition. IEEE Transactions on Multimedia 22, 11 (2020), 2977–2989.Google Scholar
Cross Ref
- [68] . 2016. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In AAAI, Phoenix, Arizona, USA, February 12-17. AAAI Press, 3697–3704.Google Scholar
- [69] . 2021. SDAN: Stacked diverse attention network for video action recognition. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, South Korea, May 22-28. IEEE, 1–5.Google Scholar
- [70] . 2017. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In ICCV, Venice, Italy, October 22-29. IEEE Computer Society, 2923–2932.Google Scholar
Index Terms
Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition
Recommendations
DiffNet: Discriminative Feature Fusion Network of Multisurface Skeleton Project Images for Action Recognition
ICMLC 2021: 2021 13th International Conference on Machine Learning and ComputingIn this work, we discuss the feature fusion approach of multisurface skeleton projection images for action recognition. Multisurface skeleton projection images are generated from human skeleton joint motion trajectories on three surfaces: horizontal-...
Dual-stream cross-modality fusion transformer for RGB-D action recognition
AbstractRGB-D-based action recognition can achieve accurate and robust performance due to rich complementary information, and thus has many application scenarios. However, existing works combine multiple modalities by late fusion or learn ...
3D Face Recognition Using Multi-level Multi-feature Fusion
PSIVT '10: Proceedings of the 2010 Fourth Pacific-Rim Symposium on Image and Video TechnologyThis paper proposed a novel 3D face recognition algorithm using multi-level multi-feature fusions. A new face representation method named average edge image is proposed in addition to traditional ones such as maximal principal curvature image and range ...






Comments