Abstract
The local key features in video are important for improving the accuracy of human action recognition. However, most end-to-end methods focus on global feature learning from videos, while few works consider the enhancement of the local information in a feature. In this article, we discuss how to automatically enhance the ability to discriminate the local information in an action feature and improve the accuracy of action recognition. To address these problems, we assume that the critical level of each region for the action recognition task is different and will not change with the region location shuffle. We therefore propose a novel action recognition method called the shuffle-invariant network. In the proposed method, the shuffled video is generated by regular region cutting and random confusion to enhance the input data. The proposed network adopts the multitask framework, which includes one feature backbone network and three task branches: local critical feature shuffle-invariant learning, adversarial learning, and an action classification network. To enhance the local features, the feature response of each region is predicted by a local critical feature learning network. To train this network, an L1-based critical feature shuffle-invariant loss is defined to ensure that the ordered feature response list of these regions remains unchanged after region location shuffle. Then, the adversarial learning is applied to eliminate the noise caused by the region shuffle. Finally, the action classification network combines these two tasks to jointly guide the training of the feature backbone network and obtain more effective action features. In the testing phase, only the action classification network is applied to identify the action category of the input video. We verify the proposed method on the HMDB51 and UCF101 action datasets. Several ablation experiments are constructed to verify the effectiveness of each module. The experimental results show that our approach achieves the state-of-the-art performance.
- [1] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4724–4733.
DOI: DOI: Google ScholarCross Ref
- [2] . 2019. Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5152–5161.
DOI: DOI: Google ScholarCross Ref
- [3] . 2017. Temporal 3D convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200.Google Scholar
- [4] . 2018. Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision (ECCV). 284–299.Google Scholar
Cross Ref
- [5] . 2017. Deep temporal linear encoding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2329–2338.Google Scholar
Cross Ref
- [6] . 2020. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 203–213.Google Scholar
Cross Ref
- [7] . 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.Google Scholar
Cross Ref
- [8] . 2019. Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3029–3038.
DOI: DOI: Google ScholarCross Ref
- [9] . 2017. Learning spatio-temporal features with 3D residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW). 3154–3160.
DOI: DOI: Google ScholarCross Ref
- [10] . 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6546–6555.
DOI: DOI: Google ScholarCross Ref
- [11] . 2020. Can deep learning recognize subtle human activities? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14244–14253.Google Scholar
Cross Ref
- [12] . 2019. STM: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2000–2009.Google Scholar
Cross Ref
- [13] . 2019. Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans. Syst. Man Cybern. Syst. 49, 9 (2019), 1806–1819.
DOI: DOI: Google ScholarCross Ref
- [14] . 2011. HMDB: A large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision. IEEE, 2556–2563.Google Scholar
Digital Library
- [15] . 2021. Representing videos as discriminative sub-graphs for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3310–3319.Google Scholar
Cross Ref
- [16] . 2020. TEA: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909–918.Google Scholar
Cross Ref
- [17] . 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 7083–7093.Google Scholar
Cross Ref
- [18] . 2021. No frame left behind: Full video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14892–14901.Google Scholar
Cross Ref
- [19] . 2020. TEINet: Towards an efficient architecture for video recognition. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference. 11669–11676.Google Scholar
Cross Ref
- [20] . 2018. End-to-end fine-grained action segmentation and recognition using conditional random field models and discriminative sparse coding. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1558–1567.Google Scholar
Cross Ref
- [21] . 2020. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 119–129.
DOI: DOI: Google ScholarCross Ref
- [22] . 2021. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6964–6974.Google Scholar
Cross Ref
- [23] . 2017. Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 5833–5842.
DOI: DOI: Google ScholarCross Ref
- [24] . 2020. A survey on 3D skeleton-based action recognition using learning method.
arxiv:2002.05907 [cs.CV].Google Scholar - [25] . 2016. Recognizing fine-grained and composite activities using hand-centric features and script data. Int. J. Comput. Vis. 119, 3 (2016), 346–373.Google Scholar
Digital Library
- [26] . 2014. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014), 568–576.Google Scholar
- [27] . 2020. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia (MM’20). Association for Computing Machinery, New York, NY, 1625–1633.
DOI: DOI: Google ScholarDigital Library
- [28] . 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.Google Scholar
- [29] . 2018. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1390–1399.Google Scholar
Cross Ref
- [30] . 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google Scholar
Cross Ref
- [31] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.Google Scholar
Digital Library
- [32] . 2019. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 5552–5561.Google Scholar
Cross Ref
- [33] . 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 20–36.Google Scholar
Cross Ref
- [34] . 2018. Beyond two-stream: Skeleton-based three-stream networks for action recognition in videos. In Proceedings of the 24th International Conference on Pattern Recognition (ICPR). IEEE, 1567–1573.Google Scholar
Cross Ref
- [35] . 2020. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 591–600.Google Scholar
Cross Ref
- [36] . 2019. A comprehensive survey of vision-based human action recognition methods. Sensors 19, 5 (2019).
DOI: DOI: Google ScholarCross Ref
- [37] . 2020. V4D: 4D convolutional neural networks for video-level representation learning. arXiv preprint arXiv:2002.07442.Google Scholar
- [38] . 2020. Context aware graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14321–14330.
DOI: DOI: Google ScholarCross Ref
- [39] . 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929.Google Scholar
Cross Ref
- [40] . 2020. Spatiotemporal fusion in 3D CNNs: A probabilistic view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9829–9838.Google Scholar
Cross Ref
- [41] . 2020. A3D: Adaptive 3D networks for video action recognition. arXiv preprint arXiv:2011.12384.Google Scholar
- [42] . 2016. Depth2Action: Exploring embedded depth for large-scale action recognition. In Computer Vision – ECCV 2016 Workshops, and (Eds.). Springer International Publishing, Cham, 668–684.Google Scholar
Cross Ref
Index Terms
Shuffle-invariant Network for Action Recognition in Videos
Recommendations
Action recognition in still images by learning spatial interest regions from videos
This paper addresses the problem of human action recognition in still images.This paper proposes a novel approach to learn interest regions from videos.This paper builds a Bayesian framework using learned interest regions and image local features for ...
Pairwise Features for Human Action Recognition
ICPR '10: Proceedings of the 2010 20th International Conference on Pattern RecognitionExisting action recognition approaches mainly rely on the discriminative power of individual local descriptors extracted from spatio-temporal interest points (STIP), while the geometric relationships among the local features are ignored. This paper ...
Recognizing 50 human action categories of web videos
Action recognition on large categories of unconstrained videos taken from the web is a very challenging problem compared to datasets like KTH (6 actions), IXMAS (13 actions), and Weizmann (10 actions). Challenges like camera motion, different viewpoints,...






Comments