Abstract
Currently, many action recognition methods mostly consider the information from spatial streams. We propose a new perspective inspired by the human visual system to combine both spatial and temporal streams to measure their attention consistency. Specifically, a branch-independent convolutional neural network (CNN) based algorithm is developed with a novel attention-consistency loss metric, enabling the temporal stream to concentrate on consistent discriminative regions with the spatial stream in the same period. The consistency loss is further combined with the cross-entropy loss to enhance the visual attention consistency. We evaluate the proposed method for action recognition on two benchmark datasets: Kinetics400 and UCF101. Despite its apparent simplicity, our proposed framework with the attention consistency achieves better performance than most of the two-stream networks, i.e., 75.7% top-1 accuracy on Kinetics400 and 95.7% on UCF101, while reducing 7.1% computational cost compared with our baseline. Particularly, our proposed method can attain remarkable improvements on complex action classes, showing that our proposed network can act as a potential benchmark to handle complicated scenarios in industry 4.0 applications.
- [1] . 2021. ViViT: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021).Google Scholar
- [2] . 2011. A database and evaluation methodology for optical flow. International Journal of Computer Vision 92, 1 (2011), 1–31.Google Scholar
Digital Library
- [3] . 2021. Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095 (2021).Google Scholar
- [4] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google Scholar
Cross Ref
- [5] . 2018. Multi-fiber networks for video recognition. In Proceedings of the European Conference on Computer Vision. 352–367.Google Scholar
Cross Ref
- [6] . 1984. Spatial and temporal contrast sensitivities of neurones in lateral geniculate nucleus of macaque. The Journal of Physiology 357, 1 (1984), 219–240.Google Scholar
Cross Ref
- [7] . 2018. Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision. 284–299.Google Scholar
Cross Ref
- [8] . 2017. Temporal 3D ConvNets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200 (2017).Google Scholar
- [9] . 2021. Multiscale vision transformers. arXiv preprint arXiv:2104.11227 (2021).Google Scholar
- [10] . 2020. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–213.Google Scholar
Cross Ref
- [11] . 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.Google Scholar
Cross Ref
- [12] . 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933–1941.Google Scholar
Cross Ref
- [13] . 2018. Actor and action video segmentation from a sentence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5958–5966.Google Scholar
Cross Ref
- [14] . 2019. Video action transformer network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 244–253.Google Scholar
Cross Ref
- [15] . 2017. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision. 5842–5850.Google Scholar
Cross Ref
- [16] . 2018. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6047–6056.Google Scholar
Cross Ref
- [17] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [18] . 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.Google Scholar
Cross Ref
- [19] . 2019. Deep Learning: Fundamentals, Theory and Applications. Vol. 2. Springer.Google Scholar
Cross Ref
- [20] . 1965. Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. Journal of Neurophysiology 28, 2 (1965), 229–289.Google Scholar
Cross Ref
- [21] . 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 221–231.Google Scholar
Digital Library
- [22] . 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725–1732.Google Scholar
Digital Library
- [23] . 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).Google Scholar
- [24] . 2021. Memory attention networks for skeleton-based action recognition. IEEE Transactions on Neural Networks and Learning Systems (2021).Google Scholar
- [25] . 2020. Spatio-temporal deep residual network with hierarchical attentions for video event recognition. 16, 2s, Article
62 (June 2020), 21 pages. Google ScholarCross Ref
- [26] . 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 7083–7093.Google Scholar
Cross Ref
- [27] . 1988. Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science 240, 4853 (1988), 740–749.Google Scholar
Cross Ref
- [28] . 2018. On the integration of optical flow and action recognition. In German Conference on Pattern Recognition. Springer, 281–297.Google Scholar
- [29] . 2021. An image is worth 16 \( \times \) 16 words, what is a video worth? arXiv preprint arXiv:2103.13915 (2021).Google Scholar
- [30] . 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568–576.Google Scholar
Digital Library
- [31] . 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
- [32] . 2020. D3D: Distilled 3D networks for video action recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 625–634.Google Scholar
Cross Ref
- [33] . 2015. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4597–4605.Google Scholar
Digital Library
- [34] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.Google Scholar
Digital Library
- [35] . 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450–6459.Google Scholar
Cross Ref
- [36] . 1994. Neural mechanisms of form and motion processing in the primate visual system. Neuron 13, 1 (1994), 1–10.Google Scholar
Cross Ref
- [37] . 2017. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [38] . 2013. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103, 1 (2013), 60–79.Google Scholar
Cross Ref
- [39] . 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551–3558.Google Scholar
Digital Library
- [40] . 2018. Appearance-and-relation networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1430–1439.Google Scholar
Cross Ref
- [41] . 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 20–36.Google Scholar
Cross Ref
- [42] . 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.Google Scholar
Cross Ref
- [43] . 2020. Boundary-aware cascade networks for temporal action segmentation. In Proceedings of the European Conference on Computer Vision. Springer, 34–51.Google Scholar
Digital Library
- [44] . 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision. 3–19.Google Scholar
Digital Library
- [45] . 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492–1500.Google Scholar
Cross Ref
- [46] . 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 305–321.Google Scholar
Cross Ref
- [47] . 2020. Multi-scale attention consistency for multi-label image classification. In International Conference on Neural Information Processing. Springer, 815–823.Google Scholar
Cross Ref
- [48] . 2019. Moving foreground-aware visual attention and key volume mining for human action recognition. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3, Article
74 (Aug. 2019), 16 pages. Google ScholarCross Ref
- [49] . 2018. Trajectory convolution for action recognition. In Proceedings of the International Conference on Neural Information Processing Systems. 2208–2219.Google Scholar
- [50] . 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929.Google Scholar
Cross Ref
- [51] . 2018. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision. 695–712.Google Scholar
Cross Ref
Index Terms
Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition
Recommendations
Spatial-temporal interaction learning based two-stream network for action recognition
AbstractTwo-stream convolutional neural networks have been widely applied to action recognition. However, two-stream networks are usually adopted to capture spatial information and temporal information separately, which normally ignore the ...
Human action recognition via multi-task learning base on spatial-temporal feature
This study proposes a novel human action recognition method using regularized multi-task learning. First, we propose the part Bag-of-Words (PBoW) representation that completely represents the local visual characteristics of the human body structure. ...
Multi-head attention-based two-stream EfficientNet for action recognition
AbstractRecent years have witnessed the popularity of using two-stream convolutional neural networks for action recognition. However, existing two-stream convolutional neural network-based action recognition approaches are incapable of distinguishing some ...






Comments