Abstract
Recently, many deep learning approaches have shown remarkable progress on human action recognition. However, it remains unclear how to extract the useful information in videos since only video-level labels are available in the training phase. To address this limitation, many efforts have been made to improve the performance of action recognition by applying the visual attention mechanism in the deep learning model. In this article, we propose a novel deep model called Moving Foreground Attention (MFA) that enhances the performance of action recognition by guiding the model to focus on the discriminative foreground targets. In our work, MFA detects the moving foreground through a proposed variance-based algorithm. Meanwhile, an unsupervised proposal is utilized to mine the action-related key volumes and generate corresponding correlation scores. Based on these scores, a newly proposed stochastic-out scheme is exploited to train the MFA. Experiment results show that action recognition performance can be significantly improved by using our proposed techniques, and our model achieves state-of-the-art performance on UCF101 and HMDB51.
- Joe Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. 2015. Beyond short snippets: Deep networks for video classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4694--4702.Google Scholar
- Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. 2003. Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer (Eds.). MIT Press, 577--584. http://papers.nips.cc/paper/2232-support-vector-machines-for-multiple-instance-learning.pdf. Google Scholar
Digital Library
- B. Antic, T. Milbich, and B. Ommer. 2013. Less is more: Video trimming for action recognition. In 2013 IEEE International Conference on Computer Vision Workshops. 515--521. Google Scholar
Digital Library
- H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. 2016. Dynamic image networks for action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3034--3042.Google Scholar
- J. Carreira and A. Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 4724--4733.Google Scholar
- P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 65--72. Google Scholar
Digital Library
- J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (April 2017), 677--691. Google Scholar
Digital Library
- Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2016. Spatiotemporal residual networks for video action recognition. CoRR abs/1611.02155 (2016). http://arxiv.org/abs/1611.02155 Google Scholar
Digital Library
- C. Feichtenhofer, A. Pinz, and A. Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1933--1941.Google Scholar
- G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2261--2269.Google Scholar
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732. Google Scholar
Digital Library
- Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. 2013. HMDB51: A large video database for human motion recognition. In High Performance Computing in Science and Engineering’12, Wolfgang E. Nagel, Dietmar H. Kröner, and Michael M. Resch (Eds.). Springer, Berlin Heidelberg, 571--582.Google Scholar
- Ivan Laptev. 2005. On space-time interest points. International Journal of Computer Vision 64, 2 (Sept. 2005), 107--123. Google Scholar
Digital Library
- Weiyao Lin, Yang Mi, Jianxin Wu, Ke Lu, and Hongkai Xiong. 2017. Action recognition with coarse-to-fine deep feature integration and asynchronous fusion. CoRR abs/1711.07430 (2017). http://arxiv.org/abs/1711.07430Google Scholar
- J. Liu, G. Wang, P. Hu, L. Duan, and A. C. Kot. 2017. Global context-aware attention LSTM networks for 3D action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3671--3680.Google Scholar
- M. Liu and J. Yuan. 2018. Recognizing human actions as the evolution of pose estimation maps. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1159--1168.Google Scholar
- Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. CoRR abs/1406.6247 (2014). http://arxiv.org/abs/1406.6247Google Scholar
- Z. Qiu, T. Yao, and T. Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV’17). 5534--5542.Google Scholar
- Ronald A. Rensink. 2000. The dynamic representation of scenes. Visual Cognition 7, 1--3 (2000), 17--42.Google Scholar
Cross Ref
- Michael Sapienza, Fabio Cuzzolin, and Philip H. S. Torr. 2014. Learning discriminative space--time action parts from weakly labelled videos. International Journal of Computer Vision 110, 1 (Oct. 2014), 30--47. Google Scholar
Digital Library
- Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia (MM’07). ACM, New York, NY, 357--360. Google Scholar
Digital Library
- Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. CoRR abs/1511.04119 (2015). http://arxiv.org/abs/1511.04119Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. CoRR abs/1406.2199 (2014). http://arxiv.org/abs/1406.2199Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556Google Scholar
- Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. 2016. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. CoRR abs/1611.06067 (2016). http://arxiv.org/abs/1611.06067Google Scholar
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). http://arxiv.org/abs/1212.0402Google Scholar
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV’15). 4489--4497. Google Scholar
Digital Library
- G. Varol, I. Laptev, and C. Schmid. 2018. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (June 2018), 1510--1517.Google Scholar
Cross Ref
- Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103, 1 (May 2013), 60--79.Google Scholar
Cross Ref
- H. Wang and C. Schmid. 2013. Action recognition with improved trajectories. In 2013 IEEE International Conference on Computer Vision. 3551--3558. Google Scholar
Digital Library
- L. Wang, Y. Qiao, and X. Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4305--4314.Google Scholar
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Computer Vision -- ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 20--36.Google Scholar
Cross Ref
- X. Wang, A. Farhadi, and A. Gupta. 2016. Actions transformations. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2658--2667.Google Scholar
- Y. Wang, M. Long, J. Wang, and P. S. Yu. 2017. Spatiotemporal pyramid network for video action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2097--2106.Google Scholar
- Z. Wang, C. Zhang, W. Luo, and W. Lin. 2018. Key joints selection and spatiotemporal mining for skeleton-based action recognition. In 2018 25th IEEE International Conference on Image Processing (ICIP’18). 3458--3462.Google Scholar
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. CoRR abs/1502.03044 (2015). http://arxiv.org/abs/1502.03044Google Scholar
- Shiyang Yan, Jeremy S. Smith, Wenjin Lu, and Bailing Zhang. 2017. Hierarchical multi-scale attention networks for action recognition. Signal Processing Image Communication 61 (2017), 73--84.Google Scholar
Cross Ref
- C. Zach, T. Pock, and H. Bischof. 2007. A duality based approach for realtime tv-l1 optical flow. In Pattern Recognition, Fred A. Hamprecht, Christoph Schnörr, and Bernd Jähne (Eds.). Springer, Berlin, 214--223. Google Scholar
Digital Library
- W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao. 2016. A key volume mining deep framework for action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1991--1999.Google Scholar
- C. Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In Computer Vision -- ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 391--405.Google Scholar
Index Terms
Moving Foreground-Aware Visual Attention and Key Volume Mining for Human Action Recognition
Recommendations
Human Action Recognition Using DFT
NCVPRIPG '11: Proceedings of the 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and GraphicsAction is any meaningful movement of the human and it is used to convey information or to interact naturally without any mechanical devices. Human action recognition is motivated by some of the applications such as video retrieval, Human robot ...
Human Action Recognition From Digital Videos Based on Deep Learning
ICCCV '22: Proceedings of the 5th International Conference on Control and Computer VisionWith the development of closed-circuit television, video-based human motion recognition has made great progress. A large number of surveillance video footages have been archived. In this paper, we implement deep learning methods to resolve human action ...
Human Action Recognition Based on Recognition of Linear Patterns in Action Bank Features Using Convolutional Neural Networks
ICMLA '14: Proceedings of the 2014 13th International Conference on Machine Learning and ApplicationsIn this paper, we proposed a deep convolutional network architecture for recognizing human actions in videos using action bank features. Action bank features computed against of a predefined set of videos known as an action bank, contain linear patterns ...






Comments