skip to main content
research-article

Moving Foreground-Aware Visual Attention and Key Volume Mining for Human Action Recognition

Authors Info & Claims
Published:08 August 2019Publication History
Skip Abstract Section

Abstract

Recently, many deep learning approaches have shown remarkable progress on human action recognition. However, it remains unclear how to extract the useful information in videos since only video-level labels are available in the training phase. To address this limitation, many efforts have been made to improve the performance of action recognition by applying the visual attention mechanism in the deep learning model. In this article, we propose a novel deep model called Moving Foreground Attention (MFA) that enhances the performance of action recognition by guiding the model to focus on the discriminative foreground targets. In our work, MFA detects the moving foreground through a proposed variance-based algorithm. Meanwhile, an unsupervised proposal is utilized to mine the action-related key volumes and generate corresponding correlation scores. Based on these scores, a newly proposed stochastic-out scheme is exploited to train the MFA. Experiment results show that action recognition performance can be significantly improved by using our proposed techniques, and our model achieves state-of-the-art performance on UCF101 and HMDB51.

References

  1. Joe Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. 2015. Beyond short snippets: Deep networks for video classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4694--4702.Google ScholarGoogle Scholar
  2. Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. 2003. Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer (Eds.). MIT Press, 577--584. http://papers.nips.cc/paper/2232-support-vector-machines-for-multiple-instance-learning.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Antic, T. Milbich, and B. Ommer. 2013. Less is more: Video trimming for action recognition. In 2013 IEEE International Conference on Computer Vision Workshops. 515--521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. 2016. Dynamic image networks for action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 3034--3042.Google ScholarGoogle Scholar
  5. J. Carreira and A. Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 4724--4733.Google ScholarGoogle Scholar
  6. P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 65--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. 2017. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (April 2017), 677--691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2016. Spatiotemporal residual networks for video action recognition. CoRR abs/1611.02155 (2016). http://arxiv.org/abs/1611.02155 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Feichtenhofer, A. Pinz, and A. Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1933--1941.Google ScholarGoogle Scholar
  10. G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2261--2269.Google ScholarGoogle Scholar
  11. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. 2013. HMDB51: A large video database for human motion recognition. In High Performance Computing in Science and Engineering’12, Wolfgang E. Nagel, Dietmar H. Kröner, and Michael M. Resch (Eds.). Springer, Berlin Heidelberg, 571--582.Google ScholarGoogle Scholar
  13. Ivan Laptev. 2005. On space-time interest points. International Journal of Computer Vision 64, 2 (Sept. 2005), 107--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Weiyao Lin, Yang Mi, Jianxin Wu, Ke Lu, and Hongkai Xiong. 2017. Action recognition with coarse-to-fine deep feature integration and asynchronous fusion. CoRR abs/1711.07430 (2017). http://arxiv.org/abs/1711.07430Google ScholarGoogle Scholar
  15. J. Liu, G. Wang, P. Hu, L. Duan, and A. C. Kot. 2017. Global context-aware attention LSTM networks for 3D action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3671--3680.Google ScholarGoogle Scholar
  16. M. Liu and J. Yuan. 2018. Recognizing human actions as the evolution of pose estimation maps. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1159--1168.Google ScholarGoogle Scholar
  17. Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. CoRR abs/1406.6247 (2014). http://arxiv.org/abs/1406.6247Google ScholarGoogle Scholar
  18. Z. Qiu, T. Yao, and T. Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV’17). 5534--5542.Google ScholarGoogle Scholar
  19. Ronald A. Rensink. 2000. The dynamic representation of scenes. Visual Cognition 7, 1--3 (2000), 17--42.Google ScholarGoogle ScholarCross RefCross Ref
  20. Michael Sapienza, Fabio Cuzzolin, and Philip H. S. Torr. 2014. Learning discriminative space--time action parts from weakly labelled videos. International Journal of Computer Vision 110, 1 (Oct. 2014), 30--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia (MM’07). ACM, New York, NY, 357--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. CoRR abs/1511.04119 (2015). http://arxiv.org/abs/1511.04119Google ScholarGoogle Scholar
  23. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. CoRR abs/1406.2199 (2014). http://arxiv.org/abs/1406.2199Google ScholarGoogle Scholar
  24. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556Google ScholarGoogle Scholar
  25. Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. 2016. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. CoRR abs/1611.06067 (2016). http://arxiv.org/abs/1611.06067Google ScholarGoogle Scholar
  26. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). http://arxiv.org/abs/1212.0402Google ScholarGoogle Scholar
  27. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV’15). 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Varol, I. Laptev, and C. Schmid. 2018. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (June 2018), 1510--1517.Google ScholarGoogle ScholarCross RefCross Ref
  29. Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103, 1 (May 2013), 60--79.Google ScholarGoogle ScholarCross RefCross Ref
  30. H. Wang and C. Schmid. 2013. Action recognition with improved trajectories. In 2013 IEEE International Conference on Computer Vision. 3551--3558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Wang, Y. Qiao, and X. Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4305--4314.Google ScholarGoogle Scholar
  32. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Computer Vision -- ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 20--36.Google ScholarGoogle ScholarCross RefCross Ref
  33. X. Wang, A. Farhadi, and A. Gupta. 2016. Actions transformations. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2658--2667.Google ScholarGoogle Scholar
  34. Y. Wang, M. Long, J. Wang, and P. S. Yu. 2017. Spatiotemporal pyramid network for video action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2097--2106.Google ScholarGoogle Scholar
  35. Z. Wang, C. Zhang, W. Luo, and W. Lin. 2018. Key joints selection and spatiotemporal mining for skeleton-based action recognition. In 2018 25th IEEE International Conference on Image Processing (ICIP’18). 3458--3462.Google ScholarGoogle Scholar
  36. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. CoRR abs/1502.03044 (2015). http://arxiv.org/abs/1502.03044Google ScholarGoogle Scholar
  37. Shiyang Yan, Jeremy S. Smith, Wenjin Lu, and Bailing Zhang. 2017. Hierarchical multi-scale attention networks for action recognition. Signal Processing Image Communication 61 (2017), 73--84.Google ScholarGoogle ScholarCross RefCross Ref
  38. C. Zach, T. Pock, and H. Bischof. 2007. A duality based approach for realtime tv-l1 optical flow. In Pattern Recognition, Fred A. Hamprecht, Christoph Schnörr, and Bernd Jähne (Eds.). Springer, Berlin, 214--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao. 2016. A key volume mining deep framework for action recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1991--1999.Google ScholarGoogle Scholar
  40. C. Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In Computer Vision -- ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 391--405.Google ScholarGoogle Scholar

Index Terms

  1. Moving Foreground-Aware Visual Attention and Key Volume Mining for Human Action Recognition

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!