skip to main content
research-article

Action Recognition Using Form and Motion Modalities

Authors Info & Claims
Published:17 April 2020Publication History
Skip Abstract Section

Abstract

Action recognition has attracted increasing interest in computer vision due to its potential applications in many vision systems. One of the main challenges in action recognition is to extract powerful features from videos. Most existing approaches exploit either hand-crafted techniques or learning-based methods to extract features from videos. However, these methods mainly focus on extracting the dynamic motion features, which ignore the static form features. Therefore, these methods cannot fully capture the underlying information in videos accurately. In this article, we propose a novel feature representation method for action recognition, which exploits hierarchical sparse coding to learn the underlying features from videos. The learned features characterize the form and motion simultaneously and therefore provide more accurate and complete feature representation. The learned form and motion features are considered as two modalities, which are used to represent both the static and motion features. These modalities are further encoded into a global representation via a pairwise dictionary learning and then fed to an SVM classifier for action classification. Experimental results on several challenging datasets validate that the proposed method is superior to several state-of-the-art methods.

References

  1. Herbert Bay, Tinne Tuytelaars, and L. Gool. 2006. SURF: Speeded up robust features. In Proceedings of the European Conference on Computer Vision. 404--417.Google ScholarGoogle Scholar
  2. Y. Boureau, F. Bach, Y. LeCun, and J. Ponce. 2010. Learning mid-level features for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2259--2566.Google ScholarGoogle Scholar
  3. X. P. Burgos-Artizzu, P. Dollár, D. Lin, D. J. Anderson, and P. Perona. 2012. Social behavior recognition in continuous video. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. 1322--1329.Google ScholarGoogle Scholar
  4. C. F. Cadieu and B. A. Olshausen. 2012. Learning intermediate-level representations of form and motion from natural movies. Neural Computation 24, 4 (2012), 827--866.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. 2011. The devil is in the details: An evaluation of recent feature encoding methods. In Proceedings of the British Machine Vision Conference.Google ScholarGoogle Scholar
  6. P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proceedings of the International Conference on Computer Communications and Networks. 65--72.Google ScholarGoogle Scholar
  7. Ionut Cosmin Duta, Bogdan Ionescu, Kiyoharu Aizawa, and Nicu Sebe. 2017. Spatio-temporal vector of locally max pooled features for action recognition in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 3205--3214.Google ScholarGoogle ScholarCross RefCross Ref
  8. Ruohan Gao, Bo Xiong, and Kristen Grauman. 2018. Im2Flow: Motion hallucination from static images for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  9. Nuno C. Garcia, Pietro Morerio, and Vittorio Murino. 2018. Modality distillation with multiple stream networks for action recognition. In Proceedings of the 15th European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  10. H. Jhuang, T. Serre, L. Wolf, and T. Poggio. 2007. A biologically inspired system for action recognition. In Proceedings of the International Conference on Computer Vision. 1--8.Google ScholarGoogle Scholar
  11. S. Ji, W. Xu, M. Yang, and K. Yu. 2010. 3D convolutional neural networks for human action recognition. In Proceedings of the International Conference on Machine Learning. 1--8.Google ScholarGoogle Scholar
  12. S. Ji, W. Xu, M. Yang, and K. Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 221--231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Feng Jiang, Shengping Zhang, Shen Wu, Yang Gao, and Debin Zhao. 2015. Multi-layered gesture recognition with Kinect. Journal of Machine Learning Research 16, 2 (2015), 227--254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Zheheng Jiang, Danny Crookes, Brian D. Green, Yunfeng Zhao, Haiping Ma, Ling Li, Shengping Zhang, Dacheng Tao, and Huiyu Zhou. 2019. Context-aware mouse behaviour recognition using hidden Markov models. IEEE Transactions on Image Processing 28, 3 (2019), 1133--1148.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. 1725--1732.Google ScholarGoogle Scholar
  16. A. Klaser, M. Marszalek, and C. Schmid. 2008. A spatio-temporal descriptor based on 3D gradients. Proceedings of the British Machine Vision Conference. 995--1004.Google ScholarGoogle Scholar
  17. Xiangyuan Lan, Andy J. Ma, Pong C. Yuen, and Rama Chellappa. 2015. Joint sparse representation and robust feature-level fusion for multi-cue visual tracking. IEEE Transactions on Image Processing 24, 12 (2015), 5826--5841.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. X. Lan, M. Ye, R. Shao, B. Zhong, P. C. Yuen, and H. Zhou. 2019. Learning modality-consistency feature templates: A robust RGB-infrared tracking system. IEEE Transactions on Industrial Electronics 66, 12 (2019), 9887--9897. DOI:https://doi.org/10.1109/TIE.2019.2898618Google ScholarGoogle ScholarCross RefCross Ref
  19. I. Laptev and T. Linderberg. 2003. Space-time interest points. In Proceedings of the International Conference on Computer Vision. 432--439.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of the International Conference on Pattern Recognition. 1--8.Google ScholarGoogle Scholar
  21. S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. 2169--2178.Google ScholarGoogle Scholar
  22. Q. Le, W. Zou, S. Yeung, and A. Ng. 2011. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the International Conference on Pattern Recognition. 3361--3368.Google ScholarGoogle Scholar
  23. Multimodal learning with deep Boltzmann machines. 2011. Deep learning in visual computing and signal processing. Journal of Machine Learning Research.Google ScholarGoogle Scholar
  24. Baiyang Liu, Junzhou Huang, Lin Yang, and Casimir Kulikowski. 2011. Robust tracking using local sparse appearance model and K-selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1313--1320.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Fang Liu, Xiangmin Xu, Shuoyang Qiu, Chunmei Qing, and Dacheng Tao. 2016. Simple to complex transfer learning for action recognition. IEEE Transactions on Image Processing 25, 2 (2016), 949--960.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. Liu, M. Yuan, and F. Sun. 2015. RGB-D action recognition using linear coding. Neurocomputing 149, A (2015), 79--85.Google ScholarGoogle Scholar
  27. J. Liu, J. Luo, and M. Shah. 2009. Recognizing realistic actions from videos “in the wild.” In Proceedings of the International Conference on Pattern Recognition. 1996--2003.Google ScholarGoogle Scholar
  28. Li Liu, Ling Shao, Xuelong Li, and Ke Lu. 2016. Learning spatio-temporal representations for action recognition: A genetic programming approach. IEEE Transactions on Cybernetics 46, 1 (2016), 158--170.Google ScholarGoogle ScholarCross RefCross Ref
  29. Li Liu, Ling Shao, Feng Zheng, and Xuelong Li. 2014. Realistic action recognition via sparsely-constructed gaussian processes. Pattern Recognition 47, 12 (2014), 3819--3827.Google ScholarGoogle ScholarCross RefCross Ref
  30. D. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91--110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Marzalek, I. Laptev, and C. Schmid. 2009. Actions in context. In Proceedings of the International Conference on Pattern Recognition. 2929--2936.Google ScholarGoogle Scholar
  32. Jie Miao, Xiangmin Xu, Shuoyang Qiu, Chunmei Qing, and Dacheng Tao. 2015. Temporal variance analysis for action recognition. IEEE Transactions on Image Processing 24, 12 (2015), 5904--5915.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  34. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. 2011. Multimodal deep learning. In Proceedings of the International Conference on Machine Learning.Google ScholarGoogle Scholar
  35. B. Ni, G. Wang, and P. Moulin. 2011. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In Proceedings of the International Conference on Computer Vision Workshops.Google ScholarGoogle Scholar
  36. A. Oikonomopoulos, I. Patras, and M. Pantic. 2005. Spatiotemporal salient points for visual recognition of human actions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 36, 3 (2005), 710--719.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yuankai Qi, Shengping Zhang, Lei Qin, Hongxun Yao, Qingming Huang, Jongwoo Lim, and Ming-Hsuan Yang. 2019. Hedging deep features for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 5 (2019), 1116--1130.Google ScholarGoogle ScholarCross RefCross Ref
  38. M. Rodriguez, J. Ahmed, and M. Shah. 2008. Action MACH: A spatio-temporal maximum average correlation height filter for action recognition. In Proceedings of the International Conference on Computer Vision. 1--8.Google ScholarGoogle Scholar
  39. S. Sadanand and J. Corso. 2012. Action Bank: A high-level representation of activity in video. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. 1234--1241.Google ScholarGoogle Scholar
  40. J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (2013), 222--245.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. Schuldt, I. Laptev, and B. Caputo. 2004. Recognizing human actions: A local SVM approach. In Proceedings of the International Conference on Pattern Recognition. 32--36.Google ScholarGoogle Scholar
  42. H. Seo and P. Milanfar. 2009. Detection of human actions from a single example. In Proceedings of the International Conference on Computer Vision. 1965--1970.Google ScholarGoogle Scholar
  43. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of Neural Information Processing Systems (NIPS’14). 568--576.Google ScholarGoogle Scholar
  44. Y. Song, S. Liu, and J. Tang. 2015. Describing trajectory of surface patch for human action recognition on RGB and depth videos. IEEE Signal Processing Letters 22, 4 (2015), 426--429.Google ScholarGoogle ScholarCross RefCross Ref
  45. G. Taylor, R. Fergus, Y. Lecun, and C. Bregler. 2010. Convolutional learning of spatio-temporal features. In Proceedings of the 11th European Conference on Computer Vision. 140--153.Google ScholarGoogle Scholar
  46. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. H. Wang, M. Ullah, A. Kläser, I. Laptev, and C. Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the British Machine Vision Conference. 127--137.Google ScholarGoogle Scholar
  48. Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, T. Huang, and Yihong Gong. 2010. Locality-constrained linear coding for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3360--3367.Google ScholarGoogle ScholarCross RefCross Ref
  49. Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314.Google ScholarGoogle ScholarCross RefCross Ref
  50. L. Wang, Y. Qiao, and X. Tang. 2015. Action recognition with trajectory pooled deep convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  51. Peter Wilf, Shengping Zhang, Sharat Chikkerur, Stefan A. Little, Scott L. Wing, and Thomas Serre. 2016. Computer vision cracks the leaf code. Proceedings of the National Academy of Sciences of the United States of America 113, 12 (2016), 3305--3310.Google ScholarGoogle ScholarCross RefCross Ref
  52. G. Willems, T. Tuytelaars, and L. Gool. 2008. An efficient dense and scale-invariant spatio-temporal interest point detector. In Proceedings of the European Conference on Computer Vision. 650--663.Google ScholarGoogle Scholar
  53. Danfeng Xie, Lei Zhang, and Li Bai. 2017. Deep learning in visual computing and signal processing. Applied Computational Intelligence and Soft Computing 2017 (2017), Article 1320780.Google ScholarGoogle Scholar
  54. J. Yu, F. Lin, H. Seah, C. Li, and Z. Lin. 2012. Image classification by multimodal subspace learning. Pattern Recognition Letters 33, 9 (2012), 1196--1204.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. J. Yu, D. Tao, Y. Rui, and J. Cheng. 2013. Pairwise constraints based multiview features fusion for scene classification. Pattern Recognition 46, 2 (2013), 483--496.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. J. Yu, D. Tao, and M. Wang. 2012. Adaptive hypergraph learning and its application in image classification. IEEE Transactions on Image Processing 21, 7 (2012), 3262--3272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. J. Yu, M. Wang, and D. Tao. 2012. Semisupervised multiview distance metric learning for cartoon synthesis. IEEE Transactions on Image Processing 21, 11 (2012), 4636--4648.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. C. Yuan, X. Li, W. Hu, H. Ling, and S. Maybank. 2013. 3D R transform on spatio-temporal interest points for action recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. 724--730.Google ScholarGoogle Scholar
  59. Lei Zhang, Zhimeng Luo, Ruimei Chai, Dooman Arefan, Jules Sumkin, and Shandong Wu. 2019. Deep-learning method for tumor segmentation in breast DCE-MRI. In Medical Imaging 2019: Imaging Informatics for Healthcare, Research, and Applications. Vol. 10954. SPIE.Google ScholarGoogle Scholar
  60. Lei Zhang, Ruimei Chai, Aly A. Mohamed, Bingjie Zheng, Zhimeng Luo, and Shandong Wu. 2019. Automated deep-learning method for whole-breast segmentation in diffusion-weighted breast MRI. In Medical Imaging 2019: Computer-Aided Diagnosis. Vol. 10950. SPIE.Google ScholarGoogle Scholar
  61. Lei Zhang, Fan Yang, Yimin Daniel Zhang, and Ying Julie Zhu. 2016. Road crack detection using deep convolutional neural network. In Proceedings of the International Conference on Image Processing. IEEE, Los Alamitos, CA.Google ScholarGoogle ScholarCross RefCross Ref
  62. Lei Zhang, Shengping Zhang, Feng Jiang, Yuankai Qi, Jun Zhang, Yuliang Guo, and Huiyu Zhou. 2018. BoMW: Bag of manifold words for one-shot learning gesture recognition from Kinect. IEEE Transactions on Circuits and Systems for Video Technology 28, 10 (2018), 2562--2573.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Shengping Zhang, Shiva Kasiviswanathan, Pong C. Yuen, and Mehrtash Harandi. 2015. Online dictionary learning on symmetric positive definite manifolds with vision applications. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 3165--3173.Google ScholarGoogle Scholar
  64. Shengping Zhang, Xiangyuan Lan, Yuankai Qi, and Pong C. Yuen. 2017. Robust visual tracking via basis matching. IEEE Transactions on Circuits and Systems for Video Technology 27, 3 (2017), 421--430.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2017. A biologically inspired appearance model for robust visual tracking. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2017), 2357--2370.Google ScholarGoogle ScholarCross RefCross Ref
  66. Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2018. Point-to-set distance metric learning on deep representations for visual tracking. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 187--198.Google ScholarGoogle ScholarCross RefCross Ref
  67. Shengping Zhang, Hongxun Yao, Xin Sun, and Xiusheng Lu. 2013. Sparse coding based visual tracking: Review and experimental comparison. Pattern Recognition 46, 7 (2013), 1772--1788.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Shengping Zhang, Hongxun Yao, Xin Sun, Kuanquan Wang, Jun Zhang, Xiusheng Lu, and Yanhao Zhang. 2014. Action recognition based on overcomplete independent component analysis. Information Sciences 281 (2014), 635--647.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. S. Zhang, H. Yao, H. Zhou, X. Sun, and S. Liu. 2013. Robust visual tracking based on online learning sparse representation. Neurocomputing 100, 1 (2013), 31--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Shengping Zhang, Huiyu Zhou, Feng Jiang, and Xuelong Li. 2015. Robust visual tracking using structurally random projection and weighted least squares. IEEE Transactions on Circuits and Systems for Video Technology 25, 11 (2015), 1749--1760.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Shengping Zhang, Huiyu Zhou, Hongxun Yao, Yanhao Zhang, Kuanquan Wang, and Jun Zhang. 2015. Adaptive NormalHedge for robust visual tracking. Signal Processing 110 (2015), 132--142.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Zhang Zhang and Dacheng Tao. 2012. Slow feature analysis for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 3 (2012), 436--450.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Xiantong Zhen, Ling Shao, and Xuelong Li. 2014. Action recognition by spatio-temporal oriented energies. Information Sciences 281 (2014), 295--309.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Xiantong Zhen, Ling Shao, Dacheng Tao, and Xuelong Li. 2013. Embedding motion and structure features for action recognition. IEEE Transactions on Circuits and Systems for Video Technology 23, 7 (2013), 1182--1190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Wei Zhong, Huchuan Lu, and M. Yang. 2012. Robust object tracking via sparsity-based collaborative model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1838--1845.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Tianyi Zhou and Dacheng Tao. 2013. Double shrinking sparse dimension reduction. IEEE Transactions on Image Processing 22, 1 (2013), 244--257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun Zeng. 2018. MiCT: Mixed 3D/2D convolutional tube for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  78. Zongwei Zhou, Jae Shin, Lei Zhang, Suryakanth Gurudu, Michael Gotway, and Jianming Liang. 2017. Fine-tuning convolutional neural networks for biomedical image analysis: Actively and incrementally. In Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 7340--7349.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Action Recognition Using Form and Motion Modalities

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 1s
      Special Issue on Multimodal Machine Learning for Human Behavior Analysis and Special Issue on Computational Intelligence for Biomedical Data and Imaging
      January 2020
      376 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3388236
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 April 2020
      • Accepted: 1 July 2019
      • Revised: 1 June 2019
      • Received: 1 March 2019
      Published in tomm Volume 16, Issue 1s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!