Abstract
Action recognition has attracted increasing interest in computer vision due to its potential applications in many vision systems. One of the main challenges in action recognition is to extract powerful features from videos. Most existing approaches exploit either hand-crafted techniques or learning-based methods to extract features from videos. However, these methods mainly focus on extracting the dynamic motion features, which ignore the static form features. Therefore, these methods cannot fully capture the underlying information in videos accurately. In this article, we propose a novel feature representation method for action recognition, which exploits hierarchical sparse coding to learn the underlying features from videos. The learned features characterize the form and motion simultaneously and therefore provide more accurate and complete feature representation. The learned form and motion features are considered as two modalities, which are used to represent both the static and motion features. These modalities are further encoded into a global representation via a pairwise dictionary learning and then fed to an SVM classifier for action classification. Experimental results on several challenging datasets validate that the proposed method is superior to several state-of-the-art methods.
- Herbert Bay, Tinne Tuytelaars, and L. Gool. 2006. SURF: Speeded up robust features. In Proceedings of the European Conference on Computer Vision. 404--417.Google Scholar
- Y. Boureau, F. Bach, Y. LeCun, and J. Ponce. 2010. Learning mid-level features for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2259--2566.Google Scholar
- X. P. Burgos-Artizzu, P. Dollár, D. Lin, D. J. Anderson, and P. Perona. 2012. Social behavior recognition in continuous video. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. 1322--1329.Google Scholar
- C. F. Cadieu and B. A. Olshausen. 2012. Learning intermediate-level representations of form and motion from natural movies. Neural Computation 24, 4 (2012), 827--866.Google Scholar
Digital Library
- K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. 2011. The devil is in the details: An evaluation of recent feature encoding methods. In Proceedings of the British Machine Vision Conference.Google Scholar
- P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proceedings of the International Conference on Computer Communications and Networks. 65--72.Google Scholar
- Ionut Cosmin Duta, Bogdan Ionescu, Kiyoharu Aizawa, and Nicu Sebe. 2017. Spatio-temporal vector of locally max pooled features for action recognition in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 3205--3214.Google Scholar
Cross Ref
- Ruohan Gao, Bo Xiong, and Kristen Grauman. 2018. Im2Flow: Motion hallucination from static images for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Nuno C. Garcia, Pietro Morerio, and Vittorio Murino. 2018. Modality distillation with multiple stream networks for action recognition. In Proceedings of the 15th European Conference on Computer Vision.Google Scholar
Cross Ref
- H. Jhuang, T. Serre, L. Wolf, and T. Poggio. 2007. A biologically inspired system for action recognition. In Proceedings of the International Conference on Computer Vision. 1--8.Google Scholar
- S. Ji, W. Xu, M. Yang, and K. Yu. 2010. 3D convolutional neural networks for human action recognition. In Proceedings of the International Conference on Machine Learning. 1--8.Google Scholar
- S. Ji, W. Xu, M. Yang, and K. Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 221--231.Google Scholar
Digital Library
- Feng Jiang, Shengping Zhang, Shen Wu, Yang Gao, and Debin Zhao. 2015. Multi-layered gesture recognition with Kinect. Journal of Machine Learning Research 16, 2 (2015), 227--254.Google Scholar
Digital Library
- Zheheng Jiang, Danny Crookes, Brian D. Green, Yunfeng Zhao, Haiping Ma, Ling Li, Shengping Zhang, Dacheng Tao, and Huiyu Zhou. 2019. Context-aware mouse behaviour recognition using hidden Markov models. IEEE Transactions on Image Processing 28, 3 (2019), 1133--1148.Google Scholar
Digital Library
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. 1725--1732.Google Scholar
- A. Klaser, M. Marszalek, and C. Schmid. 2008. A spatio-temporal descriptor based on 3D gradients. Proceedings of the British Machine Vision Conference. 995--1004.Google Scholar
- Xiangyuan Lan, Andy J. Ma, Pong C. Yuen, and Rama Chellappa. 2015. Joint sparse representation and robust feature-level fusion for multi-cue visual tracking. IEEE Transactions on Image Processing 24, 12 (2015), 5826--5841.Google Scholar
Digital Library
- X. Lan, M. Ye, R. Shao, B. Zhong, P. C. Yuen, and H. Zhou. 2019. Learning modality-consistency feature templates: A robust RGB-infrared tracking system. IEEE Transactions on Industrial Electronics 66, 12 (2019), 9887--9897. DOI:https://doi.org/10.1109/TIE.2019.2898618Google Scholar
Cross Ref
- I. Laptev and T. Linderberg. 2003. Space-time interest points. In Proceedings of the International Conference on Computer Vision. 432--439.Google Scholar
Digital Library
- I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of the International Conference on Pattern Recognition. 1--8.Google Scholar
- S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. 2169--2178.Google Scholar
- Q. Le, W. Zou, S. Yeung, and A. Ng. 2011. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the International Conference on Pattern Recognition. 3361--3368.Google Scholar
- Multimodal learning with deep Boltzmann machines. 2011. Deep learning in visual computing and signal processing. Journal of Machine Learning Research.Google Scholar
- Baiyang Liu, Junzhou Huang, Lin Yang, and Casimir Kulikowski. 2011. Robust tracking using local sparse appearance model and K-selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1313--1320.Google Scholar
Digital Library
- Fang Liu, Xiangmin Xu, Shuoyang Qiu, Chunmei Qing, and Dacheng Tao. 2016. Simple to complex transfer learning for action recognition. IEEE Transactions on Image Processing 25, 2 (2016), 949--960.Google Scholar
Digital Library
- H. Liu, M. Yuan, and F. Sun. 2015. RGB-D action recognition using linear coding. Neurocomputing 149, A (2015), 79--85.Google Scholar
- J. Liu, J. Luo, and M. Shah. 2009. Recognizing realistic actions from videos “in the wild.” In Proceedings of the International Conference on Pattern Recognition. 1996--2003.Google Scholar
- Li Liu, Ling Shao, Xuelong Li, and Ke Lu. 2016. Learning spatio-temporal representations for action recognition: A genetic programming approach. IEEE Transactions on Cybernetics 46, 1 (2016), 158--170.Google Scholar
Cross Ref
- Li Liu, Ling Shao, Feng Zheng, and Xuelong Li. 2014. Realistic action recognition via sparsely-constructed gaussian processes. Pattern Recognition 47, 12 (2014), 3819--3827.Google Scholar
Cross Ref
- D. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91--110.Google Scholar
Digital Library
- M. Marzalek, I. Laptev, and C. Schmid. 2009. Actions in context. In Proceedings of the International Conference on Pattern Recognition. 2929--2936.Google Scholar
- Jie Miao, Xiangmin Xu, Shuoyang Qiu, Chunmei Qing, and Dacheng Tao. 2015. Temporal variance analysis for action recognition. IEEE Transactions on Image Processing 24, 12 (2015), 5904--5915.Google Scholar
Digital Library
- J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. 2011. Multimodal deep learning. In Proceedings of the International Conference on Machine Learning.Google Scholar
- B. Ni, G. Wang, and P. Moulin. 2011. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In Proceedings of the International Conference on Computer Vision Workshops.Google Scholar
- A. Oikonomopoulos, I. Patras, and M. Pantic. 2005. Spatiotemporal salient points for visual recognition of human actions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 36, 3 (2005), 710--719.Google Scholar
Digital Library
- Yuankai Qi, Shengping Zhang, Lei Qin, Hongxun Yao, Qingming Huang, Jongwoo Lim, and Ming-Hsuan Yang. 2019. Hedging deep features for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 5 (2019), 1116--1130.Google Scholar
Cross Ref
- M. Rodriguez, J. Ahmed, and M. Shah. 2008. Action MACH: A spatio-temporal maximum average correlation height filter for action recognition. In Proceedings of the International Conference on Computer Vision. 1--8.Google Scholar
- S. Sadanand and J. Corso. 2012. Action Bank: A high-level representation of activity in video. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. 1234--1241.Google Scholar
- J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (2013), 222--245.Google Scholar
Digital Library
- C. Schuldt, I. Laptev, and B. Caputo. 2004. Recognizing human actions: A local SVM approach. In Proceedings of the International Conference on Pattern Recognition. 32--36.Google Scholar
- H. Seo and P. Milanfar. 2009. Detection of human actions from a single example. In Proceedings of the International Conference on Computer Vision. 1965--1970.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of Neural Information Processing Systems (NIPS’14). 568--576.Google Scholar
- Y. Song, S. Liu, and J. Tang. 2015. Describing trajectory of surface patch for human action recognition on RGB and depth videos. IEEE Signal Processing Letters 22, 4 (2015), 426--429.Google Scholar
Cross Ref
- G. Taylor, R. Fergus, Y. Lecun, and C. Bregler. 2010. Convolutional learning of spatio-temporal features. In Proceedings of the 11th European Conference on Computer Vision. 140--153.Google Scholar
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558.Google Scholar
Digital Library
- H. Wang, M. Ullah, A. Kläser, I. Laptev, and C. Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the British Machine Vision Conference. 127--137.Google Scholar
- Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, T. Huang, and Yihong Gong. 2010. Locality-constrained linear coding for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3360--3367.Google Scholar
Cross Ref
- Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314.Google Scholar
Cross Ref
- L. Wang, Y. Qiao, and X. Tang. 2015. Action recognition with trajectory pooled deep convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
- Peter Wilf, Shengping Zhang, Sharat Chikkerur, Stefan A. Little, Scott L. Wing, and Thomas Serre. 2016. Computer vision cracks the leaf code. Proceedings of the National Academy of Sciences of the United States of America 113, 12 (2016), 3305--3310.Google Scholar
Cross Ref
- G. Willems, T. Tuytelaars, and L. Gool. 2008. An efficient dense and scale-invariant spatio-temporal interest point detector. In Proceedings of the European Conference on Computer Vision. 650--663.Google Scholar
- Danfeng Xie, Lei Zhang, and Li Bai. 2017. Deep learning in visual computing and signal processing. Applied Computational Intelligence and Soft Computing 2017 (2017), Article 1320780.Google Scholar
- J. Yu, F. Lin, H. Seah, C. Li, and Z. Lin. 2012. Image classification by multimodal subspace learning. Pattern Recognition Letters 33, 9 (2012), 1196--1204.Google Scholar
Digital Library
- J. Yu, D. Tao, Y. Rui, and J. Cheng. 2013. Pairwise constraints based multiview features fusion for scene classification. Pattern Recognition 46, 2 (2013), 483--496.Google Scholar
Digital Library
- J. Yu, D. Tao, and M. Wang. 2012. Adaptive hypergraph learning and its application in image classification. IEEE Transactions on Image Processing 21, 7 (2012), 3262--3272.Google Scholar
Digital Library
- J. Yu, M. Wang, and D. Tao. 2012. Semisupervised multiview distance metric learning for cartoon synthesis. IEEE Transactions on Image Processing 21, 11 (2012), 4636--4648.Google Scholar
Digital Library
- C. Yuan, X. Li, W. Hu, H. Ling, and S. Maybank. 2013. 3D R transform on spatio-temporal interest points for action recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. 724--730.Google Scholar
- Lei Zhang, Zhimeng Luo, Ruimei Chai, Dooman Arefan, Jules Sumkin, and Shandong Wu. 2019. Deep-learning method for tumor segmentation in breast DCE-MRI. In Medical Imaging 2019: Imaging Informatics for Healthcare, Research, and Applications. Vol. 10954. SPIE.Google Scholar
- Lei Zhang, Ruimei Chai, Aly A. Mohamed, Bingjie Zheng, Zhimeng Luo, and Shandong Wu. 2019. Automated deep-learning method for whole-breast segmentation in diffusion-weighted breast MRI. In Medical Imaging 2019: Computer-Aided Diagnosis. Vol. 10950. SPIE.Google Scholar
- Lei Zhang, Fan Yang, Yimin Daniel Zhang, and Ying Julie Zhu. 2016. Road crack detection using deep convolutional neural network. In Proceedings of the International Conference on Image Processing. IEEE, Los Alamitos, CA.Google Scholar
Cross Ref
- Lei Zhang, Shengping Zhang, Feng Jiang, Yuankai Qi, Jun Zhang, Yuliang Guo, and Huiyu Zhou. 2018. BoMW: Bag of manifold words for one-shot learning gesture recognition from Kinect. IEEE Transactions on Circuits and Systems for Video Technology 28, 10 (2018), 2562--2573.Google Scholar
Digital Library
- Shengping Zhang, Shiva Kasiviswanathan, Pong C. Yuen, and Mehrtash Harandi. 2015. Online dictionary learning on symmetric positive definite manifolds with vision applications. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 3165--3173.Google Scholar
- Shengping Zhang, Xiangyuan Lan, Yuankai Qi, and Pong C. Yuen. 2017. Robust visual tracking via basis matching. IEEE Transactions on Circuits and Systems for Video Technology 27, 3 (2017), 421--430.Google Scholar
Digital Library
- Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2017. A biologically inspired appearance model for robust visual tracking. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2017), 2357--2370.Google Scholar
Cross Ref
- Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2018. Point-to-set distance metric learning on deep representations for visual tracking. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 187--198.Google Scholar
Cross Ref
- Shengping Zhang, Hongxun Yao, Xin Sun, and Xiusheng Lu. 2013. Sparse coding based visual tracking: Review and experimental comparison. Pattern Recognition 46, 7 (2013), 1772--1788.Google Scholar
Digital Library
- Shengping Zhang, Hongxun Yao, Xin Sun, Kuanquan Wang, Jun Zhang, Xiusheng Lu, and Yanhao Zhang. 2014. Action recognition based on overcomplete independent component analysis. Information Sciences 281 (2014), 635--647.Google Scholar
Digital Library
- S. Zhang, H. Yao, H. Zhou, X. Sun, and S. Liu. 2013. Robust visual tracking based on online learning sparse representation. Neurocomputing 100, 1 (2013), 31--40.Google Scholar
Digital Library
- Shengping Zhang, Huiyu Zhou, Feng Jiang, and Xuelong Li. 2015. Robust visual tracking using structurally random projection and weighted least squares. IEEE Transactions on Circuits and Systems for Video Technology 25, 11 (2015), 1749--1760.Google Scholar
Digital Library
- Shengping Zhang, Huiyu Zhou, Hongxun Yao, Yanhao Zhang, Kuanquan Wang, and Jun Zhang. 2015. Adaptive NormalHedge for robust visual tracking. Signal Processing 110 (2015), 132--142.Google Scholar
Digital Library
- Zhang Zhang and Dacheng Tao. 2012. Slow feature analysis for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 3 (2012), 436--450.Google Scholar
Digital Library
- Xiantong Zhen, Ling Shao, and Xuelong Li. 2014. Action recognition by spatio-temporal oriented energies. Information Sciences 281 (2014), 295--309.Google Scholar
Digital Library
- Xiantong Zhen, Ling Shao, Dacheng Tao, and Xuelong Li. 2013. Embedding motion and structure features for action recognition. IEEE Transactions on Circuits and Systems for Video Technology 23, 7 (2013), 1182--1190.Google Scholar
Digital Library
- Wei Zhong, Huchuan Lu, and M. Yang. 2012. Robust object tracking via sparsity-based collaborative model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1838--1845.Google Scholar
Digital Library
- Tianyi Zhou and Dacheng Tao. 2013. Double shrinking sparse dimension reduction. IEEE Transactions on Image Processing 22, 1 (2013), 244--257.Google Scholar
Digital Library
- Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and Wenjun Zeng. 2018. MiCT: Mixed 3D/2D convolutional tube for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Zongwei Zhou, Jae Shin, Lei Zhang, Suryakanth Gurudu, Michael Gotway, and Jianming Liang. 2017. Fine-tuning convolutional neural networks for biomedical image analysis: Actively and incrementally. In Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 7340--7349.Google Scholar
Cross Ref
Index Terms
Action Recognition Using Form and Motion Modalities
Recommendations
Learning discriminative motion feature for enhancing multi-modal action recognition
AbstractVideo action recognition is an important topic in computer vision tasks. Most of the existing methods use CNN-based models, and multiple modalities of image features are captured from the videos, such as static frames, dynamic images, ...
Highlights- A new network is proposed to learn discriminative dynamic motion features.
- The ...
Learning View-Invariant Sparse Representations for Cross-View Action Recognition
ICCV '13: Proceedings of the 2013 IEEE International Conference on Computer VisionWe present an approach to jointly learn a set of view-specific dictionaries and a common dictionary for cross-view action recognition. The set of view-specific dictionaries is learned for specific views while the common dictionary is shared across ...
Contextual Statistics of Space-Time Ordered Features for Human Action Recognition
AVSS '12: Proceedings of the 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based SurveillanceThe bag-of-words approach with local spatio-temporal features have become a popular video representation for action recognition. Recent methods have typically focused on capturing global and local statistics of features. However, existing approaches ...






Comments