Abstract
This article addresses the problem of recognizing partially observed human actions. Videos of actions acquired in the real world often contain corrupt frames caused by various factors. These frames may appear irregularly, and make the actions only partially observed. They change the appearance of actions and degrade the performance of pretrained recognition systems. In this article, we propose an approach to address the corrupt-frame problem without knowing their locations and durations in advance. The proposed approach includes two key components: outlier filtering and observation completion. The former identifies and filters out unobserved frames, and the latter fills up the filtered parts by retrieving coherent alternatives from training data. Hidden Conditional Random Fields (HCRFs) are then used to recognize the filtered and completed actions. Our approach has been evaluated on three datasets, which contain both fully observed actions and partially observed actions with either real or synthetic corrupt frames. The experimental results show that our approach performs favorably against the other state-of-the-art methods, especially when corrupt frames are present.
- Elisabeth Andre. 2013. Exploiting unconscious user signals in multimodal human-computer interaction. ACM Trans. Multimedia Comput., Commun., Appl. 9, 1s (2013), 48. Google Scholar
Digital Library
- Alper Ayvaci, Michalis Raptis, and Stefano Soatto. 2012. Sparse occlusion detection with optical flow. Int. J. Comput. Vis. 97, 3 (2012), 322--338. Google Scholar
Digital Library
- Prithviraj Banerjee and Ram Nevatia. 2014. Pose filter based hidden-CRF models for activity detection. In Proc. Euro. Conf. Computer Vision. 711--726.Google Scholar
Cross Ref
- Yu Cao, Daniel Barrett, Andrei Barbu, Swaminathan Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, and Song Wang. 2013. Recognize human activities from partially observed videos. In Proc. Conf. Computer Vision and Pattern Recognition. 2658--2665. Google Scholar
Digital Library
- Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2016. Realtime multi-person 2D pose estimation using part affinity fields. arXiv Preprint arXiv:1611.08050 (2016).Google Scholar
- Gustavo Carneiro and Jacinto C. Nascimento. 2013. Combining multiple dynamic models and deep learning architectures for tracking the left ventricle endocardium in ultrasound data. IEEE Trans. Pattern Anal. Mach. Intell. 35, 11 (2013), 2592--2607. Google Scholar
Digital Library
- Alexandros Andre Chaaraoui, José Ramón Padilla-López, and Francisco Flórez-Revuelta. 2013. Fusion of skeletal and silhouette-based features for human action recognition with RGB-D devices. In Proc. Int’ Conf. Computer Vision Workshops. 91--97. Google Scholar
Digital Library
- Feng-Ju Chang, Yen-Yu Lin, and Kuang-Jui Hsu. 2014. Multiple structured-instance learning for semantic segmentation with uncertain training data. In Proc. Conf. Computer Vision and Pattern Recognition. Google Scholar
Digital Library
- Kai-Yueh Chang, Tyng-Luh Liu, and Shang-Hong Lai. 2009. Learning partially-observed hidden conditional random fields for facial expression recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 533--540.Google Scholar
Cross Ref
- Chia-Chih Chen and J. K. Aggarwal. 2011. Modeling human activities as speech. In Proc. Conf. Computer Vision and Pattern Recognition. 3425--3432. Google Scholar
Digital Library
- Hongzhao Chen, Guijin Wang, and Li He. 2013. Accurate and real-time human action recognition based on 3D skeleton. In Proc. Int’l. Conf. Optical Instruments and Technology.Google Scholar
Cross Ref
- Zhuo Chen, Lu Wang, and Nelson H. C. Yung. 2011. Adaptive human motion analysis and prediction. Pattern Recognition 44, 12 (2011), 2902--2914. Google Scholar
Digital Library
- Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. Structured feature learning for pose estimation. In Proc. Conf. Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- James W. Davis and Ambrish Tyagi. 2006. Minimal-latency human action recognition using reliable-inference. Image Vis. Comput. 24, 5 (2006), 455--472. Google Scholar
Digital Library
- Maxime Devanne, Hazem Wannous, Stefano Berretti, Pietro Pala, Meroua Daoudi, and Alberto Del Bimbo. 2015. 3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold. IEEE Trans. Cybernet. 45, 7 (2015), 1340--1352.Google Scholar
Cross Ref
- Piotr Dollár, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proc. Int’l. Workshops on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 65--72.Google Scholar
Cross Ref
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. Conf. Computer Vision and Pattern Recognition. 2625--2634.Google Scholar
Cross Ref
- Li Fei-Fei and Pietro Perona. 2005. A Bayesian hierarchical model for learning natural scene categories. In Proc. Conf. Computer Vision and Pattern Recognition, Vol. 2. 524--531. Google Scholar
Digital Library
- Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proc. Conf. Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Thomas Finley and Thorsten Joachims. 2008. Training structural SVMs when exact inference is intractable. In Proc. Int'l Conf. Machine Learning. Google Scholar
Digital Library
- Simon Fothergill, Helena M. Mentis, Pushmeet Kohli, and Sebastian Nowozin. 2012. Instructing people for training gestural interactive systems. In Proc. Int’l. Conf. Human Factors in Computing Systems. 1737--1746. Google Scholar
Digital Library
- Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proc. Conf. Computer Vision and Pattern Recognition. 2568--2577.Google Scholar
Cross Ref
- Mohammad A. Gowayyed, Marwan Torki, Mohamed E. Hussein, and Motaz El-Saban. 2013. Histogram of oriented displacements (HOD): Describing trajectories of human joints for action recognition. In Proc. Int’l. Joint Conf. Artificial Intelligence. 1351--1357. Google Scholar
Digital Library
- Minh Hoai and Fernando De la Torre. 2014. Max-margin early event detectors. Int. J. Comput. Vis. 107, 2 (2014), 191--202. Google Scholar
Digital Library
- Alexandros Iosifidis, Anastasios Tefas, and Ioannis Pitas. 2013. Dynamic action classification based on iterative data selection and feedforward neural networks. In Proc. Euro. Conf. Signal Processing. 1--5.Google Scholar
- Yun Jiang and Ashutosh Saxena. 2014. Modeling high-dimensional humans for activity anticipation using Gaussian process latent CRFs. In Robotics: Science and Systems.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems. 1097--1105. Google Scholar
Digital Library
- Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. 2014. A hierarchical representation for future action prediction. In Proc. Euro. Conf. Computer Vision. 689--704.Google Scholar
Cross Ref
- Ivan Laptev. 2005. On space-time interest points. Int. J. Comput. Vis. 64, 2--3 (2005), 107--123. Google Scholar
Digital Library
- Chuanjun Li, S. Q. Zheng, and B. Prabhakaran. 2007. Segmentation and recognition of motion streams by similarity search. ACM Trans. Multimedia Comput., Commun., Appl. 3, 3 (2007), 16. Google Scholar
Digital Library
- Kang Li and Yun Fu. 2014. Prediction of human activity by discovering temporal sequence patterns. IEEE Trans. Pattern Anal. Mach. Intell. 36, 8 (2014), 1644--1657. Google Scholar
Digital Library
- Qing Li, Zhaofan Qiu, Ting Yao, Tao Mei, Yong Rui, and Jiebo Luo. 2016. Action recognition by learning deep multi-granular spatio-temporal video representation. In Proc. ACM Conf. Multimedia Retrieval. 159--166. Google Scholar
Digital Library
- Wanqing Li, Zhengyou Zhang, and Zicheng Liu. 2008. Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans. Circuits Syst. Video Technol. 18, 11 (2008), 1499--1510. Google Scholar
Digital Library
- Xiao Li, Min Fang, Ju-Jie Zhang, and Jinqiao Wu. 2017. Learning coupled classifiers with RGB images for RGB-D object recognition. Pattern Recognition 61 (2017), 433--446.Google Scholar
Digital Library
- Shih-Yao Lin, Yen-Yu Lin, Chu-Song Chen, and Yi-Ping Hung. 2017. Learning and inferring human actions with temporal pyramid features based on conditional random fields. In Proc. Int’l. Conf. Acoustics, Speech, and Signal Processing.Google Scholar
Cross Ref
- Yen-Yu Lin, Ju-Hsuan Hua, Nick C. Tang, Min-Hung Chen, and Hong-Yuan Mark Liao. 2014. Depth and skeleton associated action recognition without online accessible RGB-D cameras. In Proc. Conf. Computer Vision and Pattern Recognition. 2617--2624. Google Scholar
Digital Library
- Li Liu, Ling Shao, Xuelong Li, and Ke Lu. 2016a. Learning spatio-temporal representations for action recognition: A genetic programming approach. IEEE Trans. Cybernetics 46, 1 (2016), 158--170.Google Scholar
Cross Ref
- Li Liu, Yi Zhou, and Ling Shao. 2016b. DAP3D-Net: Where, what and how actions occur in videos? arXiv Preprint arXiv:1602.03346 (2016).Google Scholar
- Fengjun Lv and Ramakant Nevatia. 2006. Recognition and segmentation of 3-D human action using HMM and multi-class adaboost. In Proc. Euro. Conf. Computer Vision. 359--372. Google Scholar
Digital Library
- James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proc. Berkeley Symp. Mathematical Statistics and Probability, Vol. 1. 281--297.Google Scholar
- Subhransu Maji, Lubomir Bourdev, and Jitendra Malik. 2011. Action recognition from a distributed representation of pose and appearance. In Proc. Conf. Computer Vision and Pattern Recognition. 3177--3184. Google Scholar
Digital Library
- James Martens and Ilya Sutskever. 2011. Learning recurrent neural networks with Hessian-free optimization. In Proc. Int’l. Conf. Machine Learning. 1033--1040. Google Scholar
Digital Library
- Omar Oreifej and Zicheng Liu. 2013. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proc. Conf. Computer Vision and Pattern Recognition. 716--723. Google Scholar
Digital Library
- Olusegun Oshin, Andrew Gilbert, and Richard Bowden. 2011. Capturing the relative distribution of features for action recognition. In Proc. Conf. Automatic Face and Gesture Recognition. 111--116.Google Scholar
Cross Ref
- Jian Peng, Liefeng Bo, and Jinbo Xu. 2009. Conditional neural fields. In Proc. Advances in Neural Information Processing Systems. 1419--1427. Google Scholar
Digital Library
- Lasitha Piyathilaka and Sarath Kodagoda. 2013. Gaussian mixture based HMM for human daily activity recognition using 3D skeleton features. In Proc. Int’l. Conf. Industrial Electronics and Applications. 567--572.Google Scholar
Cross Ref
- Ariadna Quattoni, Sybor Wang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. 2007. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. (2007), 1848--1852. Google Scholar
Digital Library
- Michalis Raptis and Leonid Sigal. 2013. Poselet key-framing: A model for human activity recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 2650--2657. Google Scholar
Digital Library
- M. S. Ryoo. 2011. Human activity prediction: Early recognition of ongoing activities from streaming videos. In Proc. Int’l. Conf. Computer Vision. 1036--1043. Google Scholar
Digital Library
- M. S. Ryoo and J. K. Aggarwal. 2010. UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA). (2010).Google Scholar
- Konrad Schindler and Luc Van Gool. 2008. Action snippets: How many frames does human action recognition require?. In Proc. Conf. Computer Vision and Pattern Recognition. 1--8.Google Scholar
Cross Ref
- Wei Shen, Ke Deng, Xiang Bai, Tommer Leyvand, Baining Guo, and Zhuowen Tu. 2012. Exemplar-based human action pose correction and tagging. In Proc. Conf. Computer Vision and Pattern Recognition. 1784--1791. Google Scholar
Digital Library
- Ya-Fang Shih, Yang-Ming Yeh, Yen-Yu Lin, Ming-Feng Weng, Yi-Chang Lu, and Yung-Yu Chuang. 2017. Deep co-occurrence feature learning for visual object recognition. In Proc. Conf. Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Guang Shu, Afshin Dehghan, Omar Oreifej, Emily Hand, and Mubarak Shah. 2012. Part-based multiple-person tracking with partial occlusion handling. In Proc. Conf. Computer Vision and Pattern Recognition. 1815--1821. Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proc. Advances in Neural Information Processing Systems. 568--576. Google Scholar
Digital Library
- Yale Song, Louis-Philippe Morency, and Randall Davis. 2012. Multi-view latent variable discriminative models for action recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 2120--2127. Google Scholar
Digital Library
- Yale Song, Louis-Philippe Morency, and Ronald W. Davis. 2013. Action recognition by hierarchical sequence summarization. In Proc. Conf. Computer Vision and Pattern Recognition. 3562--3569. Google Scholar
Digital Library
- Jaeyong Sung, Colin Ponce, Bart Selman, and Ashutosh Saxena. 2012. Unstructured human activity detection from RGBD images. In Proc. Int’l. Conf. Robotics and Automation. 842--849.Google Scholar
- C. Sutton and A. McCallum. 2007. An Introduction to Conditional Random Fields for Relational Learning. MIT Press.Google Scholar
- Nick C. Tang, Yen-Yu Lin, Ju-Hsuan Hua, Shih-En Wei, Ming-Fang Weng, and Hong-Yuan Mark Liao. 2015. Robust action recognition via borrowing information across video modalities. IEEE Trans. Image Process. 24, 2 (2015), 709--723.Google Scholar
Cross Ref
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proc. Int’l. Conf. Computer Vision. 4489--4497. Google Scholar
Digital Library
- Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In Proc. Conf. Computer Vision and Pattern Recognition. 588--595. Google Scholar
Digital Library
- Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In Proc. Conf. Computer Vision and Pattern Recognition. 1290--1297. Google Scholar
Digital Library
- Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2014. Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 5 (2014), 914--927.Google Scholar
Cross Ref
- Liang Wang and David Suter. 2007. Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In Proc. Conf. Computer Vision and Pattern Recognition. 1--8.Google Scholar
Cross Ref
- Xiaoyu Wang, Tony X. Han, and Shuicheng Yan. 2009. An HOG-LBP human detector with partial occlusion handling. In Proc. Int’l. Conf. Computer Vision. 32--39.Google Scholar
Cross Ref
- Daniel Weinland, Mustafa Özuysal, and Pascal Fua. 2010. Making action recognition robust to occlusions and viewpoint changes. In Proc. Euro. Conf. Computer Vision. 635--648. Google Scholar
Digital Library
- Lu Xia and J. K. Aggarwal. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proc. Conf. Computer Vision and Pattern Recognition. 2834--2841. Google Scholar
Digital Library
- Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang. 2016. Learning deep feature representations with domain guided dropout for person re-identification. arXiv Preprint arXiv:1604.07528 (2016).Google Scholar
- Gang Yu, Junsong Yuan, and Zicheng Liu. 2012. Predicting human activities using spatio-temporal structure of interest points. In Proc. ACM Conf. Multimedia. 1049--1052. Google Scholar
Digital Library
- Gang Yu, Junsong Yuan, and Zicheng Liu. 2015. Propagative Hough voting for human activity detection and recognition. IEEE Trans. Circ. Syst. Video Technol. 25, 1 (2015), 87--98.Google Scholar
Cross Ref
- Bo Zhang, Nicola Conci, and Francesco G. B. De Natale. 2015. Segmentation of discriminative patches in human activity video. ACM Trans. Multimedia Comput., Commun., Appl. 12, 1 (2015), 4. Google Scholar
Digital Library
- Jianguo Zhang and Shaogang Gong. 2010. Action categorization with modified hidden conditional random field. Pattern Recognit. 43, 1 (2010), 197--203. Google Scholar
Digital Library
- Lei Zhang, Zhi Zeng, and Qiang Ji. 2011. Probabilistic image modeling with an extended chain graph for human activity recognition and image segmentation. IEEE Trans. Image Process. 20, 9 (2011), 2401--2413. Google Scholar
Digital Library
- Xin Zhao, Xue Li, Chaoyi Pang, Quan Z. Sheng, Sen Wang, and Mao Ye. 2014. Structured streaming skeleton—A new feature for online human gesture recognition. ACM Trans. Multimedia Comput., Commun., Appl. 11, 1s (2014), 22. Google Scholar
Digital Library
Index Terms
Recognizing Human Actions with Outlier Frames by Observation Filtering and Completion
Recommendations
Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin
We present a discriminative part-based approach for human action recognition from video sequences using motion features. Our model is based on the recently proposed hidden conditional random field (HCRF) for object recognition. Similarly to HCRF for ...
Viewpoint insensitive actions recognition using hidden conditional random fields
KES'10: Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part IThe viewpoint issue has been one of the bottlenecks for research development and practical implementation of human motion analysis. In this paper, we introduce a new method, e.g., hidden conditional random fields(HCRFs) to achieve viewpoint insensitive ...
ReadingAct RGB-D action dataset and human action recognition from local features
New action dataset captured using two Kinect sensors, containing 2340 videos.Applied a spatio-temporal local feature approach to depth action videos.Developed an action recognition framework using a dynamic time alignment approach.Experiment on three ...






Comments