skip to main content
research-article

Recognizing Human Actions with Outlier Frames by Observation Filtering and Completion

Published:28 June 2017Publication History
Skip Abstract Section

Abstract

This article addresses the problem of recognizing partially observed human actions. Videos of actions acquired in the real world often contain corrupt frames caused by various factors. These frames may appear irregularly, and make the actions only partially observed. They change the appearance of actions and degrade the performance of pretrained recognition systems. In this article, we propose an approach to address the corrupt-frame problem without knowing their locations and durations in advance. The proposed approach includes two key components: outlier filtering and observation completion. The former identifies and filters out unobserved frames, and the latter fills up the filtered parts by retrieving coherent alternatives from training data. Hidden Conditional Random Fields (HCRFs) are then used to recognize the filtered and completed actions. Our approach has been evaluated on three datasets, which contain both fully observed actions and partially observed actions with either real or synthetic corrupt frames. The experimental results show that our approach performs favorably against the other state-of-the-art methods, especially when corrupt frames are present.

References

  1. Elisabeth Andre. 2013. Exploiting unconscious user signals in multimodal human-computer interaction. ACM Trans. Multimedia Comput., Commun., Appl. 9, 1s (2013), 48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alper Ayvaci, Michalis Raptis, and Stefano Soatto. 2012. Sparse occlusion detection with optical flow. Int. J. Comput. Vis. 97, 3 (2012), 322--338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Prithviraj Banerjee and Ram Nevatia. 2014. Pose filter based hidden-CRF models for activity detection. In Proc. Euro. Conf. Computer Vision. 711--726.Google ScholarGoogle ScholarCross RefCross Ref
  4. Yu Cao, Daniel Barrett, Andrei Barbu, Swaminathan Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, and Song Wang. 2013. Recognize human activities from partially observed videos. In Proc. Conf. Computer Vision and Pattern Recognition. 2658--2665. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2016. Realtime multi-person 2D pose estimation using part affinity fields. arXiv Preprint arXiv:1611.08050 (2016).Google ScholarGoogle Scholar
  6. Gustavo Carneiro and Jacinto C. Nascimento. 2013. Combining multiple dynamic models and deep learning architectures for tracking the left ventricle endocardium in ultrasound data. IEEE Trans. Pattern Anal. Mach. Intell. 35, 11 (2013), 2592--2607. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Alexandros Andre Chaaraoui, José Ramón Padilla-López, and Francisco Flórez-Revuelta. 2013. Fusion of skeletal and silhouette-based features for human action recognition with RGB-D devices. In Proc. Int’ Conf. Computer Vision Workshops. 91--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Feng-Ju Chang, Yen-Yu Lin, and Kuang-Jui Hsu. 2014. Multiple structured-instance learning for semantic segmentation with uncertain training data. In Proc. Conf. Computer Vision and Pattern Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kai-Yueh Chang, Tyng-Luh Liu, and Shang-Hong Lai. 2009. Learning partially-observed hidden conditional random fields for facial expression recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 533--540.Google ScholarGoogle ScholarCross RefCross Ref
  10. Chia-Chih Chen and J. K. Aggarwal. 2011. Modeling human activities as speech. In Proc. Conf. Computer Vision and Pattern Recognition. 3425--3432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hongzhao Chen, Guijin Wang, and Li He. 2013. Accurate and real-time human action recognition based on 3D skeleton. In Proc. Int’l. Conf. Optical Instruments and Technology.Google ScholarGoogle ScholarCross RefCross Ref
  12. Zhuo Chen, Lu Wang, and Nelson H. C. Yung. 2011. Adaptive human motion analysis and prediction. Pattern Recognition 44, 12 (2011), 2902--2914. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. Structured feature learning for pose estimation. In Proc. Conf. Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  14. James W. Davis and Ambrish Tyagi. 2006. Minimal-latency human action recognition using reliable-inference. Image Vis. Comput. 24, 5 (2006), 455--472. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Maxime Devanne, Hazem Wannous, Stefano Berretti, Pietro Pala, Meroua Daoudi, and Alberto Del Bimbo. 2015. 3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold. IEEE Trans. Cybernet. 45, 7 (2015), 1340--1352.Google ScholarGoogle ScholarCross RefCross Ref
  16. Piotr Dollár, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proc. Int’l. Workshops on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 65--72.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. Conf. Computer Vision and Pattern Recognition. 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  18. Li Fei-Fei and Pietro Perona. 2005. A Bayesian hierarchical model for learning natural scene categories. In Proc. Conf. Computer Vision and Pattern Recognition, Vol. 2. 524--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proc. Conf. Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  20. Thomas Finley and Thorsten Joachims. 2008. Training structural SVMs when exact inference is intractable. In Proc. Int'l Conf. Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Simon Fothergill, Helena M. Mentis, Pushmeet Kohli, and Sebastian Nowozin. 2012. Instructing people for training gestural interactive systems. In Proc. Int’l. Conf. Human Factors in Computing Systems. 1737--1746. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proc. Conf. Computer Vision and Pattern Recognition. 2568--2577.Google ScholarGoogle ScholarCross RefCross Ref
  23. Mohammad A. Gowayyed, Marwan Torki, Mohamed E. Hussein, and Motaz El-Saban. 2013. Histogram of oriented displacements (HOD): Describing trajectories of human joints for action recognition. In Proc. Int’l. Joint Conf. Artificial Intelligence. 1351--1357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Minh Hoai and Fernando De la Torre. 2014. Max-margin early event detectors. Int. J. Comput. Vis. 107, 2 (2014), 191--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Alexandros Iosifidis, Anastasios Tefas, and Ioannis Pitas. 2013. Dynamic action classification based on iterative data selection and feedforward neural networks. In Proc. Euro. Conf. Signal Processing. 1--5.Google ScholarGoogle Scholar
  26. Yun Jiang and Ashutosh Saxena. 2014. Modeling high-dimensional humans for activity anticipation using Gaussian process latent CRFs. In Robotics: Science and Systems.Google ScholarGoogle Scholar
  27. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. 2014. A hierarchical representation for future action prediction. In Proc. Euro. Conf. Computer Vision. 689--704.Google ScholarGoogle ScholarCross RefCross Ref
  29. Ivan Laptev. 2005. On space-time interest points. Int. J. Comput. Vis. 64, 2--3 (2005), 107--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chuanjun Li, S. Q. Zheng, and B. Prabhakaran. 2007. Segmentation and recognition of motion streams by similarity search. ACM Trans. Multimedia Comput., Commun., Appl. 3, 3 (2007), 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kang Li and Yun Fu. 2014. Prediction of human activity by discovering temporal sequence patterns. IEEE Trans. Pattern Anal. Mach. Intell. 36, 8 (2014), 1644--1657. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Qing Li, Zhaofan Qiu, Ting Yao, Tao Mei, Yong Rui, and Jiebo Luo. 2016. Action recognition by learning deep multi-granular spatio-temporal video representation. In Proc. ACM Conf. Multimedia Retrieval. 159--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Wanqing Li, Zhengyou Zhang, and Zicheng Liu. 2008. Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans. Circuits Syst. Video Technol. 18, 11 (2008), 1499--1510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xiao Li, Min Fang, Ju-Jie Zhang, and Jinqiao Wu. 2017. Learning coupled classifiers with RGB images for RGB-D object recognition. Pattern Recognition 61 (2017), 433--446.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Shih-Yao Lin, Yen-Yu Lin, Chu-Song Chen, and Yi-Ping Hung. 2017. Learning and inferring human actions with temporal pyramid features based on conditional random fields. In Proc. Int’l. Conf. Acoustics, Speech, and Signal Processing.Google ScholarGoogle ScholarCross RefCross Ref
  36. Yen-Yu Lin, Ju-Hsuan Hua, Nick C. Tang, Min-Hung Chen, and Hong-Yuan Mark Liao. 2014. Depth and skeleton associated action recognition without online accessible RGB-D cameras. In Proc. Conf. Computer Vision and Pattern Recognition. 2617--2624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Li Liu, Ling Shao, Xuelong Li, and Ke Lu. 2016a. Learning spatio-temporal representations for action recognition: A genetic programming approach. IEEE Trans. Cybernetics 46, 1 (2016), 158--170.Google ScholarGoogle ScholarCross RefCross Ref
  38. Li Liu, Yi Zhou, and Ling Shao. 2016b. DAP3D-Net: Where, what and how actions occur in videos? arXiv Preprint arXiv:1602.03346 (2016).Google ScholarGoogle Scholar
  39. Fengjun Lv and Ramakant Nevatia. 2006. Recognition and segmentation of 3-D human action using HMM and multi-class adaboost. In Proc. Euro. Conf. Computer Vision. 359--372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proc. Berkeley Symp. Mathematical Statistics and Probability, Vol. 1. 281--297.Google ScholarGoogle Scholar
  41. Subhransu Maji, Lubomir Bourdev, and Jitendra Malik. 2011. Action recognition from a distributed representation of pose and appearance. In Proc. Conf. Computer Vision and Pattern Recognition. 3177--3184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. James Martens and Ilya Sutskever. 2011. Learning recurrent neural networks with Hessian-free optimization. In Proc. Int’l. Conf. Machine Learning. 1033--1040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Omar Oreifej and Zicheng Liu. 2013. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proc. Conf. Computer Vision and Pattern Recognition. 716--723. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Olusegun Oshin, Andrew Gilbert, and Richard Bowden. 2011. Capturing the relative distribution of features for action recognition. In Proc. Conf. Automatic Face and Gesture Recognition. 111--116.Google ScholarGoogle ScholarCross RefCross Ref
  45. Jian Peng, Liefeng Bo, and Jinbo Xu. 2009. Conditional neural fields. In Proc. Advances in Neural Information Processing Systems. 1419--1427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Lasitha Piyathilaka and Sarath Kodagoda. 2013. Gaussian mixture based HMM for human daily activity recognition using 3D skeleton features. In Proc. Int’l. Conf. Industrial Electronics and Applications. 567--572.Google ScholarGoogle ScholarCross RefCross Ref
  47. Ariadna Quattoni, Sybor Wang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. 2007. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. (2007), 1848--1852. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Michalis Raptis and Leonid Sigal. 2013. Poselet key-framing: A model for human activity recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 2650--2657. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. M. S. Ryoo. 2011. Human activity prediction: Early recognition of ongoing activities from streaming videos. In Proc. Int’l. Conf. Computer Vision. 1036--1043. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. S. Ryoo and J. K. Aggarwal. 2010. UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA). (2010).Google ScholarGoogle Scholar
  51. Konrad Schindler and Luc Van Gool. 2008. Action snippets: How many frames does human action recognition require?. In Proc. Conf. Computer Vision and Pattern Recognition. 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  52. Wei Shen, Ke Deng, Xiang Bai, Tommer Leyvand, Baining Guo, and Zhuowen Tu. 2012. Exemplar-based human action pose correction and tagging. In Proc. Conf. Computer Vision and Pattern Recognition. 1784--1791. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Ya-Fang Shih, Yang-Ming Yeh, Yen-Yu Lin, Ming-Feng Weng, Yi-Chang Lu, and Yung-Yu Chuang. 2017. Deep co-occurrence feature learning for visual object recognition. In Proc. Conf. Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  54. Guang Shu, Afshin Dehghan, Omar Oreifej, Emily Hand, and Mubarak Shah. 2012. Part-based multiple-person tracking with partial occlusion handling. In Proc. Conf. Computer Vision and Pattern Recognition. 1815--1821. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proc. Advances in Neural Information Processing Systems. 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Yale Song, Louis-Philippe Morency, and Randall Davis. 2012. Multi-view latent variable discriminative models for action recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 2120--2127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Yale Song, Louis-Philippe Morency, and Ronald W. Davis. 2013. Action recognition by hierarchical sequence summarization. In Proc. Conf. Computer Vision and Pattern Recognition. 3562--3569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Jaeyong Sung, Colin Ponce, Bart Selman, and Ashutosh Saxena. 2012. Unstructured human activity detection from RGBD images. In Proc. Int’l. Conf. Robotics and Automation. 842--849.Google ScholarGoogle Scholar
  59. C. Sutton and A. McCallum. 2007. An Introduction to Conditional Random Fields for Relational Learning. MIT Press.Google ScholarGoogle Scholar
  60. Nick C. Tang, Yen-Yu Lin, Ju-Hsuan Hua, Shih-En Wei, Ming-Fang Weng, and Hong-Yuan Mark Liao. 2015. Robust action recognition via borrowing information across video modalities. IEEE Trans. Image Process. 24, 2 (2015), 709--723.Google ScholarGoogle ScholarCross RefCross Ref
  61. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proc. Int’l. Conf. Computer Vision. 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In Proc. Conf. Computer Vision and Pattern Recognition. 588--595. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In Proc. Conf. Computer Vision and Pattern Recognition. 1290--1297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2014. Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 5 (2014), 914--927.Google ScholarGoogle ScholarCross RefCross Ref
  65. Liang Wang and David Suter. 2007. Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In Proc. Conf. Computer Vision and Pattern Recognition. 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  66. Xiaoyu Wang, Tony X. Han, and Shuicheng Yan. 2009. An HOG-LBP human detector with partial occlusion handling. In Proc. Int’l. Conf. Computer Vision. 32--39.Google ScholarGoogle ScholarCross RefCross Ref
  67. Daniel Weinland, Mustafa Özuysal, and Pascal Fua. 2010. Making action recognition robust to occlusions and viewpoint changes. In Proc. Euro. Conf. Computer Vision. 635--648. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Lu Xia and J. K. Aggarwal. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proc. Conf. Computer Vision and Pattern Recognition. 2834--2841. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang. 2016. Learning deep feature representations with domain guided dropout for person re-identification. arXiv Preprint arXiv:1604.07528 (2016).Google ScholarGoogle Scholar
  70. Gang Yu, Junsong Yuan, and Zicheng Liu. 2012. Predicting human activities using spatio-temporal structure of interest points. In Proc. ACM Conf. Multimedia. 1049--1052. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Gang Yu, Junsong Yuan, and Zicheng Liu. 2015. Propagative Hough voting for human activity detection and recognition. IEEE Trans. Circ. Syst. Video Technol. 25, 1 (2015), 87--98.Google ScholarGoogle ScholarCross RefCross Ref
  72. Bo Zhang, Nicola Conci, and Francesco G. B. De Natale. 2015. Segmentation of discriminative patches in human activity video. ACM Trans. Multimedia Comput., Commun., Appl. 12, 1 (2015), 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Jianguo Zhang and Shaogang Gong. 2010. Action categorization with modified hidden conditional random field. Pattern Recognit. 43, 1 (2010), 197--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Lei Zhang, Zhi Zeng, and Qiang Ji. 2011. Probabilistic image modeling with an extended chain graph for human activity recognition and image segmentation. IEEE Trans. Image Process. 20, 9 (2011), 2401--2413. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Xin Zhao, Xue Li, Chaoyi Pang, Quan Z. Sheng, Sen Wang, and Mao Ye. 2014. Structured streaming skeleton—A new feature for online human gesture recognition. ACM Trans. Multimedia Comput., Commun., Appl. 11, 1s (2014), 22. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Recognizing Human Actions with Outlier Frames by Observation Filtering and Completion

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!