Abstract
In this article, we present a novel approach to segment discriminative patches in human activity videos. First, we adopt the spatio-temporal interest points (STIPs) to represent significant motion patterns in the video sequence. Then, nonnegative sparse coding is exploited to generate a sparse representation of each STIP descriptor. We construct the feature vector for each video by applying a two-stage sum-pooling and l2-normalization operation. After training a multi-class classifier through the error-correcting code SVM, the discriminative portion of each video is determined as the patch that has the highest confidence while also being correctly classified according to the video category. Experimental results show that the video patches extracted by our method are more separable, while preserving the perceptually relevant portion of each activity.
- O. Akman, A. A. Alatan, and T. Ciloglu. 2008. Multi-camera visual surveillance for motion detection, occlusion handling, tracking and event recognition. In Proceedings of the ECCV Workshop on Multi-Camera and Multi-Modal Sensor Fusion Algorithms and Applications.Google Scholar
- P. Borges, N. Conci, and A. Cavallaro. 2013. Video-based human behavior understanding: A survey. IEEE Trans. Circuits Syst. Video Technol. 23, 11, 1993--2008. Google Scholar
Digital Library
- L. D. Bourdev and J. Malik. 2009. Poselets: Body part detectors trained using 3d human pose annotations. In Proceedings of the 12th International Conference on Computer Vision. IEEE, 1365--1372.Google Scholar
- A. Castrodad and G. Sapiro. 2012. Sparse modeling of human actions from motion imagery. Int. J. Comput. Vision 100, 1, 1--15. Google Scholar
Digital Library
- A. Cavallaro, O. Steiger, and T. Ebrahimi. 2005. Tracking video objects in cluttered background. IEEE Trans. Circuits Syst. Video Technol. 15, 4, 575--584. Google Scholar
Digital Library
- M. Y. Chen and A. Hauptmann. 2009. Mosift: Recognizing human actions in surveillance videos. Tech. Rep. CMU-CS-09-161, Carnegie Mellon University.Google Scholar
- P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proceedings of the 2nd Joint International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. IEEE, 65--72. Google Scholar
Digital Library
- M. Elad and M. Aharon. 2006a. Image denoising via learned dictionaries and sparse representation. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Vol. 1. IEEE, 895--900. Google Scholar
Digital Library
- M. Elad and M. Aharon. 2006b. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15, 12, 3736--3745. Google Scholar
Digital Library
- M. Elad and M. Aharon. 2006c. The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Trans. Image Process. 15, 12, 3736--3745. Google Scholar
Digital Library
- N. García-Pedrajas and D. Ortiz-Boyer. 2011. An empirical study of binary classifier fusion methods for multiclass classification. Infor. Fusion 12, 2, 111--130. Google Scholar
Digital Library
- L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. 2007. Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29, 12, 2247--2253. Google Scholar
Digital Library
- T. Guha and R. K. Ward. 2012. Learning sparse representations for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34, 8, 1576--1588. Google Scholar
Digital Library
- E. T. Hall. 1963. A system for the notation of proxemic behavior. Amer. Anthropologist 65, 5, 1003--1026.Google Scholar
Cross Ref
- P. O. Hoyer. 2002. Non-negative sparse coding. In Proceedings of the 12th Workshop on Neural Networks for Signal Processing. IEEE, 557--565.Google Scholar
Cross Ref
- T. K. Huang, R. C. Weng, and C. J. Lin. 2006. Generalized Bradley-Terry models and multi-class probability estimates.J. Mach. Learn. Res. 7, 85--115. Google Scholar
Digital Library
- A. Klaeser, M. Marszalek, and C. Schmid. 2008. A spatio-temporal descriptor based on 3D-gradients. In Proceedings of the British Machine Vision Conference. BMVA Press, 99.1--99.10.Google Scholar
- K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T. W. Lee, and T. J. Sejnowski. 2003. Dictionary learning algorithms for sparse representation. Neural Computation 15, 2, 349--396. Google Scholar
Digital Library
- I. Laptev. 2005. On space-time interest points. Int. J. Comput. Vision 64, 2, 107--123, Google Scholar
Digital Library
- I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.Google Scholar
- J. Mairal. 2012. Sparse modeling software. INRIA. http://spams-devel.gforge.inria.fr/index.html.Google Scholar
- J. Mairal, F. Bach, J. Ponce, and G. Sapiro. 2009. Online dictionary learning for sparse coding. In Proceedings of the International Conference on Machine Learning. ACM, 689--696. Google Scholar
Digital Library
- J. Mairal, F. Bach, J. Ponce, and G. Sapiro. 2010. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11, 19--60. Google Scholar
Digital Library
- J. C. Niebles, C. W. Chen, and F. F. Li. 2010. Modeling temporal structure of decomposable motion segments for activity classification. In Proceedings of the European Conference on Computer Vision. Springer, 392--405. Google Scholar
Digital Library
- J. C. Niebles, H. C. Wang, and F. F. Li. 2008. Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vision 79, 3, 299--318. Google Scholar
Digital Library
- A. Patron-Perez, M. Marszalek, I. Reid, and A. Zisserman. 2012. Structured Learning of Human Interactions in TV Shows. IEEE Trans. Pattern Anal. Mach. Intell. 34, 12, 2441--2453. Google Scholar
Digital Library
- M. Raptis and L. Sigal. 2013. Poselet key-framing: A model for human activity recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. IEEE, 2650--2657. Google Scholar
Digital Library
- M. Raptis and S. Soatto. 2010. Tracklet descriptors for action modeling and video analysis. In Proceedings of the European Conference on Computer Vision. Springer, 577--590. Google Scholar
Digital Library
- M. S. Ryoo and J. K. Aggarwal. 2010. UT-Interaction dataset, ICPR contest on semantic description of human activities (SDHA). http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html. (2010).Google Scholar
- G. Sapiro, J. Mairal, F. Bach, J. Ponce, and A. Zisserman. 2008. Supervised dictionary learning. In Proceedings of the International Conference on Neural Information Processing Systems.Google Scholar
- C. Schuldt, I. Laptev, and B. Caputo. 2004. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, Vol. 3. IEEE, 32--36. Google Scholar
Digital Library
- P. Sprechmann and G. Sapiro. 2010. Dictionary learning and sparse coding for unsupervised clustering. In Proceedings of the International Conference on Acoustics Speech and Signal Processing. IEEE, 2042--2045.Google Scholar
- A. Tamrakar, S. Ali, Q. Yu, J. G. Liu, O. Javed, A. Divakaran, H. Cheng, and H. Sawhney. 2012. Evaluation of low-level features and their combinations for complex event detection in open source videos. In Proceedings of International Conference on Computer Vision and Pattern Recognition. IEEE, 3681--3688. Google Scholar
Digital Library
- R. Tibshirani. 1996. Regression shrinkage and selection via the lasso. J. Royal Statistical Soci. Series B (Methodological), 267--288.Google Scholar
- H. Wang, A. Klaser, C. Schmid, and C. L. Liu. 2011. Action recognition by dense trajectories. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. IEEE, 3169--3176. Google Scholar
Digital Library
- H. Wang and C. Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the International Conference on Computer Vision. IEEE, 3551--3558. Google Scholar
Digital Library
- H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the British Machine Vision Conference. BMVA Press, London, UK, 124.1--124.11.Google Scholar
- L. M. Wang, Y. Qiao, and X. O. Tang. 2013. Motionlets: Mid-level 3D parts for human motion recognition. In Proceedings of International Conference on Computer Vision and Pattern Recognition. IEEE, 2674--2681. Google Scholar
Digital Library
- X. X. Wang, L. M. Wang, and Y. Qiao. 2012. A comparative study of encoding, pooling and normalization methods for action recognition. In Proceedings of the Asian Conference on Computer Vision. Springer, 572--585. Google Scholar
Digital Library
- G. Willems, T. Tuytelaars, and L. Van Gool. 2008. An efficient dense and scale-invariant spatio-temporal interest point detector. In Proceedings of the European Conference on Computer Vision. Springer, 650--663. Google Scholar
Digital Library
- Y. Yan, E. Ricci, S Ramanathan, G. W. Liu, and N. Sebe. 2014. Multi-task linear discriminant analysis for view invariant action recognition. IEEE Trans. Image Process. 23, 12, 5599--5611.Google Scholar
Cross Ref
- J. C. Yang, K. Yu, Y. H. Gong, and T. Huang. 2009. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of International Conference on Computer Vision and Pattern Recognition. IEEE, 1794--1801.Google Scholar
- B. Zhang, F. G. B. De Natale, and N. Conci. 2013a. Recognition of social interactions based on feature selection from visual codebooks. In Proceedings of International Conference on Image Processing. IEEE, Melbourne, Australia, 3557--3561.Google Scholar
- B. Zhang, P. Rota, and N. Conci. 2013b. Recognition of two-person interactions in multi-view surveillance video via proxemics cues and spatio-temporal interest points. In Proceedings of SPIE on Video Surveillance and Transportation Imaging Applications. SPIE, San Francisco, US.Google Scholar
Index Terms
Segmentation of Discriminative Patches in Human Activity Video
Recommendations
Discovering discriminative patches for free-hand sketch analysis
Since the ancient times, free-hand sketch has been widely used as an effective and convenient intermediate means to express human thoughts and highly diverse objects in reality. In recent years, a great quantity of researchers realized the significance ...
Activity Detection With Discriminative Patches
ICIMCS'16: Proceedings of the International Conference on Internet Multimedia Computing and ServiceIn this paper, we propose a novel model for activity detection in videos via discriminative patches. Each frame is represented as a set of mid-level discriminative patches, which are extracted automatically by association rule mining with convolutional ...
Combining spatial and temporal patches for scalable video indexing
This paper tackles the problem of scalable video indexing. We propose a new framework combining spatial and motion patch descriptors. The spatial descriptors are based on a multiscale description of the image and are called Sparse Multiscale Patches. We ...






Comments