skip to main content
research-article

Segmentation of Discriminative Patches in Human Activity Video

Published:24 August 2015Publication History
Skip Abstract Section

Abstract

In this article, we present a novel approach to segment discriminative patches in human activity videos. First, we adopt the spatio-temporal interest points (STIPs) to represent significant motion patterns in the video sequence. Then, nonnegative sparse coding is exploited to generate a sparse representation of each STIP descriptor. We construct the feature vector for each video by applying a two-stage sum-pooling and l2-normalization operation. After training a multi-class classifier through the error-correcting code SVM, the discriminative portion of each video is determined as the patch that has the highest confidence while also being correctly classified according to the video category. Experimental results show that the video patches extracted by our method are more separable, while preserving the perceptually relevant portion of each activity.

References

  1. O. Akman, A. A. Alatan, and T. Ciloglu. 2008. Multi-camera visual surveillance for motion detection, occlusion handling, tracking and event recognition. In Proceedings of the ECCV Workshop on Multi-Camera and Multi-Modal Sensor Fusion Algorithms and Applications.Google ScholarGoogle Scholar
  2. P. Borges, N. Conci, and A. Cavallaro. 2013. Video-based human behavior understanding: A survey. IEEE Trans. Circuits Syst. Video Technol. 23, 11, 1993--2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. D. Bourdev and J. Malik. 2009. Poselets: Body part detectors trained using 3d human pose annotations. In Proceedings of the 12th International Conference on Computer Vision. IEEE, 1365--1372.Google ScholarGoogle Scholar
  4. A. Castrodad and G. Sapiro. 2012. Sparse modeling of human actions from motion imagery. Int. J. Comput. Vision 100, 1, 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Cavallaro, O. Steiger, and T. Ebrahimi. 2005. Tracking video objects in cluttered background. IEEE Trans. Circuits Syst. Video Technol. 15, 4, 575--584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Y. Chen and A. Hauptmann. 2009. Mosift: Recognizing human actions in surveillance videos. Tech. Rep. CMU-CS-09-161, Carnegie Mellon University.Google ScholarGoogle Scholar
  7. P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proceedings of the 2nd Joint International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. IEEE, 65--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Elad and M. Aharon. 2006a. Image denoising via learned dictionaries and sparse representation. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Vol. 1. IEEE, 895--900. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Elad and M. Aharon. 2006b. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15, 12, 3736--3745. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Elad and M. Aharon. 2006c. The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Trans. Image Process. 15, 12, 3736--3745. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. García-Pedrajas and D. Ortiz-Boyer. 2011. An empirical study of binary classifier fusion methods for multiclass classification. Infor. Fusion 12, 2, 111--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. 2007. Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29, 12, 2247--2253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Guha and R. K. Ward. 2012. Learning sparse representations for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34, 8, 1576--1588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. T. Hall. 1963. A system for the notation of proxemic behavior. Amer. Anthropologist 65, 5, 1003--1026.Google ScholarGoogle ScholarCross RefCross Ref
  15. P. O. Hoyer. 2002. Non-negative sparse coding. In Proceedings of the 12th Workshop on Neural Networks for Signal Processing. IEEE, 557--565.Google ScholarGoogle ScholarCross RefCross Ref
  16. T. K. Huang, R. C. Weng, and C. J. Lin. 2006. Generalized Bradley-Terry models and multi-class probability estimates.J. Mach. Learn. Res. 7, 85--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Klaeser, M. Marszalek, and C. Schmid. 2008. A spatio-temporal descriptor based on 3D-gradients. In Proceedings of the British Machine Vision Conference. BMVA Press, 99.1--99.10.Google ScholarGoogle Scholar
  18. K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T. W. Lee, and T. J. Sejnowski. 2003. Dictionary learning algorithms for sparse representation. Neural Computation 15, 2, 349--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Laptev. 2005. On space-time interest points. Int. J. Comput. Vision 64, 2, 107--123, Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.Google ScholarGoogle Scholar
  21. J. Mairal. 2012. Sparse modeling software. INRIA. http://spams-devel.gforge.inria.fr/index.html.Google ScholarGoogle Scholar
  22. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. 2009. Online dictionary learning for sparse coding. In Proceedings of the International Conference on Machine Learning. ACM, 689--696. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. 2010. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11, 19--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. C. Niebles, C. W. Chen, and F. F. Li. 2010. Modeling temporal structure of decomposable motion segments for activity classification. In Proceedings of the European Conference on Computer Vision. Springer, 392--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. C. Niebles, H. C. Wang, and F. F. Li. 2008. Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vision 79, 3, 299--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Patron-Perez, M. Marszalek, I. Reid, and A. Zisserman. 2012. Structured Learning of Human Interactions in TV Shows. IEEE Trans. Pattern Anal. Mach. Intell. 34, 12, 2441--2453. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Raptis and L. Sigal. 2013. Poselet key-framing: A model for human activity recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. IEEE, 2650--2657. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Raptis and S. Soatto. 2010. Tracklet descriptors for action modeling and video analysis. In Proceedings of the European Conference on Computer Vision. Springer, 577--590. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. S. Ryoo and J. K. Aggarwal. 2010. UT-Interaction dataset, ICPR contest on semantic description of human activities (SDHA). http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html. (2010).Google ScholarGoogle Scholar
  30. G. Sapiro, J. Mairal, F. Bach, J. Ponce, and A. Zisserman. 2008. Supervised dictionary learning. In Proceedings of the International Conference on Neural Information Processing Systems.Google ScholarGoogle Scholar
  31. C. Schuldt, I. Laptev, and B. Caputo. 2004. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, Vol. 3. IEEE, 32--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Sprechmann and G. Sapiro. 2010. Dictionary learning and sparse coding for unsupervised clustering. In Proceedings of the International Conference on Acoustics Speech and Signal Processing. IEEE, 2042--2045.Google ScholarGoogle Scholar
  33. A. Tamrakar, S. Ali, Q. Yu, J. G. Liu, O. Javed, A. Divakaran, H. Cheng, and H. Sawhney. 2012. Evaluation of low-level features and their combinations for complex event detection in open source videos. In Proceedings of International Conference on Computer Vision and Pattern Recognition. IEEE, 3681--3688. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Tibshirani. 1996. Regression shrinkage and selection via the lasso. J. Royal Statistical Soci. Series B (Methodological), 267--288.Google ScholarGoogle Scholar
  35. H. Wang, A. Klaser, C. Schmid, and C. L. Liu. 2011. Action recognition by dense trajectories. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. IEEE, 3169--3176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. Wang and C. Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the International Conference on Computer Vision. IEEE, 3551--3558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the British Machine Vision Conference. BMVA Press, London, UK, 124.1--124.11.Google ScholarGoogle Scholar
  38. L. M. Wang, Y. Qiao, and X. O. Tang. 2013. Motionlets: Mid-level 3D parts for human motion recognition. In Proceedings of International Conference on Computer Vision and Pattern Recognition. IEEE, 2674--2681. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. X. X. Wang, L. M. Wang, and Y. Qiao. 2012. A comparative study of encoding, pooling and normalization methods for action recognition. In Proceedings of the Asian Conference on Computer Vision. Springer, 572--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. G. Willems, T. Tuytelaars, and L. Van Gool. 2008. An efficient dense and scale-invariant spatio-temporal interest point detector. In Proceedings of the European Conference on Computer Vision. Springer, 650--663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Y. Yan, E. Ricci, S Ramanathan, G. W. Liu, and N. Sebe. 2014. Multi-task linear discriminant analysis for view invariant action recognition. IEEE Trans. Image Process. 23, 12, 5599--5611.Google ScholarGoogle ScholarCross RefCross Ref
  42. J. C. Yang, K. Yu, Y. H. Gong, and T. Huang. 2009. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of International Conference on Computer Vision and Pattern Recognition. IEEE, 1794--1801.Google ScholarGoogle Scholar
  43. B. Zhang, F. G. B. De Natale, and N. Conci. 2013a. Recognition of social interactions based on feature selection from visual codebooks. In Proceedings of International Conference on Image Processing. IEEE, Melbourne, Australia, 3557--3561.Google ScholarGoogle Scholar
  44. B. Zhang, P. Rota, and N. Conci. 2013b. Recognition of two-person interactions in multi-view surveillance video via proxemics cues and spatio-temporal interest points. In Proceedings of SPIE on Video Surveillance and Transportation Imaging Applications. SPIE, San Francisco, US.Google ScholarGoogle Scholar

Index Terms

  1. Segmentation of Discriminative Patches in Human Activity Video

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!