skip to main content
research-article

Structured Streaming Skeleton -- A New Feature for Online Human Gesture Recognition

Published:01 October 2014Publication History
Skip Abstract Section

Abstract

Online human gesture recognition has a wide range of applications in computer vision, especially in human-computer interaction applications. The recent introduction of cost-effective depth cameras brings a new trend of research on body-movement gesture recognition. However, there are two major challenges: (i) how to continuously detect gestures from unsegmented streams, and (ii) how to differentiate different styles of the same gesture from other types of gestures. In this article, we solve these two problems with a new effective and efficient feature extraction method—Structured Streaming Skeleton (SSS)—which uses a dynamic matching approach to construct a feature vector for each frame. Our comprehensive experiments on MSRC-12 Kinect Gesture, Huawei/3DLife-2013, and MSR-Action3D datasets have demonstrated superior performances than the state-of-the-art approaches. We also demonstrate model selection based on the proposed SSS feature, where the classifier of squared loss regression with l2,1 norm regularization is a recommended classifier for best performance.

References

  1. J. K. Aggarwal and M. S. Ryoo. 2011. Human activity analysis: A review. ACM Comput. Surv. 43, 3 (2011), 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jonathan Alon, Vassilis Athitsos, Quan Yuan, and Stan Sclaroff. 2009. A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 31, 9 (2009), 1685--1699. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mihael Ankerst, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Peter L. Bartlett, Stéphane Boucheron, and Gábor Lugosi. 2002. Model selection and error estimation. Machine Learn. 48, 1--3 (2002), 85--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Berndt and J. Clifford. 1994. Using dynamic time warping to find patterns in time series. In Proceedings of the KDD Workshop, Vol. 10. 359--370.Google ScholarGoogle Scholar
  6. Victoria Bloom, Dimitrios Makris, and Vasileios Argyriou. 2012. G3D: A gaming action dataset and real time action recognition evaluation framework. In Proceedings of the Computer Vision and Pattern Recognition Workshops (CVPRW). 7--12.Google ScholarGoogle ScholarCross RefCross Ref
  7. Paul S. Bradley and Olvi L. Mangasarian. 1998. Feature selection via concave minimization and support vector machines. In Proceedings of the International Conference on Machine Learning (ICML), Vol. 98. 82--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jose M. Chaquet, Enrique Carmona, and Antonio Fernández-Caballero. 2013. A survey of video datasets for human action and activity recognition. Comput. Vision Image Understand. 117, 6 (2013), 633--659. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learn. 20, 3 (1995), 273--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Ellis, S. Z. Masood, M. F. Tappen, J. J. LaViola, and R. Sukthankar. 2013. Exploring the trade-off between accuracy and observational latency in action recognition. Int. J. Comput. Vision 101, 3 (2013), 420--436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Simon Fothergill, Helena M. Mentis, Pushmeet Kohli, and Sebastian Nowozin. 2012. Instructing people for training gestural interactive systems. In Proceedings of the ACM Annual Conference on Human Factors in Computing Systems (CHI). 1737--1746. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dian Gong, Gérard Medioni, Sikai Zhu, and Xuemei Zhao. 2012. Kernelized temporal cut for online temporal segmentation and recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 229--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Tanaya Guha and Rabab K. Ward. 2012. Learning sparse representations for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34, 8 (2012), 1576--1588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Raj Gupta, Alex Yong-Sang Chia, and Deepu Rajan. 2013. Human activities recognition using depth images. In Proceedings of the 21st ACM International Conference on Multimedia. 283--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Huawei. 2013. Huawei/3DLife ACM Multimedia Grand Challenge 2013. http://mmv.eecs.qmul.ac.uk/mmgc2013/(2013).Google ScholarGoogle Scholar
  16. J. A. Hartigan and M. A Wong. 1979. A k-means clustering algorithm. J. Royal Stat. Soc. C 28 (1979), 100--108.Google ScholarGoogle Scholar
  17. Mohamed E. Hussein, Marwan Torki, Mohammad A. Gowayyed, and Motaz El-Saban. 2013. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 2466--2472. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Johansson. 1975. Visual motion perception. Sci. Am. 232, 6 (1975), 76--88.Google ScholarGoogle Scholar
  19. Leonard Kaufman and Peter Rousseeuw. 1987. Clustering by means of medoids. In Statistical Data Analysis Based on the L1-Norm and Related Methods. Birkhäuser Basel, 405--416.Google ScholarGoogle Scholar
  20. Fei-Fei Li and Pietro Perona. 2005. A bayesian hierarchical model for learning natural scene categories. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. 524--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hong Li and Michael Greenspan. 2011. Model-based segmentation and recognition of dynamic gestures in continuous video streams. Pattern Recogn. 44, 8 (2011), 1614--1628. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Li, Z. Zhang, and Z. Liu. 2010. Action recognition based on a bag of 3d points. In Proceedings of the CVPR Workshop. 9--14.Google ScholarGoogle Scholar
  23. Shih-Yao Lin, Chuen-Kai Shie, Shen-Chi Chen, and Yi-Ping Hung. 2012. Action recognition for human-marionette interaction. In Proceedings of the ACM International Conference on Multimedia (MM). 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Lv and R. Nevatia. 2006. Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In Proceedings of the European Conference on Computer Vision (ECCV). 359--372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Martens and I. Sutskever. 2011. Learning recurrent neural networks with Hessian-free optimization. In Proceedings of the International Conference on Machine Learning (ICML). 1033--1040.Google ScholarGoogle Scholar
  26. Meinard Müller, Andreas Baak, and Hans-Peter Seidel. 2009. Efficient and robust annotation of motion capture data. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA). 17--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Y. Ng, M. I. Jordan, and Y. Weiss. 2002. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS). 849--856.Google ScholarGoogle Scholar
  28. Panagiotis Papapetrou, Vassilis Athitsos, Michalis Potamias, George Kollios, and Dimitrios Gunopulos. 2011. Embedding-based subsequence matching in time-series databases. ACM Trans. Datab. Syst. 36, 3 (2011), 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ronald Poppe. 2010. A survey on vision-based human action recognition. Image Vision Comput. 28, 6 (2010), 976--990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ryan Rifkin, Gene Yeo, and Tomaso Poggio. 2003. Regularized least-squares classification. Nato Sci. Series Sub Series III Comput. Syst. Sci. 190 (2003), 131--154.Google ScholarGoogle Scholar
  31. M. S. Ryoo. 2011. Human activity prediction: Early recognition of ongoing activities from streaming videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 1036--1043. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yasushi Sakurai, Christos Faloutsos, and Masashi Yamamuro. 2007. Stream monitoring under the time warping distance. In Proceedings of the IEEE International Conference on Data Engineering (ICDE). 1046--1055.Google ScholarGoogle ScholarCross RefCross Ref
  33. L. A. Schwarz, D. Mateus, V. Castañeda, and N. Navab. 2010. Manifold learning for ToF-based human body tracking and activity recognition. In Proceedings of the British Machine Vision Conference (BMVC). 1--11.Google ScholarGoogle Scholar
  34. Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, and Andrew Blake. 2011. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). 1297--1304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yale Song, David Demirdjian, and Randall Davis. 2012. Continuous body and hand gesture recognition for natural human-computer interaction. ACM Trans. Interactive Intell. Syst. 2, 1 (2012), 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. K. N. Tran, I. A. Kakadiaris, and S. K. Shah. 2012. Part-based motion descriptor image for human action recognition. Pattern Recog. 45, 7 (2012), 2562--2572. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Pavan Turaga, Rama Chellappa, Venkatramana S. Subrahmanian, and Octavian Udrea. 2008. Machine recognition of human activities: A survey. IEEE Trans. Circuits Syst. Video Technol. 18, 11 (2008), 1473--1488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ashok Veeraraghavan, Rama Chellappa, and Amit K. Roy-Chowdhury. 2006. The function space of an activity. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1. 959--968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012a. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). 1290--1297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Sen Wang, Yi Yang, Zhigang Ma, Xue Li, Chaoyi Pang, and Alexander G. Hauptmann. 2012b. Action recognition by exploring data distribution and feature correlation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR). 1370--1377. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yi Yang, Zhigang Ma, Alexander G. Hauptmann, and Nicu Sebe. 2013a. Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans. Multimedia 15, 3 (2013), 661--669. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yang Yang, Imran Saleemi, and Mubarak Shah. 2013b. Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions. IEEE Trans. Pattern Anal. Mach. Intell. 35, 7 (2013), 1635--1648. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhang Zhang and Dacheng Tao. 2012. Slow feature analysis for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34, 3 (2012), 436--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. F. Zhou, F. Torre, and J. K. Hodgins. 2008. Aligned cluster analysis for temporal segmentation of human motion. In Proceedings of the IEEE Conference on Automatic Face and Gesture Recognition (FG). 1--7.Google ScholarGoogle Scholar

Index Terms

  1. Structured Streaming Skeleton -- A New Feature for Online Human Gesture Recognition

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Multimedia Computing, Communications, and Applications
            ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 11, Issue 1s
            Special Issue on Multiple Sensorial (MulSeMedia) Multimodal Media : Advances and Applications
            September 2014
            260 pages
            ISSN:1551-6857
            EISSN:1551-6865
            DOI:10.1145/2675060
            Issue’s Table of Contents

            Copyright © 2014 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 October 2014
            • Revised: 1 June 2014
            • Accepted: 1 June 2014
            • Received: 1 January 2014
            Published in tomm Volume 11, Issue 1s

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!