skip to main content
research-article

GraSP: Local Grassmannian Spatio-Temporal Patterns for Unsupervised Pose Sequence Recognition

Published:04 March 2022Publication History
Skip Abstract Section

Abstract

Many applications of action recognition, especially broad domains like surveillance or anomaly-detection, favor unsupervised methods considering that exhaustive labeling of actions is not possible. However, very limited work has happened in this domain. Moreover, the existing self-supervised approaches suffer from their dependency upon labeled data for finetuning. To this end, this paper puts forward a manifold based unsupervised pose-sequence recognition approach that leverages only the natural biases present in the data. It works by clustering the projections of temporal derivatives of the fragmented data on the Grassmann manifold. Temporal derivatives are formed by the inter-frame gradients with local and global metrics. To commensurate with this, a dynamic view-invariant pose representation is proposed. Additionally, a variable aggregation step is introduced for better feature vector quantization. Extensive empirical evaluation and ablations on several challenging datasets under three categories confirm the superiority of the proposed approach in contrast to current methods.

REFERENCES

  1. [1] Amelio Alessia and Pizzuti Clara. 2015. Is normalized mutual information a fair measure for comparing community detection methods? In IEEE International Conference on Advances in Social Networks Analysis and Mining. 15841585.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Begelfor Evgeni and Werman Michael. 2006. Affine invariance revisited. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 20872094.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Tanfous Amor Ben, Drira Hassen, and Amor Boulbaba Ben. 2018. Coding Kendall’s shape trajectories for 3D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 28402849.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Buckchash Himanshu and Raman Balasubramanian. 2020. Human motion generation by stochastic conditioning of deep recurrent networks on pose manifolds. In IEEE International Conference on Image Processing. IEEE, 24062410.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Chen, Liu Kui, and Kehtarnavaz Nasser. 2016. Real-time human action recognition based on depth motion maps. Journal of Real-Time Image Processing 12, 1 (2016), 155163.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Cheng Yi-Bin, Chen Xipeng, Chen Junhong, Wei Pengxu, Zhang Dongyu, and Lin Liang. 2021. Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition. In IEEE International Conference on Multimedia and Expo. IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] CMU. 2007. Carnegie-Mellon motion capture database. In Carnegie Mellon Univ. Pittsburgh, PA, USA. Last online: Jan. 2021. http://mocap.cs.cmu.edu/.Google ScholarGoogle Scholar
  8. [8] Conway John H., Hardin Ronald H., and Sloane Neil J. A.. 1996. Packing lines, planes, etc.: Packings in Grassmannian spaces. Experimental Mathematics 5, 2 (1996), 139159.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Derkach Dmytro, Ruiz Adria, and Sukno Federico M. 2019. Tensor decomposition and non-linear manifold modeling for 3D head pose estimation. International Journal of Computer Vision 127, 10 (2019), 15651585.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Devanne Maxime, Wannous Hazem, Berretti Stefano, Pala Pietro, Daoudi Mohamed, and Bimbo Alberto Del. 2014. 3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold. IEEE Transactions on Cybernetics 45, 7 (2014), 13401352.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Du Yong, Wang Wei, and Wang Liang. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 11101118.Google ScholarGoogle Scholar
  12. [12] Fowlkes Edward B. and Mallows Colin L.. 1983. A method for comparing two hierarchical clusterings. J. Amer. Statist. Assoc. 78, 383 (1983), 553569.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Ghahramani Zoubin. 2003. Unsupervised learning. In Summer School on Machine Learning. Springer, 72112.Google ScholarGoogle Scholar
  14. [14] Gharaee Zahra. 2020. Hierarchical growing grid networks for skeleton based action recognition. Cognitive Systems Research 63 (2020), 1129.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Gong Boqing, Shi Yuan, Sha Fei, and Grauman Kristen. 2012. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 20662073.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Gowayyed Mohammad A., Torki Marwan, Hussein Mohamed E., and El-Saban Motaz. 2013. Histogram of oriented displacements (HOD) describing trajectories of human joints for action recognition. In 23rd International Joint Conference on Artificial Intelligence. 13511357.Google ScholarGoogle Scholar
  17. [17] Holden Daniel, Saito Jun, Komura Taku, and Joyce Thomas. 2015. Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 Technical Briefs. 14.Google ScholarGoogle Scholar
  18. [18] Huang Zhiwu, Wan Chengde, Probst Thomas, and Gool Luc Van. 2017. Deep learning on lie groups for skeleton-based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 60996108.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Hubert Lawrence and Arabie Phipps. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193218.Google ScholarGoogle Scholar
  20. [20] Huynh-The Thien, Hua Cam-Hao, Ngo Trung-Thanh, and Kim Dong-Seong. 2020. Image representation of pose-transition feature for 3D skeleton-based action recognition. Information Sciences 513 (2020), 112126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Kaufmann Manuel, Aksan Emre, Song Jie, Pece Fabrizio, Ziegler Remo, and Hilliges Otmar. 2020. Convolutional autoencoders for human motion infilling. In International Conference on 3D Vision. 918927.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Ke Qiuhong, Bennamoun Mohammed, An Senjian, Sohel Ferdous, and Boussaid Farid. 2017. A new representation of skeleton sequences for 3D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 32883297.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Lee Inwoong, Kim Doyoung, Kang Seoungyoon, and Lee Sanghoon. 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In IEEE International Conference on Computer Vision. 10121020.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Li Junnan, Wong Yongkang, Zhao Qi, and Kankanhalli Mohan S.. 2018. Unsupervised learning of view-invariant action representations. In 32nd International Conference on Neural Information Processing Systems. 12621272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Lin Lilang, Song Sijie, Yang Wenhan, and Liu Jiaying. 2020. MS2L: Multi-task self-supervised learning for skeleton based action recognition. In 28th ACM International Conference on Multimedia. 24902498.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Liu Jun, Shahroudy Amir, Xu Dong, and Wang Gang. 2016. Spatio-temporal LSTM with trust gates for 3D human action recognition. In European Conference on Computer Vision. Springer, 816833.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Liu Ruixu, Shen Ju, Wang He, Chen Chen, Cheung Sen-ching, and Asari Vijayan K.. 2021. Enhanced 3D human pose estimation from videos by using attention-based neural network with dilated convolutions. International Journal of Computer Vision (2021), 120.Google ScholarGoogle Scholar
  28. [28] Ludl Dennis, Gulde Thomas, and Curio Cristóbal. 2019. Simple yet efficient real-time pose-based action recognition. In IEEE Intelligent Transportation Systems Conference. IEEE, 581588.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Lui Yui Man and Beveridge J. Ross. 2011. Tangent bundle for human action recognition. In IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 97102.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Lui Yui Man, Beveridge J. Ross, and Kirby Michael. 2010. Action classification on product manifolds. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 833839.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Ma Yi, Kosecka Jana, and Sastry Shankar. 1998. Optimal motion from image sequences: A Riemannian viewpoint. In Conference on Mathematical Theory of Networks and Systems. CiteSeer.Google ScholarGoogle Scholar
  32. [32] Morais Romero, Le Vuong, Tran Truyen, Saha Budhaditya, Mansour Moussa, and Venkatesh Svetha. 2019. Learning regularity in skeleton trajectories for anomaly detection in videos. In IEEE Conference on Computer Vision and Pattern Recognition. 1199612004.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Nie Qiang, Liu Ziwei, and Liu Yunhui. 2020. Unsupervised 3D human pose representation with viewpoint and pose disentanglement. In European Conference on Computer Vision. Springer, 102118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Pham Huy Hieu, Salmane Houssam, Khoudour Louahdi, Crouzil Alain, Zegers Pablo, and Velastin Sergio A. 2019. Spatio–temporal image representation of 3D skeletal movements for view-invariant action recognition with deep convolutional neural networks. Sensors 19, 8 (2019), 1932.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Rao Haocong, Xu Shihao, Hu Xiping, Cheng Jun, and Hu Bin. 2020. Augmented skeleton based contrastive action learning with momentum LSTM for unsupervised action recognition. CoRR (2020). arxiv:2008.00188.Google ScholarGoogle Scholar
  36. [36] Rosenberg Andrew and Hirschberg Julia. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 410420.Google ScholarGoogle Scholar
  37. [37] Seidenari Lorenzo, Varano Vincenzo, Berretti Stefano, Bimbo Alberto, and Pala Pietro. 2013. Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 479485.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Shahroudy Amir, Liu Jun, Ng Tian-Tsong, and Wang Gang. 2016. NTU RGB+DNtu rgb+ d: A large scale dataset for 3D human activity analysis. In IEEE Conference on Computer Vision and Pattern Recognition. 10101019.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Si Chenyang, Chen Wentao, Wang Wei, Wang Liang, and Tan Tieniu. 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 12271236.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Slama Rim, Wannous Hazem, Daoudi Mohamed, and Srivastava Anuj. 2015. Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recognition 48, 2 (2015), 556567.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Srivastava Nitish, Mansimov Elman, and Salakhudinov Ruslan. 2015. Unsupervised learning of video representations using LSTMs. In International Conference on Machine Learning. PMLR, 843852.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Su Hang, Chang Zhigang, Yu Mingyang, Gao Jialin, Li Xinzhe, Zheng Shibao, et al. 2020. Convolutional neural network with adaptive inferential framework for skeleton-based action recognition. Journal of Visual Communication and Image Representation 73 (2020), 102925.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Su Kun, Liu Xiulong, and Shlizerman Eli. 2020. Predict & cluster: Unsupervised skeleton based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 96319640.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Sundararajan Kalaivani and Woodard Damon L.. 2015. Head pose estimation in the wild using approximate view manifolds. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 5058.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Turaga Pavan, Veeraraghavan Ashok, and Chellappa Rama. 2008. Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Vemulapalli Raviteja, Arrate Felipe, and Chellappa Rama. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In IEEE Conference on Computer Vision and Pattern Recognition. 588595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Wang Boyue, Hu Yongli, Gao Junbin, Sun Yanfeng, and Yin Baocai. 2016. Product Grassmann manifold representation and its LRR models. In AAAI Conference on Artificial Intelligence, Vol. 30.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Wang Hongsong and Wang Liang. 2018. Learning content and style: Joint action recognition and person identification from human skeletons. Pattern Recognition 81 (2018), 2335.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Wang Jiang, Liu Zicheng, Wu Ying, and Yuan Junsong. 2013. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 5 (2013), 914927.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wang Jiang, Nie Xiaohan, Xia Yin, Wu Ying, and Zhu Song-Chun. 2014. Cross-view action modeling, learning and recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 26492656.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Wang Mengjiao, Panagakis Yannis, Snape Patrick, and Zafeiriou Stefanos. 2017. Learning the multilinear structure of visual data. In IEEE Conference on Computer Vision and Pattern Recognition. 45924600.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Wang Pei, Yuan Chunfeng, Hu Weiming, Li Bing, and Zhang Yanning. 2016. Graph based skeleton motion representation and similarity measurement for action recognition. In European Conference on Computer Vision. Springer, 370385.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Yu Jiahui, Gao Hongwei, Yang Wei, Jiang Yueqiu, Chin Weihong, Kubota Naoyuki, and Ju Zhaojie. 2020. A discriminative deep model with feature fusion and temporal attention for human action recognition. IEEE Access 8 (2020), 4324343255.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Yun Kiwon, Honorio Jean, Chattopadhyay Debaleena, Berg Tamara L., and Samaras Dimitris. 2012. Two-person interaction detection using body-pose features and multiple instance learning. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 2835.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Zhang Zhe, Wang Chunyu, Qiu Weichao, Qin Wenhu, and Zeng Wenjun. 2021. Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. International Journal of Computer Vision 129, 3 (2021), 703718.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Zhao Rui, Wang Kang, Su Hui, and Ji Qiang. 2019. Bayesian graph convolution LSTM for skeleton based action recognition. In IEEE International Conference on Computer Vision. 68826892.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Zheng Nenggan, Wen Jun, Liu Risheng, Long Liangqu, Dai Jianhua, and Gong Zhefeng. 2018. Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In AAAI Conference on Artificial Intelligence, Vol. 32.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Zhou Yi, Li Zimo, Xiao Shuangjiu, He Chong, Huang Zeng, and Li Hao. 2018. Auto-conditioned recurrent networks for extended complex human motion synthesis. International Conference on Learning Representations (2018).Google ScholarGoogle Scholar
  59. [59] Zhu Wentao, Lan Cuiling, Xing Junliang, Zeng Wenjun, Li Yanghao, Shen Li, and Xie Xiaohui. 2016. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In AAAI Conference on Artificial Intelligence, Vol. 30.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Zhu Yulian, Xue Zhimei, and Li Chunyan. 2014. Automatic head pose estimation with synchronized sub manifold embedding and random regression forests. International Journal of Signal Processing, Image Processing and Pattern Recognition 7, 3 (2014), 123134.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. GraSP: Local Grassmannian Spatio-Temporal Patterns for Unsupervised Pose Sequence Recognition

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 3
        August 2022
        478 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3505208
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 March 2022
        • Revised: 1 October 2021
        • Accepted: 1 October 2021
        • Received: 1 April 2021
        Published in tomm Volume 18, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)111
        • Downloads (Last 6 weeks)8

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!