Abstract
Many applications of action recognition, especially broad domains like surveillance or anomaly-detection, favor unsupervised methods considering that exhaustive labeling of actions is not possible. However, very limited work has happened in this domain. Moreover, the existing self-supervised approaches suffer from their dependency upon labeled data for finetuning. To this end, this paper puts forward a manifold based unsupervised pose-sequence recognition approach that leverages only the natural biases present in the data. It works by clustering the projections of temporal derivatives of the fragmented data on the Grassmann manifold. Temporal derivatives are formed by the inter-frame gradients with local and global metrics. To commensurate with this, a dynamic view-invariant pose representation is proposed. Additionally, a variable aggregation step is introduced for better feature vector quantization. Extensive empirical evaluation and ablations on several challenging datasets under three categories confirm the superiority of the proposed approach in contrast to current methods.
- [1] . 2015. Is normalized mutual information a fair measure for comparing community detection methods? In IEEE International Conference on Advances in Social Networks Analysis and Mining. 1584–1585.Google Scholar
Digital Library
- [2] . 2006. Affine invariance revisited. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 2087–2094.Google Scholar
Digital Library
- [3] . 2018. Coding Kendall’s shape trajectories for 3D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 2840–2849.Google Scholar
Cross Ref
- [4] . 2020. Human motion generation by stochastic conditioning of deep recurrent networks on pose manifolds. In IEEE International Conference on Image Processing. IEEE, 2406–2410.Google Scholar
Cross Ref
- [5] . 2016. Real-time human action recognition based on depth motion maps. Journal of Real-Time Image Processing 12, 1 (2016), 155–163.Google Scholar
Digital Library
- [6] . 2021. Hierarchical transformer: Unsupervised representation learning for skeleton-based human action recognition. In IEEE International Conference on Multimedia and Expo. IEEE, 1–6.Google Scholar
Cross Ref
- [7] . 2007. Carnegie-Mellon motion capture database. In Carnegie Mellon Univ. Pittsburgh, PA, USA. Last online: Jan. 2021. http://mocap.cs.cmu.edu/.Google Scholar
- [8] . 1996. Packing lines, planes, etc.: Packings in Grassmannian spaces. Experimental Mathematics 5, 2 (1996), 139–159.Google Scholar
Cross Ref
- [9] . 2019. Tensor decomposition and non-linear manifold modeling for 3D head pose estimation. International Journal of Computer Vision 127, 10 (2019), 1565–1585.Google Scholar
Digital Library
- [10] . 2014. 3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold. IEEE Transactions on Cybernetics 45, 7 (2014), 1340–1352.Google Scholar
Cross Ref
- [11] . 2015. Hierarchical recurrent neural network for skeleton based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1110–1118.Google Scholar
- [12] . 1983. A method for comparing two hierarchical clusterings. J. Amer. Statist. Assoc. 78, 383 (1983), 553–569.Google Scholar
Cross Ref
- [13] . 2003. Unsupervised learning. In Summer School on Machine Learning. Springer, 72–112.Google Scholar
- [14] . 2020. Hierarchical growing grid networks for skeleton based action recognition. Cognitive Systems Research 63 (2020), 11–29.Google Scholar
Cross Ref
- [15] . 2012. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2066–2073.Google Scholar
Digital Library
- [16] . 2013. Histogram of oriented displacements (HOD) describing trajectories of human joints for action recognition. In 23rd International Joint Conference on Artificial Intelligence. 1351–1357.Google Scholar
- [17] . 2015. Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 Technical Briefs. 1–4.Google Scholar
- [18] . 2017. Deep learning on lie groups for skeleton-based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 6099–6108.Google Scholar
Cross Ref
- [19] . 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193–218.Google Scholar
- [20] . 2020. Image representation of pose-transition feature for 3D skeleton-based action recognition. Information Sciences 513 (2020), 112–126.Google Scholar
Digital Library
- [21] . 2020. Convolutional autoencoders for human motion infilling. In International Conference on 3D Vision. 918–927.Google Scholar
Cross Ref
- [22] . 2017. A new representation of skeleton sequences for 3D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 3288–3297.Google Scholar
Cross Ref
- [23] . 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In IEEE International Conference on Computer Vision. 1012–1020.Google Scholar
Cross Ref
- [24] . 2018. Unsupervised learning of view-invariant action representations. In 32nd International Conference on Neural Information Processing Systems. 1262–1272.Google Scholar
Digital Library
- [25] . 2020. MS2L: Multi-task self-supervised learning for skeleton based action recognition. In 28th ACM International Conference on Multimedia. 2490–2498.Google Scholar
Digital Library
- [26] . 2016. Spatio-temporal LSTM with trust gates for 3D human action recognition. In European Conference on Computer Vision. Springer, 816–833.Google Scholar
Cross Ref
- [27] . 2021. Enhanced 3D human pose estimation from videos by using attention-based neural network with dilated convolutions. International Journal of Computer Vision (2021), 1–20.Google Scholar
- [28] . 2019. Simple yet efficient real-time pose-based action recognition. In IEEE Intelligent Transportation Systems Conference. IEEE, 581–588.Google Scholar
Digital Library
- [29] . 2011. Tangent bundle for human action recognition. In IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 97–102.Google Scholar
Cross Ref
- [30] . 2010. Action classification on product manifolds. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 833–839.Google Scholar
Cross Ref
- [31] . 1998. Optimal motion from image sequences: A Riemannian viewpoint. In Conference on Mathematical Theory of Networks and Systems. CiteSeer.Google Scholar
- [32] . 2019. Learning regularity in skeleton trajectories for anomaly detection in videos. In IEEE Conference on Computer Vision and Pattern Recognition. 11996–12004.Google Scholar
Cross Ref
- [33] . 2020. Unsupervised 3D human pose representation with viewpoint and pose disentanglement. In European Conference on Computer Vision. Springer, 102–118.Google Scholar
Digital Library
- [34] . 2019. Spatio–temporal image representation of 3D skeletal movements for view-invariant action recognition with deep convolutional neural networks. Sensors 19, 8 (2019), 1932.Google Scholar
Cross Ref
- [35] . 2020. Augmented skeleton based contrastive action learning with momentum LSTM for unsupervised action recognition. CoRR (2020).
arxiv:2008.00188 .Google Scholar - [36] . 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 410–420.Google Scholar
- [37] . 2013. Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 479–485.Google Scholar
Digital Library
- [38] . 2016. NTU RGB+DNtu rgb+ d: A large scale dataset for 3D human activity analysis. In IEEE Conference on Computer Vision and Pattern Recognition. 1010–1019.Google Scholar
Cross Ref
- [39] . 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1227–1236.Google Scholar
Cross Ref
- [40] . 2015. Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recognition 48, 2 (2015), 556–567.Google Scholar
Digital Library
- [41] . 2015. Unsupervised learning of video representations using LSTMs. In International Conference on Machine Learning. PMLR, 843–852.Google Scholar
Digital Library
- [42] . 2020. Convolutional neural network with adaptive inferential framework for skeleton-based action recognition. Journal of Visual Communication and Image Representation 73 (2020), 102925.Google Scholar
Cross Ref
- [43] . 2020. Predict & cluster: Unsupervised skeleton based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 9631–9640.Google Scholar
Cross Ref
- [44] . 2015. Head pose estimation in the wild using approximate view manifolds. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 50–58.Google Scholar
Cross Ref
- [45] . 2008. Statistical analysis on Stiefel and Grassmann manifolds with applications in computer vision. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8.Google Scholar
Cross Ref
- [46] . 2014. Human action recognition by representing 3D skeletons as points in a lie group. In IEEE Conference on Computer Vision and Pattern Recognition. 588–595.Google Scholar
Digital Library
- [47] . 2016. Product Grassmann manifold representation and its LRR models. In AAAI Conference on Artificial Intelligence, Vol. 30.Google Scholar
Cross Ref
- [48] . 2018. Learning content and style: Joint action recognition and person identification from human skeletons. Pattern Recognition 81 (2018), 23–35.Google Scholar
Digital Library
- [49] . 2013. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 5 (2013), 914–927.Google Scholar
Cross Ref
- [50] . 2014. Cross-view action modeling, learning and recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 2649–2656.Google Scholar
Digital Library
- [51] . 2017. Learning the multilinear structure of visual data. In IEEE Conference on Computer Vision and Pattern Recognition. 4592–4600.Google Scholar
Cross Ref
- [52] . 2016. Graph based skeleton motion representation and similarity measurement for action recognition. In European Conference on Computer Vision. Springer, 370–385.Google Scholar
Cross Ref
- [53] . 2020. A discriminative deep model with feature fusion and temporal attention for human action recognition. IEEE Access 8 (2020), 43243–43255.Google Scholar
Cross Ref
- [54] . 2012. Two-person interaction detection using body-pose features and multiple instance learning. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 28–35.Google Scholar
Cross Ref
- [55] . 2021. Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. International Journal of Computer Vision 129, 3 (2021), 703–718.Google Scholar
Digital Library
- [56] . 2019. Bayesian graph convolution LSTM for skeleton based action recognition. In IEEE International Conference on Computer Vision. 6882–6892.Google Scholar
Cross Ref
- [57] . 2018. Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In AAAI Conference on Artificial Intelligence, Vol. 32.Google Scholar
Cross Ref
- [58] . 2018. Auto-conditioned recurrent networks for extended complex human motion synthesis. International Conference on Learning Representations (2018).Google Scholar
- [59] . 2016. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In AAAI Conference on Artificial Intelligence, Vol. 30.Google Scholar
Cross Ref
- [60] . 2014. Automatic head pose estimation with synchronized sub manifold embedding and random regression forests. International Journal of Signal Processing, Image Processing and Pattern Recognition 7, 3 (2014), 123–134.Google Scholar
Cross Ref
Index Terms
GraSP: Local Grassmannian Spatio-Temporal Patterns for Unsupervised Pose Sequence Recognition
Recommendations
Kernel analysis on Grassmann manifolds for action recognition
Modelling video sequences by subspaces has recently shown promise for recognising human actions. Subspaces are able to accommodate the effects of various image variations and can capture the dynamic properties of actions. Subspaces form a non-Euclidean ...
Local velocity-adapted motion events for spatio-temporal recognition
In this paper, we address the problem of motion recognition using event-based local motion representations. We assume that similar patterns of motion contain similar events with consistent motion across image sequences. Using this assumption, we ...
Facial expression recognition based on local binary patterns and local fisher discriminant analysis
Automatic facial expression recognition is an interesting and challenging subject in signal processing, pattern recognition, artificial intelligence, etc. In this paper, a new method of facial expression recognition based on local binary patterns (LBP) ...






Comments