Abstract
Rapid progress and superior performance have been achieved for skeleton-based action recognition recently. In this article, we investigate this problem under a cross-dataset setting, which is a new, pragmatic, and challenging task in real-world scenarios. Following the unsupervised domain adaptation (UDA) paradigm, the action labels are only available on a source dataset, but unavailable on a target dataset in the training stage. Different from the conventional adversarial learning-based approaches for UDA, we utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets. Our inspiration is drawn from Cubism, an art genre from the early 20th century, which breaks and reassembles the objects to convey a greater context. By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks to explore the temporal and spatial dependency of a skeleton-based action and improve the generalization ability of the model. We conduct experiments on six datasets for skeleton-based action recognition, including three large-scale datasets (NTU RGB+D, PKU-MMD, and Kinetics) where new cross-dataset settings and benchmarks are established. Extensive results demonstrate that our method outperforms state-of-the-art approaches. The source codes of our model and all the compared methods are available at https://github.com/shanice-l/st-cubism.
- [1] . 2021. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1 (2021), 172–186.Google Scholar
Digital Library
- [2] . 2019. Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2229–2238.Google Scholar
Cross Ref
- [3] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4724–4733.Google Scholar
Cross Ref
- [4] . 2019. Temporal attentive alignment for large-scale video domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision. 6320–6329.Google Scholar
Cross Ref
- [5] . 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. 1597–1607.Google Scholar
- [6] . 2019. Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In Proceedings of the International Conference on Machine Learning. 1081–1090.Google Scholar
- [7] . 2020. Shuffle and attend: Video domain adaptation. In Proceedings of the European Conference on Computer Vision, Vol. 12357. 678–695.Google Scholar
Digital Library
- [8] . 2018. PoTion: Pose MoTion representation for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7024–7033.Google Scholar
Cross Ref
- [9] . 2018. Graph adaptive knowledge transfer for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision. 36–52.Google Scholar
Cross Ref
- [10] . 2015. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision. 1422–1430. Google Scholar
Digital Library
- [11] . 2017. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision. 2070–2079.Google Scholar
Cross Ref
- [12] . 2019. Temporal cycle-consistency learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1801–1810.Google Scholar
Cross Ref
- [13] . 2017. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5729–5738.Google Scholar
Cross Ref
- [14] . 2018. Geometry guided convolutional neural networks for self-supervised video representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5589–5597.Google Scholar
Cross Ref
- [15] . 2016. Domain-Adversarial training of neural networks. J. Mach. Learn. Res. 17 (2016), 59:1–59:35. Google Scholar
Digital Library
- [16] . 2018. Unsupervised representation learning by predicting image rotations. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [17] . 2013. Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In Proceedings of the International Conference on Machine Learning. 222–230. Google Scholar
Digital Library
- [18] . 2017. Space-time representation of people based on 3D skeletal data: A review. Comput. Vis. Image Underst. 158 (2017), 85–105. Google Scholar
Digital Library
- [19] . 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9726–9735.Google Scholar
Cross Ref
- [20] . 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision. 1026–1034. Google Scholar
Digital Library
- [21] . 2006. Correcting sample selection bias by unlabeled data. In Proceedings of the International Conference on Neural Information Processing Systems. 601–608. Google Scholar
Digital Library
- [22] . 2018. Deep domain adaptation in action space. In Proceedings of the British Machine Vision Conference. 264.Google Scholar
- [23] . 2020. Minimum class confusion for versatile domain adaptation. In Proceedings of the European Conference on Computer Vision, , , , and (Eds.). 464–480.Google Scholar
Digital Library
- [24] . 2019. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence. 8545–8552. Google Scholar
Digital Library
- [25] . 2016. Tensor representations via kernel linearization for action recognition from 3D skeletons. In Proceedings of the European Conference on Computer Vision. 37–53.Google Scholar
Cross Ref
- [26] . 2020. MAST: A memory-augmented self-supervised tracker. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6478–6487.Google Scholar
Cross Ref
- [27] . 2017. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 840–849.Google Scholar
Cross Ref
- [28] . 2017. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision. 667–676.Google Scholar
Cross Ref
- [29] . 2018. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In Proceedings of the International Joint Conference on Artificial Intelligence. 786–792. Google Scholar
Digital Library
- [30] . 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [31] . 2020. MS2L: Multi-task self-supervised learning for skeleton based action recognition. In Proceedings of the ACM International Conference on Multimedia. 2490–2498. Google Scholar
Digital Library
- [32] . 2020. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42, 10 (2020), 2684–2701.Google Scholar
Digital Library
- [33] . 2020. A benchmark dataset and comparison study for multi-modal human action analytics. ACM Trans. Multim. Comput. Commun. Appl. 16, 2 (2020), 41:1–41:24. Google Scholar
Digital Library
- [34] . 2017. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 68 (2017), 346–362. Google Scholar
Digital Library
- [35] . 2015. Learning transferable features with deep adaptation networks. In Proceedings of the International Conference on Machine Learning. 97–105. Google Scholar
Digital Library
- [36] . 2018. Conditional adversarial domain adaptation. In Proceedings of the International Conference on Neural Information Processing Systems. 1647–1657. Google Scholar
Digital Library
- [37] . 2017. Deep transfer learning with joint adaptation networks. In Proceedings of the International Conference on Machine Learning. 2208–2217. Google Scholar
Digital Library
- [38] . 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008), 2579–2605.Google Scholar
- [39] . 2016. Shuffle and learn: Unsupervised learning using temporal order verification. In Proceedings of the European Conference on Computer Vision. 527–544.Google Scholar
Cross Ref
- [40] . 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision. 69–84.Google Scholar
Cross Ref
- [41] . 2011. Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22, 2 (2011), 199–210. Google Scholar
Digital Library
- [42] . 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2536–2544.Google Scholar
Cross Ref
- [43] . 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3723–3732.Google Scholar
Cross Ref
- [44] . 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1010–1019.Google Scholar
Cross Ref
- [45] . 2019. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7912–7921.Google Scholar
Cross Ref
- [46] . 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12026–12035.Google Scholar
Cross Ref
- [47] . 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1227–1236. Google Scholar
Digital Library
- [48] . 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the International Conference on Neural Information Processing Systems. 568–576. Google Scholar
Digital Library
- [49] . 2017. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI Conference on Artificial Intelligence. 4263–4270. Google Scholar
Digital Library
- [50] . 2014. Human action recognition across datasets by foreground-weighted histogram decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 764–771. Google Scholar
Digital Library
- [51] . 2019. Unsupervised domain adaptation through self-supervision. CoRR abs/1909.11825 (2019).Google Scholar
- [52] . 2020. Test-time training with self-supervision for generalization under distribution shifts. In Proceedings of the International Conference on Machine Learning. 9229–9248.Google Scholar
- [53] . 2019. Learning semantics-preserving attention and contextual interaction for group activity recognition. IEEE Trans. Image Process. 28, 10 (2019), 4997–5012.Google Scholar
Digital Library
- [54] . 2021. Comprehensive instructional video analysis: The COIN dataset and performance evaluation. IEEE Trans. Pattern Anal. Mach. Intell. 43, 9 (2021), 3138–3153. https://doi.org/10.1109/TPAMI.2020.2980824Google Scholar
Cross Ref
- [55] . 2020. Uncertainty-aware score distribution learning for action quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 9836–9845.Google Scholar
Cross Ref
- [56] . 2018. Deep progressive reinforcement learning for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5323–5332.Google Scholar
Cross Ref
- [57] . 2020. Graph interaction networks for relation transfer in human activity videos. IEEE Trans. Circuits Syst. Video Technol. 30, 9 (2020), 2872–2886.Google Scholar
Digital Library
- [58] . 2018. CNN-based action recognition and supervised domain adaptation on 3D body skeletons via kernel feature maps. In Proceedings of the British Machine Vision Conference. 158.Google Scholar
- [59] . 2014. Human action recognition by representing 3D skeletons as points in a lie group. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 588–595. Google Scholar
Digital Library
- [60] . 2020. Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition. Multimed. Syst. 26 (2020), 671–685.Google Scholar
Digital Library
- [61] . 2019. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4006–4015.Google Scholar
Cross Ref
- [62] . 2012. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1290–1297. Google Scholar
Digital Library
- [63] . 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20–36.Google Scholar
Cross Ref
- [64] . 2018. RGB-D-based human motion recognition with deep learning: A survey. Comput. Vis. Image Underst. 171 (2018), 118–139.Google Scholar
Cross Ref
- [65] . 2016. Graph based skeleton motion representation and similarity measurement for action recognition. In Proceedings of the European Conference on Computer Vision. 370–385.Google Scholar
Cross Ref
- [66] . 2019. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2566–2576.Google Scholar
Cross Ref
- [67] . 2017. Spatio-temporal naive-Bayes nearest-neighbor for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4171–4180.Google Scholar
Cross Ref
- [68] . 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence. 7444–7452. Google Scholar
Digital Library
- [69] . 2014. Discriminative orderlet mining for real-time recognition of human-object interaction. In Proceedings of the Asian Conference on Computer Vision. 50–65.Google Scholar
- [70] . 2012. Two-person interaction detection using body-pose features and multiple instance learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 28–35.Google Scholar
Cross Ref
- [71] . 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision. 2136–2145.Google Scholar
Cross Ref
- [72] . 2018. Adding attentiveness to the neurons in recurrent neural networks. In Proceedings of the European Conference on Computer Vision. 136–152.Google Scholar
Cross Ref
- [73] . 2016. Colorful image colorization. In Proceedings of the European Conference on Computer Vision. 649–666.Google Scholar
Cross Ref
- [74] . 2017. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In Proceedings of the IEEE International Conference on Computer Vision. 2923–2932.Google Scholar
Cross Ref
Index Terms
Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition
Recommendations
Focalized contrastive view-invariant learning for self-supervised skeleton-based action recognition
Graphical abstractDisplay Omitted
Highlights- Focalized contrastive learning for self-supervised skeletal action recognition.
AbstractLearning view-invariant representation is a key to improving feature discrimination power for skeleton-based action recognition. Existing approaches cannot effectively remove the impact of viewpoint due to the implicit view-dependent ...
Dual many-to-one-encoder-based transfer learning for cross-dataset human action recognition
The emergence of large-scale human action datasets poses a challenge to efficient action labeling. Hand labeling large-scale datasets is tedious and time consuming; thus a more efficient labeling method would be beneficial. One possible solution is to ...
Adversarial Self-supervised Learning for Semi-supervised 3D Action Recognition
Computer Vision – ECCV 2020AbstractWe consider the problem of semi-supervised 3D action recognition which has been rarely explored before. Its major challenge lies in how to effectively learn motion representations from unlabeled data. Self-supervised learning (SSL) has been proved ...






Comments