Abstract
Existing video self-supervised learning methods mainly rely on trimmed videos for model training. They apply their methods and verify the effectiveness on trimmed video datasets including UCF101 and Kinetics-400, among others. However, trimmed datasets are manually annotated from untrimmed videos. In this sense, these methods are not truly unsupervised. In this article, we propose a novel self-supervised method, referred to as Exploring Relations in Untrimmed Videos (ERUV), which can be straightforwardly applied to untrimmed videos (real unlabeled) to learn spatio-temporal features. ERUV first generates single-shot videos by shot change detection. After that, some designed sampling strategies are used to model relations for video clips. The strategies are saved as our self-supervision signals. Finally, the network learns representations by predicting the category of relations between the video clips. ERUV is able to compare the differences and similarities of video clips, which is also an essential procedure for video-related tasks. We validate our learned models with action recognition, video retrieval, and action similarity labeling tasks with four kinds of 3D convolutional neural networks. Experimental results show that ERUV is able to learn richer representations with untrimmed videos, and it outperforms state-of-the-art self-supervised methods with significant margins.
- [1] . 2017. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision. 609–617.Google Scholar
Cross Ref
- [2] . 2018. Objects that sound. In Proceedings of the European Conference on Computer Vision. 435–451.Google Scholar
Cross Ref
- [3] . 2020. SpeedNet: Learning the speediness in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9922–9931.Google Scholar
Cross Ref
- [4] . 2018. Improving spatiotemporal self-supervision by deep reinforcement learning. In Proceedings of the European Conference on Computer Vision. 770–786.Google Scholar
Cross Ref
- [5] . 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. 1597–1607.Google Scholar
- [6] . 2021. Self-training for domain adaptive scene text detection. In Proceedings of the International Conference on Pattern Recognition. IEEE, Los Alamitos, CA, 850–857.Google Scholar
Cross Ref
- [7] . 2019. Constrained relation network for character detection in scene images. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence. 137–149.Google Scholar
Digital Library
- [8] . 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2536–2544.Google Scholar
Cross Ref
- [9] . 2015. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1422–1430. Google Scholar
Digital Library
- [10] . 2015. FlowNet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758–2766. Google Scholar
Digital Library
- [11] . 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.Google Scholar
Cross Ref
- [12] . 2009. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 9 (2009), 1627–1645. Google Scholar
Digital Library
- [13] . 2017. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3636–3645.Google Scholar
Cross Ref
- [14] . 2018. Geometry guided convolutional neural networks for self-supervised video representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5589–5597.Google Scholar
Cross Ref
- [15] . 2018. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018).Google Scholar
- [16] . 2020. Self-supervised co-training for video representation learning. arXiv preprint arXiv:2010.09709 (2020).Google Scholar
- [17] . 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9729–9738.Google Scholar
Cross Ref
- [18] . 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588–3597.Google Scholar
Cross Ref
- [19] . 2011. A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 4. 6. Google Scholar
Digital Library
- [20] . 2014. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/Google Scholar
- [21] . 2018. Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018).Google Scholar
- [22] . 2017. The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).Google Scholar
- [23] . 2019. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8545–8552. Google Scholar
Digital Library
- [24] . 2011. The action similarity labeling challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 3 (2011), 615–621. Google Scholar
Digital Library
- [25] . 2019. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1920–1929.Google Scholar
Cross Ref
- [26] . 2020. Cycle-contrast for self-supervised video representation learning. arXiv preprint arXiv:2010.14810 (2020).Google Scholar
- [27] . 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105. Google Scholar
Digital Library
- [28] . 2017. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6874–6883.Google Scholar
Cross Ref
- [29] . 2017. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 667–676.Google Scholar
Cross Ref
- [30] . 2019. Multi-task self-supervised object detection via recycling of bounding box annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4984–4993.Google Scholar
Cross Ref
- [31] . 2021. MMF: Multi-task multi-structure fusion for hierarchical image classification. arXiv preprint arXiv:2107.00808 (2021).Google Scholar
- [32] . 2020. Video cloze procedure for self-supervised spatio-temporal learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11701–11708.Google Scholar
Cross Ref
- [33] . 2020. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6707–6717.Google Scholar
Cross Ref
- [34] . 2016. Shuffle and learn: Unsupervised learning using temporal order verification. In Proceedings of the European Conference on Computer Vision. 527–544.Google Scholar
Cross Ref
- [35] . 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision. 69–84.Google Scholar
Cross Ref
- [36] . 2021. VideoMoCo: Contrastive video representation learning with temporally adversarial examples. arXiv preprint arXiv:2103.05905 (2021).Google Scholar
- [37] . 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2536–2544.Google Scholar
Cross Ref
- [38] . 2019. Representation flow for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9945–9953.Google Scholar
Cross Ref
- [39] . 2020. Spatiotemporal contrastive video representation learning. arXiv preprint arXiv:2008.03800 (2020).Google Scholar
- [40] . 2021. Gaussian constrained attention network for scene text recognition. In Proceedings of the International Conference on Pattern Recognition. IEEE, Los Alamitos, CA, 3328–3335.Google Scholar
Cross Ref
- [41] . 2020. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 13528–13537.Google Scholar
Cross Ref
- [42] . 2021. FC2RN: A fully convolutional corner refinement network for accurate multi-oriented scene text detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, Los Alamitos, CA, 4350–4354.Google Scholar
Cross Ref
- [43] . 2019. Curved text detection in natural scene images with semi-and weakly-supervised learning. In Proceedings of the International Conference on Document Analysis and Recognition. IEEE, Los Alamitos, CA, 559–564.Google Scholar
Cross Ref
- [44] . 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252. Google Scholar
Digital Library
- [45] . 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568–576. Google Scholar
Digital Library
- [46] . 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
- [47] . 2018. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1390–1399.Google Scholar
Cross Ref
- [48] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497. Google Scholar
Digital Library
- [49] . 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450–6459.Google Scholar
Cross Ref
- [50] . 2016. Generating videos with scene dynamics. In Advances in Neural Information Processing Systems. 613–621. Google Scholar
Digital Library
- [51] . 2019. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4006–4015.Google Scholar
Cross Ref
- [52] . 2020. Self-supervised video representation learning by pace prediction. In Proceedings of the European Conference on Computer Vision. 504–521.Google Scholar
Digital Library
- [53] . 2018. Appearance-and-relation networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1430–1439.Google Scholar
Cross Ref
- [54] . 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305–4314.Google Scholar
Cross Ref
- [55] . 2017. UntrimmedNets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4325–4334.Google Scholar
Cross Ref
- [56] . 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20–36.Google Scholar
Cross Ref
- [57] . 2020. R-Net: A relationship network for efficient and accurate scene text detection. IEEE Transactions on Multimedia 23 (2020), 1316–1329.Google Scholar
Digital Library
- [58] . 2018. Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8052–8060.Google Scholar
Cross Ref
- [59] . 2019. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s (2019), 1–17. Google Scholar
Digital Library
- [60] . 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 305–321.Google Scholar
Cross Ref
- [61] . 2019. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10334–10343.Google Scholar
Cross Ref
- [62] . 2021. Multi-view correlation distillation for incremental object detection. arXiv preprint arXiv:2107.01787 (2021).Google Scholar
- [63] . 2020. Two-level residual distillation based triple network for incremental object detection. arXiv preprint arXiv:2007.13428 (2020).Google Scholar
- [64] . 2020. Video playback rate perception for self-supervised spatio-temporal representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6548–6557.Google Scholar
Cross Ref
- [65] . 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2678–2687.Google Scholar
Cross Ref
- [66] . 2016. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016).Google Scholar
- [67] . 2016. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2718–2726.Google Scholar
Cross Ref
- [68] . 2021. Progressive cluster purification for unsupervised feature learning. In Proceedings of the International Conference on Pattern Recognition. IEEE, Los Alamitos, CA, 8476–8483.Google Scholar
Cross Ref
- [69] . 2021. Exploring instance relations for unsupervised feature embedding. arXiv preprint arXiv:2105.03341 (2021).Google Scholar
- [70] . 2017. Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM International Conference on Multimedia. 1933–1941. Google Scholar
Digital Library
- [71] . 2018. Hierarchical tracking by reinforcement learning-based searching and coarse-to-fine verifying. IEEE Transactions on Image Processing 28, 5 (2018), 2331–2341.Google Scholar
Cross Ref
- [72] . 2018. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision. 803–818.Google Scholar
Cross Ref
- [73] . 2020. Fine-grained spatial alignment model for person re-identification with focal triplet loss. IEEE Transactions on Image Processing 29 (2020), 7578–7589.Google Scholar
Digital Library
- [74] . 2020. Expert training: Task hardness aware meta-learning for few-shot classification. arXiv preprint arXiv:2007.06240 (2020).Google Scholar
Index Terms
Exploring Relations in Untrimmed Videos for Self-Supervised Learning
Recommendations
JGCL: Joint Self-Supervised and Supervised Graph Contrastive Learning
WWW '22: Companion Proceedings of the Web Conference 2022Semi-supervised and self-supervised learning on graphs are two popular avenues for graph representation learning. We demonstrate that no single method from semi-supervised and self-supervised learning works uniformly well for all settings in the node ...
Hierarchically Self-supervised Transformer for Human Skeleton Representation Learning
Computer Vision – ECCV 2022AbstractDespite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large ...
Semi-supervised Learning for Multi-label Video Action Detection
MM '22: Proceedings of the 30th ACM International Conference on MultimediaSemi-supervised multi-label video action detection aims to locate all the persons and recognize their multiple action labels by leveraging both labeled and unlabeled videos. Compared to the single-label scenario, semi-supervised learning in multi-label ...






Comments