Abstract
Video processing and analysis have become an urgent task, as a huge amount of videos (e.g., YouTube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is important in video processing and analysis since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video classification remains an open problem, as the existing methods have not well balanced the performance and efficiency simultaneously. To tackle this problem, this work presents an unsupervised method to retrieve the key frames, which combines the convolutional neural network and temporal segment density peaks clustering. The proposed temporal segment density peaks clustering is a generic and powerful framework, and it has two advantages compared with previous works. One is that it can calculate the number of key frames automatically. The other is that it can preserve the temporal information of the video. Thus, it improves the efficiency of video classification. Furthermore, a long short-term memory network is added on the top of the convolutional neural network to further elevate the performance of classification. Moreover, a weight fusion strategy of different input networks is presented to boost performance. By optimizing both video classification and key frame extraction simultaneously, we achieve better classification performance and higher efficiency. We evaluate our method on two popular datasets (i.e., HMDB51 and UCF101), and the experimental results consistently demonstrate that our strategy achieves competitive performance and efficiency compared with the state-of-the-art approaches.
- [1] . 2016. Dynamic image networks for action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [2] . 2014. Multi-view super vector for action recognition. In Proceedings of CVPR.Google Scholar
Digital Library
- [3] . 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proceedings of CVPR.Google Scholar
Cross Ref
- [4] . 2006. Information theory-based shot cut/fade detection and video summarization. IEEE Transactions on Circuits and Systems for Video Technology 16, 1 (2006), 82–91.Google Scholar
Digital Library
- [5] . 2018. Potion: Pose motion representation for action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [6] . 2012. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2012), 66–75.Google Scholar
Digital Library
- [7] . 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56–68.Google Scholar
Digital Library
- [8] . 2016. Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In Proceedings of ECCV.Google Scholar
Cross Ref
- [9] . 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of CVPR.Google Scholar
Cross Ref
- [10] . 2017. Spatio-temporal vector of locally max pooled features for action recognition in videos. In Proceedings of CVPR.Google Scholar
Cross Ref
- [11] . 2012. Adaptive key frame extraction for video summarization using an aggregation mechanism. Journal of Visual Communication and Image Representation 23, 7 (2012), 1031–1040.Google Scholar
Digital Library
- [12] . 2016. Spatiotemporal residual networks for video action recognition. In Proceedings of NIPS.Google Scholar
- [13] . 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [14] . 2018. What have we learned from deep representations for action recognition? In Proceedings of CVPR.Google Scholar
- [15] . 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [16] . 2015. Modeling video evolution for action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [17] . 2013. Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 11 (2013), 2782–2795.Google Scholar
Digital Library
- [18] . 2018. Im2Flow: Motion hallucination from static images for action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [19] . 2017. Key frames extraction using graph modularity clustering for efficient video summarization. In Proceedings of ICASSP.Google Scholar
Digital Library
- [20] . 2013. Keypoint-based keyframe selection. IEEE Transactions on Circuits and Systems for Video Technology 23, 4 (2013), 729–734.Google Scholar
Digital Library
- [21] . 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNS and ImageNet? In Proceedings of CVPR.Google Scholar
Cross Ref
- [22] . 2016. Deep residual learning for image recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [23] . 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 221–231.Google Scholar
Digital Library
- [24] . 2017. AdaScan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In Proceedings of CVPR.Google Scholar
Cross Ref
- [25] . 2014. Large-scale video classification with convolutional neural networks. In Proceedings of CVPR.Google Scholar
Digital Library
- [26] . 2013. Video key frame extraction through dynamic Delaunay clustering with a structural constraint. Journal of Visual Communication and Image Representation 24, 7 (2013), 1212–1227.Google Scholar
Digital Library
- [27] . 2011. HMDB: A large video database for human motion recognition. In Proceedings of ICCV.Google Scholar
Digital Library
- [28] . 2016. Key frame extraction for salient activity recognition. In Proceedings of ICPR.Google Scholar
Cross Ref
- [29] . 2015. Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of CVPR.Google Scholar
- [30] . 2008. Learning realistic human actions from movies. In Proceedings of CVPR.Google Scholar
Cross Ref
- [31] . 2016. Sequential bag-of-words model for human action classification. CAAI Transactions on Intelligence Technology 1, 2 (2016), 125–136.Google Scholar
Cross Ref
- [32] . 2015. SDM-BSM: A fusing depth scheme for human action recognition. In Proceedings of ICIP.Google Scholar
Digital Library
- [33] . 2018. Multimodal keyless attention fusion for video classification. In Proceedings of AAAI.Google Scholar
Cross Ref
- [34] . 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Lucien M. Le Cam and Jerzy Neyman (Eds.). University of California Press, 281–297.Google Scholar
- [35] . 2014. L2, 0 constrained sparse dictionary selection for video summarization. In Proceedings of ICME.Google Scholar
- [36] . 2015. Video summarization via minimum sparse reconstruction. Pattern Recognition 48, 2 (2015), 522–533.Google Scholar
Digital Library
- [37] . 2016. From keyframes to key objects: Video summarization by representative object proposal selection. In Proceedings of CVPR.Google Scholar
Cross Ref
- [38] . 2015. Motion part regularization: Improving action recognition via trajectory selection. In Proceedings of CVPR.Google Scholar
Cross Ref
- [39] . 2009. Equivalent key frames selection based on iso-content principles. IEEE Transactions on Circuits and Systems for Video Technology 19, 3 (2009), 447–451.Google Scholar
Digital Library
- [40] . 2014. Scalable video summarization using skeleton graph and random walk. In Proceedings of ICPR.Google Scholar
Digital Library
- [41] . 2016. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding 150 (2016), 109–125.Google Scholar
Digital Library
- [42] . 2014. Action recognition with stacked Fisher vectors. In Proceedings of ECCV. 581–595.Google Scholar
Cross Ref
- [43] . 2017. Procedural generation of videos to train deep action recognition networks. In Proceedings of CVPR.Google Scholar
Cross Ref
- [44] . 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 1492–1496.Google Scholar
- [45] . 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.Google Scholar
Digital Library
- [46] . 2019. DMC-Net: Generating discriminative motion cues for fast compressed video action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [47] . 2016. Learning visual storylines with skipping recurrent neural networks. In Proceedings of ECCV.Google Scholar
Cross Ref
- [48] . 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of NIPS.Google Scholar
- [49] . 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
- [50] . 2018. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of CVPR.Google Scholar
Cross Ref
- [51] . 2015. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of ICCV.Google Scholar
Digital Library
- [52] . 2018. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [53] . 2015. Gender classification using pyramid segmentation for unconstrained back-facing video sequences. In Proceedings of ACM MM.Google Scholar
Digital Library
- [54] . 2019. Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing 331 (2019), 424–433.Google Scholar
Digital Library
- [55] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of ICCV.Google Scholar
Digital Library
- [56] . 2017. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017).Google Scholar
- [57] . 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [58] . 2017. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1510–1517.Google Scholar
Cross Ref
- [59] . 2013. Spatio-temporal feature-based keyframe detection from video shots using spectral clustering. Pattern Recognition Letters 34, 7 (2013), 770–779.Google Scholar
Digital Library
- [60] . 2016. Generating videos with scene dynamics. In Proceedings of NIPS.Google Scholar
- [61] . 2017. Representative selection with structured sparsity. Pattern Recognition 63 (2017), 268–278.Google Scholar
Digital Library
- [62] . 2016. A robust and efficient video representation for action recognition. International Journal of Computer Vision 119, 3 (2016), 219–238.Google Scholar
Digital Library
- [63] . 2013. Action recognition with improved trajectories. In Proceedings of ICCV.Google Scholar
Digital Library
- [64] . 2014. Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing 23, 2 (2014), 810–822.Google Scholar
Digital Library
- [65] . 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of CVPR.Google Scholar
Cross Ref
- [66] . 2016. MoFAP: A multi-level representation for action recognition. International Journal of Computer Vision 119, 3 (2016), 254–271.Google Scholar
Digital Library
- [67] . 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of ECCV.Google Scholar
Cross Ref
- [68] . 2016. Actions transformations. In Proceedings of CVPR.Google Scholar
Cross Ref
- [69] . 2017. Spatiotemporal pyramid network for video action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [70] . 2018. Temporal hallucinating for action recognition with few still images. In Proceedings of CVPR.Google Scholar
Cross Ref
- [71] . 2018. Compressed video action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [72] . 2016. Joint unsupervised learning of deep representations and image clusters. In Proceedings of CVPR.Google Scholar
Cross Ref
- [73] . 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of CVPR.Google Scholar
- [74] . 2018. MiCT: Mixed 3D/2D convolutional tube for human action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [75] . 2016. A key volume mining deep framework for action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [76] . 2018. Towards universal representation for unseen action recognition. In Proceedings of CVPR.Google Scholar
Cross Ref
- [77] . 1998. Adaptive key frame extraction using unsupervised clustering. In Proceedings of ICIP.Google Scholar
Index Terms
Deep Unsupervised Key Frame Extraction for Efficient Video Classification
Recommendations
Spatial-Temporal Feature-Based Sports Video Classification
Video classification has been an active research field of computer vision in last few years. Its main purpose is to produce a label that is relevant to the video given its frames. Unlike image classification, which takes still pictures as input, the ...
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification
MM '15: Proceedings of the 23rd ACM international conference on MultimediaClassifying videos according to content semantics is an important problem with a wide range of applications. In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-...
Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification
MM '16: Proceedings of the 24th ACM international conference on MultimediaThis paper studies deep network architectures to address the problem of video classification. A multi-stream framework is proposed to fully utilize the rich multimodal information in videos. Specifically, we first train three Convolutional Neural ...






Comments