skip to main content
research-article

Deep Unsupervised Key Frame Extraction for Efficient Video Classification

Published:25 February 2023Publication History
Skip Abstract Section

Abstract

Video processing and analysis have become an urgent task, as a huge amount of videos (e.g., YouTube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is important in video processing and analysis since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video classification remains an open problem, as the existing methods have not well balanced the performance and efficiency simultaneously. To tackle this problem, this work presents an unsupervised method to retrieve the key frames, which combines the convolutional neural network and temporal segment density peaks clustering. The proposed temporal segment density peaks clustering is a generic and powerful framework, and it has two advantages compared with previous works. One is that it can calculate the number of key frames automatically. The other is that it can preserve the temporal information of the video. Thus, it improves the efficiency of video classification. Furthermore, a long short-term memory network is added on the top of the convolutional neural network to further elevate the performance of classification. Moreover, a weight fusion strategy of different input networks is presented to boost performance. By optimizing both video classification and key frame extraction simultaneously, we achieve better classification performance and higher efficiency. We evaluate our method on two popular datasets (i.e., HMDB51 and UCF101), and the experimental results consistently demonstrate that our strategy achieves competitive performance and efficiency compared with the state-of-the-art approaches.

REFERENCES

  1. [1] Bilen Hakan, Fernando Basura, Gavves Efstratios, Vedaldi Andrea, and Gould Stephen. 2016. Dynamic image networks for action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Cai Zhuowei, Wang Limin, Peng Xiaojiang, and Qiao Yu. 2014. Multi-view super vector for action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Carreira Joao and Zisserman Andrew. 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Cernekova Zuzana, Pitas Ioannis, and Nikou Christophoros. 2006. Information theory-based shot cut/fade detection and video summarization. IEEE Transactions on Circuits and Systems for Video Technology 16, 1 (2006), 8291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Choutas Vasileios, Weinzaepfel Philippe, Revaud Jérôme, and Schmid Cordelia. 2018. Potion: Pose motion representation for action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cong Yang, Yuan Junsong, and Luo Jiebo. 2012. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2012), 6675.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Avila Sandra Eliza Fontes De, Lopes Ana Paula Brandão, Luz Antonio da, and Araújo Arnaldo de Albuquerque. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 5668.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Souza César Roberto de, Gaidon Adrien, Vig Eleonora, and López Antonio Manuel. 2016. Sympathy for the details: Dense trajectories and hybrid classification architectures for action recognition. In Proceedings of ECCV.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Donahue Jeffrey, Hendricks Lisa Anne, Guadarrama Sergio, Rohrbach Marcus, Venugopalan Subhashini, Saenko Kate, and Darrell Trevor. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Duta Ionut Cosmin, Ionescu Bogdan, Aizawa Kiyoharu, and Sebe Nicu. 2017. Spatio-temporal vector of locally max pooled features for action recognition in videos. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Ejaz Naveed, Tariq Tayyab Bin, and Baik Sung Wook. 2012. Adaptive key frame extraction for video summarization using an aggregation mechanism. Journal of Visual Communication and Image Representation 23, 7 (2012), 10311040.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Feichtenhofer Christoph, Pinz Axel, and Wildes Richard. 2016. Spatiotemporal residual networks for video action recognition. In Proceedings of NIPS.Google ScholarGoogle Scholar
  13. [13] Feichtenhofer Christoph, Pinz Axel, and Wildes Richard P.. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Feichtenhofer Christoph, Pinz Axel, Wildes Richard P., and Zisserman Andrew. 2018. What have we learned from deep representations for action recognition? In Proceedings of CVPR.Google ScholarGoogle Scholar
  15. [15] Feichtenhofer Christoph, Pinz Axel, and Zisserman Andrew. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Fernando Basura, Gavves Efstratios, Oramas Jose M., Ghodrati Amir, and Tuytelaars Tinne. 2015. Modeling video evolution for action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Gaidon Adrien, Harchaoui Zaid, and Schmid Cordelia. 2013. Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 11 (2013), 27822795.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Gao Ruohan, Xiong Bo, and Grauman Kristen. 2018. Im2Flow: Motion hallucination from static images for action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Gharbi Hana, Bahroun Sahbi, Massaoudi Mohamed, and Zagrouba Ezzeddine. 2017. Key frames extraction using graph modularity clustering for efficient video summarization. In Proceedings of ICASSP.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Guan Genliang, Wang Zhiyong, Lu Shiyang, Deng Jeremiah Da, and Feng David Dagan. 2013. Keypoint-based keyframe selection. IEEE Transactions on Circuits and Systems for Video Technology 23, 4 (2013), 729734.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Hara Kensho, Kataoka Hirokatsu, and Satoh Yutaka. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNS and ImageNet? In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Ji Shuiwang, Xu Wei, Yang Ming, and Yu Kai. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 221231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Kar Amlan, Rai Nishant, Sikka Karan, and Sharma Gaurav. 2017. AdaScan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Karpathy Andrej, Toderici George, Shetty Sanketh, Leung Thomas, Sukthankar Rahul, and Fei-Fei Li. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of CVPR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Kuanar Sanjay K., Panda Rameswar, and Chowdhury Ananda S.. 2013. Video key frame extraction through dynamic Delaunay clustering with a structural constraint. Journal of Visual Communication and Image Representation 24, 7 (2013), 12121227.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Kuehne H., Jhuang H., Garrote E., Poggio T., and Serre T.. 2011. HMDB: A large video database for human motion recognition. In Proceedings of ICCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Kulhare Sourabh, Sah Shagan, Pillai Suhas, and Ptucha Raymond. 2016. Key frame extraction for salient activity recognition. In Proceedings of ICPR.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Lan Zhengzhong, Lin Ming, Li Xuanchong, Hauptmann Alex G., and Raj Bhiksha. 2015. Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of CVPR.Google ScholarGoogle Scholar
  30. [30] Laptev Ivan, Marszalek Marcin, Schmid Cordelia, and Rozenfeld Benjamin. 2008. Learning realistic human actions from movies. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Liu Hong, Tang Hao, Xiao Wei, Guo ZiYi, Tian Lu, and Gao Yuan. 2016. Sequential bag-of-words model for human action classification. CAAI Transactions on Intelligence Technology 1, 2 (2016), 125136.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Liu Hong, Tian Lu, Liu Mengyuan, and Tang Hao. 2015. SDM-BSM: A fusing depth scheme for human action recognition. In Proceedings of ICIP.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Long Xiang, Gan Chuang, Melo Gerard de, Liu Xiao, Li Yandong, Li Fu, and Wen Shilei. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of AAAI.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Lucien M. Le Cam and Jerzy Neyman (Eds.). University of California Press, 281–297.Google ScholarGoogle Scholar
  35. [35] Mei Shaohui, Guan Genliang, Wang Zhiyong, He Mingyi, Hua Xian-Sheng, and Feng David Dagan. 2014. L2, 0 constrained sparse dictionary selection for video summarization. In Proceedings of ICME.Google ScholarGoogle Scholar
  36. [36] Mei Shaohui, Guan Genliang, Wang Zhiyong, Wan Shuai, He Mingyi, and Feng David Dagan. 2015. Video summarization via minimum sparse reconstruction. Pattern Recognition 48, 2 (2015), 522533.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Meng Jingjing, Wang Hongxing, Yuan Junsong, and Tan Yap-Peng. 2016. From keyframes to key objects: Video summarization by representative object proposal selection. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Ni Bingbing, Moulin Pierre, Yang Xiaokang, and Yan Shuicheng. 2015. Motion part regularization: Improving action recognition via trajectory selection. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Panagiotakis Costas, Doulamis Anastasios, and Tziritas Georgios. 2009. Equivalent key frames selection based on iso-content principles. IEEE Transactions on Circuits and Systems for Video Technology 19, 3 (2009), 447451.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Panda Rameswar, Kuanar Sanjay K., and Chowdhury Ananda S.. 2014. Scalable video summarization using skeleton graph and random walk. In Proceedings of ICPR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Peng Xiaojiang, Wang Limin, Wang Xingxing, and Qiao Yu. 2016. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding 150 (2016), 109125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Peng Xiaojiang, Zou Changqing, Qiao Yu, and Peng Qiang. 2014. Action recognition with stacked Fisher vectors. In Proceedings of ECCV. 581595.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Souza Cesar Roberto de, Gaidon Adrien, Cabon Yohann, and Lopez Antonio Manuel. 2017. Procedural generation of videos to train deep action recognition networks. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Rodriguez Alex and Laio Alessandro. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 14921496.Google ScholarGoogle Scholar
  45. [45] Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, et al. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Shou Zheng, Lin Xudong, Kalantidis Yannis, Sevilla-Lara Laura, Rohrbach Marcus, Chang Shih-Fu, and Yan Zhicheng. 2019. DMC-Net: Generating discriminative motion cues for fast compressed video action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Sigurdsson Gunnar A., Chen Xinlei, and Gupta Abhinav. 2016. Learning visual storylines with skipping recurrent neural networks. In Proceedings of ECCV.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Simonyan Karen and Zisserman Andrew. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of NIPS.Google ScholarGoogle Scholar
  49. [49] Soomro Khurram, Zamir Amir Roshan, and Shah Mubarak. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google ScholarGoogle Scholar
  50. [50] Sun Deqing, Yang Xiaodong, Liu Ming-Yu, and Kautz Jan. 2018. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Sun Lin, Jia Kui, Yeung Dit-Yan, and Shi Bertram E.. 2015. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of ICCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Sun Shuyang, Kuang Zhanghui, Sheng Lu, Ouyang Wanli, and Zhang Wei. 2018. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Tang Hao, Liu Hong, and Xiao Wei. 2015. Gender classification using pyramid segmentation for unconstrained back-facing video sequences. In Proceedings of ACM MM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Tang Hao, Liu Hong, Xiao Wei, and Sebe Nicu. 2019. Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing 331 (2019), 424433.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of ICCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Tran Du, Ray Jamie, Shou Zheng, Chang Shih-Fu, and Paluri Manohar. 2017. Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017).Google ScholarGoogle Scholar
  57. [57] Tran Du, Wang Heng, Torresani Lorenzo, Ray Jamie, LeCun Yann, and Paluri Manohar. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Varol Gül, Laptev Ivan, and Schmid Cordelia. 2017. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 15101517.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Vázquez-Martn Ricardo and Bandera Antonio. 2013. Spatio-temporal feature-based keyframe detection from video shots using spectral clustering. Pattern Recognition Letters 34, 7 (2013), 770779.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Vondrick Carl, Pirsiavash Hamed, and Torralba Antonio. 2016. Generating videos with scene dynamics. In Proceedings of NIPS.Google ScholarGoogle Scholar
  61. [61] Wang Hongxing, Kawahara Yoshinobu, Weng Chaoqun, and Yuan Junsong. 2017. Representative selection with structured sparsity. Pattern Recognition 63 (2017), 268278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Wang Heng, Oneata Dan, Verbeek Jakob, and Schmid Cordelia. 2016. A robust and efficient video representation for action recognition. International Journal of Computer Vision 119, 3 (2016), 219238.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Wang Heng and Schmid Cordelia. 2013. Action recognition with improved trajectories. In Proceedings of ICCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Wang Limin, Qiao Yu, and Tang Xiaoou. 2014. Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing 23, 2 (2014), 810822.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Wang Limin, Qiao Yu, and Tang Xiaoou. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Wang Limin, Qiao Yu, and Tang Xiaoou. 2016. MoFAP: A multi-level representation for action recognition. International Journal of Computer Vision 119, 3 (2016), 254271.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Wang Limin, Xiong Yuanjun, Wang Zhe, Qiao Yu, Lin Dahua, Tang Xiaoou, and Gool Luc Van. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of ECCV.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Wang Xiaolong, Farhadi Ali, and Gupta Abhinav. 2016. Actions transformations. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Wang Yunbo, Long Mingsheng, Wang Jianmin, and Yu Philip S.. 2017. Spatiotemporal pyramid network for video action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Wang Yali, Zhou Lei, and Qiao Yu. 2018. Temporal hallucinating for action recognition with few still images. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Wu Chao-Yuan, Zaheer Manzil, Hu Hexiang, Manmatha R., Smola Alexander J., and Krähenbühl Philipp. 2018. Compressed video action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Yang Jianwei, Parikh Devi, and Batra Dhruv. 2016. Joint unsupervised learning of deep representations and image clusters. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Ng Joe Yue-Hei, Hausknecht Matthew, Vijayanarasimhan Sudheendra, Vinyals Oriol, Monga Rajat, and Toderici George. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of CVPR.Google ScholarGoogle Scholar
  74. [74] Zhou Yizhou, Sun Xiaoyan, Zha Zheng-Jun, and Zeng Wenjun. 2018. MiCT: Mixed 3D/2D convolutional tube for human action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Zhu Wangjiang, Hu Jie, Sun Gang, Cao Xudong, and Qiao Yu. 2016. A key volume mining deep framework for action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Zhu Yi, Long Yang, Guan Yu, Newsam Shawn, and Shao Ling. 2018. Towards universal representation for unseen action recognition. In Proceedings of CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Zhuang Yueting, Rui Yong, Huang Thomas S., and Mehrotra Sharad. 1998. Adaptive key frame extraction using unsupervised clustering. In Proceedings of ICIP.Google ScholarGoogle Scholar

Index Terms

  1. Deep Unsupervised Key Frame Extraction for Efficient Video Classification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3
        May 2023
        514 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3582886
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 February 2023
        • Online AM: 12 December 2022
        • Accepted: 6 November 2022
        • Revised: 17 September 2022
        • Received: 18 April 2022
        Published in tomm Volume 19, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)218
        • Downloads (Last 6 weeks)63

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!