skip to main content
research-article

A Differentiable Parallel Sampler for Efficient Video Classification

Published:25 February 2023Publication History
Skip Abstract Section

Abstract

It is crucial to sample a small portion of relevant frames for efficient video classification. The existing methods mainly develop hand-designed sampling strategies or learn sequential selection policies. However, there are two challenges to be solved. First, hand-designed sampling strategies are intrinsically non-adaptive to different video backbones. Second, sequential frame selection policies ignore temporal relations among all video frames. The sequential selection process also hinders the application of these video samplers in speed-critical systems. In this article, we propose a differentiable parallel video sampling network (PSN) to tackle the aforementioned challenges, First, we optimize the video sampler with a differentiable surrogate loss, allowing to dynamically learn the sampler with the cooperation from the video classification model. Our sampler considers the feedback from all frames jointly, eliminating the learning difficulties of sequential decision making. The learning process is fully gradient-based, making the sampler be learned efficiently. Our video sampler can assess a set of frames swiftly and determine the importance of each frame in parallel. Second, we propose to model the inter-relation among contextual frames, which encourages the sampler to select frames based on a comprehensive inspection of the entire video. We observe that a simple context relation mining instantiation would significantly improve the classification performance. The experimental results on three standard video recognition benchmarks demonstrate the efficacy and efficiency of our framework.

REFERENCES

  1. [1] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In CVPR.Google ScholarGoogle Scholar
  2. [2] Carreira João and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR.Google ScholarGoogle Scholar
  3. [3] Chen Zhourong, Li Yang, Bengio Samy, and Si Si. 2019. You look twice: GaterNet for dynamic filter selection in CNNs. In CVPR.Google ScholarGoogle Scholar
  4. [4] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In CVPR.Google ScholarGoogle Scholar
  5. [5] Ding Yuhang, Fan Hehe, Xu Mingliang, and Yang Yi. 2020. Adaptive exploration for unsupervised person re-identification. ACM Trans. Multim. Comput., Commun., Applic. 16, 1 (2020), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Donahue J. and Simonyan K.. 2019. Large scale adversarial representation learning. In EMBC.Google ScholarGoogle Scholar
  7. [7] Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.Google ScholarGoogle Scholar
  8. [8] Fan Hehe, Xu Zhongwen, Zhu Linchao, Yan Chenggang, Ge Jianjun, and Yang Yi. 2018. Watching a small portion could be as good as watching all: Towards efficient video classification. In IJCAI.Google ScholarGoogle Scholar
  9. [9] Fan Hehe, Zhu Linchao, Yang Yi, and Wu Fei. 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multim. Comput. Commun. Appl. 16, 3 (July2020). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Gao Ruohan, Oh Tae-Hyun, Grauman Kristen, and Torresani Lorenzo. 2020. Listen to look: Action recognition by previewing audio. In CVPR.Google ScholarGoogle Scholar
  11. [11] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In CVPR.Google ScholarGoogle Scholar
  12. [12] Hu Hezhen, Zhou Wengang, Li Xingze, Yan Ning, and Li Houqiang. 2020. MV2Flow: Learning motion representation for fast compressed video action recognition. ACM Trans. Multim. Comput., Commun., Applic. 16, 3s (2020), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Huang Yi, Yang Xiaoshan, Gao Junyu, Sang Jitao, and Xu Changsheng. 2020. Knowledge-driven egocentric multimodal activity recognition. ACM Trans. Multim. Comput., Commun., Applic. 16, 4 (2020), 1133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Jiang Yu-Gang, Wu Zuxuan, Wang Jun, Xue Xiangyang, and Chang Shih-Fu. 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2 (2018), 352–364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Korbar Bruno, Tran Du, and Torresani Lorenzo. 2019. SCSampler: Sampling salient clips from video for efficient action recognition. In ICCV.Google ScholarGoogle Scholar
  16. [16] Lea Colin, Flynn Michael D., Vidal Rene, Reiter Austin, and Hager Gregory D.. 2017. Temporal convolutional networks for action segmentation and detection. In CVPR.Google ScholarGoogle Scholar
  17. [17] Ma Ningning, Zhang Xiangyu, Zheng Hai-Tao, and Sun Jian. 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In ECCV.Google ScholarGoogle Scholar
  18. [18] Meng Yue, Lin Chung-Ching, Panda Rameswar, Sattigeri Prasanna, Karlinsky Leonid, Oliva Aude, Saenko Kate, and Feris Rogerio. 2020. AR-Net: Adaptive frame resolution for efficient action recognition. In ECCV. Springer, 86104.Google ScholarGoogle Scholar
  19. [19] Qiu Zhaofan, Yao Ting, and Mei Tao. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In ICCV.Google ScholarGoogle Scholar
  20. [20] Simonyan Karen and Zisserman Andrew. 2014. Two-stream convolutional networks for action recognition in videos. In NeurIPS.Google ScholarGoogle Scholar
  21. [21] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV.Google ScholarGoogle Scholar
  22. [22] Tran Du, Wang Heng, Torresani Lorenzo, Ray Jamie, LeCun Yann, and Paluri Manohar. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR.Google ScholarGoogle Scholar
  23. [23] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In NeurIPS.Google ScholarGoogle Scholar
  24. [24] Wang Limin, Xiong Yuanjun, Lin Dahua, and Gool Luc Van. 2017. UntrimmedNets for weakly supervised action recognition and detection. In CVPR.Google ScholarGoogle Scholar
  25. [25] Wang Limin, Xiong Yuanjun, Wang Zhe, Qiao Yu, Lin Dahua, Tang Xiaoou, and Gool Luc Val. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV.Google ScholarGoogle Scholar
  26. [26] Wang Xiaohan, Zhu Linchao, Wu Yu, and Yang Yi. 2020. Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Trans. Pattern Anal. Mach. Intell. (2020). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Wang Xiaohan, Zhu Linchao, Zheng Zhedong, Xu Mingliang, and Yang Yi. 2022. Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision. IEEE Trans. Multim. (2022). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Wu Wenhao, He Dongliang, Tan Xiao, Chen Shifeng, and Wen Shilei. 2019. Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In ICCV.Google ScholarGoogle Scholar
  29. [29] Wu Zuxuan, Xiong Caiming, Jiang Yu-Gang, and Davis Larry S.. 2019. LiteEval: A coarse-to-fine framework for resource efficient video recognition. In NeurIPS.Google ScholarGoogle Scholar
  30. [30] Wu Zuxuan, Xiong Caiming, Ma Chih-Yao, Socher Richard, and Davis Larry S.. 2019. AdaFrame: Adaptive frame selection for fast video recognition. In CVPR.Google ScholarGoogle Scholar
  31. [31] Yang Yi, Zhuang Yueting, and Pan Yunhe. 2021. Multiple knowledge representation for big data artificial intelligence: Framework, applications, and case studies. Front. Inf. Technol. Electron. Eng. 22, 12 (2021), 15511558.Google ScholarGoogle Scholar
  32. [32] Ye Jun, Hu Hao, Qi Guo-Jun, and Hua Kien A.. 2017. A temporal order modeling approach to human action recognition from multimodal sensor data. ACM Trans. Multim. Comput., Commun., Applic. 13, 2 (2017), 122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Yeung Serena, Russakovsky Olga, Mori Greg, and Fei-Fei Li. 2016. End-to-end learning of action detection from frame glimpses in videos. In CVPR.Google ScholarGoogle Scholar
  34. [34] Zhang Hanwang, Shang Xindi, Luan Huanbo, Wang Meng, and Chua Tat-Seng. 2016. Learning from collective intelligence: Feature learning using social images and tags. ACM Trans. Multim. Comput., Commun., Applic. 13, 1 (2016), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Zheng Yin-Dong, Liu Zhaoyang, Lu Tong, and Wang Limin. 2020. Dynamic sampling networks for efficient action recognition in videos. IEEE Trans. Image Process. 29 (2020), 7970–7983.Google ScholarGoogle Scholar
  36. [36] Zhu Chen, Tan Xiao, Zhou Feng, Liu Xiao, Yue Kaiyu, Ding Errui, and Ma Yi. 2018. Fine-grained video categorization with redundancy reduction attention. In ECCV.Google ScholarGoogle Scholar
  37. [37] Zhu Linchao, Fan Hehe, Luo Yawei, Xu Mingliang, and Yang Yi. 2022. Temporal cross-layer correlation mining for action recognition. IEEE Trans. Multim. 24 (2022), 668676. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Zhu Linchao, Xu Zhongwen, and Yang Yi. 2017. Bidirectional multirate reconstruction for temporal modeling in videos. In CVPR.Google ScholarGoogle Scholar
  39. [39] Zhu Yi, Lan Zhenzhong, Newsam Shawn, and Hauptmann Alexander. 2018. Hidden two-stream convolutional networks for action recognition. In ACCV. Springer, 363378.Google ScholarGoogle Scholar
  40. [40] Zhuang Yueting, Xu Dejing, Yan Xin, Cheng Wenzhuo, Zhao Zhou, Pu Shiliang, and Xiao Jun. 2020. Multichannel attention refinement for video question answering. ACM Trans. Multim. Comput., Commun., Applic. 16, 1s (2020), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Differentiable Parallel Sampler for Efficient Video Classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3
      May 2023
      514 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3582886
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 February 2023
      • Online AM: 27 October 2022
      • Accepted: 22 October 2022
      • Revised: 5 September 2022
      • Received: 18 May 2022
      Published in tomm Volume 19, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!