Abstract
It is crucial to sample a small portion of relevant frames for efficient video classification. The existing methods mainly develop hand-designed sampling strategies or learn sequential selection policies. However, there are two challenges to be solved. First, hand-designed sampling strategies are intrinsically non-adaptive to different video backbones. Second, sequential frame selection policies ignore temporal relations among all video frames. The sequential selection process also hinders the application of these video samplers in speed-critical systems. In this article, we propose a differentiable parallel video sampling network (PSN) to tackle the aforementioned challenges, First, we optimize the video sampler with a differentiable surrogate loss, allowing to dynamically learn the sampler with the cooperation from the video classification model. Our sampler considers the feedback from all frames jointly, eliminating the learning difficulties of sequential decision making. The learning process is fully gradient-based, making the sampler be learned efficiently. Our video sampler can assess a set of frames swiftly and determine the importance of each frame in parallel. Second, we propose to model the inter-relation among contextual frames, which encourages the sampler to select frames based on a comprehensive inspection of the entire video. We observe that a simple context relation mining instantiation would significantly improve the classification performance. The experimental results on three standard video recognition benchmarks demonstrate the efficacy and efficiency of our framework.
- [1] . 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In CVPR.Google Scholar
- [2] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR.Google Scholar
- [3] . 2019. You look twice: GaterNet for dynamic filter selection in CNNs. In CVPR.Google Scholar
- [4] . 2009. ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
- [5] . 2020. Adaptive exploration for unsupervised person re-identification. ACM Trans. Multim. Comput., Commun., Applic. 16, 1 (2020), 1–19.Google Scholar
Digital Library
- [6] . 2019. Large scale adversarial representation learning. In EMBC.Google Scholar
- [7] . 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.Google Scholar
- [8] . 2018. Watching a small portion could be as good as watching all: Towards efficient video classification. In IJCAI.Google Scholar
- [9] . 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multim. Comput. Commun. Appl. 16, 3 (
July 2020).DOI: Google ScholarDigital Library
- [10] . 2020. Listen to look: Action recognition by previewing audio. In CVPR.Google Scholar
- [11] . 2016. Deep residual learning for image recognition. In CVPR.Google Scholar
- [12] . 2020. MV2Flow: Learning motion representation for fast compressed video action recognition. ACM Trans. Multim. Comput., Commun., Applic. 16, 3s (2020), 1–19.Google Scholar
Digital Library
- [13] . 2020. Knowledge-driven egocentric multimodal activity recognition. ACM Trans. Multim. Comput., Commun., Applic. 16, 4 (2020), 1–133.Google Scholar
Digital Library
- [14] . 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2 (2018), 352–364.Google Scholar
Digital Library
- [15] . 2019. SCSampler: Sampling salient clips from video for efficient action recognition. In ICCV.Google Scholar
- [16] . 2017. Temporal convolutional networks for action segmentation and detection. In CVPR.Google Scholar
- [17] . 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In ECCV.Google Scholar
- [18] . 2020. AR-Net: Adaptive frame resolution for efficient action recognition. In ECCV. Springer, 86–104.Google Scholar
- [19] . 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In ICCV.Google Scholar
- [20] . 2014. Two-stream convolutional networks for action recognition in videos. In NeurIPS.Google Scholar
- [21] . 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV.Google Scholar
- [22] . 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR.Google Scholar
- [23] . 2017. Attention is all you need. In NeurIPS.Google Scholar
- [24] . 2017. UntrimmedNets for weakly supervised action recognition and detection. In CVPR.Google Scholar
- [25] . 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV.Google Scholar
- [26] . 2020. Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Trans. Pattern Anal. Mach. Intell. (2020).
DOI: Google ScholarDigital Library
- [27] . 2022. Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision. IEEE Trans. Multim. (2022).
DOI: Google ScholarCross Ref
- [28] . 2019. Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In ICCV.Google Scholar
- [29] . 2019. LiteEval: A coarse-to-fine framework for resource efficient video recognition. In NeurIPS.Google Scholar
- [30] . 2019. AdaFrame: Adaptive frame selection for fast video recognition. In CVPR.Google Scholar
- [31] . 2021. Multiple knowledge representation for big data artificial intelligence: Framework, applications, and case studies. Front. Inf. Technol. Electron. Eng. 22, 12 (2021), 1551–1558.Google Scholar
- [32] . 2017. A temporal order modeling approach to human action recognition from multimodal sensor data. ACM Trans. Multim. Comput., Commun., Applic. 13, 2 (2017), 1–22.Google Scholar
Digital Library
- [33] . 2016. End-to-end learning of action detection from frame glimpses in videos. In CVPR.Google Scholar
- [34] . 2016. Learning from collective intelligence: Feature learning using social images and tags. ACM Trans. Multim. Comput., Commun., Applic. 13, 1 (2016), 1–23.Google Scholar
Digital Library
- [35] . 2020. Dynamic sampling networks for efficient action recognition in videos. IEEE Trans. Image Process. 29 (2020), 7970–7983.Google Scholar
- [36] . 2018. Fine-grained video categorization with redundancy reduction attention. In ECCV.Google Scholar
- [37] . 2022. Temporal cross-layer correlation mining for action recognition. IEEE Trans. Multim. 24 (2022), 668–676.
DOI: Google ScholarCross Ref
- [38] . 2017. Bidirectional multirate reconstruction for temporal modeling in videos. In CVPR.Google Scholar
- [39] . 2018. Hidden two-stream convolutional networks for action recognition. In ACCV. Springer, 363–378.Google Scholar
- [40] . 2020. Multichannel attention refinement for video question answering. ACM Trans. Multim. Comput., Commun., Applic. 16, 1s (2020), 1–23.Google Scholar
Digital Library
Index Terms
A Differentiable Parallel Sampler for Efficient Video Classification
Recommendations
NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition
Computer Vision – ECCV 2022AbstractIt is challenging for artificial intelligence systems to achieve accurate video recognition under the scenario of low computation costs. Adaptive inference based efficient video recognition methods typically preview videos and focus on salient ...
An Adaptive Video Clip Sampling Approach for Enhancing Query-Based Moment Retrieval in Videos
Database Systems for Advanced ApplicationsAbstractQuery-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing retrieval models require the same length for easy training and use. Therefore, videos with ...
Learning Sample Specific Weights for Late Fusion
Late fusion is one of the most effective approaches to enhance recognition accuracy through combining prediction scores of multiple classifiers, each of which is trained by a specific feature or model. The existing methods generally use a fixed fusion ...






Comments