Abstract
Recent action localization works learn in a weakly supervised manner to avoid the expensive cost of human labeling. Those works are mostly based on the Multiple Instance Learning framework, where temporal pooling is an indispensable part that usually relies on the guidance of snippet-level Class Activation Sequences (CAS). However, we observe that previous works only leverage a simple convolutional neural network for the generation of CAS, which ignores the weak discriminative foreground action segments and the background ones, and meanwhile, the relationship between different actions has not been considered. To solve this problem, we propose multiple temporal pooling mechanisms (MTP) for a more sufficient information utilization. Specifically, with the design of the Foreground Variance Branch, Dual Foreground Attention Branch and Hybrid Attention Fine-tuning Branch, MTP can leverage more effective information from different aspects and generate different CASs to guide the learning of temporal pooling. Moreover, different loss functions are designed for a better optimization of individual branches, aiming to effectively distinguish the action from the background. Our method shows excellent results on the THUMOS14 and ActivityNet1.2 datasets.
- [1] . 2019. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference 2017. British Machine Vision Association.Google Scholar
- [2] . 2017. SST: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2911–2920.Google Scholar
Cross Ref
- [3] . 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.Google Scholar
Cross Ref
- [4] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google Scholar
Cross Ref
- [5] . 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1130–1139.Google Scholar
Cross Ref
- [6] . 2019. Relation attention for temporal action localization. IEEE Transactions on Multimedia 22, 10 (2019), 2723–2733.Google Scholar
Cross Ref
- [7] . 2021. Feature matching network for weakly-supervised temporal action localization. In Pattern Recognition and Computer Vision. 459–471.Google Scholar
- [8] . 2020. Learning temporal co-attention models for unsupervised video action localization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google Scholar
Cross Ref
- [9] . 2017. AVA: A video dataset of spatio-temporally localized atomic visual actions. (2017).Google Scholar
- [10] . 1981. Determining optical flow. In Techniques and Applications of Image Understanding.Google Scholar
- [11] . 2020. Relational prototypical network for weakly supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11053–11060.Google Scholar
Cross Ref
- [12] . 2021. Modeling sub-actions for weakly supervised temporal action localization. IEEE Transactions on Image Processing 30 (2021), 5154–5167.Google Scholar
Cross Ref
- [13] . 2017. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155 (2017), 1–23.Google Scholar
Cross Ref
- [14] . 2021. A hybrid attention mechanism for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1637–1645.Google Scholar
Cross Ref
- [15] . 2020. Weakly supervised temporal action localization using deep metric learning. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV’20).Google Scholar
Cross Ref
- [16] . 2021. Weakly-supervised temporal action localization via cross-stream collaborative learning. In Proceedings of the 29th ACM International Conference on Multimedia. 853–861.Google Scholar
Digital Library
- [17] . 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725–1732.Google Scholar
Digital Library
- [18] . 2017. The kinetics human action video dataset. (2017).Google Scholar
- [19] . 2017. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE International Conference on Computer Vision. 3524–3533.Google Scholar
Cross Ref
- [20] . 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156–165.Google Scholar
Cross Ref
- [21] . 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11320–11327.Google Scholar
Cross Ref
- [22] . 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 3–19.Google Scholar
Digital Library
- [23] . 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [24] . 2020. Weakly supervised temporal action localization through contrast based evaluation networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’20).Google Scholar
- [25] . 2021. ACSNet: Action-context separation network for weakly supervised temporal action localization. (2021).Google Scholar
- [26] . 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 344–353.Google Scholar
Cross Ref
- [27] . 2021. Action unit memory network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9969–9979.Google Scholar
Cross Ref
- [28] . 2020. Weakly-supervised action localization with expectation-maximization multi-instance learning. In European Conference on Computer Vision. Springer, 729–745.Google Scholar
Digital Library
- [29] . 2020. Adversarial background-aware loss for weakly-supervised temporal activity localization. In European Conference on Computer Vision. Springer, 283–299.Google Scholar
Digital Library
- [30] . 2020. Action completeness modeling with background aware networks for weakly-supervised temporal action localization. In Proceedings of the 28th ACM International Conference on Multimedia. 2166–2174.Google Scholar
Digital Library
- [31] . 2019. 3C-Net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8679–8687.Google Scholar
Cross Ref
- [32] . 2018. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6752–6761.Google Scholar
Cross Ref
- [33] . 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5502–5511.Google Scholar
Cross Ref
- [34] . 2021. RefineLoc: Iterative refinement for weakly-supervised action localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3319–3328.Google Scholar
Cross Ref
- [35] . 2018. W-TALC: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google Scholar
Digital Library
- [36] . 2020. Spatial enhancement and temporal constraint for weakly supervised action localization. IEEE Signal Processing Letters 27 (2020), 1520–1524.
DOI: Google ScholarCross Ref
- [37] . 2020. Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1009–1019.Google Scholar
Cross Ref
- [38] . 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google Scholar
Cross Ref
- [39] Z. Shou, H. Gao, L. Zhang, Kazuyuki Mayazawa, and Shih-Fu Chang. 2018. AutoLoc: Weakly-supervised temporal action localization. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google Scholar
- [40] . 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.Google Scholar
Cross Ref
- [41] . 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Springer, Cham.Google Scholar
- [42] . 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems 27 (2014).Google Scholar
- [43] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.Google Scholar
Digital Library
- [44] . 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450–6459.Google Scholar
Cross Ref
- [45] . 2021. POLO: Learning explicit cross-modality fusion for temporal action localization. IEEE Signal Processing Letters 28 (2021), 503–507.
DOI: Google ScholarCross Ref
- [46] . 2017. UntrimmedNets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4325–4334.Google Scholar
Cross Ref
- [47] . 2017. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision. 5783–5792.Google Scholar
Cross Ref
- [48] . 2020. G-TAD: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10156–10165.Google Scholar
Cross Ref
- [49] . 2019. Step: Spatio-temporal progressive learning for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 264–272.Google Scholar
Cross Ref
- [50] . 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2678–2687.Google Scholar
Cross Ref
- [51] . 2019. Weakly supervised video action localisation via two-stream action activation network. Electronics Letters 55, 21 (2019), 1126–1127.Google Scholar
Cross Ref
- [52] . 2019. Temporal structure mining for weakly supervised action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5522–5531.Google Scholar
Cross Ref
- [53] . 2019. Marginalized average attentional network for weakly-supervised learning. In ICLR 2019-Seventh International Conference on Learning Representations.Google Scholar
- [54] . 2019. Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Transactions on Image Processing 28, 12 (2019), 5797–5808.Google Scholar
Cross Ref
- [55] . 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7094–7103.Google Scholar
Cross Ref
- [56] . 2021. Graph convolutional module for temporal action localization in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google Scholar
Digital Library
- [57] . 2020. Two-stream consensus network for weakly-supervised temporal action localization. In European Conference on Computer Vision. Springer, 37–54.Google Scholar
Digital Library
- [58] . 2021. Actionness-guided transformer for anchor-free temporal action localization. IEEE Signal Processing Letters (2021), 1–1.
DOI: Google ScholarCross Ref
- [59] . 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision. 2914–2923.Google Scholar
Cross Ref
Index Terms
Multiple Temporal Pooling Mechanisms for Weakly Supervised Temporal Action Localization
Recommendations
Action Completeness Modeling with Background Aware Networks for Weakly-Supervised Temporal Action Localization
MM '20: Proceedings of the 28th ACM International Conference on MultimediaThe state-of-the-art of fully-supervised methods for temporal action localization from untrimmed videos has achieved impressive results. Yet, it remains unsatisfactory for the weakly-supervised temporal action localization, where only video-level action ...
Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning
Computer Vision – ECCV 2020AbstractWeakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains ...
Weakly supervised activity analysis with spatio-temporal localisation
In computer vision, an increasing number of weakly annotated videos have become available, due to the fact it is often difficult and time consuming to annotate all the details in the videos collected. Learning methods that analyse human activities in ...






Comments