Abstract
Weakly supervised action localization is a challenging problem in video understanding and action recognition. Existing models usually formulate the training process as direct classification using video-level supervision. They tend to only locate the most discriminative parts of action instances and produce temporally incomplete detection results. A natural solution for this problem, the adversarial erasing strategy, is to remove such parts from training so that models can attend to complementary parts. Previous works do it in an offline and heuristic way. They adopt a multi-stage pipeline, where discriminative regions are determined and erased under the guidance of detection results from last stage. Such a pipeline can be both ineffective and inefficient, possibly hindering the overall performance. On the contrary, we combine adversarial erasing with dropout mechanism and propose a Temporal Dropout Module that learns where to remove in a data-driven and online manner. This plug-and-play module is trained without iterative stages, which not only simplifies the pipeline but also makes the regularization during training easier and more adaptive. Experiments show that the proposed method outperforms previous erasing-based methods by a large margin. More importantly, it achieves universal improvement when plugged into various direct classification methods and obtains state-of-the-art performance.
Supplemental Material
Available for Download
Supplementary material
- [1] . 2020. Uncertainty-aware weakly supervised action detection from untrimmed videos. In Proceedings of the European Conference on Computer Vision. Springer, 751–768.Google Scholar
Digital Library
- [2] . 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.Google Scholar
Cross Ref
- [3] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google Scholar
Cross Ref
- [4] . 2018. Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1130–1139.Google Scholar
Cross Ref
- [5] . 2019. Relation attention for temporal action localization. IEEE Trans. Multimedia 22, 10 (2019), 2723–2733.Google Scholar
Cross Ref
- [6] . 2018. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems. 10727–10737.Google Scholar
- [7] . 2020. Relational prototypical network for weakly supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11053–11060.Google Scholar
Cross Ref
- [8] . 2021. Foreground-action consistency network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8002–8011.Google Scholar
Cross Ref
- [9] . 2017. The THUMOS challenge on action recognition for videos “in the wild.” Comput. Vis. Image Understand. 155 (2017), 1–23.Google Scholar
Cross Ref
- [10] . 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [11] . 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [12] . 2015. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems. 2575–2583.Google Scholar
- [13] . 2021. Learning action completeness from points for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13648–13657.Google Scholar
Cross Ref
- [14] . 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11320–11327.Google Scholar
Cross Ref
- [15] . 2021. Weakly-supervised temporal action localization by uncertainty modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 2.Google Scholar
Cross Ref
- [16] . 2019. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3889–3898.Google Scholar
Cross Ref
- [17] . 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1298–1307.Google Scholar
Cross Ref
- [18] . 2019. Weakly supervised temporal action localization through contrast based evaluation networks. In Proceedings of the International Conference on Computer Vision. 3899–3908.Google Scholar
Cross Ref
- [19] . 2021. ACSNet: Action-context separation network for weakly supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35, 2233–2241.Google Scholar
- [20] . 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 344–353.Google Scholar
Cross Ref
- [21] . 2021. Action unit memory network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 9969–9979.Google Scholar
Cross Ref
- [22] . 2020. Weakly-supervised action localization with expectation-maximization multi-instance learning. In Proceedings of the European Conference on Computer Vision. Springer, 729–745.Google Scholar
Digital Library
- [23] . 2020. Sf-net: Single-frame supervision for temporal action localization. In Proceedings of the European Conference on Computer Vision. Springer, 420–437.Google Scholar
Digital Library
- [24] . 2021. Weakly supervised action selection learning in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7587–7596.Google Scholar
Cross Ref
- [25] . 2019. Action recognition from single timestamp supervision in untrimmed videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9915–9924.Google Scholar
Cross Ref
- [26] . 2021. D2-Net: Weakly-supervised action localization via discriminative embeddings and denoised activations. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13608–13617.Google Scholar
Cross Ref
- [27] . 2019. 3c-net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE International Conference on Computer Vision. 8679–8687.Google Scholar
Cross Ref
- [28] . 2018. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6752–6761.Google Scholar
Cross Ref
- [29] . 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision. 5502–5511.Google Scholar
Cross Ref
- [30] . 2021. Refineloc: Iterative refinement for weakly-supervised action localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3319–3328.Google Scholar
Cross Ref
- [31] . 2018. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV’18). 563–579.Google Scholar
Digital Library
- [32] . 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 485–494.Google Scholar
Cross Ref
- [33] . 2021. ACM-Net: Action context modeling network for weakly-supervised temporal action localization. arXiv:2104.02967. Retrieved from https://arxiv.org/abs/2104.02967.Google Scholar
- [34] . 2020. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10598–10607.Google Scholar
Cross Ref
- [35] . 2020. Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1009–1019.Google Scholar
Cross Ref
- [36] . 2022. Shuffle-invariant network for action recognition in videos. ACM Trans. Multimedia Comput. Commun. Appl. 18, 3 (2022), 1–18.Google Scholar
Digital Library
- [37] . 2017. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5734–5743.Google Scholar
Cross Ref
- [38] . 2018. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV’18). 154–171.Google Scholar
Digital Library
- [39] . 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.Google Scholar
Cross Ref
- [40] . 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568–576.Google Scholar
Digital Library
- [41] . 2017. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE, 3544–3553.Google Scholar
Cross Ref
- [42] . 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929–1958.Google Scholar
Digital Library
- [43] . 2018. Cascaded pyramid mining network for weakly supervised temporal action localization. In Proceedings of the Asian Conference on Computer Vision. Springer, 558–574.Google Scholar
- [44] . 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 648–656.Google Scholar
Cross Ref
- [45] . 2017. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4325–4334.Google Scholar
Cross Ref
- [46] . 2017. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1568–1576.Google Scholar
Cross Ref
- [47] . 2017. R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision. 5783–5792.Google Scholar
Cross Ref
- [48] . 2020. Revisiting anchor mechanisms for temporal action localization. IEEE Trans. Image Process. 29 (2020), 8535–8548.Google Scholar
Cross Ref
- [49] . 2022. ACGNet: Action complement graph network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36, 3090–3098.Google Scholar
- [50] . 2019. Temporal structure mining for weakly supervised action detection. In Proceedings of the IEEE International Conference on Computer Vision. 5522–5531.Google Scholar
Cross Ref
- [51] . 2007. A duality based approach for realtime tv-l 1 optical flow. In Proceedings of the Joint Pattern Recognition Symposium. Springer, 214–223.Google Scholar
Cross Ref
- [52] . 2019. Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Trans. Image Process. 28, 12 (2019), 5797–5808.Google Scholar
Cross Ref
- [53] . 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision. 7094–7103.Google Scholar
Cross Ref
- [54] . 2021. Graph convolutional module for temporal action localization in videos. IEEE Trans. Pattern Anal. Mach. Intell. 44 (2021), 6209–6223.Google Scholar
- [55] . 2020. Two-stream consensus network for weakly-supervised temporal action localization. In Proceedings of the European Conference on Computer Vision. Springer, 37–54.Google Scholar
Digital Library
- [56] . 2021. Action coherence network for weakly-supervised temporal action localization. IEEE Trans. Multimedia 24 (2021), 1857–1870.Google Scholar
- [57] . 2021. Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16010–16019.Google Scholar
Cross Ref
- [58] . 2019. Glnet: Global local network for weakly supervised action localization. IEEE Trans. Multimedia 22, 10 (2019), 2610–2622.Google Scholar
Cross Ref
- [59] . 2020. Bottom-up temporal action localization with mutual regularization. In Proceedings of the European Conference on Computer Vision. Springer, 539–555.Google Scholar
Digital Library
- [60] . 2018. Step-by-step erasion, one-by-one collection: A weakly supervised temporal action detector. In Proceedings of the 26th ACM International Conference on Multimedia. 35–44.Google Scholar
Digital Library
- [61] . 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929.Google Scholar
Cross Ref
- [62] . 2020. Proposal complementary action detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2s (2020), 1–12.Google Scholar
Digital Library
Index Terms
Temporal Dropout for Weakly Supervised Action Localization
Recommendations
Action Completeness Modeling with Background Aware Networks for Weakly-Supervised Temporal Action Localization
MM '20: Proceedings of the 28th ACM International Conference on MultimediaThe state-of-the-art of fully-supervised methods for temporal action localization from untrimmed videos has achieved impressive results. Yet, it remains unsatisfactory for the weakly-supervised temporal action localization, where only video-level action ...
Weakly supervised activity analysis with spatio-temporal localisation
In computer vision, an increasing number of weakly annotated videos have become available, due to the fact it is often difficult and time consuming to annotate all the details in the videos collected. Learning methods that analyse human activities in ...
Weakly Supervised Temporal Action Localization with Segment-Level Labels
Pattern Recognition and Computer VisionAbstractTemporal action localization presents a trade-off between test performance and annotation-time cost. Fully supervised methods achieve good performance with time-consuming boundary annotations. Weakly supervised methods with cheaper video-level ...






Comments