skip to main content
research-article

Temporal Dropout for Weakly Supervised Action Localization

Published:25 February 2023Publication History
Skip Abstract Section

Abstract

Weakly supervised action localization is a challenging problem in video understanding and action recognition. Existing models usually formulate the training process as direct classification using video-level supervision. They tend to only locate the most discriminative parts of action instances and produce temporally incomplete detection results. A natural solution for this problem, the adversarial erasing strategy, is to remove such parts from training so that models can attend to complementary parts. Previous works do it in an offline and heuristic way. They adopt a multi-stage pipeline, where discriminative regions are determined and erased under the guidance of detection results from last stage. Such a pipeline can be both ineffective and inefficient, possibly hindering the overall performance. On the contrary, we combine adversarial erasing with dropout mechanism and propose a Temporal Dropout Module that learns where to remove in a data-driven and online manner. This plug-and-play module is trained without iterative stages, which not only simplifies the pipeline but also makes the regularization during training easier and more adaptive. Experiments show that the proposed method outperforms previous erasing-based methods by a large margin. More importantly, it achieves universal improvement when plugged into various direct classification methods and obtains state-of-the-art performance.

Skip Supplemental Material Section

Supplemental Material

REFERENCES

  1. [1] Arnab Anurag, Sun Chen, Nagrani Arsha, and Schmid Cordelia. 2020. Uncertainty-aware weakly supervised action detection from untrimmed videos. In Proceedings of the European Conference on Computer Vision. Springer, 751768.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961970.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 62996308.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chao Yu-Wei, Vijayanarasimhan Sudheendra, Seybold Bryan, Ross David A., Deng Jia, and Sukthankar Rahul. 2018. Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11301139.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Peihao, Gan Chuang, Shen Guangyao, Huang Wenbing, Zeng Runhao, and Tan Mingkui. 2019. Relation attention for temporal action localization. IEEE Trans. Multimedia 22, 10 (2019), 27232733.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Ghiasi Golnaz, Lin Tsung-Yi, and Le Quoc V.. 2018. Dropblock: A regularization method for convolutional networks. In Advances in Neural Information Processing Systems. 1072710737.Google ScholarGoogle Scholar
  7. [7] Huang Linjiang, Huang Yan, Ouyang Wanli, and Wang Liang. 2020. Relational prototypical network for weakly supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1105311060.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Huang Linjiang, Wang Liang, and Li Hongsheng. 2021. Foreground-action consistency network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 80028011.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Idrees Haroon, Zamir Amir R., Jiang Yu-Gang, Gorban Alex, Laptev Ivan, Sukthankar Rahul, and Shah Mubarak. 2017. The THUMOS challenge on action recognition for videos “in the wild.” Comput. Vis. Image Understand. 155 (2017), 123.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Jang Eric, Gu Shixiang, and Poole Ben. 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  11. [11] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  12. [12] Kingma Durk P., Salimans Tim, and Welling Max. 2015. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems. 25752583.Google ScholarGoogle Scholar
  13. [13] Lee Pilhyeon and Byun Hyeran. 2021. Learning action completeness from points for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1364813657.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Lee Pilhyeon, Uh Youngjung, and Byun Hyeran. 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1132011327.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Lee Pilhyeon, Wang Jinglu, Lu Yan, and Byun Hyeran. 2021. Weakly-supervised temporal action localization by uncertainty modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 2.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Lin Tianwei, Liu Xiao, Li Xin, Ding Errui, and Wen Shilei. 2019. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 38893898.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Liu Daochang, Jiang Tingting, and Wang Yizhou. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12981307.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Liu Ziyi, Wang Le, Zhang Qilin, Gao Zhanning, Niu Zhenxing, Zheng Nanning, and Hua Gang. 2019. Weakly supervised temporal action localization through contrast based evaluation networks. In Proceedings of the International Conference on Computer Vision. 38993908.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Liu Ziyi, Wang Le, Zhang Qilin, Tang Wei, Yuan Junsong, Zheng Nanning, and Hua Gang. 2021. ACSNet: Action-context separation network for weakly supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35, 2233–2241.Google ScholarGoogle Scholar
  20. [20] Long Fuchen, Yao Ting, Qiu Zhaofan, Tian Xinmei, Luo Jiebo, and Mei Tao. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 344353.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Luo Wang, Zhang Tianzhu, Yang Wenfei, Liu Jingen, Mei Tao, Wu Feng, and Zhang Yongdong. 2021. Action unit memory network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 99699979.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Luo Zhekun, Guillory Devin, Shi Baifeng, Ke Wei, Wan Fang, Darrell Trevor, and Xu Huijuan. 2020. Weakly-supervised action localization with expectation-maximization multi-instance learning. In Proceedings of the European Conference on Computer Vision. Springer, 729745.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Ma Fan, Zhu Linchao, Yang Yi, Zha Shengxin, Kundu Gourab, Feiszli Matt, and Shou Zheng. 2020. Sf-net: Single-frame supervision for temporal action localization. In Proceedings of the European Conference on Computer Vision. Springer, 420437.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Ma Junwei, Gorti Satya Krishna, Volkovs Maksims, and Yu Guangwei. 2021. Weakly supervised action selection learning in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 75877596.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Moltisanti Davide, Fidler Sanja, and Damen Dima. 2019. Action recognition from single timestamp supervision in untrimmed videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 99159924.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Narayan Sanath, Cholakkal Hisham, Hayat Munawar, Khan Fahad Shahbaz, Yang Ming-Hsuan, and Shao Ling. 2021. D2-Net: Weakly-supervised action localization via discriminative embeddings and denoised activations. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1360813617.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Narayan Sanath, Cholakkal Hisham, Khan Fahad Shahbaz, and Shao Ling. 2019. 3c-net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE International Conference on Computer Vision. 86798687.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Nguyen Phuc, Liu Ting, Prasad Gautam, and Han Bohyung. 2018. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 67526761.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Nguyen Phuc Xuan, Ramanan Deva, and Fowlkes Charless C.. 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision. 55025511.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Pardo Alejandro, Alwassel Humam, Caba Fabian, Thabet Ali, and Ghanem Bernard. 2021. Refineloc: Iterative refinement for weakly-supervised action localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 33193328.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Paul Sujoy, Roy Sourya, and Roy-Chowdhury Amit K.. 2018. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV’18). 563579.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Qing Zhiwu, Su Haisheng, Gan Weihao, Wang Dongliang, Wu Wei, Wang Xiang, Qiao Yu, Yan Junjie, Gao Changxin, and Sang Nong. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 485494.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Qu Sanqing, Chen Guang, Li Zhijun, Zhang Lijun, Lu Fan, and Knoll Alois. 2021. ACM-Net: Action context modeling network for weakly-supervised temporal action localization. arXiv:2104.02967. Retrieved from https://arxiv.org/abs/2104.02967.Google ScholarGoogle Scholar
  34. [34] Ren Zhongzheng, Yu Zhiding, Yang Xiaodong, Liu Ming-Yu, Lee Yong Jae, Schwing Alexander G, and Kautz Jan. 2020. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1059810607.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Shi Baifeng, Dai Qi, Mu Yadong, and Wang Jingdong. 2020. Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10091019.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Shi Qinghongya, Zhang Hong-Bo, Li Zhe, Du Ji-Xiang, Lei Qing, and Liu Jing-Hua. 2022. Shuffle-invariant network for action recognition in videos. ACM Trans. Multimedia Comput. Commun. Appl. 18, 3 (2022), 118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Shou Zheng, Chan Jonathan, Zareian Alireza, Miyazawa Kazuyuki, and Chang Shih-Fu. 2017. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 57345743.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Shou Zheng, Gao Hang, Zhang Lei, Miyazawa Kazuyuki, and Chang Shih-Fu. 2018. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV’18). 154171.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Shou Zheng, Wang Dongang, and Chang Shih-Fu. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10491058.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Simonyan Karen and Zisserman Andrew. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Singh Krishna Kumar and Lee Yong Jae. 2017. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE, 35443553.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Srivastava Nitish, Hinton Geoffrey, Krizhevsky Alex, Sutskever Ilya, and Salakhutdinov Ruslan. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 19291958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Su Haisheng, Zhao Xu, and Lin Tianwei. 2018. Cascaded pyramid mining network for weakly supervised temporal action localization. In Proceedings of the Asian Conference on Computer Vision. Springer, 558574.Google ScholarGoogle Scholar
  44. [44] Tompson Jonathan, Goroshin Ross, Jain Arjun, LeCun Yann, and Bregler Christoph. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 648656.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Wang Limin, Xiong Yuanjun, Lin Dahua, and Gool Luc Van. 2017. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 43254334.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Wei Yunchao, Feng Jiashi, Liang Xiaodan, Cheng Ming-Ming, Zhao Yao, and Yan Shuicheng. 2017. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 15681576.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Xu Huijuan, Das Abir, and Saenko Kate. 2017. R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision. 57835792.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Yang Le, Peng Houwen, Zhang Dingwen, Fu Jianlong, and Han Junwei. 2020. Revisiting anchor mechanisms for temporal action localization. IEEE Trans. Image Process. 29 (2020), 85358548.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Yang Zichen, Qin Jie, and Huang Di. 2022. ACGNet: Action complement graph network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36, 3090–3098.Google ScholarGoogle Scholar
  50. [50] Yu Tan, Ren Zhou, Li Yuncheng, Yan Enxu, Xu Ning, and Yuan Junsong. 2019. Temporal structure mining for weakly supervised action detection. In Proceedings of the IEEE International Conference on Computer Vision. 55225531.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Zach Christopher, Pock Thomas, and Bischof Horst. 2007. A duality based approach for realtime tv-l 1 optical flow. In Proceedings of the Joint Pattern Recognition Symposium. Springer, 214223.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Zeng Runhao, Gan Chuang, Chen Peihao, Huang Wenbing, Wu Qingyao, and Tan Mingkui. 2019. Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Trans. Image Process. 28, 12 (2019), 57975808.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Zeng Runhao, Huang Wenbing, Tan Mingkui, Rong Yu, Zhao Peilin, Huang Junzhou, and Gan Chuang. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision. 70947103.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Zeng Runhao, Huang Wenbing, Tan Mingkui, Rong Yu, Zhao Peilin, Huang Junzhou, and Gan Chuang. 2021. Graph convolutional module for temporal action localization in videos. IEEE Trans. Pattern Anal. Mach. Intell. 44 (2021), 6209–6223.Google ScholarGoogle Scholar
  55. [55] Zhai Yuanhao, Wang Le, Tang Wei, Zhang Qilin, Yuan Junsong, and Hua Gang. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In Proceedings of the European Conference on Computer Vision. Springer, 3754.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Zhai Yuanhao, Wang Le, Tang Wei, Zhang Qilin, Zheng Nanning, and Hua Gang. 2021. Action coherence network for weakly-supervised temporal action localization. IEEE Trans. Multimedia 24 (2021), 1857–1870.Google ScholarGoogle Scholar
  57. [57] Zhang Can, Cao Meng, Yang Dongming, Chen Jie, and Zou Yuexian. 2021. Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1601016019.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Zhang Shiwei, Song Lin, Gao Changxin, and Sang Nong. 2019. Glnet: Global local network for weakly supervised action localization. IEEE Trans. Multimedia 22, 10 (2019), 26102622.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Zhao Peisen, Xie Lingxi, Ju Chen, Zhang Ya, Wang Yanfeng, and Tian Qi. 2020. Bottom-up temporal action localization with mutual regularization. In Proceedings of the European Conference on Computer Vision. Springer, 539555.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Zhong Jia-Xing, Li Nannan, Kong Weijie, Zhang Tao, Li Thomas H., and Li Ge. 2018. Step-by-step erasion, one-by-one collection: A weakly supervised temporal action detector. In Proceedings of the 26th ACM International Conference on Multimedia. 3544.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Zhou Bolei, Khosla Aditya, Lapedriza Agata, Oliva Aude, and Torralba Antonio. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 29212929.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Zhu Suguo, Yang Xiaoxian, Yu Jun, Fang Zhenying, Wang Meng, and Huang Qingming. 2020. Proposal complementary action detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2s (2020), 112.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Temporal Dropout for Weakly Supervised Action Localization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3
      May 2023
      514 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3582886
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 February 2023
      • Online AM: 7 November 2022
      • Accepted: 25 September 2022
      • Revised: 27 July 2022
      • Received: 20 March 2022
      Published in tomm Volume 19, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)219
      • Downloads (Last 6 weeks)21

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!