skip to main content
research-article

Multiple Temporal Pooling Mechanisms for Weakly Supervised Temporal Action Localization

Authors Info & Claims
Published:25 February 2023Publication History
Skip Abstract Section

Abstract

Recent action localization works learn in a weakly supervised manner to avoid the expensive cost of human labeling. Those works are mostly based on the Multiple Instance Learning framework, where temporal pooling is an indispensable part that usually relies on the guidance of snippet-level Class Activation Sequences (CAS). However, we observe that previous works only leverage a simple convolutional neural network for the generation of CAS, which ignores the weak discriminative foreground action segments and the background ones, and meanwhile, the relationship between different actions has not been considered. To solve this problem, we propose multiple temporal pooling mechanisms (MTP) for a more sufficient information utilization. Specifically, with the design of the Foreground Variance Branch, Dual Foreground Attention Branch and Hybrid Attention Fine-tuning Branch, MTP can leverage more effective information from different aspects and generate different CASs to guide the learning of temporal pooling. Moreover, different loss functions are designed for a better optimization of individual branches, aiming to effectively distinguish the action from the background. Our method shows excellent results on the THUMOS14 and ActivityNet1.2 datasets.

REFERENCES

  1. [1] Buch Shyamal, Escorcia Victor, Ghanem Bernard, Fei-Fei Li, and Niebles Juan Carlos. 2019. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference 2017. British Machine Vision Association.Google ScholarGoogle Scholar
  2. [2] Buch Shyamal, Escorcia Victor, Shen Chuanqi, Ghanem Bernard, and Niebles Juan Carlos. 2017. SST: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 29112920.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961970.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 62996308.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chao Yu-Wei, Vijayanarasimhan Sudheendra, Seybold Bryan, Ross David A., Deng Jia, and Sukthankar Rahul. 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11301139.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chen Peihao, Gan Chuang, Shen Guangyao, Huang Wenbing, Zeng Runhao, and Tan Mingkui. 2019. Relation attention for temporal action localization. IEEE Transactions on Multimedia 22, 10 (2019), 27232733.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Dou Peng, Zhou Wei, Liao Zhongke, and Hu Haifeng. 2021. Feature matching network for weakly-supervised temporal action localization. In Pattern Recognition and Computer Vision. 459471.Google ScholarGoogle Scholar
  8. [8] Gong G., Wang X., Mu Y., and Tian Q.. 2020. Learning temporal co-attention models for unsupervised video action localization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Gu C., Sun C., Ross D. A., Vondrick C., Pantofaru C., Li Y., Vijayanarasimhan S., Toderici G., Ricco S., and Sukthankar R.. 2017. AVA: A video dataset of spatio-temporally localized atomic visual actions. (2017).Google ScholarGoogle Scholar
  10. [10] Horn B. K. P. and Schunck B. G.. 1981. Determining optical flow. In Techniques and Applications of Image Understanding.Google ScholarGoogle Scholar
  11. [11] Huang Linjiang, Huang Yan, Ouyang Wanli, and Wang Liang. 2020. Relational prototypical network for weakly supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1105311060.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Huang Linjiang, Huang Yan, Ouyang Wanli, and Wang Liang. 2021. Modeling sub-actions for weakly supervised temporal action localization. IEEE Transactions on Image Processing 30 (2021), 51545167.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Idrees Haroon, Zamir Amir R., Jiang Yu-Gang, Gorban Alex, Laptev Ivan, Sukthankar Rahul, and Shah Mubarak. 2017. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155 (2017), 123.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Islam Ashraful, Long Chengjiang, and Radke Richard. 2021. A hybrid attention mechanism for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 16371645.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Islam A. and Radke R. J.. 2020. Weakly supervised temporal action localization using deep metric learning. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV’20).Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Ji Yuan, Jia Xu, Lu Huchuan, and Ruan Xiang. 2021. Weakly-supervised temporal action localization via cross-stream collaborative learning. In Proceedings of the 29th ACM International Conference on Multimedia. 853861.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Karpathy Andrej, Toderici George, Shetty Sanketh, Leung Thomas, Sukthankar Rahul, and Fei-Fei Li. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 17251732.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Kay W., Carreira J., Simonyan K., Zhang B., and Zisserman A.. 2017. The kinetics human action video dataset. (2017).Google ScholarGoogle Scholar
  19. [19] Singh Krishna Kumar and Lee Yong Jae. 2017. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE International Conference on Computer Vision. 35243533.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Lea Colin, Flynn Michael D., Vidal Rene, Reiter Austin, and Hager Gregory D.. 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156165.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Lee Pilhyeon, Uh Youngjung, and Byun Hyeran. 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1132011327.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Lin Tianwei, Zhao Xu, Su Haisheng, Wang Chongjing, and Yang Ming. 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 319.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Liu D., Jiang T., and Wang Y.. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Liu Z., Wang L., Zhang Q., Gao Z., Niu Z., Zheng N, and Hua G.. 2020. Weakly supervised temporal action localization through contrast based evaluation networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV’20).Google ScholarGoogle Scholar
  25. [25] Liu Z., Wang L., Zhang Q., Tang W., Yuan J., Zheng N., and Hua G.. 2021. ACSNet: Action-context separation network for weakly supervised temporal action localization. (2021).Google ScholarGoogle Scholar
  26. [26] Long Fuchen, Yao Ting, Qiu Zhaofan, Tian Xinmei, Luo Jiebo, and Mei Tao. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 344353.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Luo Wang, Zhang Tianzhu, Yang Wenfei, Liu Jingen, Mei Tao, Wu Feng, and Zhang Yongdong. 2021. Action unit memory network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 99699979.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Luo Zhekun, Guillory Devin, Shi Baifeng, Ke Wei, Wan Fang, Darrell Trevor, and Xu Huijuan. 2020. Weakly-supervised action localization with expectation-maximization multi-instance learning. In European Conference on Computer Vision. Springer, 729745.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Min Kyle and Corso Jason J.. 2020. Adversarial background-aware loss for weakly-supervised temporal activity localization. In European Conference on Computer Vision. Springer, 283299.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Moniruzzaman Md., Yin Zhaozheng, He Zhihai, Qin Ruwen, and Leu Ming C.. 2020. Action completeness modeling with background aware networks for weakly-supervised temporal action localization. In Proceedings of the 28th ACM International Conference on Multimedia. 21662174.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Narayan Sanath, Cholakkal Hisham, Khan Fahad Shahbaz, and Shao Ling. 2019. 3C-Net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 86798687.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Nguyen Phuc, Liu Ting, Prasad Gautam, and Han Bohyung. 2018. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 67526761.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Nguyen Phuc Xuan, Ramanan Deva, and Fowlkes Charless C.. 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 55025511.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Pardo Alejandro, Alwassel Humam, Caba Fabian, Thabet Ali, and Ghanem Bernard. 2021. RefineLoc: Iterative refinement for weakly-supervised action localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 33193328.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Paul Sujoy, Roy Sourya, and Roy-Chowdhury Amit K.. 2018. W-TALC: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Qin Xiaolei, Ge Yongxin, Yu Hui, Chen Feiyu, and Yang Dan. 2020. Spatial enhancement and temporal constraint for weakly supervised action localization. IEEE Signal Processing Letters 27 (2020), 15201524. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Shi Baifeng, Dai Qi, Mu Yadong, and Wang Jingdong. 2020. Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10091019.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Shou Zheng, Chan Jonathan, Zareian Alireza, Miyazawa Kazuyuki, and Chang Shih-Fu. 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Z. Shou, H. Gao, L. Zhang, Kazuyuki Mayazawa, and Shih-Fu Chang. 2018. AutoLoc: Weakly-supervised temporal action localization. In Proceedings of the European Conference on Computer Vision (ECCV’18).Google ScholarGoogle Scholar
  40. [40] Shou Zheng, Wang Dongang, and Chang Shih-Fu. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10491058.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Sigurdsson G. A., Varol G., Wang X., Farhadi A., Laptev I., and Gupta A.. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Springer, Cham.Google ScholarGoogle Scholar
  42. [42] Simonyan Karen and Zisserman Andrew. 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems 27 (2014).Google ScholarGoogle Scholar
  43. [43] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 44894497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Tran Du, Wang Heng, Torresani Lorenzo, Ray Jamie, LeCun Yann, and Paluri Manohar. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 64506459.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Wang Binglu, Yang Le, and Zhao Yongqiang. 2021. POLO: Learning explicit cross-modality fusion for temporal action localization. IEEE Signal Processing Letters 28 (2021), 503507. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Wang Limin, Xiong Yuanjun, Lin Dahua, and Gool Luc Van. 2017. UntrimmedNets for weakly supervised action recognition and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 43254334.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Xu Huijuan, Das Abir, and Saenko Kate. 2017. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision. 57835792.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Xu Mengmeng, Zhao Chen, Rojas David S., Thabet Ali, and Ghanem Bernard. 2020. G-TAD: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1015610165.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Yang Xitong, Yang Xiaodong, Liu Ming-Yu, Xiao Fanyi, Davis Larry S., and Kautz Jan. 2019. Step: Spatio-temporal progressive learning for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 264272.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Yeung Serena, Russakovsky Olga, Mori Greg, and Fei-Fei Li. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 26782687.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Yin C., Liao Z., Hu H., and Chen D.. 2019. Weakly supervised video action localisation via two-stream action activation network. Electronics Letters 55, 21 (2019), 11261127.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Yu Tan, Ren Zhou, Li Yuncheng, Yan Enxu, Xu Ning, and Yuan Junsong. 2019. Temporal structure mining for weakly supervised action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 55225531.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Yuan Yuan, Lyu Yueming, Shen Xi, Tsang Ivor, and Yeung Dit-Yan. 2019. Marginalized average attentional network for weakly-supervised learning. In ICLR 2019-Seventh International Conference on Learning Representations.Google ScholarGoogle Scholar
  54. [54] Zeng Runhao, Gan Chuang, Chen Peihao, Huang Wenbing, Wu Qingyao, and Tan Mingkui. 2019. Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Transactions on Image Processing 28, 12 (2019), 57975808.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Zeng Runhao, Huang Wenbing, Tan Mingkui, Rong Yu, Zhao Peilin, Huang Junzhou, and Gan Chuang. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 70947103.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Zeng Runhao, Huang Wenbing, Tan Mingkui, Rong Yu, Zhao Peilin, Huang Junzhou, and Gan Chuang. 2021. Graph convolutional module for temporal action localization in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Zhai Yuanhao, Wang Le, Tang Wei, Zhang Qilin, Yuan Junsong, and Hua Gang. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In European Conference on Computer Vision. Springer, 3754.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Zhao Peisen, Xie Lingxi, Zhang Ya, and Tian Qi. 2021. Actionness-guided transformer for anchor-free temporal action localization. IEEE Signal Processing Letters (2021), 11. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Zhao Yue, Xiong Yuanjun, Wang Limin, Wu Zhirong, Tang Xiaoou, and Lin Dahua. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision. 29142923.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Multiple Temporal Pooling Mechanisms for Weakly Supervised Temporal Action Localization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3
      May 2023
      514 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3582886
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 February 2023
      • Online AM: 13 October 2022
      • Accepted: 3 October 2022
      • Revised: 16 August 2022
      • Received: 27 April 2022
      Published in tomm Volume 19, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)135
      • Downloads (Last 6 weeks)18

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!