skip to main content
research-article

Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning

Authors Info & Claims
Published:05 January 2023Publication History
Skip Abstract Section

Abstract

In real-world scenarios, it is common that a video contains multiple actors and their activities. Selectively localizing one specific actor and its action spatially and temporally via a language query becomes a vital and challenging task. Existing fully supervised methods require extensive elaborately annotated data and are sensitive to the class labels, which cannot satisfy real-world applications’ needs. Thus, we introduce the task of weakly supervised actor-action video segmentation from a sentence query (AAVSS) in this work, where only the video-sentence pairs are provided. To the best of our knowledge, our work is the first to perform AAVSS under weakly supervised situations. However, this task is extremely challenging not only because the task aims to learn the complex interactions between two heterogeneous modalities but also because the task needs to learn fine-grained analysis of video content without pixel-level annotations. To overcome the challenges, we propose a two-stage network. The network first follows the sentence guidance to localize the candidate region and then performs segmentation to achieve selective segmentation. Specifically, a novel tracker-based clip-level multiple instance learning paradigm is proposed in this article to learn the matches between regions and sentences, which makes our two-stage network robust to the region proposal network. Furthermore, two intrinsic characteristics of the video, temporal consistency and motion information, are utilized in companion with the weak supervision to facilitate the region-query matching. Through extensive experiments, the proposed method achieves comparable performance to state-of-the-art fully supervised approaches on two large-scale benchmarks, including A2D Sentences and J-HMDB Sentences.

REFERENCES

  1. [1] Hendricks Lisa Anne, Wang Oliver, Shechtman Eli, Sivic Josef, Darrell Trevor, and Russell Bryan. 2017. Localizing moments in video with natural language. In Proceedings of the ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Chen Junwen, Bao Wentao, and Kong Yu. 2020. Activity-driven weakly supervised spatio-temporal grounding from untrimmed videos. In Proceedings of the ACM MM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Chen Jie, Li Zhiheng, Luo Jiebo, and Xu Chenliang. 2020. Learning a weakly supervised video actor-action segmentation model with a wise selection. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Kan, Gao Jiyang, and Nevatia Ram. 2018. Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Peihao, Gan Chuang, Shen Guangyao, Huang Wenbing, Zeng Runhao, and Tan Mingkui. 2019. Relation attention for temporal action localization. IEEE Trans. Multimedia 22, 10 (2019), 27232733.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chen Weidong, Li Guorong, Zhang Xinfeng, Yu Hongyang, Wang Shuhui, and Huang Qingming. 2021. Cascade cross-modal attention network for video actor and action segmentation from a sentence. In Proceedings of the ACM MM. 40534062.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Chen Yangyu, Wang Shuhui, Zhang Weigang, and Huang Qingming. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the ECCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Chen Zhenfang, Ma Lin, Luo Wenhan, and Wong Kwan-Yee Kenneth. 2019. Weakly supervised spatio-temporally grounding natural sentence in video. In Proceedings of the ACL.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Dhiman Chhavi, Vishwakarma Dinesh Kumar, and Agarwal Paras. 2021. Part-wise spatio-temporal attention driven CNN-based 3D human action recognition. ACM Trans. Multimidia Comput. Commun. Appl. 17, 3 (2021), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Fan Junsong, Zhang Zhaoxiang, Song Chunfeng, and Tan Tieniu. 2020. Learning integral objects with intra-class discriminator for weakly supervised semantic segmentation. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Fan Junsong, Zhang Zhaoxiang, Tan Tieniu, Song Chunfeng, and Xiao Jun. 2020. Cian: Cross-image affinity net for weakly supervised semantic segmentation. In Proceedings of the AAAI.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Fan Ruochen, Hou Qibin, Cheng Ming-Ming, Yu Gang, Martin Ralph R, and Hu Shi-Min. 2018. Associating inter-image salient instances for weakly supervised semantic segmentation. In Proceedings of the ECCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Gao Jiyang, Sun Chen, Yang Zhenheng, and Nevatia Ram. 2017. Tall: Temporal activity localization via language query. In Proceedings of the ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Gavrilyuk Kirill, Ghodrati Amir, Li Zhenyang, and Snoek Cees G. M.. 2018. Actor and action video segmentation from a sentence. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Han Tengda, Xie Weidi, and Zisserman Andrew. 2020. Self-supervised co-training for video representation learning. In Proceedings of the NeurIPS.Google ScholarGoogle Scholar
  16. [16] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Hou Qibin, Jiang Peng-Tao, Wei Yunchao, and Cheng Ming-Ming. 2018. Self-erasing network for integral object attention. In Proceedings of the NeurIPS.Google ScholarGoogle Scholar
  18. [18] Hu Ronghang, Rohrbach Marcus, and Darrell Trevor. 2016. Segmentation from natural language expressions. In Proceedings of the ECCV.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Huang De-An, Buch Shyamal, Dery Lucio, Garg Animesh, Fei-Fei Li, and Niebles Juan Carlos. 2018. Finding “it”: Weakly supervised reference-aware visual grounding in instructional videos. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Huang Zilong, Wang Xinggang, Wang Jiasi, Liu Wenyu, and Wang Jingdong. 2018. Weakly supervised semantic segmentation network with deep seeded region growing. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Ji Wanting and Wang Ruili. 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 17, 2s (2021), 118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Khoreva Anna, Benenson Rodrigo, Hosang Jan, Hein Matthias, and Schiele Bernt. 2017. Simple does it: Weakly supervised instance and semantic segmentation. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A. et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123, 1 (2017), 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Lee Jungbeom, Kim Eunji, Lee Sungmin, Lee Jangho, and Yoon Sungroh. 2019. Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Li Shuang, Xiao Tong, Li Hongsheng, Zhou Bolei, Yue Dayu, and Wang Xiaogang. 2017. Person search with natural language description. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Li Xueyi, Zhou Tianfei, Li Jianwu, Zhou Yi, and Zhang Zhaoxiang. 2021. Group-wise semantic mining for weakly supervised semantic segmentation. In Proceedings of the AAAI.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Li Zhenyang, Tao Ran, Gavves Efstratios, Snoek Cees G. M., and Smeulders Arnold W. M.. 2017. Tracking by natural language specification. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Lin Di, Dai Jifeng, Jia Jiaya, He Kaiming, and Sun Jian. 2016. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Liu Xuejing, Li Liang, Wang Shuhui, Zha Zheng-Jun, Meng Dechao, and Huang Qingming. 2019. Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of the ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Liu Xuejing, Li Liang, Wang Shuhui, Zha Zheng-Jun, Su Li, and Huang Qingming. 2019. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of the ACM MM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Liu Xinfang, Nie Xiushan, Teng Junya, Lian Li, and Yin Yilong. 2021. Single-shot semantic matching network for moment localization in videos. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3 (2021), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Liu Yongfei, Wan Bo, Ma Lin, and He Xuming. 2021. Relation-aware instance refinement for weakly supervised visual grounding. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Lu Xiankai, Wang Wenguan, Shen Jianbing, Tai Yu-Wing, Crandall David J., and Hoi Steven C. H.. 2020. Learning video object segmentation from unlabeled videos. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Luo Ruotian and Shakhnarovich Gregory. 2017. Comprehension-guided referring expressions. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Manning Christopher D., Surdeanu Mihai, Bauer John, Finkel Jenny Rose, Bethard Steven, and McClosky David. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the ACL (System Demonstrations).Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Mao Junhua, Huang Jonathan, Toshev Alexander, Camburu Oana, Yuille Alan L, and Murphy Kevin. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] McIntosh Bruce, Duarte Kevin, Rawat Yogesh S., and Shah Mubarak. 2020. Visual-textual capsule routing for text-based video segmentation. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Ning Ke, Xie Lingxi, Wu Fei, and Tian Qi. 2020. Polar relative positional encoding for video-language segmentation. In Proceedings of the IJCAI.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Piergiovanni AJ and Ryoo Michael. 2019. Temporal gaussian mixture layer for videos. In Proceedings of the ICML. PMLR, 51525161.Google ScholarGoogle Scholar
  40. [40] Schuster Mike and Paliwal Kuldip K.. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 11 (1997), 26732681.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Shi Hengcan, Li Hongliang, Meng Fanman, and Wu Qingbo. 2018. Key-word-aware network for referring expression image segmentation. In Proceedings of the ECCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Shi Jing, Xu Jia, Gong Boqing, and Xu Chenliang. 2019. Not all frames are equal: Weakly supervised video grounding with contextual similarity and visual clustering losses. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Simonyan Karen and Zisserman Andrew. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the ICLR.Google ScholarGoogle Scholar
  44. [44] Song Chunfeng, Huang Yan, Ouyang Wanli, and Wang Liang. 2019. Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Sun Guolei, Wang Wenguan, Dai Jifeng, and Gool Luc Van. 2020. Mining cross-image semantics for weakly supervised semantic segmentation. In Proceedings of the ECCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Sun Mingjie, Xiao Jimin, Lim Enggee, Liu Si, and Goulermas John Yannis. 2021. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE Trans. PAMI 43, 11 (2021), 4189–4195.Google ScholarGoogle Scholar
  47. [47] Tang Meng, Perazzi Federico, Djelouah Abdelaziz, Ayed Ismail Ben, Schroers Christopher, and Boykov Yuri. 2018. On regularized losses for weakly supervised CNN segmentation. In Proceedings of the ECCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Tang Pengjie, Wang Hanli, and Li Qinyu. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2 (2019), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Vernaza Paul and Chandraker Manmohan. 2017. Learning random-walk label propagation for weakly supervised semantic segmentation. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wang Hao, Deng Cheng, Ma Fan, and Yang Yi. 2020. Context modulated dynamic networks for actor and action video segmentation with language queries. In Proceedings of the AAAI.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Wang Hao, Deng Cheng, Yan Junchi, and Tao Dacheng. 2019. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In Proceedings of the ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Xiao Junsheng, Xu Huahu, Gao Honghao, Bian Minjie, and Li Yang. 2021. A weakly supervised semantic segmentation network by aggregating seed cues: The multi-object proposal generation perspective. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1s (2021), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Xu Chenliang and Corso Jason J.. 2016. Actor-action semantic segmentation with grouping process models. In CVPR.Google ScholarGoogle Scholar
  54. [54] Xu Chenliang, Hsieh Shao-Hang, Xiong Caiming, and Corso Jason J.. 2015. Can humans fly? Action understanding with multiple classes of actors. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the ICML.Google ScholarGoogle Scholar
  56. [56] Xu Mengmeng, Zhao Chen, Rojas David S., Thabet Ali, and Ghanem Bernard. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF CVPR. 1015610165.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Yamaguchi Masataka, Saito Kuniaki, Ushiku Yoshitaka, and Harada Tatsuya. 2017. Spatio-temporal person retrieval via natural language queries. In Proceedings of the ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Yan Yan, Xu Chenliang, Cai Dawen, and Corso Jason J. 2017. Weakly supervised actor-action segmentation via robust multi-task ranking. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Yang Xun, Liu Xueliang, Jian Meng, Gao Xinjian, and Wang Meng. 2020. Weakly supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the ACM MM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Ye Linwei, Rochan Mrigank, Liu Zhi, and Wang Yang. 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Yu Jun, Li Jing, Yu Zhou, and Huang Qingming. 2019. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circ. Syst. Video Technol. 30, 12 (2019), 4467–4480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Yu Licheng, Lin Zhe, Shen Xiaohui, Yang Jimei, Lu Xin, Bansal Mohit, and Berg Tamara L.. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Zach Christopher, Pock Thomas, and Bischof Horst. 2007. A duality based approach for realtime tv-l 1 optical flow. In Pattern Recognition.Google ScholarGoogle Scholar
  64. [64] Zeng Runhao, Huang Wenbing, Tan Mingkui, Rong Yu, Zhao Peilin, Huang Junzhou, and Gan Chuang. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF ICCV. 70947103.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Zeng Yu, Zhuge Yunzhi, Lu Huchuan, Zhang Lihe, Qian Mingyang, and Yu Yizhou. 2019. Multi-source weak supervision for saliency detection. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Zhang Bingfeng, Xiao Jimin, Wei Yunchao, Sun Mingjie, and Huang Kaizhu. 2020. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach. In Proceedings of the AAAI.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Zhang Hanwang, Niu Yulei, and Chang Shih-Fu. 2018. Grounding referring expressions in images by variational context. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Zhou Luowei, Louis Nathan, and Corso Jason J.. 2018. Weakly supervised video object grounding from text by loss weighting and object interaction. In Proceedings of the BMVC.Google ScholarGoogle Scholar
  69. [69] Zhu Suguo, Yang Xiaoxian, Yu Jun, Fang Zhenying, Wang Meng, and Huang Qingming. 2020. Proposal complementary action detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2s (2020), 112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. [70] Zhu Xizhou, Wang Yujie, Dai Jifeng, Yuan Lu, and Wei Yichen. 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the ICCV.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1
      January 2023
      505 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3572858
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 January 2023
      • Online AM: 18 July 2022
      • Accepted: 28 January 2022
      • Revised: 25 January 2022
      • Received: 5 November 2021
      Published in tomm Volume 19, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!