Abstract
In real-world scenarios, it is common that a video contains multiple actors and their activities. Selectively localizing one specific actor and its action spatially and temporally via a language query becomes a vital and challenging task. Existing fully supervised methods require extensive elaborately annotated data and are sensitive to the class labels, which cannot satisfy real-world applications’ needs. Thus, we introduce the task of weakly supervised actor-action video segmentation from a sentence query (AAVSS) in this work, where only the video-sentence pairs are provided. To the best of our knowledge, our work is the first to perform AAVSS under weakly supervised situations. However, this task is extremely challenging not only because the task aims to learn the complex interactions between two heterogeneous modalities but also because the task needs to learn fine-grained analysis of video content without pixel-level annotations. To overcome the challenges, we propose a two-stage network. The network first follows the sentence guidance to localize the candidate region and then performs segmentation to achieve selective segmentation. Specifically, a novel tracker-based clip-level multiple instance learning paradigm is proposed in this article to learn the matches between regions and sentences, which makes our two-stage network robust to the region proposal network. Furthermore, two intrinsic characteristics of the video, temporal consistency and motion information, are utilized in companion with the weak supervision to facilitate the region-query matching. Through extensive experiments, the proposed method achieves comparable performance to state-of-the-art fully supervised approaches on two large-scale benchmarks, including A2D Sentences and J-HMDB Sentences.
- [1] . 2017. Localizing moments in video with natural language. In Proceedings of the ICCV.Google Scholar
Cross Ref
- [2] . 2020. Activity-driven weakly supervised spatio-temporal grounding from untrimmed videos. In Proceedings of the ACM MM.Google Scholar
Digital Library
- [3] . 2020. Learning a weakly supervised video actor-action segmentation model with a wise selection. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [4] . 2018. Knowledge aided consistency for weakly supervised phrase grounding. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [5] . 2019. Relation attention for temporal action localization. IEEE Trans. Multimedia 22, 10 (2019), 2723–2733.Google Scholar
Cross Ref
- [6] . 2021. Cascade cross-modal attention network for video actor and action segmentation from a sentence. In Proceedings of the ACM MM. 4053–4062.Google Scholar
Digital Library
- [7] . 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the ECCV.Google Scholar
Digital Library
- [8] . 2019. Weakly supervised spatio-temporally grounding natural sentence in video. In Proceedings of the ACL.Google Scholar
Cross Ref
- [9] . 2021. Part-wise spatio-temporal attention driven CNN-based 3D human action recognition. ACM Trans. Multimidia Comput. Commun. Appl. 17, 3 (2021), 1–24.Google Scholar
Digital Library
- [10] . 2020. Learning integral objects with intra-class discriminator for weakly supervised semantic segmentation. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [11] . 2020. Cian: Cross-image affinity net for weakly supervised semantic segmentation. In Proceedings of the AAAI.Google Scholar
Cross Ref
- [12] . 2018. Associating inter-image salient instances for weakly supervised semantic segmentation. In Proceedings of the ECCV.Google Scholar
Digital Library
- [13] . 2017. Tall: Temporal activity localization via language query. In Proceedings of the ICCV.Google Scholar
Cross Ref
- [14] . 2018. Actor and action video segmentation from a sentence. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [15] . 2020. Self-supervised co-training for video representation learning. In Proceedings of the NeurIPS.Google Scholar
- [16] . 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [17] . 2018. Self-erasing network for integral object attention. In Proceedings of the NeurIPS.Google Scholar
- [18] . 2016. Segmentation from natural language expressions. In Proceedings of the ECCV.Google Scholar
Cross Ref
- [19] . 2018. Finding “it”: Weakly supervised reference-aware visual grounding in instructional videos. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [20] . 2018. Weakly supervised semantic segmentation network with deep seeded region growing. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [21] . 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 17, 2s (2021), 1–18.Google Scholar
Digital Library
- [22] . 2017. Simple does it: Weakly supervised instance and semantic segmentation. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [23] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [24] . 2019. Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [25] . 2017. Person search with natural language description. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [26] . 2021. Group-wise semantic mining for weakly supervised semantic segmentation. In Proceedings of the AAAI.Google Scholar
Cross Ref
- [27] . 2017. Tracking by natural language specification. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [28] . 2016. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [29] . 2019. Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of the ICCV.Google Scholar
Cross Ref
- [30] . 2019. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of the ACM MM.Google Scholar
Digital Library
- [31] . 2021. Single-shot semantic matching network for moment localization in videos. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3 (2021), 1–14.Google Scholar
Digital Library
- [32] . 2021. Relation-aware instance refinement for weakly supervised visual grounding. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [33] . 2020. Learning video object segmentation from unlabeled videos. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [34] . 2017. Comprehension-guided referring expressions. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [35] . 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the ACL (System Demonstrations).Google Scholar
Cross Ref
- [36] . 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [37] . 2020. Visual-textual capsule routing for text-based video segmentation. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [38] . 2020. Polar relative positional encoding for video-language segmentation. In Proceedings of the IJCAI.Google Scholar
Cross Ref
- [39] . 2019. Temporal gaussian mixture layer for videos. In Proceedings of the ICML. PMLR, 5152–5161.Google Scholar
- [40] . 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 11 (1997), 2673–2681.Google Scholar
Digital Library
- [41] . 2018. Key-word-aware network for referring expression image segmentation. In Proceedings of the ECCV.Google Scholar
Digital Library
- [42] . 2019. Not all frames are equal: Weakly supervised video grounding with contextual similarity and visual clustering losses. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [43] . 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the ICLR.Google Scholar
- [44] . 2019. Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [45] . 2020. Mining cross-image semantics for weakly supervised semantic segmentation. In Proceedings of the ECCV.Google Scholar
Digital Library
- [46] . 2021. Discriminative triad matching and reconstruction for weakly referring expression grounding. IEEE Trans. PAMI 43, 11 (2021), 4189–4195.Google Scholar
- [47] . 2018. On regularized losses for weakly supervised CNN segmentation. In Proceedings of the ECCV.Google Scholar
Digital Library
- [48] . 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2 (2019), 1–23.Google Scholar
Digital Library
- [49] . 2017. Learning random-walk label propagation for weakly supervised semantic segmentation. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [50] . 2020. Context modulated dynamic networks for actor and action video segmentation with language queries. In Proceedings of the AAAI.Google Scholar
Cross Ref
- [51] . 2019. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In Proceedings of the ICCV.Google Scholar
Cross Ref
- [52] . 2021. A weakly supervised semantic segmentation network by aggregating seed cues: The multi-object proposal generation perspective. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1s (2021), 1–19.Google Scholar
Digital Library
- [53] . 2016. Actor-action semantic segmentation with grouping process models. In CVPR.Google Scholar
- [54] . 2015. Can humans fly? Action understanding with multiple classes of actors. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [55] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the ICML.Google Scholar
- [56] . 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF CVPR. 10156–10165.Google Scholar
Cross Ref
- [57] . 2017. Spatio-temporal person retrieval via natural language queries. In Proceedings of the ICCV.Google Scholar
Cross Ref
- [58] . 2017. Weakly supervised actor-action segmentation via robust multi-task ranking. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [59] . 2020. Weakly supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the ACM MM.Google Scholar
Digital Library
- [60] . 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [61] . 2019. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circ. Syst. Video Technol. 30, 12 (2019), 4467–4480.Google Scholar
Digital Library
- [62] . 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [63] . 2007. A duality based approach for realtime tv-l 1 optical flow. In Pattern Recognition.Google Scholar
- [64] . 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF ICCV. 7094–7103.Google Scholar
Cross Ref
- [65] . 2019. Multi-source weak supervision for saliency detection. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [66] . 2020. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach. In Proceedings of the AAAI.Google Scholar
Cross Ref
- [67] . 2018. Grounding referring expressions in images by variational context. In Proceedings of the CVPR.Google Scholar
Cross Ref
- [68] . 2018. Weakly supervised video object grounding from text by loss weighting and object interaction. In Proceedings of the BMVC.Google Scholar
- [69] . 2020. Proposal complementary action detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2s (2020), 1–12.Google Scholar
Digital Library
- [70] . 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the ICCV.Google Scholar
Cross Ref
Index Terms
Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning
Recommendations
Transformer Based Multiple Instance Learning for Weakly Supervised Histopathology Image Segmentation
Medical Image Computing and Computer Assisted Intervention – MICCAI 2022AbstractHispathological image segmentation algorithms play a critical role in computer aided diagnosis technology. The development of weakly supervised segmentation algorithm alleviates the problem of medical image annotation that it is time-consuming and ...
Multiple instance learning
The characteristics specific of MIL problems are formally identified and described.MIL methods and applications are reviewed in the light of the problem characteristics.Comparative experiments show the impact of problem characteristics on 16 reference ...
Online MIL tracking with instance-level semi-supervised learning
In this paper we propose an online multiple instance boosting algorithm with instance-level semi-supervised learning, termed SemiMILBoost, to achieve robust object tracking. Our work revisits the multiple instance learning (MIL) formulation to alleviate ...






Comments