Abstract
Text-to-video temporal grounding aims to locate a target video moment that semantically corresponds to the given sentence query in an untrimmed video. In this task, fully supervised works require text descriptions for each event along with its temporal segment coordinate for training, which is labor-consuming. Existing weakly supervised works require only video-sentence pairs but cannot achieve satisfactory performance. However, many available annotations in the form of coarse temporal boundaries for sentences are ignored and unexploited. These coarse boundaries are common in streaming media platform and can be collected in a mechanical manner. We propose a novel approach to perform fine-grained text-to-video temporal grounding from these coarse boundaries. We take dense video captioning as base task and leverage the trained captioning model to identify the relevance of each video frame to the sentence query according to the frame participation in event captioning. To quantify the frame participation in event captioning, we propose event activation sequence, a simple method that highlights the temporal regions which have high correlations to the text modality in videos. Experiments on modified ActivityNet Captions and a use case demonstrate the promising fine-grained performance of our approach.
- [1] . 2016. Layer normalization. CoRR abs/1607.06450 (2016).Google Scholar
- [2] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.Google Scholar
Digital Library
- [3] . 2021. On pursuit of designing multi-modal transformer for video grounding. In EMNLP. 9810–9823.Google Scholar
- [4] . 2018. Temporally grounding natural sentence in video. In EMNLP.Google Scholar
- [5] . 2022. Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google Scholar
- [6] . 2020. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10551–10558.Google Scholar
Cross Ref
- [7] . 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In ECCV.Google Scholar
- [8] . 2019. Semantic proposal for activity localization in videos via sentence query. In AAAI.Google Scholar
- [9] . 2020. Hierarchical visual-textual graph for temporal activity localization via language. In ECCV.Google Scholar
- [10] . 2021. End-to-end multi-modal video temporal grounding. NIPS 34 (2021).Google Scholar
- [11] . 2018. Weakly supervised dense event captioning in videos. In NIPS. 3063–3073.Google Scholar
- [12] . 2017. TALL: Temporal activity localization via language query. In ICCV.Google Scholar
- [13] . 2021. Fast video moment retrieval. In ICCV. 1523–1532.Google Scholar
- [14] . 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia 19, 9 (2017), 2045–2055.Google Scholar
Digital Library
- [15] . 2019. WSLLN: Weakly supervised natural language localization networks. In EMNLP. 1481–1487.Google Scholar
- [16] . 2019. ExCL: Extractive clip localization using natural language descriptions. In NAACL.Google Scholar
- [17] . 2022. Can shuffling video benefit temporal bias problem: A novel training framework for temporal grounding. In European Conference on Computer Vision.Google Scholar
Digital Library
- [18] . 2022. Query-aware video encoder for video moment retrieval. Neurocomputing (2022).Google Scholar
Digital Library
- [19] . 2016. Deep residual learning for image recognition. In CVPR. 770–778.Google Scholar
- [20] . 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In CVPR. 961–970.Google Scholar
- [21] . 2017. Localizing moments in video with natural language. In ICCV.Google Scholar
- [22] . 2018. Localizing moments in video with temporal language. In EMNLP.Google Scholar
- [23] . 2021. CONQUER: Contextual query-aware ranking for video corpus moment retrieval. In ACM MM. 20–24.Google Scholar
- [24] . 2021. Cross-sentence temporal and semantic relations in video activity localisation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7199–7208.Google Scholar
Cross Ref
- [25] . 2020. Self-adaptive neural module transformer for visual question answering. IEEE Transactions on Multimedia (2020).Google Scholar
- [26] . 2020. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In Proceedings of the 31st British Machine Vision Virtual Conference. British Machine Vision Association (BMVA).Google Scholar
- [27] . 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 17, 2s (2021).Google Scholar
Digital Library
- [28] . 2021. SBAT: Video captioning with sparse boundary-aware transformer. In Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence. 630–636.Google Scholar
- [29] . 2019. Video question answering via knowledge-based progressive spatial-temporal attention network. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019).Google Scholar
Digital Library
- [30] . 2022. A survey on temporal sentence grounding in videos. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google Scholar
Digital Library
- [31] . 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11320–11327.Google Scholar
Cross Ref
- [32] . 2022. End-to-end modeling via information tree for one-shot natural language spatial video grounding. arXiv preprint arXiv:2203.08013 (2022).Google Scholar
- [33] . 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7492–7500.Google Scholar
Cross Ref
- [34] . 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL’04). 605–612.Google Scholar
Digital Library
- [35] . 2020. Weakly-supervised video moment retrieval via semantic completion network. In AAAI. 11539–11546.Google Scholar
- [36] . 2020. Moment retrieval via cross-modal interaction networks with query reconstruction. IEEE TIP (2020).Google Scholar
- [37] . 2018. Temporal modular networks for retrieving complex compositional activities in videos. In ECCV.Google Scholar
- [38] . 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In CVPR. 1298–1307.Google Scholar
- [39] . 2022. Memory-guided semantic learning network for temporal sentence grounding. arXiv preprint arXiv:2201.00454 (2022).Google Scholar
- [40] . 2021. Adaptive proposal generation network for temporal sentence localization in videos. In EMNLP. 9292–9301.Google Scholar
- [41] . 2021. Context-aware biaffine localizing network for temporal sentence grounding. In CVPR. 11235–11244.Google Scholar
- [42] . 2020. Jointly cross-and self-modal graph attention network for query-based moment localization. In ACM MM.Google Scholar
- [43] . 2021. Progressively guide to attend: An iterative alignment framework for temporal sentence grounding. In EMNLP. 9302–9311.Google Scholar
- [44] . 2022. Exploring motion and appearance information for temporal sentence grounding. arXiv preprint arXiv:2201.00457 (2022).Google Scholar
- [45] . 2018. Attentive moment retrieval in videos. In SIGIR.Google Scholar
- [46] . 2018. Cross-modal moment localization in videos. In ACM MM.Google Scholar
- [47] . 2021. Single-shot semantic matching network for moment localization in videos. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3 (2021).Google Scholar
Digital Library
- [48] . 2019. Weakly supervised temporal action localization through contrast based evaluation networks. In ICCV.Google Scholar
- [49] . 1982. Least squares quantization in PCM. IEEE Trans. Information Theory 28, 2 (1982), 129–136.Google Scholar
Digital Library
- [50] . 2019. DEBUG: A dense bottom-up grounding approach for natural language video localization. In EMNLP.Google Scholar
- [51] . 2020. SF-net: Single-frame supervision for temporal action localization. In European Conference on Computer Vision. Springer, 420–437.Google Scholar
Digital Library
- [52] . 2020. VLANet: Video-language alignment network for weakly-supervised video moment retrieval. In ECCV. 156–171.Google Scholar
- [53] . 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4 (2022).Google Scholar
Digital Library
- [54] . 2019. Weakly supervised video moment retrieval from text queries. In CVPR. 11592–11601.Google Scholar
- [55] . 2019. Action recognition from single timestamp supervision in untrimmed videos. In CVPR. 9907–9916.Google Scholar
- [56] . 2020. Local-global video-text interactions for temporal grounding. In CVPR.Google Scholar
- [57] . 2019. Streamlined dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6588–6597.Google Scholar
Cross Ref
- [58] . 2021. Interventional video grounding with dual contrastive learning. In CVPR. 2765–2775.Google Scholar
- [59] . 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594–4602.Google Scholar
Cross Ref
- [60] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google Scholar
Digital Library
- [61] . 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In WACV.Google Scholar
- [62] . 2020. Video object grounding using semantic roles in language description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10417–10427.Google Scholar
Cross Ref
- [63] . 2018. AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. In ECCV. 162–179.Google Scholar
- [64] . 2021. Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1533–1542.Google Scholar
Cross Ref
- [65] . 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2 (2019).Google Scholar
Digital Library
- [66] . 2017. Attention is all you need. In NIPS. 5998–6008.Google Scholar
- [67] . 2015. CIDER: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google Scholar
Cross Ref
- [68] . 2020. Temporally grounding language queries in videos by contextual boundary-aware prediction. In AAAI.Google Scholar
- [69] . 2017. UntrimmedNets for weakly supervised action recognition and detection. In CVPR. 6402–6411.Google Scholar
- [70] . 2021. Weakly supervised temporal adjacent network for language grounding. IEEE Transactions on Multimedia (2021).Google Scholar
- [71] . 2021. Boundary proposal network for two-stage natural language video localization. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI 2021), 33rd Conference on Innovative Applications of Artificial Intelligence (IAAI 2021), 11th Symposium on Educational Advances in Artificial Intelligence (EAAI 2021), Virtual Event, February 2–9, 2021. 2986–2994.Google Scholar
- [72] . 2019. Multilevel language and vision integration for text-to-clip retrieval. In AAAI.Google Scholar
- [73] . 2019. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia 22, 5 (2019), 1372–1383.Google Scholar
Cross Ref
- [74] . 2019. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Transactions on Multimedia 22, 1 (2019), 229–241.Google Scholar
Digital Library
- [75] . 2021. Local correspondence network for weakly supervised temporal sentence grounding. IEEE Transactions on Image Processing 30 (2021), 3252–3262.Google Scholar
Digital Library
- [76] . 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In NIPS.Google Scholar
- [77] . 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI.Google Scholar
- [78] . 2022. Moment is important: Language-based video moment retrieval via adversarial learning. ACM Trans. Multimedia Comput. Commun. Appl. 18, 2 (2022).Google Scholar
Digital Library
- [79] . 2019. Spatiotemporal-textual co-attention network for video question answering. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019).Google Scholar
Digital Library
- [80] . 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR.Google Scholar
- [81] . 2021. Natural language video localization: A revisit in span-based question answering framework. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google Scholar
- [82] . 2020. Span-based localizing network for natural language video localization. In ACL.Google Scholar
- [83] . 2021. Multi-scale 2D temporal adjacency networks for moment localization with natural language. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google Scholar
- [84] . 2020. Learning 2D temporal adjacent networks for moment localization with natural language. In AAAI.Google Scholar
- [85] . 2019. Frame augmented alternating attention network for video question answering. IEEE Transactions on Multimedia 22, 4 (2019), 1032–1041.Google Scholar
Digital Library
- [86] . 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19), (Paris, France, July 21–25, 2019). 655–664.Google Scholar
Digital Library
- [87] Zhu Zhang, Zhou Zhao, Zhijie Lin, Jieming Zhu, and Xiuqiang He. 2020. Counterfactual contrastive learning for weakly-supervised vision-language grounding. Advances in Neural Information Processing Systems 33 (2020).Google Scholar
- [88] . 2020. Temporal textual localization in video via adversarial bi-directional interaction networks. IEEE TMM (2020).Google Scholar
- [89] . 2022. Progressive localization networks for language based moment localization. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google Scholar
- [90] . 2021. Embracing uncertainty: Decoupling and de-bias for robust temporal grounding. In CVPR. 8445–8454.Google Scholar
- [91] . 2018. End-to-end dense video captioning with masked transformer. In CVPR. 8739–8748.Google Scholar
- [92] . 2020. Multimedia intelligence: When multimedia meets artificial intelligence. IEEE Transactions on Multimedia (2020).Google Scholar
- [93] . 2020. Multichannel attention refinement for video question answering. ACM Trans. Multimedia Comput. Commun. Appl. 16, 1s (2020).Google Scholar
Digital Library
Index Terms
Fine-Grained Text-to-Video Temporal Grounding from Coarse Boundary
Recommendations
Leveraging Fine-Grained Labels to Regularize Fine-Grained Visual Classification
ICCMS '19: Proceedings of the 11th International Conference on Computer Modeling and SimulationFine-grained visual categorization (FGVC) is challenging mainly due to the large intra-class confusion and small inter-class variance in terms of shape, pose, and appearance. We propose the concept of fine-grained label and that any given label can be ...
Weakly-Supervised Temporal Action Detection for Fine-Grained Videos with Hierarchical Atomic Actions
Computer Vision – ECCV 2022AbstractAction understanding has evolved into the era of fine granularity, as most human behaviors in real life have only minor differences. To detect these fine-grained actions accurately in a label-efficient way, we tackle the problem of weakly-...
Weakly labeled fine-grained classification with hierarchy relationship of fine and coarse labels
Highlights- We propose a fine-grained image classification model under a weakly supervised setting.
AbstractThe current work of fine-grained classification generally depends on a large number of fine labels of images. However, these fine labels are much more difficult to annotate than the coarse labels, which generalize fine labels based on ...






Comments