Abstract
This article targets the task of language-based video moment localization. The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments. Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, then match them with the given query to determine the target moment. However, candidate moments generated with a fixed temporal granularity may be suboptimal to handle the large variation in moment lengths. To this end, we propose a novel multi-stage Progressive Localization Network (PLN) that progressively localizes the target moment in a coarse-to-fine manner. Specifically, each stage of PLN has a localization branch and focuses on candidate moments that are generated with a specific temporal granularity. The temporal granularities of candidate moments are different across the stages. Moreover, we devise a conditional feature manipulation module and an upsampling connection to bridge the multiple localization branches. In this fashion, the later stages are able to absorb the previously learned information, thus facilitating the more fine-grained localization. Extensive experiments on three public datasets demonstrate the effectiveness of our proposed PLN for language-based moment localization, especially for localizing short moments in long videos.
- [1] . 2017. SST: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2911–2920.Google Scholar
Cross Ref
- [2] . 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.Google Scholar
Cross Ref
- [3] . 2020. Adversarial video moment retrieval by jointly modeling ranking and localization. In Proceedings of the ACM International Conference on Multimedia. 898–906.Google Scholar
Digital Library
- [4] . 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1130–1139.Google Scholar
Cross Ref
- [5] . 2018. Temporally grounding natural sentence in video. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 162–171.Google Scholar
Cross Ref
- [6] . 2020. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10551–10558.Google Scholar
Cross Ref
- [7] . 2019. Relation attention for temporal action localization. IEEE Transactions on Multimedia 22, 10 (2019), 2723–2733.Google Scholar
Cross Ref
- [8] . 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In Proceedings of the European Conference on Computer Vision. 333–351.Google Scholar
Digital Library
- [9] . 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8199–8206.Google Scholar
Digital Library
- [10] . 2020. Hierarchical visual-textual graph for temporal activity localization via language. In Proceedings of the European Conference on Computer Vision. 601–618.Google Scholar
Digital Library
- [11] . 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377–3388.Google Scholar
Digital Library
- [12] . 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence. Early access, February 15, 2021.Google Scholar
Digital Library
- [13] . 2018. Cross-media similarity evaluation for web image retrieval in the wild. IEEE Transactions on Multimedia 20, 9 (2018), 2371–2384.Google Scholar
Digital Library
- [14] . 2022. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology. Early access, January 23, 2022.Google Scholar
Digital Library
- [15] . 2019. SlowFast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.Google Scholar
Cross Ref
- [16] . 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.Google Scholar
Cross Ref
- [17] . 2021. Fast video moment retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 1523–1532.Google Scholar
Cross Ref
- [18] . 2021. Learning video moment retrieval without a single annotated video. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 1646–1657.Google Scholar
- [19] . 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision. 3628–3636.Google Scholar
Cross Ref
- [20] . 2019. MAC: Mining activity concepts for language-based temporal localization. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 245–253.Google Scholar
Cross Ref
- [21] . 2020. Tripping through time: Efficient localization of activities in videos. In Proceedings of the British Machine Vision Conference. 549.1–549.14.Google Scholar
- [22] . 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803–5812.Google Scholar
Cross Ref
- [23] . 2021. Video moment localization via deep cross-modal hashing. IEEE Transactions on Image Processing 30 (2021), 4667–4677.Google Scholar
Digital Library
- [24] . 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In Proceedings of the International Conference on Multimedia Retrieval. 217–225.Google Scholar
Digital Library
- [25] . 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference for Learning Representations. 1–15.Google Scholar
- [26] . 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.Google Scholar
Cross Ref
- [27] . 2017. Single shot temporal action detection. In Proceedings of the ACM International Conference on Multimedia. 988–996.Google Scholar
Digital Library
- [28] . 2020. Joint learning of local and global context for temporal action proposal generation. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2020), 4899–4912.Google Scholar
Digital Library
- [29] . 2020. Weakly-supervised video moment retrieval via semantic completion network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11539–11546.Google Scholar
Cross Ref
- [30] . 2020. Moment retrieval via cross-modal interaction networks with query reconstruction. IEEE Transactions on Image Processing 29 (2020), 3750–3762.Google Scholar
Digital Library
- [31] . 2020. Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In Proceedings of the 28th International Conference on Computational Linguistics. 1841–1851.Google Scholar
Cross Ref
- [32] . 2021. Context-aware Biaffine Localizing Network for temporal sentence grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11235–11244.Google Scholar
- [33] . 2018. Attentive moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 15–24.Google Scholar
Digital Library
- [34] . 2018. Cross-modal moment localization in videos. In Proceedings of the ACM International Conference on Multimedia. 843–851.Google Scholar
Digital Library
- [35] . 2020. Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11612–11619.Google Scholar
Cross Ref
- [36] . 2021. Single-shot semantic matching network for moment localization in videos. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3 (2021), 1–14.Google Scholar
Digital Library
- [37] . 2021. Centerness-aware network for temporal action proposal. IEEE Transactions on Circuits and Systems for Video Technology 32, 1 (2021), 5–16.Google Scholar
- [38] . 2019. DEBUG: A dense bottom-up grounding approach for natural language video localization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing. 5147–5156.Google Scholar
Cross Ref
- [39] . 2013. Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 3 (2013), 1–23.Google Scholar
Digital Library
- [40] . 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11592–11601.Google Scholar
Cross Ref
- [41] . 2020. Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10810–10819.Google Scholar
Cross Ref
- [42] . 2021. Interaction-integrated network for natural language moment localization. IEEE Transactions on Image Processing 30 (2021), 2538–2548.Google Scholar
Cross Ref
- [43] . 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 1–24.Google Scholar
Digital Library
- [44] . 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532–1543.Google Scholar
Cross Ref
- [45] . 2018. FiLM: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. 3942–3951.Google Scholar
Cross Ref
- [46] . 2020. Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the ACM International Conference on Multimedia. 4280–4288.Google Scholar
Digital Library
- [47] . 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).Google Scholar
- [48] . 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1 (2013), 25–36.Google Scholar
- [49] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.Google Scholar
Digital Library
- [50] . 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2464–2473.Google Scholar
Cross Ref
- [51] . 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the European Conference on Computer Vision. 144–157.Google Scholar
Digital Library
- [52] . 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5734–5743.Google Scholar
Cross Ref
- [53] . 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.Google Scholar
Cross Ref
- [54] . 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. 510–526.Google Scholar
Cross Ref
- [55] . 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference for Learning Representations. 1–14.Google Scholar
- [56] . 2021. MABAN: Multi-agent boundary-aware network for natural language moment retrieval. IEEE Transactions on Image Processing 30 (2021), 5589–5599.Google Scholar
Cross Ref
- [57] . 2021. LoGAN: Latent graph co-attention network for weakly-supervised video moment retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2083–2092.Google Scholar
Cross Ref
- [58] . 2019. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision. 9627–9636.Google Scholar
Cross Ref
- [59] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.Google Scholar
Digital Library
- [60] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [61] . 2020. Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12168–12175.Google Scholar
Cross Ref
- [62] . 2019. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 334–343.Google Scholar
Cross Ref
- [63] . 2020. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In Proceedings of the ACM International Conference on Multimedia. 1283–1291.Google Scholar
Digital Library
- [64] . 2020. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12386–12393.Google Scholar
Cross Ref
- [65] . 2021. Boundary proposal network for two-stage natural language video localization. Proceedings of the AAAI Conference on Artificial Intelligence 35, 04, 2986–2994.Google Scholar
Cross Ref
- [66] . 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9062–9069.Google Scholar
Digital Library
- [67] . 2022. Video moment retrieval with cross-modal neural architecture search. IEEE Transactions on Image Processing 31 (2022), 1204–1216.Google Scholar
Cross Ref
- [68] . 2016. Semantic feature mining for video event understanding. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 4 (2016), 1–22.Google Scholar
Digital Library
- [69] . 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In Advances in Neural Information Processing Systems. 536–546.Google Scholar
- [70] . 2020. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 5 (2020), 2725–2741.Google Scholar
Cross Ref
- [71] . 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9159–9166.Google Scholar
Digital Library
- [72] . 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision. 7094–7103.Google Scholar
Cross Ref
- [73] . 2020. Dense regression network for video grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10287–10296.Google Scholar
Cross Ref
- [74] . 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1247–1257.Google Scholar
Cross Ref
- [75] . 2020. METAL: Minimum effort temporal activity localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3882–3892.Google Scholar
Cross Ref
- [76] . 2020. Span-based localizing network for natural language video localization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 6543–6554.Google Scholar
Cross Ref
- [77] . 2020. Learning 2D temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12870–12877.Google Scholar
Cross Ref
- [78] . 2019. Exploiting temporal relationships in video moment localization with natural language. In Proceedings of the ACM International Conference on Multimedia. 1230–1238.Google Scholar
Digital Library
- [79] . 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 655–664.Google Scholar
Digital Library
- [80] . 2020. Temporal textual localization in video via adversarial bi-directional interaction networks. IEEE Transactions on Multimedia 23 (2020), 3306–3317.Google Scholar
Index Terms
Progressive Localization Networks for Language-Based Moment Localization
Recommendations
A Polygonal Method for Ranging-Based Localization in an Indoor Wireless Sensor Network
In this paper, we propose an indoor localization method in a wireless sensor network based on IEEE 802.15.4 specification. The proposed method follows a ranging-based approach using not only the measurements of received signal strength (RSS) but also ...
Cross-modal Moment Localization in Videos
MM '18: Proceedings of the 26th ACM international conference on MultimediaIn this paper, we address the temporal moment localization issue, namely, localizing a video moment described by a natural language query in an untrimmed video. This is a general yet challenging vision-language task since it requires not only the ...
Error analysis of quantised RSSI based sensor network localisation
Localising sensor nodes is an essential process for self-organising Wireless Sensor Networks (WSNs). A few recently-proposed localisation algorithms use Received Signal Strength Indication (RSSI) readings usually available in WSNs for node localisation. ...






Comments