Abstract
The newly emerging language-based video moment retrieval task aims at retrieving a target video moment from an untrimmed video given a natural language as the query. It is more applicable in reality since it is able to accurately localize a specific video moment, as compared to traditional whole video retrieval. In this work, we propose a novel solution to thoroughly investigate the language-based video moment retrieval issue under the adversarial learning. The key of our solution is to formulate the language-based video moment retrieval task as an adversarial learning problem with two tightly connected components. Specifically, a reinforcement learning is employed as a generator to produce a set of possible video moments. Meanwhile, a multi-task learning is utilized as a discriminator, which integrates inter-modal and intra-modal in a unified framework by employing a sequential update strategy. Finally, the generator and the discriminator are mutually reinforced in the adversarial learning, which is able to jointly optimize the performance of both video moment ranking and video moment localization. Extensive experimental results on two challenging benchmarks, i.e., Charades-STA and TACoS datasets, have well demonstrated the effectiveness and rationality of our proposed solution. Meanwhile, on the larger and unbiased datasets, i.e., ActivityNet Captions and ActivityNet-CD, our proposed framework exhibits excellent robustness.
- [1] . 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5803–5812.Google Scholar
Cross Ref
- [2] . 2018. Attentive group recommendation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 645–654. Google Scholar
Digital Library
- [3] . 2019. Social-enhanced attentive group recommendation. IEEE Transactions on Knowledge and Data Engineering (2019).Google Scholar
- [4] . 2019. Video-based cross-modal recipe retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 1685–1693. Google Scholar
Digital Library
- [5] . 2020. STRONG: Spatio-temporal reinforcement learning for cross-modal video moment localization. In Proceedings of the ACM International Conference on Multimedia. ACM, 4162–4170. Google Scholar
Digital Library
- [6] . 2020. Adversarial video moment retrieval by jointly modeling ranking and localization. In Proceedings of the ACM International Conference on Multimedia. ACM, 898–906. Google Scholar
Digital Library
- [7] . 2018. Temporally grounding natural sentence in video. In Proceedings of the Empirical Methods in Natural Language Processing. ACL, 162–171.Google Scholar
Cross Ref
- [8] . 2019. Localizing natural language in videos. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 8175–8182. Google Scholar
Digital Library
- [9] . 2018. Deep understanding of cooking procedure for cross-modal recipe retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 1020–1028. Google Scholar
Digital Library
- [10] . 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 8199–8206. Google Scholar
Digital Library
- [11] . 2020. Look closer to ground better: Weakly-supervised temporal grounding of sentence in video. arXiv preprint arXiv:2001.09308 (2020), 1–10.Google Scholar
- [12] . 2019. Reinforced negative sampling for recommendation with exposure data. In Proceedings of the International Joint Conference on Artificial Intelligence. AAAI, 2230–2236. Google Scholar
Digital Library
- [13] . 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5267–5275.Google Scholar
Cross Ref
- [14] . 2017. Sketch-based image retrieval using generative adversarial networks. In Proceedings of the ACM International Conference on Multimedia. ACM, 1267–1268. Google Scholar
Digital Library
- [15] . 2018. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1930–1939. Google Scholar
Digital Library
- [16] . 2019. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 8393–8400. Google Scholar
Digital Library
- [17] . 2018. Adversarial personalized ranking for recommendation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 355–364. Google Scholar
Digital Library
- [18] . 2018. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 3207–3214. Google Scholar
Digital Library
- [19] . 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4555–4564.Google Scholar
Cross Ref
- [20] . 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In Proceedings of the ACM SIGMM International Conference on Multimedia Retrieval. ACM, 217–225. Google Scholar
Digital Library
- [21] . 2019. Unsupervised semantic generative adversarial networks for expert retrieval. In Proceedings of the International Conference on World Wide Web. ACM, 1039–1050. Google Scholar
Digital Library
- [22] . 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).Google Scholar
- [23] . 2020. Weakly-supervised video moment retrieval via semantic completion network. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 11539–11546.Google Scholar
Cross Ref
- [24] . 2020. Moment retrieval via cross-modal interaction networks with query reconstruction. IEEE Transactions on Image Processing 29 (2020), 3750–3762.Google Scholar
Digital Library
- [25] . 2018. Attentive moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 15–24. Google Scholar
Digital Library
- [26] . 2018. Cross-modal moment localization in videos. In Proceedings of the ACM International Conference on Multimedia. ACM, 843–851. Google Scholar
Digital Library
- [27] . 2018. Learning a multi-concept video retrieval model with multiple latent variables. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2 (2018), 46:1–46:21. Google Scholar
Digital Library
- [28] . 2019. Tripping through time: Efficient localization of activities in videos. arXiv preprint arXiv:1904-09936 (2019), 1–13.Google Scholar
- [29] . 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3994–4003.Google Scholar
Cross Ref
- [30] . 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 11592–11601.Google Scholar
Cross Ref
- [31] . 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. ACM, 1928–1937. Google Scholar
Digital Library
- [32] . 2019. Continual lifelong learning with neural networks: A review. Neural Networks 113 (2019), 54–71.Google Scholar
Digital Library
- [33] . 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1 (2019), 22:1–22:24. Google Scholar
Digital Library
- [34] . 2017. Dense-captioning events in videos. In Proceedings of the IEEE Conference on Computer Vision. IEEE, 706–715.Google Scholar
- [35] . 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1 (2013), 25–36.Google Scholar
Cross Ref
- [36] . 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the European Conference on Computer Vision. Springer, 144–157. Google Scholar
Digital Library
- [37] . 2017. Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3020–3028.Google Scholar
Cross Ref
- [38] . 2017. Continual learning in generative adversarial nets. arXiv preprint arXiv:1705.08395 (2017).Google Scholar
- [39] . 2019. Synthesizing facial photometries and corresponding geometries using generative adversarial networks. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3s (2019), 87:1–87:24. Google Scholar
Digital Library
- [40] . 2019. Video retrieval with similarity-preserving deep temporal hashing. ACM Trans. Multimedia Comput. Commun. Appl. 15, 4 (2019), 109:1–109:16. Google Scholar
Digital Library
- [41] . 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. Springer, 510–526.Google Scholar
Cross Ref
- [42] . 2018. Binary generative adversarial networks for image retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 394–401.Google Scholar
- [43] . 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. AUAI Press, 452–461. Google Scholar
Digital Library
- [44] . 2020. ERNIE 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 8968–8975.Google Scholar
Cross Ref
- [45] . 2020. Progressive layered extraction (PLE): A novel multi-task learning (MTL) model for personalized recommendations. In Proceedings of the ACM Conference on Recommender Systems. ACM, 269–278. Google Scholar
Digital Library
- [46] . 2018. On catastrophic forgetting and mode collapse in generative adversarial networks. arXiv preprint arXiv:1807.04015 (2018).Google Scholar
- [47] . 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 154–162. Google Scholar
Digital Library
- [48] . 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 2s (2018), 40:1–40:20. Google Scholar
Digital Library
- [49] . 2020. Dual path interaction network for video moment localization. In Proceedings of the ACM International Conference on Multimedia. ACM, 4116–4124. Google Scholar
Digital Library
- [50] . 2017. IRGAN: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 515–524. Google Scholar
Digital Library
- [51] . 2019. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 334–343.Google Scholar
Cross Ref
- [52] . 2020. Reinforced negative sampling over knowledge graph for recommendation. In Proceedings of the International Conference on World Wide Web. ACM. Google Scholar
Digital Library
- [53] . 2019. Characterizing and avoiding negative transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 11293–11302.Google Scholar
Cross Ref
- [54] . 2019. Adversarial preference learning with pairwise comparisons. In Proceedings of the ACM International Conference on Multimedia. ACM, 656–664. Google Scholar
Digital Library
- [55] . 2020. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In Proceedings of the ACM International Conference on Multimedia. ACM, 1283–1291. Google Scholar
Digital Library
- [56] . 2020. Augmented adversarial training for cross-modal retrieval. IEEE Transactions on Multimedia (2020).Google Scholar
- [57] . 2021. Boundary proposal network for two-stage natural language video localization. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 2986–2994.Google Scholar
Cross Ref
- [58] . 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 9062–9069. Google Scholar
Digital Library
- [59] . [n.d.]. Local correspondence network for weakly supervised temporal sentence grounding. IEEE ([n. d.]).Google Scholar
- [60] . 2021. A closer look at temporal sentence grounding in videos: Datasets and metrics. arXiv preprint arXiv:2101.09028v2 (2021), 1–10.Google Scholar
- [61] . 2020. Dense regression network for video grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 10284–10293.Google Scholar
Cross Ref
- [62] . 2021. Multi-modal relational graph for cross-modal video moment retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2215–2224.Google Scholar
Cross Ref
- [63] . 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1247–1257.Google Scholar
Cross Ref
- [64] . 2021. Natural language video localization: A revisit in span-based question answering framework. arXiv preprint arXiv:2102.13558 (2021), 1–10.Google Scholar
- [65] . 2019. Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Transactions on Multimedia 22, 1 (2019), 174–187.Google Scholar
Digital Library
- [66] . 2020. Learning 2D temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 12870–12877.Google Scholar
Cross Ref
- [67] . 2019. Exploiting temporal relationships in video moment localization with natural language. In Proceedings of the ACM International Conference on Multimedia. ACM, 1230–1238. Google Scholar
Digital Library
- [68] . 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 655–664. Google Scholar
Digital Library
- [69] . 2019. Visual content recognition by exploiting semantic feature map with attention and multi-task learning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1s (2019), 6:1–6:22. Google Scholar
Digital Library
- [70] . 2019. Adversarial point-of-interest recommendation. In Proceedings of the International Conference on World Wide Web. ACM, 3462–3468. Google Scholar
Digital Library
- [71] . 2019. R2GAN: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 11477–11486.Google Scholar
Cross Ref
Index Terms
Moment is Important: Language-Based Video Moment Retrieval via Adversarial Learning
Recommendations
A Survey on Video Moment Localization
Video moment localization, also known as video moment retrieval, aims to search a target segment within a video described by a given natural language query. Beyond the task of temporal action localization whereby the target actions are pre-defined, video ...
Video Corpus Moment Retrieval with Contrastive Learning
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalGiven a collection of untrimmed and unsegmented videos, video corpus moment retrieval (VCMR) is to retrieve a temporal moment (i.e., a fraction of a video) that semantically corresponds to a given text query. As video and text are from two distinct ...
Video Moment Retrieval with Hierarchical Contrastive Learning
MM '22: Proceedings of the 30th ACM International Conference on MultimediaThis paper explores the task of video moment retrieval (VMR), which aims to localize the temporal boundary of a specific moment from an untrimmed video by a sentence query. Previous methods either extract pre-defined candidate moment features and select ...






Comments