Abstract
The explosive growth of video data has brought great challenges to video retrieval, which aims to find out related videos from a video collection. Most users are usually not interested in all the content of retrieved videos but have a more fine-grained need. In the meantime, most existing methods can only return a ranked list of retrieved videos lacking a proper way to present the video content. In this paper, we introduce a distinctively new task, namely One-Stop Video Delivery (OSVD) aiming to realize a comprehensive retrieval system with the following merits: it not only retrieves the relevant videos but also filters out irrelevant information and presents compact video content to users, given a natural language query and video collection. To solve this task, we propose an end-to-end Hierarchical Video Graph Reasoning framework (HVGR), which considers relations of different video levels and jointly accomplishes the one-stop delivery task. Specifically, we decompose the video into three levels, namely the video-level, moment-level, and the clip-level in a coarse-to-fine manner, and apply Graph Neural Networks (GNNs) on the hierarchical graph to model the relations. Furthermore, a pairwise ranking loss named Progressively Refined Loss is proposed based on prior knowledge that there is a relative order of the similarity of query-video, query-moment, and query-clip due to the different granularity of matched information. Extensive experimental results on benchmark datasets demonstrate that the proposed method achieves superior performance compared with baseline methods.
- [1] . 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803–5812.Google Scholar
Cross Ref
- [2] . 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638–10647.Google Scholar
Cross Ref
- [3] . 2020. Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval. Multimedia Tools and Applications (2020), 1–28.Google Scholar
- [4] . 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377–3388.Google Scholar
Digital Library
- [5] . 2019. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9346–9355.Google Scholar
Cross Ref
- [6] . 2019. Temporal localization of moments in video collections with natural language. arXiv preprint arXiv:1907.12763 (2019).Google Scholar
- [7] . 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).Google Scholar
- [8] . 2017. Complex event detection by identifying reliable shots from untrimmed videos. In Proceedings of the IEEE International Conference on Computer Vision. 736–744.Google Scholar
Cross Ref
- [9] . 2020. Person tube retrieval via language description. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10754–10761.Google Scholar
Cross Ref
- [10] . 2018. Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 4 (2018), 1–18. Google Scholar
Digital Library
- [11] . 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121–2129. Google Scholar
Digital Library
- [12] . 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.Google Scholar
Cross Ref
- [13] . 2019. Graph convolutional tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4649–4659.Google Scholar
Cross Ref
- [14] . 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In AAAI. Google Scholar
Digital Library
- [15] . 2020. Learning to model relationships for zero-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).Google Scholar
- [16] . 2009. Dynamic video summarization using two-level redundancy detection. Multimedia Tools and Applications 42, 2 (2009), 233–250. Google Scholar
Digital Library
- [17] . 2009. Thematic video thumbnail selection. In 2009 16th IEEE International Conference on Image Processing (ICIP). IEEE, 4333–4336. Google Scholar
Digital Library
- [18] . 2019. Mac: Mining activity concepts for language-based temporal localization. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 245–253.Google Scholar
Cross Ref
- [19] . 2019. Excl: Extractive clip localization using natural language descriptions. arXiv preprint arXiv:1904.02755 (2019).Google Scholar
- [20] . 2011. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 315–323.Google Scholar
- [21] . 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181–7189.Google Scholar
Cross Ref
- [22] . 2016. Video2gif: Automatic generation of animated gifs from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1001–1009.Google Scholar
Cross Ref
- [23] . 2019. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8393–8400. Google Scholar
Digital Library
- [24] . 2015. Deep metric learning using triplet network. In International Workshop on Similarity-based Pattern Recognition. Springer, 84–92.Google Scholar
- [25] . 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588–3597.Google Scholar
Cross Ref
- [26] . 2019. Language-conditioned graph networks for relational reasoning. In Proceedings of the IEEE International Conference on Computer Vision. 10294–10303.Google Scholar
Cross Ref
- [27] . 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163–6171.Google Scholar
Cross Ref
- [28] . 2019. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology (2019).Google Scholar
- [29] . 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 217–225. Google Scholar
Digital Library
- [30] . 2018. Three-dimensional attention-based deep ranking model for video highlight detection. IEEE Transactions on Multimedia 20, 10 (2018), 2693–2705.Google Scholar
Cross Ref
- [31] . 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7482–7491.Google Scholar
- [32] . 2018. Exploiting web images for video highlight detection with triplet deep ranking. IEEE Transactions on Multimedia 20, 9 (2018), 2415–2426. Google Scholar
Digital Library
- [33] . 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
- [34] . 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).Google Scholar
- [35] . 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.Google Scholar
Cross Ref
- [36] . 2016. Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 1 (2016), 102–114. Google Scholar
Digital Library
- [37] . 2011. Query sensitive dynamic web video thumbnail generation. In 2011 18th IEEE International Conference on Image Processing. IEEE, 2449–2452.Google Scholar
Cross Ref
- [38] . 2018. Attentive moment retrieval in videos. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 15–24. Google Scholar
Digital Library
- [39] . 2015. Multi-task deep visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3707–3715.Google Scholar
Cross Ref
- [40] . 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019).Google Scholar
- [41] . 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 55–60.Google Scholar
Cross Ref
- [42] . 2016. Being a supercook: Joint food attributes and multimodal content modeling for recipe retrieval and exploration. IEEE Transactions on Multimedia 19, 5 (2016), 1100–1113. Google Scholar
Digital Library
- [43] . 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 19–27. Google Scholar
Digital Library
- [44] . 2019. Video highlight extraction via content-aware deep transfer. Multimedia Tools and Applications 78, 15 (2019), 21133–21144.Google Scholar
Digital Library
- [45] . 2019. Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity. Multimedia Tools and Applications (2019), 1–18.Google Scholar
- [46] . 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.Google Scholar
Cross Ref
- [47] . 2019. Video summarization by learning from unpaired data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7902–7911.Google Scholar
Cross Ref
- [48] . 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In The IEEE Winter Conference on Applications of Computer Vision. 2464–2473.Google Scholar
Cross Ref
- [49] . 2018. Find and focus: Retrieve and localize video events with natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV). 200–216.Google Scholar
Cross Ref
- [50] . 2017. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4788–4797.Google Scholar
Cross Ref
- [51] . 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1227–1236. Google Scholar
Digital Library
- [52] . 2016. To click or not to click: Automatic selection of beautiful thumbnails from videos. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management. 659–668. Google Scholar
Digital Library
- [53] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497. Google Scholar
Digital Library
- [54] . 2017. Query-adaptive video summarization via quality-aware relevance estimation. In Proceedings of the 25th ACM International Conference on Multimedia. 582–590. Google Scholar
Digital Library
- [55] . 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).Google Scholar
- [56] . 2019. Temporally grounding language queries in videos by contextual boundary-aware prediction. arXiv preprint arXiv:1909.05010 (2019).Google Scholar
- [57] . 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1960–1968.Google Scholar
Cross Ref
- [58] . 2019. Heterogeneous graph attention network. In The World Wide Web Conference. 2022–2032. Google Scholar
Digital Library
- [59] . 2018. Video summarization via semantic attended networks. In Thirty-Second AAAI Conference on Artificial Intelligence.Google Scholar
- [60] . 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9062–9069. Google Scholar
Digital Library
- [61] . 2018. How powerful are graph neural networks?arXiv preprint arXiv:1810.00826 (2018).Google Scholar
- [62] . 2019. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia 22, 5 (2019), 1372–1383.Google Scholar
Cross Ref
- [63] . 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Twenty-Ninth AAAI Conference on Artificial Intelligence. Google Scholar
Digital Library
- [64] . 2018. Graph R-CNN for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV). 670–685.Google Scholar
Cross Ref
- [65] . 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 471–487.Google Scholar
Cross Ref
- [66] . 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In Advances in Neural Information Processing Systems. 534–544. Google Scholar
Digital Library
- [67] . 2019. Sentence specified dynamic video thumbnail generation. In Proceedings of the 27th ACM International Conference on Multimedia. 2332–2340. Google Scholar
Digital Library
- [68] . 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision. 7094–7103.Google Scholar
Cross Ref
- [69] . 2018. Cross-modal and hierarchical modeling of video and text. In Proceedings of the European Conference on Computer Vision (ECCV). 374–390.Google Scholar
Cross Ref
- [70] . 2019. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1247–1257.Google Scholar
Cross Ref
- [71] . 2018. Retrospective encoders for video summarization. In Proceedings of the European Conference on Computer Vision (ECCV). 383–399.Google Scholar
Cross Ref
- [72] . 2019. Learning 2D temporal adjacent networks for moment localization with natural language. arXiv preprint arXiv:1912.03590 (2019).Google Scholar
- [73] . 2019. Exploiting temporal relationships in video moment localization with natural language. In Proceedings of the 27th ACM International Conference on Multimedia. 1230–1238. Google Scholar
Digital Library
- [74] . 2014. Web video thumbnail recommendation with content-aware analysis and query-sensitive matching. Multimedia Tools and Applications 73, 1 (2014), 547–571. Google Scholar
Digital Library
- [75] . 2018. Query-conditioned three-player adversarial network for video summarization. arXiv preprint arXiv:1807.06677 (2018).Google Scholar
- [76] . 2019. Deep reinforcement learning for query-conditioned video summarization. Applied Sciences 9, 4 (2019), 750.Google Scholar
Cross Ref
- [77] . 2017. A survey on multi-task learning. arXiv preprint arXiv:1707.08114 (2017).Google Scholar
- [78] . 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 655–664. Google Scholar
Digital Library
- [79] . 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7405–7414.Google Scholar
Cross Ref
- [80] . 2018. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434 (2018).Google Scholar
Index Terms
Learning Hierarchical Video Graph Networks for One-Stop Video Delivery
Recommendations
Hierarchical Recurrent Neural Network for Video Summarization
MM '17: Proceedings of the 25th ACM international conference on MultimediaExploiting the temporal dependency among video frames or subshots is very important for the task of video summarization. Practically, RNN is good at temporal dependency modeling, and has achieved overwhelming performance in many video-based tasks, such ...
Meta-path based graph contrastive learning for micro-video recommendation
AbstractNowadays, micro-video sharing platforms have become popular tools for people creating and viewing micro-videos in daily life. The micro-video recommendation task has attracted significant attention from researchers, recently. The key ...
Highlights- The meta-path-based random walk strategy can help capture the heterogeneity.
- ...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaAutomatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...






Comments