skip to main content
research-article

Learning Hierarchical Video Graph Networks for One-Stop Video Delivery

Authors Info & Claims
Published:27 January 2022Publication History
Skip Abstract Section

Abstract

The explosive growth of video data has brought great challenges to video retrieval, which aims to find out related videos from a video collection. Most users are usually not interested in all the content of retrieved videos but have a more fine-grained need. In the meantime, most existing methods can only return a ranked list of retrieved videos lacking a proper way to present the video content. In this paper, we introduce a distinctively new task, namely One-Stop Video Delivery (OSVD) aiming to realize a comprehensive retrieval system with the following merits: it not only retrieves the relevant videos but also filters out irrelevant information and presents compact video content to users, given a natural language query and video collection. To solve this task, we propose an end-to-end Hierarchical Video Graph Reasoning framework (HVGR), which considers relations of different video levels and jointly accomplishes the one-stop delivery task. Specifically, we decompose the video into three levels, namely the video-level, moment-level, and the clip-level in a coarse-to-fine manner, and apply Graph Neural Networks (GNNs) on the hierarchical graph to model the relations. Furthermore, a pairwise ranking loss named Progressively Refined Loss is proposed based on prior knowledge that there is a relative order of the similarity of query-video, query-moment, and query-clip due to the different granularity of matched information. Extensive experimental results on benchmark datasets demonstrate that the proposed method achieves superior performance compared with baseline methods.

REFERENCES

  1. [1] Hendricks Lisa Anne, Wang Oliver, Shechtman Eli, Sivic Josef, Darrell Trevor, and Russell Bryan. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 58035812.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Chen Shizhe, Zhao Yida, Jin Qin, and Wu Qi. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1063810647.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Cheng Qingrong and Gu Xiaodong. 2020. Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval. Multimedia Tools and Applications (2020), 128.Google ScholarGoogle Scholar
  4. [4] Dong Jianfeng, Li Xirong, and Snoek Cees GM. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 33773388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Dong Jianfeng, Li Xirong, Xu Chaoxi, Ji Shouling, He Yuan, Yang Gang, and Wang Xun. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 93469355.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Escorcia Victor, Soldan Mattia, Sivic Josef, Ghanem Bernard, and Russell Bryan. 2019. Temporal localization of moments in video collections with natural language. arXiv preprint arXiv:1907.12763 (2019).Google ScholarGoogle Scholar
  7. [7] Faghri Fartash, Fleet David J., Kiros Jamie Ryan, and Fidler Sanja. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).Google ScholarGoogle Scholar
  8. [8] Fan Hehe, Chang Xiaojun, Cheng De, Yang Yi, Xu Dong, and Hauptmann Alexander G.. 2017. Complex event detection by identifying reliable shots from untrimmed videos. In Proceedings of the IEEE International Conference on Computer Vision. 736744.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Fan Hehe and Yang Yi. 2020. Person tube retrieval via language description. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1075410761.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Fan Hehe, Zheng Liang, Yan Chenggang, and Yang Yi. 2018. Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 4 (2018), 118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Frome Andrea, Corrado Greg S., Shlens Jon, Bengio Samy, Dean Jeff, Ranzato Marc’Aurelio, and Mikolov Tomas. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 21212129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Gao Jiyang, Sun Chen, Yang Zhenheng, and Nevatia Ram. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 52675275.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Gao Junyu, Zhang Tianzhu, and Xu Changsheng. 2019. Graph convolutional tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46494659.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Gao Junyu, Zhang Tianzhu, and Xu Changsheng. 2019. I know the relationships: Zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In AAAI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Gao Junyu, Zhang Tianzhu, and Xu Changsheng. 2020. Learning to model relationships for zero-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).Google ScholarGoogle Scholar
  16. [16] Gao Yue, Wang Wei-Bo, Yong Jun-Hai, and Gu He-Jin. 2009. Dynamic video summarization using two-level redundancy detection. Multimedia Tools and Applications 42, 2 (2009), 233250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Gao Yuli, Zhang Tong, and Xiao Jun. 2009. Thematic video thumbnail selection. In 2009 16th IEEE International Conference on Image Processing (ICIP). IEEE, 43334336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Ge Runzhou, Gao Jiyang, Chen Kan, and Nevatia Ram. 2019. Mac: Mining activity concepts for language-based temporal localization. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 245253.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Ghosh Soham, Agarwal Anuva, Parekh Zarana, and Hauptmann Alexander. 2019. Excl: Extractive clip localization using natural language descriptions. arXiv preprint arXiv:1904.02755 (2019).Google ScholarGoogle Scholar
  20. [20] Glorot Xavier, Bordes Antoine, and Bengio Yoshua. 2011. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 315323.Google ScholarGoogle Scholar
  21. [21] Gu Jiuxiang, Cai Jianfei, Joty Shafiq R., Niu Li, and Wang Gang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 71817189.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Gygli Michael, Song Yale, and Cao Liangliang. 2016. Video2gif: Automatic generation of animated gifs from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10011009.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] He Dongliang, Zhao Xiang, Huang Jizhou, Li Fu, Liu Xiao, and Wen Shilei. 2019. Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 83938400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Hoffer Elad and Ailon Nir. 2015. Deep metric learning using triplet network. In International Workshop on Similarity-based Pattern Recognition. Springer, 8492.Google ScholarGoogle Scholar
  25. [25] Hu Han, Gu Jiayuan, Zhang Zheng, Dai Jifeng, and Wei Yichen. 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 35883597.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Hu Ronghang, Rohrbach Anna, Darrell Trevor, and Saenko Kate. 2019. Language-conditioned graph networks for relational reasoning. In Proceedings of the IEEE International Conference on Computer Vision. 1029410303.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Huang Yan, Wu Qi, Song Chunfeng, and Wang Liang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 61636171.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Ji Zhong, Xiong Kailin, Pang Yanwei, and Li Xuelong. 2019. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology (2019).Google ScholarGoogle Scholar
  29. [29] Jiang Bin, Huang Xin, Yang Chao, and Yuan Junsong. 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 217225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Jiao Yifan, Li Zhetao, Huang Shucheng, Yang Xiaoshan, Liu Bin, and Zhang Tianzhu. 2018. Three-dimensional attention-based deep ranking model for video highlight detection. IEEE Transactions on Multimedia 20, 10 (2018), 26932705.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Kendall Alex, Gal Yarin, and Cipolla Roberto. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 74827491.Google ScholarGoogle Scholar
  32. [32] Kim Hoseong, Mei Tao, Byun Hyeran, and Yao Ting. 2018. Exploiting web images for video highlight detection with triplet deep ranking. IEEE Transactions on Multimedia 20, 9 (2018), 24152426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Kipf Thomas N. and Welling Max. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google ScholarGoogle Scholar
  34. [34] Kiros Ryan, Salakhutdinov Ruslan, and Zemel Richard S.. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).Google ScholarGoogle Scholar
  35. [35] Krishna Ranjay, Hata Kenji, Ren Frederic, Fei-Fei Li, and Niebles Juan Carlos. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706715.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Liu An-An, Su Yu-Ting, Nie Wei-Zhi, and Kankanhalli Mohan. 2016. Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 1 (2016), 102114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Liu Chunxi, Huang Qingming, and Jiang Shuqiang. 2011. Query sensitive dynamic web video thumbnail generation. In 2011 18th IEEE International Conference on Image Processing. IEEE, 24492452.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Liu Meng, Wang Xiang, Nie Liqiang, He Xiangnan, Chen Baoquan, and Chua Tat-Seng. 2018. Attentive moment retrieval in videos. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1524. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Liu Wu, Mei Tao, Zhang Yongdong, Che Cherry, and Luo Jiebo. 2015. Multi-task deep visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 37073715.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Liu Yang, Albanie Samuel, Nagrani Arsha, and Zisserman Andrew. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019).Google ScholarGoogle Scholar
  41. [41] Manning Christopher D., Surdeanu Mihai, Bauer John, Finkel Jenny Rose, Bethard Steven, and McClosky David. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 5560.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Min Weiqing, Jiang Shuqiang, Sang Jitao, Wang Huayang, Liu Xinda, and Herranz Luis. 2016. Being a supercook: Joint food attributes and multimodal content modeling for recipe retrieval and exploration. IEEE Transactions on Multimedia 19, 5 (2016), 11001113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Mithun Niluthpol Chowdhury, Li Juncheng, Metze Florian, and Roy-Chowdhury Amit K.. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 1927. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Niu Ke and Wang Han. 2019. Video highlight extraction via content-aware deep transfer. Multimedia Tools and Applications 78, 15 (2019), 2113321144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Ou Weihua, Xuan Ruisheng, Gou Jianping, Zhou Quan, and Cao Yongfeng. 2019. Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity. Multimedia Tools and Applications (2019), 118.Google ScholarGoogle Scholar
  46. [46] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Rochan Mrigank and Wang Yang. 2019. Video summarization by learning from unpaired data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 79027911.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Rodriguez Cristian, Marrese-Taylor Edison, Saleh Fatemeh Sadat, Li Hongdong, and Gould Stephen. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In The IEEE Winter Conference on Applications of Computer Vision. 24642473.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Shao Dian, Xiong Yu, Zhao Yue, Huang Qingqiu, Qiao Yu, and Lin Dahua. 2018. Find and focus: Retrieve and localize video events with natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV). 200216.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Sharghi Aidean, Laurel Jacob S., and Gong Boqing. 2017. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 47884797.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Si Chenyang, Chen Wentao, Wang Wei, Wang Liang, and Tan Tieniu. 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12271236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Song Yale, Redi Miriam, Vallmitjana Jordi, and Jaimes Alejandro. 2016. To click or not to click: Automatic selection of beautiful thumbnails from videos. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management. 659668. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 44894497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Vasudevan Arun Balajee, Gygli Michael, Volokitin Anna, and Van Gool Luc. 2017. Query-adaptive video summarization via quality-aware relevance estimation. In Proceedings of the 25th ACM International Conference on Multimedia. 582590. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Lio Pietro, and Bengio Yoshua. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).Google ScholarGoogle Scholar
  56. [56] Wang Jingwen, Ma Lin, and Jiang Wenhao. 2019. Temporally grounding language queries in videos by contextual boundary-aware prediction. arXiv preprint arXiv:1909.05010 (2019).Google ScholarGoogle Scholar
  57. [57] Wang Peng, Wu Qi, Cao Jiewei, Shen Chunhua, Gao Lianli, and van den Hengel Anton. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19601968.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Wang Xiao, Ji Houye, Shi Chuan, Wang Bai, Ye Yanfang, Cui Peng, and Yu Philip S.. 2019. Heterogeneous graph attention network. In The World Wide Web Conference. 20222032. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Wei Huawei, Ni Bingbing, Yan Yichao, Yu Huanyu, Yang Xiaokang, and Yao Chen. 2018. Video summarization via semantic attended networks. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  60. [60] Xu Huijuan, He Kun, Plummer Bryan A., Sigal Leonid, Sclaroff Stan, and Saenko Kate. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 90629069. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Xu Keyulu, Hu Weihua, Leskovec Jure, and Jegelka Stefanie. 2018. How powerful are graph neural networks?arXiv preprint arXiv:1810.00826 (2018).Google ScholarGoogle Scholar
  62. [62] Xu Ning, Zhang Hanwang, Liu An-An, Nie Weizhi, Su Yuting, Nie Jie, and Zhang Yongdong. 2019. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia 22, 5 (2019), 13721383.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Xu Ran, Xiong Caiming, Chen Wei, and Corso Jason J.. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In Twenty-Ninth AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Yang Jianwei, Lu Jiasen, Lee Stefan, Batra Dhruv, and Parikh Devi. 2018. Graph R-CNN for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV). 670685.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Yu Youngjae, Kim Jongseok, and Kim Gunhee. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 471487.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Yuan Yitian, Ma Lin, Wang Jingwen, Liu Wei, and Zhu Wenwu. 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In Advances in Neural Information Processing Systems. 534544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Yuan Yitian, Ma Lin, and Zhu Wenwu. 2019. Sentence specified dynamic video thumbnail generation. In Proceedings of the 27th ACM International Conference on Multimedia. 23322340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Zeng Runhao, Huang Wenbing, Tan Mingkui, Rong Yu, Zhao Peilin, Huang Junzhou, and Gan Chuang. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision. 70947103.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Zhang Bowen, Hu Hexiang, and Sha Fei. 2018. Cross-modal and hierarchical modeling of video and text. In Proceedings of the European Conference on Computer Vision (ECCV). 374390.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Zhang Da, Dai Xiyang, Wang Xin, Wang Yuan-Fang, and Davis Larry S.. 2019. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12471257.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Zhang Ke, Grauman Kristen, and Sha Fei. 2018. Retrospective encoders for video summarization. In Proceedings of the European Conference on Computer Vision (ECCV). 383399.Google ScholarGoogle ScholarCross RefCross Ref
  72. [72] Zhang Songyang, Peng Houwen, Fu Jianlong, and Luo Jiebo. 2019. Learning 2D temporal adjacent networks for moment localization with natural language. arXiv preprint arXiv:1912.03590 (2019).Google ScholarGoogle Scholar
  73. [73] Zhang Songyang, Su Jinsong, and Luo Jiebo. 2019. Exploiting temporal relationships in video moment localization with natural language. In Proceedings of the 27th ACM International Conference on Multimedia. 12301238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. [74] Zhang Weigang, Liu Chunxi, Wang Zhenjun, Li Guorong, Huang Qingming, and Gao Wen. 2014. Web video thumbnail recommendation with content-aware analysis and query-sensitive matching. Multimedia Tools and Applications 73, 1 (2014), 547571. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Zhang Yujia, Kampffmeyer Michael, Liang Xiaodan, Tan Min, and Xing Eric P.. 2018. Query-conditioned three-player adversarial network for video summarization. arXiv preprint arXiv:1807.06677 (2018).Google ScholarGoogle Scholar
  76. [76] Zhang Yujia, Kampffmeyer Michael, Zhao Xiaoguang, and Tan Min. 2019. Deep reinforcement learning for query-conditioned video summarization. Applied Sciences 9, 4 (2019), 750.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Zhang Yu and Yang Qiang. 2017. A survey on multi-task learning. arXiv preprint arXiv:1707.08114 (2017).Google ScholarGoogle Scholar
  78. [78] Zhang Zhu, Lin Zhijie, Zhao Zhou, and Xiao Zhenxin. 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 655664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. [79] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 74057414.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Zhou Jie, Cui Ganqu, Zhang Zhengyan, Yang Cheng, Liu Zhiyuan, Wang Lifeng, Li Changcheng, and Sun Maosong. 2018. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434 (2018).Google ScholarGoogle Scholar

Index Terms

  1. Learning Hierarchical Video Graph Networks for One-Stop Video Delivery

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 1
      January 2022
      517 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3505205
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 January 2022
      • Accepted: 1 May 2021
      • Revised: 1 March 2021
      • Received: 1 October 2020
      Published in tomm Volume 18, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)119
      • Downloads (Last 6 weeks)9

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!