skip to main content
research-article

Progressive Localization Networks for Language-Based Moment Localization

Published:06 February 2023Publication History
Skip Abstract Section

Abstract

This article targets the task of language-based video moment localization. The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments. Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, then match them with the given query to determine the target moment. However, candidate moments generated with a fixed temporal granularity may be suboptimal to handle the large variation in moment lengths. To this end, we propose a novel multi-stage Progressive Localization Network (PLN) that progressively localizes the target moment in a coarse-to-fine manner. Specifically, each stage of PLN has a localization branch and focuses on candidate moments that are generated with a specific temporal granularity. The temporal granularities of candidate moments are different across the stages. Moreover, we devise a conditional feature manipulation module and an upsampling connection to bridge the multiple localization branches. In this fashion, the later stages are able to absorb the previously learned information, thus facilitating the more fine-grained localization. Extensive experiments on three public datasets demonstrate the effectiveness of our proposed PLN for language-based moment localization, especially for localizing short moments in long videos.

REFERENCES

  1. [1] Buch Shyamal, Escorcia Victor, Shen Chuanqi, Ghanem Bernard, and Niebles Juan Carlos. 2017. SST: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 29112920.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961970.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Cao Da, Zeng Yawen, Wei Xiaochi, Nie Liqiang, Hong Richang, and Qin Zheng. 2020. Adversarial video moment retrieval by jointly modeling ranking and localization. In Proceedings of the ACM International Conference on Multimedia. 898906.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Chao Yu-Wei, Vijayanarasimhan Sudheendra, Seybold Bryan, Ross David A., Deng Jia, and Sukthankar Rahul. 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11301139.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Jingyuan, Chen Xinpeng, Ma Lin, Jie Zequn, and Chua Tat-Seng. 2018. Temporally grounding natural sentence in video. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 162171.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chen Long, Lu Chujie, Tang Siliang, Xiao Jun, Zhang Dong, Tan Chilie, and Li Xiaolin. 2020. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1055110558.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chen Peihao, Gan Chuang, Shen Guangyao, Huang Wenbing, Zeng Runhao, and Tan Mingkui. 2019. Relation attention for temporal action localization. IEEE Transactions on Multimedia 22, 10 (2019), 27232733.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chen Shaoxiang, Jiang Wenhao, Liu Wei, and Jiang Yu-Gang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In Proceedings of the European Conference on Computer Vision. 333351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Chen Shaoxiang and Jiang Yu-Gang. 2019. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 81998206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Chen Shaoxiang and Jiang Yu-Gang. 2020. Hierarchical visual-textual graph for temporal activity localization via language. In Proceedings of the European Conference on Computer Vision. 601618.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Dong Jianfeng, Li Xirong, and Snoek Cees G. M.. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 33773388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Dong Jianfeng, Li Xirong, Xu Chaoxi, Yang Xun, Yang Gang, Wang Xun, and Wang Meng. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence. Early access, February 15, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Dong Jianfeng, Li Xirong, and Xu Duanqing. 2018. Cross-media similarity evaluation for web image retrieval in the wild. IEEE Transactions on Multimedia 20, 9 (2018), 23712384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Dong Jianfeng, Wang Yabing, Chen Xianke, Qu Xiaoye, Li Xirong, He Yuan, and Wang Xun. 2022. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology. Early access, January 23, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Feichtenhofer Christoph, Fan Haoqi, Malik Jitendra, and He Kaiming. 2019. SlowFast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 62026211.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Gao Jiyang, Sun Chen, Yang Zhenheng, and Nevatia Ram. 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 52675275.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Gao Junyu and Xu Changsheng. 2021. Fast video moment retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 15231532.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Gao Junyu and Xu Changsheng. 2021. Learning video moment retrieval without a single annotated video. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 1646–1657.Google ScholarGoogle Scholar
  19. [19] Gao Jiyang, Yang Zhenheng, Chen Kan, Sun Chen, and Nevatia Ram. 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision. 36283636.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Ge Runzhou, Gao Jiyang, Chen Kan, and Nevatia Ram. 2019. MAC: Mining activity concepts for language-based temporal localization. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 245253.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Hahn Meera, Kadav Asim, Rehg James M., and Graf Hans Peter. 2020. Tripping through time: Efficient localization of activities in videos. In Proceedings of the British Machine Vision Conference. 549.1–549.14.Google ScholarGoogle Scholar
  22. [22] Hendricks Lisa Anne, Wang Oliver, Shechtman Eli, Sivic Josef, Darrell Trevor, and Russell Bryan. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 58035812.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Hu Yupeng, Liu Meng, Su Xiaobin, Gao Zan, and Nie Liqiang. 2021. Video moment localization via deep cross-modal hashing. IEEE Transactions on Image Processing 30 (2021), 46674677.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Jiang Bin, Huang Xin, Yang Chao, and Yuan Junsong. 2019. Cross-modal video moment retrieval with spatial and language-temporal attention. In Proceedings of the International Conference on Multimedia Retrieval. 217225.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference for Learning Representations. 115.Google ScholarGoogle Scholar
  26. [26] Krishna Ranjay, Hata Kenji, Ren Frederic, Fei-Fei Li, and Niebles Juan Carlos. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706715.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Lin Tianwei, Zhao Xu, and Shou Zheng. 2017. Single shot temporal action detection. In Proceedings of the ACM International Conference on Multimedia. 988996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Lin Tianwei, Zhao Xu, and Su Haisheng. 2020. Joint learning of local and global context for temporal action proposal generation. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2020), 48994912.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Lin Zhijie, Zhao Zhou, Zhang Zhu, Wang Qi, and Liu Huasheng. 2020. Weakly-supervised video moment retrieval via semantic completion network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1153911546.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Lin Zhijie, Zhao Zhou, Zhang Zhu, Zhang Zijian, and Cai Deng. 2020. Moment retrieval via cross-modal interaction networks with query reconstruction. IEEE Transactions on Image Processing 29 (2020), 37503762.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Liu Daizong, Qu Xiaoye, Dong Jianfeng, and Zhou Pan. 2020. Reasoning step-by-step: Temporal sentence localization in videos via deep rectification-modulation network. In Proceedings of the 28th International Conference on Computational Linguistics. 18411851.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Liu Daizong, Qu Xiaoye, Dong Jianfeng, Zhou Pan, Cheng Yu, Wei Wei, Xu Zichuan, and Xie Yulai. 2021. Context-aware Biaffine Localizing Network for temporal sentence grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1123511244.Google ScholarGoogle Scholar
  33. [33] Liu Meng, Wang Xiang, Nie Liqiang, He Xiangnan, Chen Baoquan, and Chua Tat-Seng. 2018. Attentive moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1524.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Liu Meng, Wang Xiang, Nie Liqiang, Tian Qi, Chen Baoquan, and Chua Tat-Seng. 2018. Cross-modal moment localization in videos. In Proceedings of the ACM International Conference on Multimedia. 843851.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Liu Qinying and Wang Zilei. 2020. Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1161211619.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Liu Xinfang, Nie Xiushan, Teng Junya, Lian Li, and Yin Yilong. 2021. Single-shot semantic matching network for moment localization in videos. ACM Transactions on Multimedia Computing, Communications, and Applications 17, 3 (2021), 114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Liu Yuan, Chen Jingyuan, Chena Xinpeng, Deng Bing, Huang Jianqiang, and Hua Xiansheng. 2021. Centerness-aware network for temporal action proposal. IEEE Transactions on Circuits and Systems for Video Technology 32, 1 (2021), 5–16.Google ScholarGoogle Scholar
  38. [38] Lu Chujie, Chen Long, Tan Chilie, Li Xiaolin, and Xiao Jun. 2019. DEBUG: A dense bottom-up grounding approach for natural language video localization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing. 51475156.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Mei Tao, Tang Lin-Xie, Tang Jinhui, and Hua Xian-Sheng. 2013. Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 3 (2013), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Mithun Niluthpol Chowdhury, Paul Sujoy, and Roy-Chowdhury Amit K.. 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1159211601.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Mun Jonghwan, Cho Minsu, and Han Bohyung. 2020. Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1081010819.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Ning Ke, Xie Lingxi, Liu Jianzhuang, Wu Fei, and Tian Qi. 2021. Interaction-integrated network for natural language moment localization. IEEE Transactions on Image Processing 30 (2021), 25382548.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Peng Yuxin and Qi Jinwei. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Perez Ethan, Strub Florian, Vries Harm De, Dumoulin Vincent, and Courville Aaron. 2018. FiLM: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. 39423951.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Qu Xiaoye, Tang Pengwei, Zou Zhikang, Cheng Yu, Dong Jianfeng, Zhou Pan, and Xu Zichuan. 2020. Fine-grained iterative attention network for temporal language localization in videos. In Proceedings of the ACM International Conference on Multimedia. 42804288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Redmon Joseph and Farhadi Ali. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).Google ScholarGoogle Scholar
  48. [48] Regneri Michaela, Rohrbach Marcus, Wetzel Dominikus, Thater Stefan, Schiele Bernt, and Pinkal Manfred. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1 (2013), 2536.Google ScholarGoogle Scholar
  49. [49] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 9199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Rodriguez Cristian, Marrese-Taylor Edison, Saleh Fatemeh Sadat, Li Hongdong, and Gould Stephen. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 24642473.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Rohrbach Marcus, Regneri Michaela, Andriluka Mykhaylo, Amin Sikandar, Pinkal Manfred, and Schiele Bernt. 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the European Conference on Computer Vision. 144157.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Shou Zheng, Chan Jonathan, Zareian Alireza, Miyazawa Kazuyuki, and Chang Shih-Fu. 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 57345743.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Shou Zheng, Wang Dongang, and Chang Shih-Fu. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10491058.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Sigurdsson Gunnar A., Varol Gül, Wang Xiaolong, Farhadi Ali, Laptev Ivan, and Gupta Abhinav. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. 510526.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Simonyan Karen and Zisserman Andrew. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference for Learning Representations. 114.Google ScholarGoogle Scholar
  56. [56] Sun Xiaoyang, Wang Hanli, and He Bin. 2021. MABAN: Multi-agent boundary-aware network for natural language moment retrieval. IEEE Transactions on Image Processing 30 (2021), 55895599.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Tan Reuben, Xu Huijuan, Saenko Kate, and Plummer Bryan A.. 2021. LoGAN: Latent graph co-attention network for weakly-supervised video moment retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 20832092.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Tian Zhi, Shen Chunhua, Chen Hao, and He Tong. 2019. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision. 96279636.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 44894497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  61. [61] Wang Jingwen, Ma Lin, and Jiang Wenhao. 2020. Temporally grounding language queries in videos by contextual boundary-aware prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1216812175.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Wang Weining, Huang Yan, and Wang Liang. 2019. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 334343.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Wu Jie, Li Guanbin, Han Xiaoguang, and Lin Liang. 2020. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos. In Proceedings of the ACM International Conference on Multimedia. 12831291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Wu Jie, Li Guanbin, Liu Si, and Lin Liang. 2020. Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1238612393.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Xiao Shaoning, Chen Long, Zhang Songyang, Ji Wei, Shao Jian, Ye Lu, and Xiao Jun. 2021. Boundary proposal network for two-stage natural language video localization. Proceedings of the AAAI Conference on Artificial Intelligence 35, 04, 29862994.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Xu Huijuan, He Kun, Plummer Bryan A., Sigal Leonid, Sclaroff Stan, and Saenko Kate. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 90629069.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Yang Xun, Wang Shanshan, Dong Jian, Dong Jianfeng, Wang Meng, and Chua Tat-Seng. 2022. Video moment retrieval with cross-modal neural architecture search. IEEE Transactions on Image Processing 31 (2022), 12041216.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Yang Xiaoshan, Zhang Tianzhu, and Xu Changsheng. 2016. Semantic feature mining for video event understanding. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 4 (2016), 122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Yuan Yitian, Ma Lin, Wang Jingwen, Liu Wei, and Zhu Wenwu. 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In Advances in Neural Information Processing Systems. 536546.Google ScholarGoogle Scholar
  70. [70] Yuan Yitian, Ma Lin, Wang Jingwen, Liu Wei, and Zhu Wenwu. 2020. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 5 (2020), 2725–2741.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Yuan Yitian, Mei Tao, and Zhu Wenwu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 91599166.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. [72] Zeng Runhao, Huang Wenbing, Tan Mingkui, Rong Yu, Zhao Peilin, Huang Junzhou, and Gan Chuang. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision. 70947103.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Zeng Runhao, Xu Haoming, Huang Wenbing, Chen Peihao, Tan Mingkui, and Gan Chuang. 2020. Dense regression network for video grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1028710296.Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Zhang Da, Dai Xiyang, Wang Xin, Wang Yuan-Fang, and Davis Larry S.. 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12471257.Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Zhang Da, Dai Xiyang, and Wang Yuan-Fang. 2020. METAL: Minimum effort temporal activity localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 38823892.Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Zhang Hao, Sun Aixin, Jing Wei, and Zhou Joey Tianyi. 2020. Span-based localizing network for natural language video localization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 65436554.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Zhang Songyang, Peng Houwen, Fu Jianlong, and Luo Jiebo. 2020. Learning 2D temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1287012877.Google ScholarGoogle ScholarCross RefCross Ref
  78. [78] Zhang Songyang, Su Jinsong, and Luo Jiebo. 2019. Exploiting temporal relationships in video moment localization with natural language. In Proceedings of the ACM International Conference on Multimedia. 12301238.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. [79] Zhang Zhu, Lin Zhijie, Zhao Zhou, and Xiao Zhenxin. 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 655664.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. [80] Zhang Zijian, Zhao Zhou, Zhang Zhu, Lin Zhijie, Wang Qi, and Hong Richang. 2020. Temporal textual localization in video via adversarial bi-directional interaction networks. IEEE Transactions on Multimedia 23 (2020), 3306–3317.Google ScholarGoogle Scholar

Index Terms

  1. Progressive Localization Networks for Language-Based Moment Localization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 2
      March 2023
      540 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3572860
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 February 2023
      • Online AM: 11 June 2022
      • Accepted: 31 May 2022
      • Revised: 30 April 2022
      • Received: 26 December 2021
      Published in tomm Volume 19, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!