skip to main content
research-article

Fine-Grained Text-to-Video Temporal Grounding from Coarse Boundary

Authors Info & Claims
Published:16 March 2023Publication History
Skip Abstract Section

Abstract

Text-to-video temporal grounding aims to locate a target video moment that semantically corresponds to the given sentence query in an untrimmed video. In this task, fully supervised works require text descriptions for each event along with its temporal segment coordinate for training, which is labor-consuming. Existing weakly supervised works require only video-sentence pairs but cannot achieve satisfactory performance. However, many available annotations in the form of coarse temporal boundaries for sentences are ignored and unexploited. These coarse boundaries are common in streaming media platform and can be collected in a mechanical manner. We propose a novel approach to perform fine-grained text-to-video temporal grounding from these coarse boundaries. We take dense video captioning as base task and leverage the trained captioning model to identify the relevance of each video frame to the sentence query according to the frame participation in event captioning. To quantify the frame participation in event captioning, we propose event activation sequence, a simple method that highlights the temporal regions which have high correlations to the text modality in videos. Experiments on modified ActivityNet Captions and a use case demonstrate the promising fine-grained performance of our approach.

REFERENCES

  1. [1] Ba Lei Jimmy, Kiros Jamie Ryan, and Hinton Geoffrey E.. 2016. Layer normalization. CoRR abs/1607.06450 (2016).Google ScholarGoogle Scholar
  2. [2] Banerjee Satanjeev and Lavie Alon. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 6572.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Cao Meng, Chen Long, Shou Mike Zheng, Zhang Can, and Zou Yuexian. 2021. On pursuit of designing multi-modal transformer for video grounding. In EMNLP. 98109823.Google ScholarGoogle Scholar
  4. [4] Chen Jingyuan, Chen Xinpeng, Ma Lin, Jie Zequn, and Chua Tat-Seng. 2018. Temporally grounding natural sentence in video. In EMNLP.Google ScholarGoogle Scholar
  5. [5] Chen Jingwen, Pan Yingwei, Li Yehao, Yao Ting, Chao Hongyang, and Mei Tao. 2022. Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google ScholarGoogle Scholar
  6. [6] Chen Long, Lu Chujie, Tang Siliang, Xiao Jun, Zhang Dong, Tan Chilie, and Li Xiaolin. 2020. Rethinking the bottom-up framework for query-based video localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1055110558.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chen Shaoxiang, Jiang Wenhao, Liu Wei, and Jiang Yu-Gang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In ECCV.Google ScholarGoogle Scholar
  8. [8] Chen Shaoxiang and Jiang Yu-Gang. 2019. Semantic proposal for activity localization in videos via sentence query. In AAAI.Google ScholarGoogle Scholar
  9. [9] Chen Shaoxiang and Jiang Yu-Gang. 2020. Hierarchical visual-textual graph for temporal activity localization via language. In ECCV.Google ScholarGoogle Scholar
  10. [10] Chen Yi-Wen, Tsai Yi-Hsuan, and Yang Ming-Hsuan. 2021. End-to-end multi-modal video temporal grounding. NIPS 34 (2021).Google ScholarGoogle Scholar
  11. [11] Duan Xuguang, Huang Wen-bing, Gan Chuang, Wang Jingdong, Zhu Wenwu, and Huang Junzhou. 2018. Weakly supervised dense event captioning in videos. In NIPS. 30633073.Google ScholarGoogle Scholar
  12. [12] Gao Jiyang, Sun Chen, Yang Zhenheng, and Nevatia Ram. 2017. TALL: Temporal activity localization via language query. In ICCV.Google ScholarGoogle Scholar
  13. [13] Gao Junyu and Xu Changsheng. 2021. Fast video moment retrieval. In ICCV. 15231532.Google ScholarGoogle Scholar
  14. [14] Gao Lianli, Guo Zhao, Zhang Hanwang, Xu Xing, and Shen Heng Tao. 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia 19, 9 (2017), 20452055.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Gao Mingfei, Davis Larry, Socher Richard, and Xiong Caiming. 2019. WSLLN: Weakly supervised natural language localization networks. In EMNLP. 14811487.Google ScholarGoogle Scholar
  16. [16] Ghosh Soham, Agarwal Anuva, Parekh Zarana, and Hauptmann Alexander G.. 2019. ExCL: Extractive clip localization using natural language descriptions. In NAACL.Google ScholarGoogle Scholar
  17. [17] Hao Jiachang, Sun Haifeng, Ren Pengfei, Wang Jingyu, Qi Qi, and Liao Jianxin. 2022. Can shuffling video benefit temporal bias problem: A novel training framework for temporal grounding. In European Conference on Computer Vision.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Hao Jiachang, Sun Haifeng, Ren Pengfei, Wang Jingyu, Qi Qi, and Liao Jianxin. 2022. Query-aware video encoder for video moment retrieval. Neurocomputing (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In CVPR. 770778.Google ScholarGoogle Scholar
  20. [20] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In CVPR. 961970.Google ScholarGoogle Scholar
  21. [21] Hendricks Lisa Anne, Wang Oliver, Shechtman Eli, Sivic Josef, Darrell Trevor, and Russell Bryan C.. 2017. Localizing moments in video with natural language. In ICCV.Google ScholarGoogle Scholar
  22. [22] Hendricks Lisa Anne, Wang Oliver, Shechtman Eli, Sivic Josef, Darrell Trevor, and Russell Bryan C.. 2018. Localizing moments in video with temporal language. In EMNLP.Google ScholarGoogle Scholar
  23. [23] Hou Zhijian, Ngo Chong-Wah, and Chan W. K.. 2021. CONQUER: Contextual query-aware ranking for video corpus moment retrieval. In ACM MM. 2024.Google ScholarGoogle Scholar
  24. [24] Huang Jiabo, Liu Yang, Gong Shaogang, and Jin Hailin. 2021. Cross-sentence temporal and semantic relations in video activity localisation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 71997208.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Huasong Zhong, Chen Jingyuan, Shen Chen, Zhang Hanwang, Huang Jianqiang, and Hua Xian-Sheng. 2020. Self-adaptive neural module transformer for visual question answering. IEEE Transactions on Multimedia (2020).Google ScholarGoogle Scholar
  26. [26] Iashin Vladimir and Rahtu Esa. 2020. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In Proceedings of the 31st British Machine Vision Virtual Conference. British Machine Vision Association (BMVA).Google ScholarGoogle Scholar
  27. [27] Ji Wanting and Wang Ruili. 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 17, 2s (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Jin Tao, Huang Siyu, Chen Ming, Li Yingming, and Zhang Zhongfei. 2021. SBAT: Video captioning with sparse boundary-aware transformer. In Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence. 630636.Google ScholarGoogle Scholar
  29. [29] Jin Weike, Zhao Zhou, Li Yimeng, Li Jie, Xiao Jun, and Zhuang Yueting. 2019. Video question answering via knowledge-based progressive spatial-temporal attention network. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Lan Xiaohan, Yuan Yitian, Wang Xin, Wang Zhi, and Zhu Wenwu. 2022. A survey on temporal sentence grounding in videos. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Lee Pilhyeon, Uh Youngjung, and Byun Hyeran. 2020. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1132011327.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Li Mengze, Wang Tianbao, Zhang Haoyu, Zhang Shengyu, Zhao Zhou, Miao Jiaxu, Zhang Wenqiao, Tan Wenming, Wang Jin, Wang Peng, Shiliang Pu, and Fei Wu. 2022. End-to-end modeling via information tree for one-shot natural language spatial video grounding. arXiv preprint arXiv:2203.08013 (2022).Google ScholarGoogle Scholar
  33. [33] Li Yehao, Yao Ting, Pan Yingwei, Chao Hongyang, and Mei Tao. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 74927500.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Lin Chin-Yew and Och Franz Josef. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL’04). 605612.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Lin Zhijie, Zhao Zhou, Zhang Zhu, Wang Qi, and Liu Huasheng. 2020. Weakly-supervised video moment retrieval via semantic completion network. In AAAI. 1153911546.Google ScholarGoogle Scholar
  36. [36] Lin Zhijie, Zhao Zhou, Zhang Zhu, Zhang Zijian, and Cai Deng. 2020. Moment retrieval via cross-modal interaction networks with query reconstruction. IEEE TIP (2020).Google ScholarGoogle Scholar
  37. [37] Liu Bingbin, Yeung Serena, Chou Edward, Huang De-An, Fei-Fei Li, and Niebles Juan Carlos. 2018. Temporal modular networks for retrieving complex compositional activities in videos. In ECCV.Google ScholarGoogle Scholar
  38. [38] Liu Daochang, Jiang Tingting, and Wang Yizhou. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In CVPR. 12981307.Google ScholarGoogle Scholar
  39. [39] Liu Daizong, Qu Xiaoye, Di Xing, Cheng Yu, Xu Zichuan, and Zhou Pan. 2022. Memory-guided semantic learning network for temporal sentence grounding. arXiv preprint arXiv:2201.00454 (2022).Google ScholarGoogle Scholar
  40. [40] Liu Daizong, Qu Xiaoye, Dong Jianfeng, and Zhou Pan. 2021. Adaptive proposal generation network for temporal sentence localization in videos. In EMNLP. 92929301.Google ScholarGoogle Scholar
  41. [41] Liu Daizong, Qu Xiaoye, Dong Jianfeng, Zhou Pan, Cheng Yu, Wei Wei, Xu Zichuan, and Xie Yulai. 2021. Context-aware biaffine localizing network for temporal sentence grounding. In CVPR. 1123511244.Google ScholarGoogle Scholar
  42. [42] Liu Daizong, Qu Xiaoye, Liu Xiao-Yang, Dong Jianfeng, Zhou Pan, and Xu Zichuan. 2020. Jointly cross-and self-modal graph attention network for query-based moment localization. In ACM MM.Google ScholarGoogle Scholar
  43. [43] Liu Daizong, Qu Xiaoye, and Zhou Pan. 2021. Progressively guide to attend: An iterative alignment framework for temporal sentence grounding. In EMNLP. 93029311.Google ScholarGoogle Scholar
  44. [44] Liu Daizong, Qu Xiaoye, Zhou Pan, and Liu Yang. 2022. Exploring motion and appearance information for temporal sentence grounding. arXiv preprint arXiv:2201.00457 (2022).Google ScholarGoogle Scholar
  45. [45] Liu Meng, Wang Xiang, Nie Liqiang, He Xiangnan, Chen Baoquan, and Chua Tat-Seng. 2018. Attentive moment retrieval in videos. In SIGIR.Google ScholarGoogle Scholar
  46. [46] Liu Meng, Wang Xiang, Nie Liqiang, Tian Qi, Chen Baoquan, and Chua Tat-Seng. 2018. Cross-modal moment localization in videos. In ACM MM.Google ScholarGoogle Scholar
  47. [47] Liu Xinfang, Nie Xiushan, Teng Junya, Lian Li, and Yin Yilong. 2021. Single-shot semantic matching network for moment localization in videos. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3 (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Liu Ziyi, Wang Le, Zhang Qilin, Gao Zhanning, Niu Zhenxing, Zheng Nanning, and Hua Gang. 2019. Weakly supervised temporal action localization through contrast based evaluation networks. In ICCV.Google ScholarGoogle Scholar
  49. [49] Lloyd Stuart P.. 1982. Least squares quantization in PCM. IEEE Trans. Information Theory 28, 2 (1982), 129136.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Lu Chujie, Chen Long, Tan Chilie, Li Xiaolin, and Xiao Jun. 2019. DEBUG: A dense bottom-up grounding approach for natural language video localization. In EMNLP.Google ScholarGoogle Scholar
  51. [51] Ma Fan, Zhu Linchao, Yang Yi, Zha Shengxin, Kundu Gourab, Feiszli Matt, and Shou Zheng. 2020. SF-net: Single-frame supervision for temporal action localization. In European Conference on Computer Vision. Springer, 420437.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Ma Minuk, Yoon Sunjae, Kim Junyeong, Lee Youngjoon, Kang Sunghun, and Yoo Chang D.. 2020. VLANet: Video-language alignment network for weakly-supervised video moment retrieval. In ECCV. 156171.Google ScholarGoogle Scholar
  53. [53] Man Xin, Ouyang Deqiang, Li Xiangpeng, Song Jingkuan, and Shao Jie. 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4 (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Mithun Niluthpol Chowdhury, Paul Sujoy, and Roy-Chowdhury Amit K.. 2019. Weakly supervised video moment retrieval from text queries. In CVPR. 1159211601.Google ScholarGoogle Scholar
  55. [55] Moltisanti D., Fidler S., and Damen D.. 2019. Action recognition from single timestamp supervision in untrimmed videos. In CVPR. 99079916.Google ScholarGoogle Scholar
  56. [56] Mun Jonghwan, Cho Minsu, and Han Bohyung. 2020. Local-global video-text interactions for temporal grounding. In CVPR.Google ScholarGoogle Scholar
  57. [57] Mun Jonghwan, Yang Linjie, Ren Zhou, Xu Ning, and Han Bohyung. 2019. Streamlined dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 65886597.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Nan Guoshun, Qiao Rui, Xiao Yao, Liu Jun, Leng Sicong, Zhang Hao, and Lu Wei. 2021. Interventional video grounding with dual contrastive learning. In CVPR. 27652775.Google ScholarGoogle Scholar
  59. [59] Pan Yingwei, Mei Tao, Yao Ting, Li Houqiang, and Rui Yong. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45944602.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Rodriguez Cristian, Marrese-Taylor Edison, Saleh Fatemeh Sadat, Li Hongdong, and Gould Stephen. 2020. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In WACV.Google ScholarGoogle Scholar
  62. [62] Sadhu Arka, Chen Kan, and Nevatia Ram. 2020. Video object grounding using semantic roles in language description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1041710427.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Shou Zheng, Gao Hang, Zhang Lei, Miyazawa Kazuyuki, and Chang Shih-Fu. 2018. AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. In ECCV. 162179.Google ScholarGoogle Scholar
  64. [64] Su Rui, Yu Qian, and Xu Dong. 2021. Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15331542.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Tang Pengjie, Wang Hanli, and Li Qinyu. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2 (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In NIPS. 59986008.Google ScholarGoogle Scholar
  67. [67] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. CIDER: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45664575.Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Wang Jingwen, Ma Lin, and Jiang Wenhao. 2020. Temporally grounding language queries in videos by contextual boundary-aware prediction. In AAAI.Google ScholarGoogle Scholar
  69. [69] Wang Limin, Xiong Yuanjun, Lin Dahua, and Gool Luc Van. 2017. UntrimmedNets for weakly supervised action recognition and detection. In CVPR. 64026411.Google ScholarGoogle Scholar
  70. [70] Wang Yuechen, Deng Jiajun, Zhou Wengang, and Li Houqiang. 2021. Weakly supervised temporal adjacent network for language grounding. IEEE Transactions on Multimedia (2021).Google ScholarGoogle Scholar
  71. [71] Xiao Shaoning, Chen Long, Zhang Songyang, Ji Wei, Shao Jian, Ye Lu, and Xiao Jun. 2021. Boundary proposal network for two-stage natural language video localization. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI 2021), 33rd Conference on Innovative Applications of Artificial Intelligence (IAAI 2021), 11th Symposium on Educational Advances in Artificial Intelligence (EAAI 2021), Virtual Event, February 2–9, 2021. 29862994.Google ScholarGoogle Scholar
  72. [72] Xu Huijuan, He Kun, Plummer Bryan A., Sigal Leonid, Sclaroff Stan, and Saenko Kate. 2019. Multilevel language and vision integration for text-to-clip retrieval. In AAAI.Google ScholarGoogle Scholar
  73. [73] Xu Ning, Zhang Hanwang, Liu An-An, Nie Weizhi, Su Yuting, Nie Jie, and Zhang Yongdong. 2019. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia 22, 5 (2019), 13721383.Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Yan Chenggang, Tu Yunbin, Wang Xingzheng, Zhang Yongbing, Hao Xinhong, Zhang Yongdong, and Dai Qionghai. 2019. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Transactions on Multimedia 22, 1 (2019), 229241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Yang Wenfei, Zhang Tianzhu, Zhang Yongdong, and Wu Feng. 2021. Local correspondence network for weakly supervised temporal sentence grounding. IEEE Transactions on Image Processing 30 (2021), 32523262.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Yuan Yitian, Ma Lin, Wang Jingwen, Liu Wei, and Zhu Wenwu. 2019. Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In NIPS.Google ScholarGoogle Scholar
  77. [77] Yuan Yitian, Mei Tao, and Zhu Wenwu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In AAAI.Google ScholarGoogle Scholar
  78. [78] Zeng Yawen, Cao Da, Lu Shaofei, Zhang Hanling, Xu Jiao, and Qin Zheng. 2022. Moment is important: Language-based video moment retrieval via adversarial learning. ACM Trans. Multimedia Comput. Commun. Appl. 18, 2 (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. [79] Zha Zheng-Jun, Liu Jiawei, Yang Tianhao, and Zhang Yongdong. 2019. Spatiotemporal-textual co-attention network for video question answering. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. [80] Zhang Da, Dai Xiyang, Wang Xin, Wang Yuan-Fang, and Davis Larry S.. 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In CVPR.Google ScholarGoogle Scholar
  81. [81] Zhang Hao, Sun Aixin, Jing Wei, Zhen Liangli, Zhou Joey Tianyi, and Goh Rick Siow Mong. 2021. Natural language video localization: A revisit in span-based question answering framework. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google ScholarGoogle Scholar
  82. [82] Zhang Hao, Sun Aixin, Jing Wei, and Zhou Joey Tianyi. 2020. Span-based localizing network for natural language video localization. In ACL.Google ScholarGoogle Scholar
  83. [83] Zhang Songyang, Peng Houwen, Fu Jianlong, Lu Yijuan, and Luo Jiebo. 2021. Multi-scale 2D temporal adjacency networks for moment localization with natural language. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google ScholarGoogle Scholar
  84. [84] Zhang Songyang, Peng Houwen, Fu Jianlong, and Luo Jiebo. 2020. Learning 2D temporal adjacent networks for moment localization with natural language. In AAAI.Google ScholarGoogle Scholar
  85. [85] Zhang Wenqiao, Tang Siliang, Cao Yanpeng, Pu Shiliang, Wu Fei, and Zhuang Yueting. 2019. Frame augmented alternating attention network for video question answering. IEEE Transactions on Multimedia 22, 4 (2019), 10321041.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. [86] Zhang Zhu, Lin Zhijie, Zhao Zhou, and Xiao Zhenxin. 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19), (Paris, France, July 21–25, 2019). 655664.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. [87] Zhu Zhang, Zhou Zhao, Zhijie Lin, Jieming Zhu, and Xiuqiang He. 2020. Counterfactual contrastive learning for weakly-supervised vision-language grounding. Advances in Neural Information Processing Systems 33 (2020).Google ScholarGoogle Scholar
  88. [88] Zhang Z., Zhao Z., Zhang Z., Lin Z., Wang Q., and Hong R.. 2020. Temporal textual localization in video via adversarial bi-directional interaction networks. IEEE TMM (2020).Google ScholarGoogle Scholar
  89. [89] Zheng Qi, Dong Jianfeng, Qu Xiaoye, Yang Xun, Wang Yabing, Zhou Pan, Liu Baolong, and Wang Xun. 2022. Progressive localization networks for language based moment localization. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google ScholarGoogle Scholar
  90. [90] Zhou Hao, Zhang Chongyang, Luo Yan, Chen Yanjun, and Hu Chuanping. 2021. Embracing uncertainty: Decoupling and de-bias for robust temporal grounding. In CVPR. 84458454.Google ScholarGoogle Scholar
  91. [91] Zhou Luowei, Zhou Yingbo, Corso Jason J., Socher Richard, and Xiong Caiming. 2018. End-to-end dense video captioning with masked transformer. In CVPR. 87398748.Google ScholarGoogle Scholar
  92. [92] Zhu Wenwu, Wang Xin, and Gao Wen. 2020. Multimedia intelligence: When multimedia meets artificial intelligence. IEEE Transactions on Multimedia (2020).Google ScholarGoogle Scholar
  93. [93] Zhuang Yueting, Xu Dejing, Yan Xin, Cheng Wenzhuo, Zhao Zhou, Pu Shiliang, and Xiao Jun. 2020. Multichannel attention refinement for video question answering. ACM Trans. Multimedia Comput. Commun. Appl. 16, 1s (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fine-Grained Text-to-Video Temporal Grounding from Coarse Boundary

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 5
      September 2023
      262 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3585398
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 March 2023
      • Online AM: 12 January 2023
      • Accepted: 3 January 2023
      • Revised: 23 October 2022
      • Received: 20 June 2022
      Published in tomm Volume 19, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)242
      • Downloads (Last 6 weeks)40

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!