skip to main content
research-article

Video Captioning by Learning from Global Sentence and Looking Ahead

Published:07 June 2023Publication History
Skip Abstract Section

Abstract

Video captioning aims to automatically generate natural language sentences describing the content of a video. Although encoder-decoder-based models have achieved promising progress, it is still very challenging to effectively model the linguistic behavior of humans in generating video captions. In this paper, we propose a novel video captioning model by learning from gLobal sEntence and looking AheaD, LEAD for short. Specifically, LEAD consists of two modules: a Vision Module (VM) and a Language Module (LM). Thereinto, VM is a novel attention network, which can map visual features to high-level language space and model entire sentences explicitly. LM can not only effectively make use of the information of the previous sequence when generating the current word, but also have a look at the future word. Therefore, based on VM and LM, LEAD can obtain global sentence information and future word information to make video captioning more like a fill-in-the-blank task than a word-by-word sentence generation. In addition, we also propose an autonomous strategy and a multi-stage training scheme to optimize the model, which can mitigate the problem of information leakage. Extensive experiments show that LEAD outperforms some state-of-the-art methods on MSR-VTT, MSVD, and VATEX, demonstrating the effectiveness of the proposed approach in video captioning. In addition, we release the code of our proposed model to be publicly available.1

REFERENCES

  1. [1] Aafaq Nayyer, Akhtar Naveed, al. Wei Liu, et2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1248712496.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Ballas Nicolas, Yao Li, al. Chris Pal, et2016. Delving deeper into convolutional networks for learning video representations. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  3. [3] Banerjee Satanjeev and Lavie Alon. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 6572.Google ScholarGoogle Scholar
  4. [4] Carreira João and Zisserman Andrew. 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 47244733.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen David L. and Dolan William B.. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 190200.Google ScholarGoogle Scholar
  6. [6] Chen Haoran, Li Jianmin, and Hu Xiaolin. 2020. Delving deeper into the decoder for video captioning. Frontiers in Artificial Intelligence and Applications 325 (2020), 10791086.Google ScholarGoogle Scholar
  7. [7] Chen Jingwen, Pan Yingwei, al. Yehao Li, et2023. Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Transactions on Multimedia Computing Communications and Applications 19, 1s (2023), 24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Chen Shaoxiang and Jiang Yu-Gang. 2019. Motion guided spatial attention for video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 81918198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Cherian Anoop, Wang Jue, Hori Chiori, and Marks Tim K.. 2020. Spatio-temporal ranked-attention networks for video captioning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 16061615.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Deng Jincan, al. Liang Li, et2022. Syntax-guided hierarchical attention network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 2 (2022), 880892.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Devlin Jacob, Chang Ming-Wei, al. Kenton Lee, et2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 41714186.Google ScholarGoogle Scholar
  12. [12] Dong Shanshan, Niu Tianzi, al. Xin Luo, et2023. Semantic embedding guided attention with explicit visual feature fusion for video captioning. ACM Transactions on Multimedia Computing Communications and Applications 19, 2 (2021), 18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Fang Shancheng, Xie Hongtao, al. Yuxin Wang, et2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 70987107.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Gao Lianli, Guo Zhao, al. Hanwang Zhang, et2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia 19, 9 (2017), 20452055.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Guadarrama Sergio, Krishnamoorthy Niveda, al. Girish Malkarnenkar, et2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the International Conference on Computer Vision. 27122719.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Hara Kensho, Kataoka Hirokatsu, and Satoh Yutaka. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 65466555.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Huang Yiqing, Cai Qiuyu, Xu Siyu, and Chen Jiansheng. 2020. XlanV model with adaptively multi-modality feature fusing for video captioning. In Proceedings of the ACM International Conference on Multimedia. 46004604.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Jiang Weitao, Wang Weixuan, and Hu Haifeng. 2021. Bi-directional co-attention network for image captioning. ACM Transactions on Multimedia Computing Communications and Applications 17, 4 (2021), 125:1125:20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 2 (2012), 11061114.Google ScholarGoogle Scholar
  21. [21] Lin Chin-Yew. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 7481.Google ScholarGoogle Scholar
  22. [22] Liu Fenglin, Ren Xuancheng, al. Xian Wu, et2020. Prophet attention: Predicting attention with future attention. Advances in Neural Information Processing Systems 33 (2020), 18651876.Google ScholarGoogle Scholar
  23. [23] Liu Fenglin, Ren Xuancheng, al. Xian Wu, et2021. O2NA: An object-oriented non-autoregressive approach for controllable video captioning. In Proceedings of the Findings of the Association for Computational Linguistics. 281292.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Liu Sheng, Ren Zhou, and Yuan Junsong. 2021. SibNet: Sibling convolutional encoder for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 9 (2021), 32593272.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Man Xin, Ouyang Deqiang, al. Xiangpeng Li, et2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Transactions on Multimedia Computing Communications and Applications 18, 4 (2022), 104:1104:17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Ni Jianmo, Ábrego Gustavo Hernández, al. Noah Constant, et2022. Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 18641874.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Pan Boxiao, Cai Haoye, al. De-An Huang, et2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1086710876.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Pan Pingbo, Xu Zhongwen, al. Yi Yang, et2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10291038.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Pan Yingwei, Mei Tao, Yao Ting, et al. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45944602.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 311318.Google ScholarGoogle Scholar
  31. [31] Patrick Mandela, Huang Po-Yao, al. Yuki Markus Asano, et2021. Support-set bottlenecks for video-text representation learning. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  32. [32] Pei Wenjie, Zhang Jiyuan, al. Xiangrong Wang, et2019. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 83478356.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Radford Alec, Kim Jong Wook, al. Chris Hallacy, et2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. 87488763.Google ScholarGoogle Scholar
  35. [35] Reimers Nils and Gurevych Iryna. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-Networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 39803990.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Rennie Steven J., Marcheret Etienne, al. Youssef Mroueh, et2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11791195.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Ronneberger Olaf, Fischer Philipp, and Brox Thomas. 2015. U-Net: Convolutional networks for biomedical image segmentation. Lecture Notes in Computer Science 9351, 234241.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Ryu Hobin, Kang Sunghun, al. Haeyong Kang, et2021. Semantic grouping network for video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 25142522.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Shi Xingjian, Chen Zhourong, al. Hao Wang, et2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems 28 (2015), 802810.Google ScholarGoogle Scholar
  40. [40] Shi Yaya, Xu Haiyang, al. Chunfeng Yuan, et2023. Learning video-text aligned representations for video captioning. ACM Transactions on Multimedia Computing Communications and Applications 19, 2 (2023), 21.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Song Jingkuan, Guo Yuyu, al. Lianli Gao, et2019. From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE Transactions on Neural Networks and Learning Systems 30, 10 (2019), 30473058.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Szegedy Christian, Ioffe Sergey, Vanhoucke Vincent, and Alemi Alexander A.. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 42784284.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Tan Ganchao, Liu Daqing, Wang Meng, and Zha Zheng-Jun. 2020. Learning to discretely compose reasoning module networks for video captioning. In Proceedings of the International Joint Conference on Artificial Intelligence. 745752.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Tang Pengjie, Wang Hanli, and Li Qinyu. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Transactions on Multimedia Computing Communications and Applications 15, 2 (2019), 31:131:23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Thomason Jesse, Venugopalan Subhashini, al. Sergio Guadarrama, et2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the International Conference on Computational Linguistics. 12181227.Google ScholarGoogle Scholar
  46. [46] Vaidya Jayesh, Subramaniam Arulkumar, and Mittal Anurag. 2022. Co-segmentation aided two-stream architecture for video captioning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 24422452.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Vaswani Ashish, Shazeer Noam, al. Niki Parmar, et2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 59986008.Google ScholarGoogle Scholar
  48. [48] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45664575.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Venugopalan Subhashini, Rohrbach Marcus, al. Jeffrey Donahue, et2015. Sequence to sequence - video to text. In Proceedings of the International Conference on Computer Vision. 45344542.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Wang Bairui, Ma Lin, al. Wei Zhang, et2019. Controllable video captioning with POS sequence guidance based on gated fusion network. In Proceedings of the International Conference on Computer Vision. 26412650.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Wang Bairui, Ma Lin, Zhang Wei, and Liu Wei. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 76227631.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Wang Guan-Hong, Du Ji-Xiang, and Zhang Hong-Bo. 2022. Multi-feature fusion refine network for video captioning. Journal of Experimental and Theoretical Artificial Intelligence 34, 3 (2022), 483497.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Wang Xin, Chen Wenhu, al. Jiawei Wu, et2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 42134222.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Wang Xin, Wu Jiawei, al. Junkun Chen, et2019. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the International Conference on Computer Vision. 45804590.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Wu Aming, Han Yahong, al. Yi Yang, et2020. Convolutional reconstruction-to-sequence for video captioning. IEEE Transactions on Circuits and Systems for Video Technology 30, 11 (2020), 42994308.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Wu Hanjie, Liu Yongtuo, al. Hongmin Cai, et2022. Learning transferable perturbations for image captioning. ACM Transactions on Multimedia Computing Communications and Applications 18, 2 (2022), 57:157:18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Xu Jun, Mei Tao, Yao Ting, and Rui Yong. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 52885296.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Xu Youjiang, Han Yahong, Hong Richang, and Tian Qi. 2018. Sequential video VLAD: Training the aggregation locally and temporally. IEEE Transactions on Image Processing 27, 10 (2018), 49334944.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Yang Bang, Zou Yuexian, Liu Fenglin, and Zhang Can. 2021. Non-autoregressive coarse-to-fine video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 31193127.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Yang Liang, Hu Haifeng, al. Songlong Xing, et2020. Constrained LSTM and residual attention for image captioning. ACM Transactions on Multimedia Computing Communications and Applications 16, 3 (2020), 75:175:18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Yao Li, Torabi Atousa, al. Kyunghyun Cho, et2015. Describing videos by exploiting temporal structure. In Proceedings of the International Conference on Computer Vision. 45074515.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Zhang Junchao and Peng Yuxin. 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 83278336.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Zhang Ziqi, Qi Zhongang, al. Chunfeng Yuan, et2021. Open-book video captioning with retrieve-copy-generate network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 98379846.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Zhang Ziqi, Shi Yaya, al. Chunfeng Yuan, et2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1327513285.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Zheng Qi, Wang Chaoyue, and Tao Dacheng. 2020. Syntax-aware action targeting for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1309313102.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Zheng Yi, al. Yuejie Zhang, et2022. Stacked multimodal attention network for context-aware video captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 1 (2022), 3142.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Video Captioning by Learning from Global Sentence and Looking Ahead

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 5s
        October 2023
        280 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3599694
        • Editor:
        • Abdulmotaleb El Saddik
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 June 2023
        • Online AM: 9 March 2023
        • Accepted: 7 March 2023
        • Revised: 9 February 2023
        • Received: 17 October 2022
        Published in tomm Volume 19, Issue 5s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)109
        • Downloads (Last 6 weeks)29

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!