Abstract
Video captioning aims to automatically generate natural language sentences describing the content of a video. Although encoder-decoder-based models have achieved promising progress, it is still very challenging to effectively model the linguistic behavior of humans in generating video captions. In this paper, we propose a novel video captioning model by learning from gLobal sEntence and looking AheaD, LEAD for short. Specifically, LEAD consists of two modules: a Vision Module (VM) and a Language Module (LM). Thereinto, VM is a novel attention network, which can map visual features to high-level language space and model entire sentences explicitly. LM can not only effectively make use of the information of the previous sequence when generating the current word, but also have a look at the future word. Therefore, based on VM and LM, LEAD can obtain global sentence information and future word information to make video captioning more like a fill-in-the-blank task than a word-by-word sentence generation. In addition, we also propose an autonomous strategy and a multi-stage training scheme to optimize the model, which can mitigate the problem of information leakage. Extensive experiments show that LEAD outperforms some state-of-the-art methods on MSR-VTT, MSVD, and VATEX, demonstrating the effectiveness of the proposed approach in video captioning. In addition, we release the code of our proposed model to be publicly available.1
- [1] 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12487–12496.Google Scholar
Cross Ref
- [2] 2016. Delving deeper into convolutional networks for learning video representations. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [3] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 65–72.Google Scholar
- [4] . 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4724–4733.Google Scholar
Cross Ref
- [5] . 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 190–200.Google Scholar
- [6] . 2020. Delving deeper into the decoder for video captioning. Frontiers in Artificial Intelligence and Applications 325 (2020), 1079–1086.Google Scholar
- [7] 2023. Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Transactions on Multimedia Computing Communications and Applications 19, 1s (2023), 24.Google Scholar
Digital Library
- [8] . 2019. Motion guided spatial attention for video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 8191–8198.Google Scholar
Digital Library
- [9] . 2020. Spatio-temporal ranked-attention networks for video captioning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 1606–1615.Google Scholar
Cross Ref
- [10] 2022. Syntax-guided hierarchical attention network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 2 (2022), 880–892.Google Scholar
Cross Ref
- [11] 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.Google Scholar
- [12] 2023. Semantic embedding guided attention with explicit visual feature fusion for video captioning. ACM Transactions on Multimedia Computing Communications and Applications 19, 2 (2021), 18.Google Scholar
Digital Library
- [13] 2021. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7098–7107.Google Scholar
Cross Ref
- [14] 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia 19, 9 (2017), 2045–2055.Google Scholar
Digital Library
- [15] 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the International Conference on Computer Vision. 2712–2719.Google Scholar
Digital Library
- [16] . 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6546–6555.Google Scholar
Cross Ref
- [17] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [18] . 2020. XlanV model with adaptively multi-modality feature fusing for video captioning. In Proceedings of the ACM International Conference on Multimedia. 4600–4604.Google Scholar
Digital Library
- [19] . 2021. Bi-directional co-attention network for image captioning. ACM Transactions on Multimedia Computing Communications and Applications 17, 4 (2021), 125:1–125:20.Google Scholar
Digital Library
- [20] . 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 2 (2012), 1106–1114.Google Scholar
- [21] . 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 74–81.Google Scholar
- [22] 2020. Prophet attention: Predicting attention with future attention. Advances in Neural Information Processing Systems 33 (2020), 1865–1876.Google Scholar
- [23] 2021. O2NA: An object-oriented non-autoregressive approach for controllable video captioning. In Proceedings of the Findings of the Association for Computational Linguistics. 281–292.Google Scholar
Cross Ref
- [24] . 2021. SibNet: Sibling convolutional encoder for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 9 (2021), 3259–3272.Google Scholar
Cross Ref
- [25] 2022. Scenario-aware recurrent transformer for goal-directed video captioning. ACM Transactions on Multimedia Computing Communications and Applications 18, 4 (2022), 104:1–104:17.Google Scholar
Digital Library
- [26] 2022. Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 1864–1874.Google Scholar
Cross Ref
- [27] 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10867–10876.Google Scholar
Cross Ref
- [28] 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1029–1038.Google Scholar
Cross Ref
- [29] , et al. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594–4602.Google Scholar
Cross Ref
- [30] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 311–318.Google Scholar
- [31] 2021. Support-set bottlenecks for video-text representation learning. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [32] 2019. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8347–8356.Google Scholar
Cross Ref
- [33] . 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532–1543.Google Scholar
Cross Ref
- [34] 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. 8748–8763.Google Scholar
- [35] . 2019. Sentence-BERT: Sentence embeddings using siamese BERT-Networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 3980–3990.Google Scholar
Cross Ref
- [36] 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1179–1195.Google Scholar
Cross Ref
- [37] . 2015. U-Net: Convolutional networks for biomedical image segmentation. Lecture Notes in Computer Science 9351, 234–241.Google Scholar
Cross Ref
- [38] 2021. Semantic grouping network for video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 2514–2522.Google Scholar
Cross Ref
- [39] 2015. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems 28 (2015), 802–810.Google Scholar
- [40] 2023. Learning video-text aligned representations for video captioning. ACM Transactions on Multimedia Computing Communications and Applications 19, 2 (2023), 21.Google Scholar
Digital Library
- [41] 2019. From deterministic to generative: Multimodal stochastic RNNs for video captioning. IEEE Transactions on Neural Networks and Learning Systems 30, 10 (2019), 3047–3058.Google Scholar
Cross Ref
- [42] . 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence. 4278–4284.Google Scholar
Cross Ref
- [43] . 2020. Learning to discretely compose reasoning module networks for video captioning. In Proceedings of the International Joint Conference on Artificial Intelligence. 745–752.Google Scholar
Cross Ref
- [44] . 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Transactions on Multimedia Computing Communications and Applications 15, 2 (2019), 31:1–31:23.Google Scholar
Digital Library
- [45] 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the International Conference on Computational Linguistics. 1218–1227.Google Scholar
- [46] . 2022. Co-segmentation aided two-stream architecture for video captioning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2442–2452.Google Scholar
Cross Ref
- [47] 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 5998–6008.Google Scholar
- [48] . 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google Scholar
Cross Ref
- [49] 2015. Sequence to sequence - video to text. In Proceedings of the International Conference on Computer Vision. 4534–4542.Google Scholar
Digital Library
- [50] 2019. Controllable video captioning with POS sequence guidance based on gated fusion network. In Proceedings of the International Conference on Computer Vision. 2641–2650.Google Scholar
Cross Ref
- [51] . 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7622–7631.Google Scholar
Cross Ref
- [52] . 2022. Multi-feature fusion refine network for video captioning. Journal of Experimental and Theoretical Artificial Intelligence 34, 3 (2022), 483–497.Google Scholar
Cross Ref
- [53] 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4213–4222.Google Scholar
Cross Ref
- [54] 2019. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the International Conference on Computer Vision. 4580–4590.Google Scholar
Cross Ref
- [55] 2020. Convolutional reconstruction-to-sequence for video captioning. IEEE Transactions on Circuits and Systems for Video Technology 30, 11 (2020), 4299–4308.Google Scholar
Cross Ref
- [56] 2022. Learning transferable perturbations for image captioning. ACM Transactions on Multimedia Computing Communications and Applications 18, 2 (2022), 57:1–57:18.Google Scholar
Digital Library
- [57] . 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.Google Scholar
Cross Ref
- [58] . 2018. Sequential video VLAD: Training the aggregation locally and temporally. IEEE Transactions on Image Processing 27, 10 (2018), 4933–4944.Google Scholar
Digital Library
- [59] . 2021. Non-autoregressive coarse-to-fine video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 3119–3127.Google Scholar
Cross Ref
- [60] 2020. Constrained LSTM and residual attention for image captioning. ACM Transactions on Multimedia Computing Communications and Applications 16, 3 (2020), 75:1–75:18.Google Scholar
Digital Library
- [61] 2015. Describing videos by exploiting temporal structure. In Proceedings of the International Conference on Computer Vision. 4507–4515.Google Scholar
Digital Library
- [62] . 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8327–8336.Google Scholar
Cross Ref
- [63] 2021. Open-book video captioning with retrieve-copy-generate network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9837–9846.Google Scholar
Cross Ref
- [64] 2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 13275–13285.Google Scholar
Cross Ref
- [65] . 2020. Syntax-aware action targeting for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 13093–13102.Google Scholar
Cross Ref
- [66] 2022. Stacked multimodal attention network for context-aware video captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 1 (2022), 31–42.Google Scholar
Digital Library
Index Terms
Video Captioning by Learning from Global Sentence and Looking Ahead
Recommendations
Video Captioning using Hierarchical Multi-Attention Model
ICAIP '18: Proceedings of the 2nd International Conference on Advances in Image ProcessingAttention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such ...
Relation-aware attention for video captioning via graph learning
Highlights- We improve the conventional attention mechanism to a relation-aware attention mechanism via graph learning, which aims to 1) support proper semantic ...
AbstractVideo captioning often uses an attentive encoder-decoder as the baseline model. However, the conventional attention mechanism still remains two problems. First, the attended visual feature is often irrelevant to the target word state, ...
Graph convolutional network meta-learning with multi-granularity POS guidance for video captioning
AbstractVideo as information carrier has gained overwhelming popularity in city surveillance and social networks, such as WeChat, Weibo, and TikTok. To bridge the semantic gap between video content (e.g., user and landmark building) and ...






Comments