Abstract
Video captioning requires that the model has the abilities of video understanding, video-text alignment, and text generation. Due to the semantic gap between vision and language, conducting video-text alignment is a crucial step to reduce the semantic gap, which maps the representations from the visual to the language domain. However, the existing methods often overlook this step, so the decoder has to directly take the visual representations as input, which increases the decoder’s workload and limits its ability to generate semantically correct captions. In this paper, we propose a video-text alignment module with a retrieval unit and an alignment unit to learn video-text aligned representations for video captioning. Specifically, we firstly propose a retrieval unit to retrieve sentences as additional input which is used as the semantic anchor between visual scene and language description. Then, we employ an alignment unit with the input of the video and retrieved sentences to conduct the video-text alignment. The representations of two modal inputs are aligned in a shared semantic space. The obtained video-text aligned representations are used to generate semantically correct captions. Moreover, retrieved sentences provide rich semantic concepts which are helpful for generating distinctive captions. Experiments on two public benchmarks, i.e., VATEX and MSR-VTT, demonstrate that our method outperforms state-of-the-art performances by a large margin. The qualitative analysis shows that our method generates correct and distinctive captions.
- [1] . 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In IEEE Conference on Computer Vision and Pattern Recognition. 12487–12496.Google Scholar
Cross Ref
- [2] . 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1728–1738.Google Scholar
Cross Ref
- [3] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.Google Scholar
Digital Library
- [4] . 2012. Video in sentences out. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence. 102–112.Google Scholar
Digital Library
- [5] . 2011. Collecting highly parallel data for paraphrase evaluation. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, , , and (Eds.). The Association for Computer Linguistics, 190–200.Google Scholar
- [6] . 2019. Motion guided spatial attention for video captioning. In The Thirty-Third AAAI Conference on Artificial Intelligence. 8191–8198.Google Scholar
Digital Library
- [7] . 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10635–10644.Google Scholar
Cross Ref
- [8] . 2015. Microsoft COCO captions: Data collection and evaluation server. CoRR abs/1504.00325 (2015).
arxiv:1504.00325 .Google Scholar - [9] . 2017. Contrastive learning for image captioning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 898–907.Google Scholar
- [10] . 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In 2013 IEEE Conference on Computer Vision and Pattern Recognition. 2634–2641.Google Scholar
Digital Library
- [11] . 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 248–255.Google Scholar
Cross Ref
- [12] . 2022. Syntax-guided hierarchical attention network for video captioning. IEEE Trans. Circuits Syst. Video Technol. 32, 2 (2022), 880–892.Google Scholar
Cross Ref
- [13] . 2018. Not all words are equal: Video-specific information loss for video captioning. In British Machine Vision Conference. 58.Google Scholar
- [14] . 2019. Dual encoding for zero-example video retrieval. In IEEE Conference on Computer Vision and Pattern Recognition. 9346–9355.Google Scholar
Cross Ref
- [15] . 2021. CLIP2Video: Mastering video-text retrieval via image CLIP. CoRR abs/2106.11097 (2021).
arxiv:2106.11097 Google Scholar - [16] . 2020. Normalized and geometry-aware self-attention network for image captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10324–10333.Google Scholar
Cross Ref
- [17] . 2012. Choosing linguistics over vision to describe images. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, and (Eds.).Google Scholar
Digital Library
- [18] . 2020. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event(
Proceedings of Machine Learning Research , Vol. 119). PMLR, 3929–3938.Google Scholar - [19] . 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In 2018 IEEE Conference on Computer Vision and Pattern Recognition. 6546–6555.Google Scholar
Cross Ref
- [20] . 2019. Image captioning with visual-semantic double attention. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article
26 (Jan. 2019), 16 pages.Google ScholarDigital Library
- [21] . 2016. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [22] . 2019. Joint syntax representation learning and visual cue translation for video captioning. In 2019 IEEE/CVF International Conference on Computer Vision. 8917–8926.Google Scholar
Cross Ref
- [23] . 2019. Hierarchical global-local temporal modeling for video captioning. In Proceedings of the 27th ACM International Conference on Multimedia. 774–783.Google Scholar
Digital Library
- [24] . 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 17, 2s, Article
72 (Jun. 2021), 18 pages.Google ScholarDigital Library
- [25] . 2021. Bi-directional Co-attention network for image captioning. ACM Trans. Multimedia Comput. Commun. Appl. 17, 4, Article
125 (Nov. 2021), 20 pages.Google ScholarDigital Library
- [26] . 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).Google Scholar
- [27] . 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171–184.Google Scholar
Digital Library
- [28] . 2017. Dense-captioning events in videos. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017. IEEE Computer Society, 706–715.Google Scholar
Cross Ref
- [29] . 2012. Collective generation of natural image descriptions. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. 359–368.Google Scholar
- [30] . 2020. Pre-training via paraphrasing. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual, , , , , and (Eds.).Google Scholar
- [31] . 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (
NIPS’20 ). Curran Associates Inc., Red Hook, NY, USA, Article793 , 16 pages.Google Scholar - [32] . 2020. HERO: Hierarchical encoder for video+language Omni-representation pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2046–2065.Google Scholar
Cross Ref
- [33] . 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. 74–81.Google Scholar
- [34] . [n.d.]. Context-aware visual policy network for sequence-level image captioning. In 2018 ACM Multimedia Conference on Multimedia Conference, MM 2018. 1416–1424.Google Scholar
- [35] . 2019. Generating diverse and descriptive image captions using visual paraphrases. In 2019 IEEE/CVF International Conference on Computer Vision. 4239–4248.Google Scholar
Cross Ref
- [36] . 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Computer Vision - ECCV 2018-15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part XV, Vol. 11219. 353–369.Google Scholar
Digital Library
- [37] . 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations.Google Scholar
- [38] . 2021. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval. CoRR abs/2104.08860 (2021).
arxiv:2104.08860 .Google Scholar - [39] . 2018. Discriminability objective for training descriptive captions. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. IEEE Computer Society, 6964–6974.Google Scholar
Cross Ref
- [40] . 2020. End-to-end learning of visual representations from uncurated instructional videos. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9876–9886.Google Scholar
Cross Ref
- [41] . 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In 2019 IEEE/CVF International Conference on Computer Vision. 2630–2640.Google Scholar
Cross Ref
- [42] . 2011. Im2Text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems. 1143–1151.Google Scholar
- [43] . 2020. Spatio-temporal graph for video captioning with knowledge distillation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10867–10876.Google Scholar
Cross Ref
- [44] . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google Scholar
Digital Library
- [45] . 2019. Memory-attended recurrent network for video captioning. In IEEE Conference on Computer Vision and Pattern Recognition. 8347–8356.Google Scholar
Cross Ref
- [46] . 2021. A straightforward framework for video retrieval using CLIP. CoRR abs/2102.12443 (2021).
arxiv:2102.12443 .Google Scholar - [47] . 2021. Learning transferable visual models from natural language supervision. CoRR abs/2103.00020 (2021).
arxiv:2103.00020 .Google Scholar - [48] . 2021. Semantic grouping network for video captioning. In Thirty-Fifth AAAI Conference on Artificial Intelligence. 2514–2522.Google Scholar
Cross Ref
- [49] . 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. 4278–4284.Google Scholar
Digital Library
- [50] . [n.d.]. Learning to discretely compose reasoning module networks for video captioning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, (Ed.). 745–752. Google Scholar
Cross Ref
- [51] . 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2, Article
31 (Jun. 2019), 23 pages.Google ScholarDigital Library
- [52] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 5998–6008.Google Scholar
- [53] . 2015. CIDEr: Consensus-based image description evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google Scholar
Cross Ref
- [54] . 2019. Controllable video captioning with POS sequence guidance based on gated fusion network. In 2019 IEEE/CVF International Conference on Computer Vision. 2641–2650.Google Scholar
Cross Ref
- [55] . 2019. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In 2019 IEEE/CVF International Conference on Computer Vision. 4580–4590.Google Scholar
Cross Ref
- [56] . 2022. Memorizing transformers. CoRR abs/2203.08913 (2022).
arxiv:2203.08913 .Google Scholar - [57] . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).
arxiv:1609.08144 .Google Scholar - [58] . 2016. MSR-VTT: A large video description dataset for bridging video and language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.Google Scholar
Cross Ref
- [59] . 2021. Non-autoregressive coarse-to-fine video captioning. In Thirty-Fifth AAAI Conference on Artificial Intelligence. 3119–3127.Google Scholar
Cross Ref
- [60] . 2019. Image captioning by asking questions. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s, Article
55 (Jul. 2019), 19 pages.Google ScholarDigital Library
- [61] . 2015. Describing videos by exploiting temporal structure. In 2015 IEEE International Conference on Computer Vision. 4507–4515.Google Scholar
Digital Library
- [62] . 2021. Adaptive semiparametric language models. Trans. Assoc. Comput. Linguistics 9 (2021), 362–373.Google Scholar
Cross Ref
- [63] . 2022. Context-aware visual policy network for fine-grained image captioning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 2 (2022), 710–722. Google Scholar
Digital Library
- [64] . 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In IEEE Conference on Computer Vision and Pattern Recognition. 8327–8336.Google Scholar
Cross Ref
- [65] . 2021. Open-book video captioning with retrieve-copy-generate network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9837–9846.Google Scholar
Cross Ref
- [66] . 2020. Object relational graph with teacher-recommended learning for video captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13275–13285.Google Scholar
Cross Ref
- [67] . 2020. Syntax-aware action targeting for video captioning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13093–13102.Google Scholar
Cross Ref
Index Terms
Learning Video-Text Aligned Representations for Video Captioning
Recommendations
Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
Video captioning, which bridges vision and language, is a fundamental yet challenging task in computer vision. To generate accurate and comprehensive sentences, both visual and semantic information is quite important. However, most existing methods simply ...
Hierarchical Multimodal Attention Network Based on Semantically Textual Guidance for Video Captioning
Neural Information ProcessingAbstractUtilizing multiple modal information to understand video semantics is quite natural when humans watch a video and describe its contents with natural language. In this paper, a hierarchical multimodal attention network that promotes the information ...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaAutomatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...






Comments