Abstract
Video captioning has been an emerging research topic in computer vision, which aims to generate a natural sentence to correctly reflect the visual content of a video. The well-established way of doing so is to rely on encoder-decoder paradigm by learning to encode the input video and decode the variable-length output sentence in a sequence-to-sequence manner. Nevertheless, these approaches often fail to produce complex and descriptive sentences as natural as those from human being, since the models are incapable of memorizing all visual contents and syntactic structures in the human-annotated video-sentence pairs. In this article, we uniquely introduce a Retrieval Augmentation Mechanism (RAM) that enables the explicit reference to existing video-sentence pairs within any encoder-decoder captioning model. Specifically, for each query video, a video-sentence retrieval model is first utilized to fetch semantically relevant sentences from the training sentence pool, coupled with the corresponding training videos. RAM then writes the relevant video-sentence pairs into memory and reads the memorized visual contents/syntactic structures in video-sentence pairs from memory to facilitate the word prediction at each timestep. Furthermore, we present Retrieval Augmented Convolutional Encoder-Decoder Network (R-ConvED), which novelly integrates RAM into convolutional encoder-decoder structure to boost video captioning. Extensive experiments on MSVD, MSR-VTT, Activity Net Captions, and VATEX datasets validate the superiority of our proposals and demonstrate quantitatively compelling results.
- [1] . 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR. 12487–12496.Google Scholar
- [2] . 2017. VQA: Visual question answering. Int. J. Comput. Vis. 123, 1 (2017), 4–31.Google Scholar
Digital Library
- [3] . 2016. Evaluating video search, video event detection, localization, and hyperlinking. In TRECVID.Google Scholar
- [4] . 2015. Neural machine translation by jointly learning to align and translate. In ICLR.Google Scholar
- [5] . 2016. Delving deeper into convolutional networks for learning video representations. In ICLR.Google Scholar
- [6] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or [email protected]. Association for Computational Linguistics, 65–72.Google Scholar
- [7] . 2017. Hierarchical boundary-aware neural encoder for video captioning. In CVPR. IEEE Computer Society, 3185–3194.Google Scholar
- [8] . 2011. Collecting highly parallel data for paraphrase evaluation. In ACL-HLT. The Association for Computer Linguistics, 190–200.Google Scholar
- [9] . 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI. 8167–8174.Google Scholar
- [10] . 2017. Video captioning with guidance of multimodal latent topics. In MM. ACM, 1838–1846.Google Scholar
- [11] . 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In ECCV. Springer, 333–351.Google Scholar
- [12] . 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google Scholar
- [13] . 2018. Less is more: Picking informative frames for video captioning. In ECCV. Springer, 367–384.Google Scholar
- [14] . 2017. Language modeling with gated convolutional networks. In ICML. PMLR, 933–941.Google Scholar
- [15] . 2018. Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multim. 20, 12 (2018), 3377–3388.Google Scholar
Digital Library
- [16] . 2019. Dual encoding for zero-example video retrieval. In CVPR. Computer Vision Foundation/IEEE, 9346–9355.Google Scholar
- [17] . 2015. From captions to visual concepts and back. In CVPR. 1473–1482.Google Scholar
- [18] . 2022. Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Trans. Image Process. 31 (2022), 202–215.Google Scholar
Cross Ref
- [19] . 2017. Convolutional sequence-to-sequence learning. In ICML. PMLR, 1243–1252.Google Scholar
- [20] . 2019. Incorporating textual similarity in video captioning schemes. In ICE/ITMC. 1–6.Google Scholar
- [21] . 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. 2712–2719.Google Scholar
- [22] . 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 2 (2021), 53:1–53:24.Google Scholar
Digital Library
- [23] . 2012. Automated textual descriptions for a wide range of video events with 48 human actions. In ECCV. 372–380.Google Scholar
- [24] . 2016. Deep residual learning for image recognition. In CVPR. IEEE Computer Society, 770–778.Google Scholar
- [25] . 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853–899.Google Scholar
Cross Ref
- [26] . 2019. Video question answering with spatio-temporal reasoning. Int. J. Comput. Vis. 127, 10 (2019), 1385–1412.Google Scholar
Digital Library
- [27] . 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 17, 2s (
June 2021).Google ScholarDigital Library
- [28] . 2020. SBAT: Video captioning with sparse boundary-aware transformer. In IJCAI. ijcai.org, 630–636.Google Scholar
- [29] . 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 1889–1897.Google Scholar
- [30] . 2014. Large-scale video classification with convolutional neural networks. In CVPR. IEEE Computer Society, 1725–1732.Google Scholar
- [31] . 2015. Adam: A method for stochastic optimization. In ICLR.Google Scholar
- [32] . 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171–184.Google Scholar
Digital Library
- [33] . 2017. Dense-captioning events in videos. In ICCV. 706–715.Google Scholar
- [34] . 2012. ImageNet classification with deep convolutional neural networks. In NIPS.1106–1114.Google Scholar
- [35] . 2017. MAM-RNN: Multi-level attention model based RNN for video captioning. In IJCAI. ijcai.org, 2208–2214.Google Scholar
- [36] . 2019. Residual attention-based LSTM for video captioning. World Wide Web 22, 2 (2019), 621–636.Google Scholar
Digital Library
- [37] . 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. arXiv preprint arXiv:2201.04026 (2022).Google Scholar
- [38] . 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In MM. 3799–3802.Google Scholar
- [39] . 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In AAAI. 8518–8526.Google Scholar
- [40] . 2018. Jointly localizing and describing events for dense video captioning. In CVPR. 7492–7500.Google Scholar
- [41] . 2019. Pointing novel objects in image captioning. In IEEE/CVF CVPR. 12497–12506.Google Scholar
- [42] . 2022. Contextual transformer networks for visual recognition. IEEE Trans. Patt. Anal. Mach. Intell. (2022). https://ieeexplore.ieee.org/abstract/document/9747984.Google Scholar
- [43] . 2021. SwinBERT: End-to-end transformers with sparse attention for video captioning. CoRR abs/2111.13196 (2021).Google Scholar
- [44] . 2021. SibNet: Sibling convolutional encoder for video captioning. IEEE Trans. Patt. Anal. Mach. Intell. 43, 9 (2021), 3259–3272.Google Scholar
Cross Ref
- [45] . 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In MM. 5600–5608.Google Scholar
- [46] . 2018. Learning a multi-concept video retrieval model with multiple latent variables. ACM Trans. Multim. Comput. Commun. Appl. 14, 2 (2018), 46:1–46:21.Google Scholar
Digital Library
- [47] . 2015. Encoding source language with convolutional neural network for machine translation. In ACL-IJCNLP. The Association for Computer Linguistics, 20–30.Google Scholar
- [48] . 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR, . ACM, 19–27.Google Scholar
- [49] . 2019. End-to-end learning of latent deformable part-based representations for object detection. Int. J. Comput. Vis. 127, 11–12 (2019), 1659–1679.Google Scholar
Digital Library
- [50] . 2020. Video captioning using boosted and parallel Long Short-Term Memory networks. Comput. Vis. Image Underst. 190 (2020).Google Scholar
Digital Library
- [51] . 2016. Large scale retrieval and generation of image descriptions. Int. J. Comput. Vis. 119, 1 (2016), 46–59.Google Scholar
Digital Library
- [52] . 2011. Im2Text: Describing images using 1 million captioned photographs. In NIPS. 1143–1151.Google Scholar
- [53] . 2020. Spatio-temporal graph for video captioning with knowledge distillation. In CVPR. Computer Vision Foundation/IEEE, 10867–10876.Google Scholar
- [54] . 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. arXiv preprint arXiv:2007.02375 (2020).Google Scholar
- [55] . 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In JCAI. 3832–3838.Google Scholar
- [56] . 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR. IEEE Computer Society, 4594–4602.Google Scholar
- [57] . 2017. Video captioning with transferred semantic attributes. In CVPR. IEEE Computer Society, 984–992.Google Scholar
- [58] . 2020. X-linear attention networks for image captioning. In CVPR. 10971–10980.Google Scholar
- [59] . 2002. Bleu: A method for automatic evaluation of machine translation. In ACL. 311–318.Google Scholar
- [60] . 2017. Multi-task video captioning with video and entailment generation. In ACL. Association for Computational Linguistics, 1273–1283.Google Scholar
- [61] . 2016. Multimodal video description. In MM. ACM, 1092–1096.Google Scholar
- [62] . 2017. YOLO9000: Better, faster, stronger. In CVPR. 6517–6525.Google Scholar
- [63] . 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Patt. Anal. Mach. Intell. 39, 6 (2017), 1137–1149.Google Scholar
Digital Library
- [64] . 2013. Translating video content to natural language descriptions. In ICCV. 433–440.Google Scholar
- [65] . 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211–252.Google Scholar
Digital Library
- [66] . 2021. Semantic grouping network for video captioning. In AAAI, EAAI. AAAI Press, 2514–2522.Google Scholar
- [67] . 2021. Semantic grouping network for video captioning. In AAAI, IAAI, EAAI. AAAI Press, 2514–2522.Google Scholar
- [68] . 2020. Understanding temporal structure for video captioning. Patt. Anal. Appl. 23, 1 (2020), 147–159.Google Scholar
Digital Library
- [69] . 2020. Video captioning with boundary-aware hierarchical language decoding and joint video prediction. Neurocomputing 417 (2020), 347–356.Google Scholar
Cross Ref
- [70] . 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Computat. Ling. 2 (2014), 207–218.Google Scholar
Cross Ref
- [71] . 2017. Hierarchical LSTM with adjusted temporal attention for video captioning. In IJCAI. ijcai.org, 2737–2743.Google Scholar
- [72] . 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI. 4278–4284.Google Scholar
- [73] . 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 15, 2 (2019), 31:1–31:23.Google Scholar
Digital Library
- [74] . 2016. Learning language-visual embedding for movie understanding with natural-language. CoRR abs/1609.08124 (2016).Google Scholar
- [75] . 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV. IEEE Computer Society, 4489–4497.Google Scholar
- [76] . 2021. Enhancing the alignment between target words and corresponding frames for video captioning. Patt. Recog. 111 (2021), 107702.Google Scholar
Cross Ref
- [77] . 2017. Attention is all you need. In NIPS. 5998–6008.Google Scholar
- [78] . 2015. CIDEr: Consensus-based image description evaluation. In CVPR. IEEE Computer Society, 4566–4575.Google Scholar
- [79] . 2015. Sequence to sequence—Video to text. In ICCV. IEEE Computer Society, 4534–4542.Google Scholar
- [80] . 2015. Translating videos to natural language using deep recurrent neural networks. In NAACL HLT. The Association for Computational Linguistics, 1494–1504.Google Scholar
- [81] . 2018. Reconstruction network for video captioning. In CVPR. IEEE Computer Society, 7622–7631.Google Scholar
- [82] . 2018. M3: Multimodal memory modelling for video captioning. In CVPR. Computer Vision Foundation/IEEE Computer Society, 7512–7520.Google Scholar
- [83] . 2019. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In IEEE/CVF ICCV. IEEE, 4580–4590.Google Scholar
- [84] . 2018. Multi-modal circulant fusion for video-to-language and backward. In IJCAI.Google Scholar
- [85] . 2016. MSR-VTT: A large video description dataset for bridging video and language. In CVPR. IEEE Computer Society, 5288–5296.Google Scholar
- [86] . 2017. Learning multimodal attention LSTM networks for video captioning. In MM. ACM, 537–545.Google Scholar
- [87] . 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI. AAAI Press, 2346–2352.Google Scholar
- [88] . 2020. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans. Multim. 22, 1 (2020), 229–241.Google Scholar
Digital Library
- [89] . 2015. Describing videos by exploiting temporal structure. In ICCV. IEEE Computer Society, 4507–4515.Google Scholar
- [90] . 2017. Incorporating copying mechanism in image captioning for learning novel objects. In CVPR.Google Scholar
- [91] . 2018. Exploring visual relationship for image captioning. In ECCV.Google Scholar
- [92] . 2019. Hierarchy parsing for image captioning. In IEEE/CVF ICCV. 2621–2629.Google Scholar
- [93] . 2017. Boosting image captioning with attributes. In ICCV.Google Scholar
- [94] . 2016. Video paragraph captioning using hierarchical recurrent neural networks. In CVPR. IEEE Computer Society, 4584–4593.Google Scholar
- [95] . 2021. Open-book video captioning with retrieve-copy-generate network. In IEEE CVPR. Computer Vision Foundation/IEEE, 9837–9846.Google Scholar
- [96] . 2020. Object relational graph with teacher-recommended learning for video captioning. In IEEE/CVF CVPR. Computer Vision Foundation/ IEEE, 13275–13285.Google Scholar
- [97] . 2018. Video captioning with tube features. In IJCAI. ijcai.org, 1177–1183.Google Scholar
- [98] . 2019. CAM-RNN: Co-attention model based RNN for video captioning. IEEE Trans. Image Process. 28, 11 (2019), 5552–5565.Google Scholar
Digital Library
- [99] . 2021. Multi-modal dependency tree for video captioning. Adv. Neural Inf. Process. Syst. 34 (2021).Google Scholar
Index Terms
Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning
Recommendations
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaAutomatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Video Captioning using Hierarchical Multi-Attention Model
ICAIP '18: Proceedings of the 2nd International Conference on Advances in Image ProcessingAttention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such ...
Exploiting long-term temporal dynamics for video captioning
Automatically describing videos with natural language is a fundamental challenge for computer vision and natural language processing. Recently, progress in this problem has been achieved through two steps: 1) employing 2-D and/or 3-D Convolutional ...






Comments