Abstract
Video captioning, which bridges vision and language, is a fundamental yet challenging task in computer vision. To generate accurate and comprehensive sentences, both visual and semantic information is quite important. However, most existing methods simply concatenate different types of features and ignore the interactions between them. In addition, there is a large semantic gap between visual feature space and semantic embedding space, making the task very challenging. To address these issues, we propose a framework named semantic embedding guided attention with Explicit visual Feature Fusion for vidEo CapTioning, EFFECT for short, in which we design an explicit visual-feature fusion (EVF) scheme to capture the pairwise interactions between multiple visual modalities and fuse multimodal visual features of videos in an explicit way. Furthermore, we propose a novel attention mechanism called semantic embedding guided attention (SEGA), which cooperates with the temporal attention to generate a joint attention map. Specifically, in SEGA, the semantic word embedding information is leveraged to guide the model to pay more attention to the most correlated visual features at each decoding stage. In this way, the semantic gap between visual and semantic space is alleviated to some extent. To evaluate the proposed model, we conduct extensive experiments on two widely used datasets, i.e., MSVD and MSR-VTT. The experimental results demonstrate that our approach achieves state-of-the-art results in terms of four evaluation metrics.
- [1] . 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.12487–12496.Google Scholar
- [2] . 2020. Losing heads in the lottery: Pruning transformer attention in neural machine translation. In Proc. Conf. Empirical Methods Natural Lang. Process.2664–2674.Google Scholar
- [3] . 2018. How teens with visual impairments take, edit, and share photos on social media. In Proc. CHI Conf. Hum. Factors Comput. Syst.76.Google Scholar
- [4] . 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.4724–4733.Google Scholar
- [5] . 2011. Collecting highly parallel data for paraphrase evaluation. In Proc. Annu. Meeting Assoc. Comput. Linguistics. 190–200.Google Scholar
- [6] . 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In Proc. Eur. Conf. Comput. Vis.333–351.Google Scholar
- [7] . 2019. Motion guided spatial attention for video captioning. In Proc. AAAI Conf. Artif. Intell.8191–8198.Google Scholar
- [8] . 2021. Motion guided region message passing for video captioning. In Proc. IEEE Int. Conf. Comput. Vis.1523–1532.Google Scholar
- [9] . 2021. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.8425–8435.Google Scholar
- [10] . 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325. http://arxiv.org/abs/1504.00325.Google Scholar
- [11] . 2018. Less is more: Picking informative frames for video captioning. In Proc. Eur. Conf. Comput. Vis.367–384.Google Scholar
- [12] . 2021. Sketch, ground, and refine: Top-down dense video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.234–243.Google Scholar
- [13] . 2014. Meteor universal: Language specific translation evaluation for any target language. In Proc. 9th Workshop Stat. Mach. Transl.376–380.Google Scholar
- [14] . 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proc. Conf. Empirical Methods Natural Lang. Process.457–468.Google Scholar
- [15] . 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multim. 19, 9 (2017), 2045–2055.Google Scholar
Digital Library
- [16] . 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proc. IEEE Int. Conf. Comput. Vis.2712–2719.Google Scholar
- [17] . 2019. Visual attention consistency under image transforms for multi-label image classification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.729–739.Google Scholar
- [18] . 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 2 (2021), 53:1–53:24.Google Scholar
Digital Library
- [19] . 2017. Attention-based multimodal fusion for video description. In Proc. IEEE Int. Conf. Comput. Vis.4203–4212.Google Scholar
- [20] . 2019. Joint syntax representation learning and visual cue translation for video captioning. In Proc. IEEE Int. Conf. Comput. Vis.8917–8926.Google Scholar
- [21] . 2019. Hierarchical global-local temporal modeling for video captioning. In Proc. ACM Multimedia Conf.774–783.Google Scholar
- [22] . 2017. V-JAUNE: A framework for joint action recognition and video summarization. ACM Trans. Multim. Comput. Commun. Appl. 13, 2 (2017), 20:1–20:19.Google Scholar
Digital Library
- [23] . 2019. Low-rank HOCA: Efficient high-order cross-modal attention for video captioning. In Proc. Conf. Empirical Methods Natural Lang. Process.2001–2011.Google Scholar
- [24] . The kinetics human action video dataset. arXiv:1705.06950. http://arxiv.org/abs/1705.06950.Google Scholar
- [25] . 2019. Relation-aware graph attention network for visual question answering. In Proc. IEEE Int. Conf. Comput. Vis.10312–10321.Google Scholar
- [26] . 2017. A general framework for edited video and raw video summarization. IEEE Trans. Image Process. 26, 8 (2017), 3652–3664.Google Scholar
Digital Library
- [27] . 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proc. 42nd Annu. Meeting Assoc. Comput. Linguistics. 605–612.Google Scholar
- [28] . 2018. SibNet: Sibling convolutional encoder for video captioning. In Proc. ACM Multimedia Conf.1425–1434.Google Scholar
- [29] . 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.10867–10876.Google Scholar
- [30] . 2016. Jointly modeling embedding and translation to bridge video and language. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.4594–4602.Google Scholar
- [31] . 2020. X-linear attention networks for image captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.10968–10977.Google Scholar
- [32] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proc. 40th Annu. Meeting Assoc. Comput. Linguistics. 311–318.Google Scholar
- [33] . 2019. Memory-attended recurrent network for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.8347–8356.Google Scholar
- [34] . 2014. GloVe: Global vectors for word representation. In Proc. Conf. Empirical Methods Natural Lang. Process.1532–1543.Google Scholar
- [35] . 2013. Translating video content to natural language descriptions. In Proc. IEEE Int. Conf. Comput. Vis.433–440.Google Scholar
- [36] . 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211–252.Google Scholar
Digital Library
- [37] . 2021. Semantic grouping network for video captioning. In Proc. AAAI Conf. Artif. Intell.2514–2522.Google Scholar
- [38] . 2017. Hierarchical LSTM with adjusted temporal attention for video captioning. In Proc. 26th Int. Joint Conf. Artif. Intell.2737–2743.Google Scholar
- [39] . 2021. Towards diverse paragraph captioning for untrimmed videos. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.11245–11254.Google Scholar
- [40] . 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proc. AAAI Conf. Artif. Intell.4278–4284.Google Scholar
- [41] . 2020. Learning to discretely compose reasoning module networks for video captioning. In Proc. Int. Joint Conf. Artif. Intell.745–752.Google Scholar
- [42] . 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 15, 2 (2019), 31:1–31:23.Google Scholar
Digital Library
- [43] . 2017. Video description with spatial-temporal attention. In Proc. ACM Multimedia Conf.1014–1022.Google Scholar
- [44] . 2015. CIDEr: Consensus-based image description evaluation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.4566–4575.Google Scholar
- [45] . 2016. Improving LSTM-based video description with linguistic knowledge mined from text. In Proc. Conf. Empirical Methods Natural Lang. Process.1961–1966.Google Scholar
- [46] . 2015. Sequence to sequence - video to text. In Proc. IEEE Int. Conf. Comput. Vis.4534–4542.Google Scholar
- [47] . 2015. Translating videos to natural language using deep recurrent neural networks. In Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol.1494–1504.Google Scholar
- [48] . 2019. Controllable video captioning with POS sequence guidance based on gated fusion network. In Proc. IEEE Int. Conf. Comput. Vis.2641–2650.Google Scholar
- [49] . 2018. Reconstruction network for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.7622–7631.Google Scholar
- [50] . 2018. Spotting and aggregating salient regions for video captioning. In Proc. ACM Multimedia Conf.1519–1526.Google Scholar
- [51] . 2018. M3: Multimodal memory modelling for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.7512–7520.Google Scholar
- [52] . 2019. Learning from synthetic data for crowd counting in the wild. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.8198–8207.Google Scholar
- [53] . 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Trans. Multim. Comput. Commun. Appl. 14, 4 (2018), 87:1–87:19.Google Scholar
Digital Library
- [54] . 2016. Dynamic memory networks for visual and textual question answering. In Proc. Int. Conf. Mach. Learn.2397–2406.Google Scholar
- [55] . 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proc. AAAI Conf. Artif. Intell.9062–9069.Google Scholar
- [56] . 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.5288–5296.Google Scholar
- [57] . 2017. Learning multimodal attention LSTM networks for video captioning. In Proc. ACM Multimedia Conf.537–545.Google Scholar
- [58] . 2018. Sequential video VLAD: Training the aggregation locally and temporally. IEEE Trans. Image Process. 27, 10 (2018), 4933–4944.Google Scholar
Digital Library
- [59] . 2016. Stacked attention networks for image question answering. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.21–29.Google Scholar
- [60] . 2015. Describing videos by exploiting temporal structure. In Proc. IEEE Int. Conf. Comput. Vis.4507–4515.Google Scholar
- [61] . 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proc. IEEE Int. Conf. Comput. Vis.1839–1848.Google Scholar
- [62] . 2019. Spatiotemporal-textual co-attention network for video question answering. ACM Trans. Multim. Comput. Commun. Appl. 15, 2s (2019), 53:1–53:18.Google Scholar
Digital Library
- [63] . 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.8327–8336.Google Scholar
- [64] . 2017. Task-driven dynamic fusion: Reducing ambiguity in video description. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.6250–6258.Google Scholar
- [65] . 2020. Object relational graph with teacher-recommended learning for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.13275–13285.Google Scholar
- [66] . 2019. CAM-RNN: Co-attention model based RNN for video captioning. IEEE Trans. Image Process. 28, 11 (2019), 5552–5565.Google Scholar
Digital Library
- [67] . 2020. Syntax-aware action targeting for video captioning. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit.13093–13102.Google Scholar
- [68] . 2015. Simple baseline for visual question answering. arXiv:1512.02167. http://arxiv.org/abs/1512.02167.Google Scholar
Index Terms
Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
Recommendations
Video Captioning using Hierarchical Multi-Attention Model
ICAIP '18: Proceedings of the 2nd International Conference on Advances in Image ProcessingAttention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such ...
Semantic Enhanced Video Captioning with Multi-feature Fusion
Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage ...
Modeling Context-Guided Visual and Linguistic Semantic Feature for Video Captioning
Artificial Neural Networks and Machine Learning – ICANN 2021AbstractIt has received increasing attention to exploiting temporal visual features and corresponding descriptions in video captioning. Most of the existing models generate the captioning words merely depend on video temporal structure, ignoring fine-...






Comments