Abstract
Fully mining visual cues to aid in content understanding is crucial for video captioning. However, most state-of-the-art video captioning methods are limited to generating captions purely based on straightforward information while ignoring the scenario and context information. To fill the gap, we propose a novel, simple but effective scenario-aware recurrent transformer (SART) model to execute video captioning. Our model contains a “scenario understanding” module to obtain a global perspective across multiple frames, providing a specific scenario to guarantee a goal-directed description. Moreover, for the sake of achieving narrative continuity in the generated paragraph, a unified recurrent transformer is adopted. To demonstrate the effectiveness of our proposed SART, we have conducted comprehensive experiments on various large-scale video description datasets, including ActivityNet, YouCookII, and VideoStory. Additionally, we extend a story-oriented evaluation framework for assessing the quality of the generated caption more precisely. The superior performance has shown that SART has a strong ability to generate correct, deliberative, and narrative coherent video descriptions.
- [1] . 2020. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys 52, 6 (2020), 115:1–115:37.Google Scholar
Digital Library
- [2] . 2018. A knowledge-grounded multimodal search-based conversational agent. In Proceedings of the 2nd International Workshop on Search-Oriented Conversational AI. 59–66.Google Scholar
Cross Ref
- [3] . 2019. LaSO: Label-set operations networks for multi-label few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6548–6557.Google Scholar
Cross Ref
- [4] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.Google Scholar
- [5] . 2017. Video captioning with guidance of multimodal latent topics. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017. 1838–1846.Google Scholar
Digital Library
- [6] . 2020. UNITER: UNiversal image-TExt representation learning. In Proceedings of the 16th European Conference on Computer Vision. 104–120.Google Scholar
Digital Library
- [7] . 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019. 2978–2988.Google Scholar
Cross Ref
- [8] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186.Google Scholar
- [9] . 2018. Weakly supervised dense event captioning in videos. In Proceedings of the Annual Conference on Neural Information Processing Systems 2018. 3063–3073.Google Scholar
- [10] . 2020. SODA: Story oriented dense video captioning evaluation framework. In Proceedings of the16th European Conference Computer Vision. 517–531.Google Scholar
Digital Library
- [11] . 2017. StyleNet: Generating attractive visual captions with styles. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 955–964.Google Scholar
Cross Ref
- [12] . 2017. Semantic compositional networks for visual captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 1141–1150.Google Scholar
Cross Ref
- [13] . 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia 19, 9 (2017), 2045–2055.Google Scholar
Digital Library
- [14] . 2018. A dataset for telling the stories of social media videos. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 968–974.Google Scholar
Cross Ref
- [15] . 2020. COOT: Cooperative hierarchical transformer for video-text representation learning. In Proceedings of the Annual Conference on Neural Information Processing Systems 2020.Google Scholar
- [16] . 2019. Multi-modal representation learning for short video understanding and recommendation. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops. 687–690.Google Scholar
Cross Ref
- [17] . 2016. Identity mappings in deep residual networks. In Proceedings of the 14th European Conference on Computer Vision, Vol. 9908. 630–645.Google Scholar
Cross Ref
- [18] . 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.Google Scholar
Cross Ref
- [19] . 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015. 448–456.Google Scholar
Digital Library
- [20] . 2016. Describing videos using multi-modal fusion. In Proceedings of the 2016 ACM Conference on Multimedia. 1087–1091.Google Scholar
Digital Library
- [21] . 2018. Efficient video encoding for automatic video analysis in distributed wireless surveillance systems. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), 72:1–72:24.Google Scholar
Digital Library
- [22] . 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.Google Scholar
Cross Ref
- [23] . 2020. MART: Memory-augmented recurrent transformer for coherent video paragraph captioning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2603–2614.Google Scholar
Cross Ref
- [24] . 2019. Visual to text: Survey of image and video captioning. IEEE Transactions on Emerging Topics in Computational Intelligence 3, 4 (2019), 297–312.Google Scholar
Cross Ref
- [25] . 2018. Video captioning with multi-faceted attention. Transactions of the Association for Computational 6 (2018), 173–184.Google Scholar
- [26] . 2013. Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 3 (2013), 16:1–16:23.Google Scholar
Digital Library
- [27] . 2019. Streamlined dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6588–6597.Google Scholar
Cross Ref
- [28] . 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 1029–1038.Google Scholar
Cross Ref
- [29] . 2017. Video captioning with transferred semantic attributes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017. 984–992.Google Scholar
Cross Ref
- [30] . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google Scholar
Digital Library
- [31] . 2019. Adversarial inference for multi-sentence video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6598–6608.Google Scholar
Cross Ref
- [32] . 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning. 1310–1318.Google Scholar
Digital Library
- [33] . 2005. Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence 167, 1–2 (2005), 170–205.Google Scholar
Digital Library
- [34] . 2017. A simple neural network module for relational reasoning. In Proceedings of the Annual Conference on Neural Information Processing Systems. 4967–4976.Google Scholar
- [35] . 2019. VideoBERT: A joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision. 7463–7472.Google Scholar
Cross Ref
- [36] . 2014. Sequence to sequence learning with neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems. 3104–3112.Google Scholar
- [37] . 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP. 5099–5110.Google Scholar
Cross Ref
- [38] . 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2 (2019), 31:1–31:23.Google Scholar
Digital Library
- [39] . 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5998–6008.Google Scholar
- [40] . 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google Scholar
Cross Ref
- [41] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- [42] . 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), 73:1–73:15.Google Scholar
Digital Library
- [43] . 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 4213–4222.Google Scholar
Cross Ref
- [44] . 2018. Move forward and tell: A progressive generator of video descriptions. In Proceedings of the 15th European Conference on Computer Vision. 489–505.Google Scholar
Cross Ref
- [45] . 2020. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Transactions on Multimedia 22, 1 (2020), 229–241.Google Scholar
Digital Library
- [46] . 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5754–5764.Google Scholar
- [47] . 2020. CLEVRER: Collision events for video representation and reasoning. In Proceedings of the 8th International Conference on Learning Representations.Google Scholar
- [48] . 2018. Cross-modal and hierarchical modeling of video and text. In Proceedings of the 15th European Conference on Computer Vision. 385–401.Google Scholar
Cross Ref
- [49] . 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8327–8336.Google Scholar
Cross Ref
- [50] . 2020. Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Transactions on Image Processing 29 (2020), 6209–6222.Google Scholar
Cross Ref
- [51] . 2019. Grounded video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6578–6587.Google Scholar
Cross Ref
- [52] . 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18). 7590–7598.Google Scholar
Cross Ref
- [53] . 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 8739–8748.Google Scholar
Cross Ref
Index Terms
Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning
Recommendations
An attention based dual learning approach for video captioning
AbstractVideo captioning aims to generate sentences/captions to describe video contents. It is one of the key tasks in the field of multimedia processing. However, most of the current video captioning approaches utilize only the visual ...
Highlights- We propose a novel attention based dual learning approach for video captioning.
Sentimental Visual Captioning using Multimodal Transformer
AbstractWe propose a new task called sentimental visual captioning that generates captions with the inherent sentiment reflected by the input image or video. Compared with the stylized visual captioning task that requires a predefined style independent of ...
Meaning Guided Video Captioning
Pattern RecognitionAbstractCurrent video captioning approaches often suffer from problems of missing objects in the video to be described, while generating captions semantically similar with ground truth sentences. In this paper, we propose a new approach to video ...






Comments