Abstract
It is interesting and challenging to translate a video to natural description sentences based on the video content. In this work, an advanced framework is built to generate sentences with coherence and rich semantic expressions for video captioning. A long short term memory (LSTM) network with an improved factored way is first developed, which takes the inspiration of LSTM with a conventional factored way and a common practice to feed multi-modal features into LSTM at the first time step for visual description. Then, the incorporation of the LSTM network with the proposed improved factored way and un-factored way is exploited, and a voting strategy is utilized to predict candidate words. In addition, for robust and abstract visual and language representation, residuals are employed to enhance the gradient signals that are learned from the residual network (ResNet), and a deeper LSTM network is constructed. Furthermore, three convolutional neural network based features extracted from GoogLeNet, ResNet101, and ResNet152, are fused to catch more comprehensive and complementary visual information. Experiments are conducted on two benchmark datasets, including MSVD and MSR-VTT2016, and competitive performances are obtained by the proposed techniques as compared to other state-of-the-art methods.
- Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2015. Delving deeper into convolutional networks for learning video representations. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Meeting of the Association for Computational Linguistics Workshop. 65–72.Google Scholar
- Lorenzo Baraldi, Grana Costantino, and Cucchiara Rita. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3185–3194.Google Scholar
Cross Ref
- Yi Bin, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen, and Xuelong Li. 2019. Describing video with attention based bidirectional LSTM. IEEE Trans. Cyber. 49, 7 (July 2019), 2631--2641.Google Scholar
Cross Ref
- João Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google Scholar
Cross Ref
- David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Meeting of the Association for Computational Linguistics. 190–200. Google Scholar
Digital Library
- Yanyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision.Google Scholar
Cross Ref
- Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans. Multimed. 17, 11 (Nov. 2015), 1875–1886.Google Scholar
Digital Library
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.Google Scholar
Cross Ref
- Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, and Cees G. M. Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the ACM Conference on Multimedia. ACM, 1082–1086. Google Scholar
Digital Library
- Fang Fang, Qinyu Li, Hanli Wang, and Pengjie Tang. 2018. Refining attention: A sequential attention model for image captioning. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1–6.Google Scholar
Cross Ref
- Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5630–5639.Google Scholar
Cross Ref
- Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimed. 19, 9 (2017), 2045–2055.Google Scholar
Digital Library
- Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2712–2719. Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Digital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM Conference on Multimedia. ACM, 675–678. Google Scholar
Digital Library
- Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the ACM Conference on Multimedia. ACM, 1087–1091. Google Scholar
Digital Library
- Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google Scholar
Cross Ref
- Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. 2014. Multimodal neural language models. In Proceedings of the International Conference on Machine Learning. 595–603. Google Scholar
Digital Library
- Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence. 541–547. Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Neural Information Processing Systems. 1097–1105. Google Scholar
Digital Library
- Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the Meeting of the Association for Computational Linguistics. 21–26. Google Scholar
Digital Library
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.Google Scholar
- Andriy Mnih and Geoffrey E. Hinton. 2007. Three new graphical models for statistical language modelling. In Proceedings of the International Conference on Machine Learning. 641–648. Google Scholar
Digital Library
- Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.Google Scholar
Cross Ref
- Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1029–1038.Google Scholar
Cross Ref
- Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594–4602.Google Scholar
Cross Ref
- Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 984–992.Google Scholar
Cross Ref
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Meeting of the Association for Computational Linguistics. 311–318. Google Scholar
Digital Library
- Yunchen Pu, Martin Renqiang Min, Zhe Gan, and Lawrence Carin. 2016. Adaptive feature abstraction for translating video to language. Retrieved from arXiv preprint arXiv:1611.07837.Google Scholar
- Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal video description. In Proceedings of the ACM Conference on Multimedia. ACM, 1092–1096. Google Scholar
Digital Library
- Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1916–1924.Google Scholar
Cross Ref
- Rakshith Shetty and Jorma Laaksonen. 2016. Frame- and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the ACM Conference on Multimedia. ACM, 1073–1076. Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Jingkuan Song, Yuyu Guo, Lianli Gao, Xuelong Li, Alan Hanjalic, and Heng Tao Shen. 2017. From deterministic to generative: Multimodal stochastic RNNs for video captioning. arXiv preprint arXiv: 1708.02478.Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google Scholar
Cross Ref
- Pengjie Tang, Hanli Wang, Hanzhang Wang, and Kaisheng Xu. 2017. Richer semantic visual and language representation for video captioning. In Proceedings of the ACM Conference on Multimedia. ACM, 1871–1876. Google Scholar
Digital Library
- Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond J. Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the International Conference on Computational Linguistics. 1218–1227.Google Scholar
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 489–4497. Google Scholar
Digital Library
- Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google Scholar
Cross Ref
- Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542. Google Scholar
Digital Library
- Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494–1504.Google Scholar
Cross Ref
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google Scholar
Cross Ref
- Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7622–7631.Google Scholar
Cross Ref
- Hanzhang Wang, Hanli Wang, and Kaisheng Xu. 2018. Categorizing concepts with basic level for vision-to-language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4962–4970.Google Scholar
Cross Ref
- Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tieniu Tan. 2018 M<sup>3</sup> . M: Multimodal memory modelling for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7512–7520.Google Scholar
Cross Ref
- Shikui Wei, Yao Zhao, Zhenfeng Zhu, and Nan Liu. 2010. Multimodal fusion for video search reranking. IEEE Trans. Knowl. Data Eng. 22, 8 (Aug. 2010), 1191–1199. Google Scholar
Digital Library
- Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–212.Google Scholar
Cross Ref
- Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.Google Scholar
Cross Ref
- Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057. Google Scholar
Digital Library
- Kaisheng Xu, Hanli Wang, and Pengjie Tang. 2017. Image captioning with deep LSTM based on sequential residual. In Proceedings of the IEEE International Conference on Multimedia and Expo. 361–366.Google Scholar
Cross Ref
- Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, and Heng Tao Shen. 2018. Video captioning by adversarial LSTM. IEEE Trans. Image Proc. 27, 11 (2018), 5600–5611. Google Scholar
Digital Library
- Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507–4515. Google Scholar
Digital Library
- Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4904–4912.Google Scholar
Cross Ref
- Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4584–4593.Google Scholar
Cross Ref
- Mingxing Zhang, Yang Yang, Hanwang Zhang, Yanli Ji, Heng Tao Shen, and Tat-Seng Chua. 2018. More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Trans. Image Proc. 28, 1 (2018), 32–44. Google Scholar
Digital Library
- Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. ECO: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision.Google Scholar
Cross Ref
Index Terms
Rich Visual and Language Representation with Complementary Semantics for Video Captioning
Recommendations
Richer Semantic Visual and Language Representation for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaTranslating and summarizing a video into natural language is an interesting and challenging visual task. In this work, a novel framework is built to generate sentences for videos with more coherence and semantics. A long short term memory (LSTM) network ...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on MultimediaAutomatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Residual attention-based LSTM for video captioning
Recently great success has been achieved by proposing a framework with hierarchical LSTMs in video captioning, such as stacked LSTM networks. When deeper LSTM layers are able to start converging, a degradation problem has been exposed. With the number ...






Comments