skip to main content
research-article

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

Authors Info & Claims
Published:05 June 2019Publication History
Skip Abstract Section

Abstract

It is interesting and challenging to translate a video to natural description sentences based on the video content. In this work, an advanced framework is built to generate sentences with coherence and rich semantic expressions for video captioning. A long short term memory (LSTM) network with an improved factored way is first developed, which takes the inspiration of LSTM with a conventional factored way and a common practice to feed multi-modal features into LSTM at the first time step for visual description. Then, the incorporation of the LSTM network with the proposed improved factored way and un-factored way is exploited, and a voting strategy is utilized to predict candidate words. In addition, for robust and abstract visual and language representation, residuals are employed to enhance the gradient signals that are learned from the residual network (ResNet), and a deeper LSTM network is constructed. Furthermore, three convolutional neural network based features extracted from GoogLeNet, ResNet101, and ResNet152, are fused to catch more comprehensive and complementary visual information. Experiments are conducted on two benchmark datasets, including MSVD and MSR-VTT2016, and competitive performances are obtained by the proposed techniques as compared to other state-of-the-art methods.

References

  1. Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2015. Delving deeper into convolutional networks for learning video representations. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  2. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Meeting of the Association for Computational Linguistics Workshop. 65–72.Google ScholarGoogle Scholar
  3. Lorenzo Baraldi, Grana Costantino, and Cucchiara Rita. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3185–3194.Google ScholarGoogle ScholarCross RefCross Ref
  4. Yi Bin, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen, and Xuelong Li. 2019. Describing video with attention based bidirectional LSTM. IEEE Trans. Cyber. 49, 7 (July 2019), 2631--2641.Google ScholarGoogle ScholarCross RefCross Ref
  5. João Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google ScholarGoogle ScholarCross RefCross Ref
  6. David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Meeting of the Association for Computational Linguistics. 190–200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yanyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  8. Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans. Multimed. 17, 11 (Nov. 2015), 1875–1886.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, and Cees G. M. Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the ACM Conference on Multimedia. ACM, 1082–1086. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fang Fang, Qinyu Li, Hanli Wang, and Pengjie Tang. 2018. Refining attention: A sequential attention model for image captioning. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1–6.Google ScholarGoogle ScholarCross RefCross Ref
  12. Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5630–5639.Google ScholarGoogle ScholarCross RefCross Ref
  13. Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimed. 19, 9 (2017), 2045–2055.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2712–2719. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM Conference on Multimedia. ACM, 675–678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the ACM Conference on Multimedia. ACM, 1087–1091. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google ScholarGoogle ScholarCross RefCross Ref
  19. Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. 2014. Multimodal neural language models. In Proceedings of the International Conference on Machine Learning. 595–603. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence. 541–547. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Neural Information Processing Systems. 1097–1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the Meeting of the Association for Computational Linguistics. 21–26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.Google ScholarGoogle Scholar
  24. Andriy Mnih and Geoffrey E. Hinton. 2007. Three new graphical models for statistical language modelling. In Proceedings of the International Conference on Machine Learning. 641–648. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.Google ScholarGoogle ScholarCross RefCross Ref
  26. Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1029–1038.Google ScholarGoogle ScholarCross RefCross Ref
  27. Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594–4602.Google ScholarGoogle ScholarCross RefCross Ref
  28. Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 984–992.Google ScholarGoogle ScholarCross RefCross Ref
  29. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Meeting of the Association for Computational Linguistics. 311–318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yunchen Pu, Martin Renqiang Min, Zhe Gan, and Lawrence Carin. 2016. Adaptive feature abstraction for translating video to language. Retrieved from arXiv preprint arXiv:1611.07837.Google ScholarGoogle Scholar
  31. Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal video description. In Proceedings of the ACM Conference on Multimedia. ACM, 1092–1096. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1916–1924.Google ScholarGoogle ScholarCross RefCross Ref
  33. Rakshith Shetty and Jorma Laaksonen. 2016. Frame- and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the ACM Conference on Multimedia. ACM, 1073–1076. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  35. Jingkuan Song, Yuyu Guo, Lianli Gao, Xuelong Li, Alan Hanjalic, and Heng Tao Shen. 2017. From deterministic to generative: Multimodal stochastic RNNs for video captioning. arXiv preprint arXiv: 1708.02478.Google ScholarGoogle Scholar
  36. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google ScholarGoogle ScholarCross RefCross Ref
  37. Pengjie Tang, Hanli Wang, Hanzhang Wang, and Kaisheng Xu. 2017. Richer semantic visual and language representation for video captioning. In Proceedings of the ACM Conference on Multimedia. ACM, 1871–1876. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond J. Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the International Conference on Computational Linguistics. 1218–1227.Google ScholarGoogle Scholar
  39. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 489–4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.Google ScholarGoogle ScholarCross RefCross Ref
  41. Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494–1504.Google ScholarGoogle ScholarCross RefCross Ref
  43. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google ScholarGoogle ScholarCross RefCross Ref
  44. Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7622–7631.Google ScholarGoogle ScholarCross RefCross Ref
  45. Hanzhang Wang, Hanli Wang, and Kaisheng Xu. 2018. Categorizing concepts with basic level for vision-to-language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4962–4970.Google ScholarGoogle ScholarCross RefCross Ref
  46. Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tieniu Tan. 2018 M<sup>3</sup> . M: Multimodal memory modelling for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7512–7520.Google ScholarGoogle ScholarCross RefCross Ref
  47. Shikui Wei, Yao Zhao, Zhenfeng Zhu, and Nan Liu. 2010. Multimodal fusion for video search reranking. IEEE Trans. Knowl. Data Eng. 22, 8 (Aug. 2010), 1191–1199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–212.Google ScholarGoogle ScholarCross RefCross Ref
  49. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.Google ScholarGoogle ScholarCross RefCross Ref
  50. Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Kaisheng Xu, Hanli Wang, and Pengjie Tang. 2017. Image captioning with deep LSTM based on sequential residual. In Proceedings of the IEEE International Conference on Multimedia and Expo. 361–366.Google ScholarGoogle ScholarCross RefCross Ref
  52. Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, and Heng Tao Shen. 2018. Video captioning by adversarial LSTM. IEEE Trans. Image Proc. 27, 11 (2018), 5600–5611. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507–4515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4904–4912.Google ScholarGoogle ScholarCross RefCross Ref
  55. Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4584–4593.Google ScholarGoogle ScholarCross RefCross Ref
  56. Mingxing Zhang, Yang Yang, Hanwang Zhang, Yanli Ji, Heng Tao Shen, and Tat-Seng Chua. 2018. More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Trans. Image Proc. 28, 1 (2018), 32–44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. ECO: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Rich Visual and Language Representation with Complementary Semantics for Video Captioning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!