skip to main content
research-article

Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning

Authors Info & Claims
Published:04 March 2022Publication History
Skip Abstract Section

Abstract

Fully mining visual cues to aid in content understanding is crucial for video captioning. However, most state-of-the-art video captioning methods are limited to generating captions purely based on straightforward information while ignoring the scenario and context information. To fill the gap, we propose a novel, simple but effective scenario-aware recurrent transformer (SART) model to execute video captioning. Our model contains a “scenario understanding” module to obtain a global perspective across multiple frames, providing a specific scenario to guarantee a goal-directed description. Moreover, for the sake of achieving narrative continuity in the generated paragraph, a unified recurrent transformer is adopted. To demonstrate the effectiveness of our proposed SART, we have conducted comprehensive experiments on various large-scale video description datasets, including ActivityNet, YouCookII, and VideoStory. Additionally, we extend a story-oriented evaluation framework for assessing the quality of the generated caption more precisely. The superior performance has shown that SART has a strong ability to generate correct, deliberative, and narrative coherent video descriptions.

REFERENCES

  1. [1] Aafaq Nayyer, Mian Ajmal, Liu Wei, Gilani Syed Zulqarnain, and Shah Mubarak. 2020. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys 52, 6 (2020), 115:1–115:37.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Agarwal Shubham, Dusek Ondrej, Konstas Ioannis, and Rieser Verena. 2018. A knowledge-grounded multimodal search-based conversational agent. In Proceedings of the 2nd International Workshop on Search-Oriented Conversational AI. 5966.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Alfassy Amit, Karlinsky Leonid, Aides Amit, Shtok Joseph, Harary Sivan, Feris Rogério Schmidt, Giryes Raja, and Bronstein Alexander M.. 2019. LaSO: Label-set operations networks for multi-label few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 65486557.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Banerjee Satanjeev and Lavie Alon. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 6572.Google ScholarGoogle Scholar
  5. [5] Chen Shizhe, Chen Jia, Jin Qin, and Hauptmann Alexander G.. 2017. Video captioning with guidance of multimodal latent topics. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017. 18381846.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Chen Yen-Chun, Li Linjie, Yu Licheng, Kholy Ahmed El, Ahmed Faisal, Gan Zhe, Cheng Yu, and Liu Jingjing. 2020. UNITER: UNiversal image-TExt representation learning. In Proceedings of the 16th European Conference on Computer Vision. 104120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Dai Zihang, Yang Zhilin, Yang Yiming, Carbonell Jaime G., Le Quoc Viet, and Salakhutdinov Ruslan. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019. 29782988.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 41714186.Google ScholarGoogle Scholar
  9. [9] Duan Xuguang, Huang Wen-bing, Gan Chuang, Wang Jingdong, Zhu Wenwu, and Huang Junzhou. 2018. Weakly supervised dense event captioning in videos. In Proceedings of the Annual Conference on Neural Information Processing Systems 2018. 30633073.Google ScholarGoogle Scholar
  10. [10] Fujita Soichiro, Hirao Tsutomu, Kamigaito Hidetaka, Okumura Manabu, and Nagata Masaaki. 2020. SODA: Story oriented dense video captioning evaluation framework. In Proceedings of the16th European Conference Computer Vision. 517531.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Gan Chuang, Gan Zhe, He Xiaodong, Gao Jianfeng, and Deng Li. 2017. StyleNet: Generating attractive visual captions with styles. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 955964.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Gan Zhe, Gan Chuang, He Xiaodong, Pu Yunchen, Tran Kenneth, Gao Jianfeng, Carin Lawrence, and Deng Li. 2017. Semantic compositional networks for visual captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 11411150.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Gao Lianli, Guo Zhao, Zhang Hanwang, Xu Xing, and Shen Heng Tao. 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia 19, 9 (2017), 20452055.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Gella Spandana, Lewis Mike, and Rohrbach Marcus. 2018. A dataset for telling the stories of social media videos. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 968974.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Ging Simon, Zolfaghari Mohammadreza, Pirsiavash Hamed, and Brox Thomas. 2020. COOT: Cooperative hierarchical transformer for video-text representation learning. In Proceedings of the Annual Conference on Neural Information Processing Systems 2020.Google ScholarGoogle Scholar
  16. [16] Guo Daya, Hong Jiangshui, Luo Binli, Yan Qirui, and Niu Zhangming. 2019. Multi-modal representation learning for short video understanding and recommendation. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops. 687690.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Identity mappings in deep residual networks. In Proceedings of the 14th European Conference on Computer Vision, Vol. 9908. 630645.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961970.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Ioffe Sergey and Szegedy Christian. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015. 448456.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Jin Qin, Chen Jia, Chen Shizhe, Xiong Yifan, and Hauptmann Alexander G.. 2016. Describing videos using multi-modal fusion. In Proceedings of the 2016 ACM Conference on Multimedia. 10871091.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Kong Lingchao and Dai Rui. 2018. Efficient video encoding for automatic video analysis in distributed wireless surveillance systems. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), 72:1–72:24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Krishna Ranjay, Hata Kenji, Ren Frederic, Fei-Fei Li, and Niebles Juan Carlos. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706715.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Lei Jie, Wang Liwei, Shen Yelong, Yu Dong, Berg Tamara L., and Bansal Mohit. 2020. MART: Memory-augmented recurrent transformer for coherent video paragraph captioning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 26032614.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Li Sheng, Tao Zhiqiang, Li Kang, and Fu Yun. 2019. Visual to text: Survey of image and video captioning. IEEE Transactions on Emerging Topics in Computational Intelligence 3, 4 (2019), 297312.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Long Xiang, Gan Chuang, and Melo Gerard de. 2018. Video captioning with multi-faceted attention. Transactions of the Association for Computational 6 (2018), 173184.Google ScholarGoogle Scholar
  26. [26] Mei Tao, Tang Lin-Xie, Tang Jinhui, and Hua Xian-Sheng. 2013. Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 3 (2013), 16:1–16:23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Mun Jonghwan, Yang Linjie, Ren Zhou, Xu Ning, and Han Bohyung. 2019. Streamlined dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 65886597.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Pan Pingbo, Xu Zhongwen, Yang Yi, Wu Fei, and Zhuang Yueting. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 10291038.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Pan Yingwei, Yao Ting, Li Houqiang, and Mei Tao. 2017. Video captioning with transferred semantic attributes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017. 984992.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Park Jae Sung, Rohrbach Marcus, Darrell Trevor, and Rohrbach Anna. 2019. Adversarial inference for multi-sentence video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 65986608.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Pascanu Razvan, Mikolov Tomás, and Bengio Yoshua. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning. 13101318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Roy Deb. 2005. Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence 167, 1–2 (2005), 170205.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Santoro Adam, Raposo David, Barrett David G. T., Malinowski Mateusz, Pascanu Razvan, Battaglia Peter W., and Lillicrap Tim. 2017. A simple neural network module for relational reasoning. In Proceedings of the Annual Conference on Neural Information Processing Systems. 49674976.Google ScholarGoogle Scholar
  35. [35] Sun Chen, Myers Austin, Vondrick Carl, Murphy Kevin, and Schmid Cordelia. 2019. VideoBERT: A joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision. 74637472.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems. 31043112.Google ScholarGoogle Scholar
  37. [37] Tan Hao and Bansal Mohit. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP. 50995110.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Tang Pengjie, Wang Hanli, and Li Qinyu. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2 (2019), 31:1–31:23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  40. [40] Vedantam Ramakrishna, Zitnick C. Lawrence, and Parikh Devi. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 45664575.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Wang Anqi, Hu Haifeng, and Yang Liang. 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), 73:1–73:15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Wang Xin, Chen Wenhu, Wu Jiawei, Wang Yuan-Fang, and Wang William Yang. 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 42134222.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Xiong Yilei, Dai Bo, and Lin Dahua. 2018. Move forward and tell: A progressive generator of video descriptions. In Proceedings of the 15th European Conference on Computer Vision. 489505.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Yan Chenggang, Tu Yunbin, Wang Xingzheng, Zhang Yongbing, Hao Xinhong, Zhang Yongdong, and Dai Qionghai. 2020. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Transactions on Multimedia 22, 1 (2020), 229241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Yang Zhilin, Dai Zihang, Yang Yiming, Carbonell Jaime G., Salakhutdinov Ruslan, and Le Quoc V.. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Annual Conference on Neural Information Processing Systems. 57545764.Google ScholarGoogle Scholar
  47. [47] Yi Kexin, Gan Chuang, Li Yunzhu, Kohli Pushmeet, Wu Jiajun, Torralba Antonio, and Tenenbaum Joshua B.. 2020. CLEVRER: Collision events for video representation and reasoning. In Proceedings of the 8th International Conference on Learning Representations.Google ScholarGoogle Scholar
  48. [48] Zhang Bowen, Hu Hexiang, and Sha Fei. 2018. Cross-modal and hierarchical modeling of video and text. In Proceedings of the 15th European Conference on Computer Vision. 385401.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Zhang Junchao and Peng Yuxin. 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 83278336.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Zhang Junchao and Peng Yuxin. 2020. Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Transactions on Image Processing 29 (2020), 62096222.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Zhou Luowei, Kalantidis Yannis, Chen Xinlei, Corso Jason J., and Rohrbach Marcus. 2019. Grounded video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 65786587.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Zhou Luowei, Xu Chenliang, and Corso Jason J.. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18). 75907598.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Zhou Luowei, Zhou Yingbo, Corso Jason J., Socher Richard, and Xiong Caiming. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 87398748.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 4
      November 2022
      497 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3514185
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 March 2022
      • Accepted: 1 December 2021
      • Revised: 1 October 2021
      • Received: 1 August 2021
      Published in tomm Volume 18, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!