skip to main content
10.1145/3474085.3475236acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Latent Memory-augmented Graph Transformer for Visual Storytelling

Authors Info & Claims
Published:17 October 2021Publication History

ABSTRACT

Visual storytelling aims to automatically generate a human-like short story given an image stream. Most existing works utilize either scene-level or object-level representations, neglecting the interaction among objects in each image and the sequential dependency between consecutive images. In this paper, we present a novel Latent Memory-augmented Graph Transformer~(LMGT ), a Transformer based framework for visual story generation. LMGT directly inherits the merits from the Transformer, which is further enhanced with two carefully designed components, i.e., a graph encoding module and a latent memory unit. Specifically, the graph encoding module exploits the semantic relationships among image regions and attentively aggregates critical visual features based on the parsed scene graphs. Furthermore, to better preserve inter-sentence coherence and topic consistency, we introduce an augmented latent memory unit that learns and records highly summarized latent information as the story line from the image stream and the sentence history. Experimental results on three widely-used datasets demonstrate the superior performance of LMGT over the state-of-the-art methods.

Skip Supplemental Material Section

Supplemental Material

mfp0419_video.mp4

mp4

40.1 MB

References

  1. Vishal Anand, Raksha Ramesh, Ziyin Wang, Yijing Feng, Jiana Feng, Wenfeng Lyu, Tianle Zhu, Serena Yuan, and Ching-Yung Lin. 2020. Story Semantic Relationships from Multimodal Cognitions. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV. Springer.Google ScholarGoogle Scholar
  3. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).Google ScholarGoogle Scholar
  4. Elaheh Barati and Xuewen Chen. 2019. Critic-based Attention Network for Event-based Video Captioning. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google ScholarGoogle Scholar
  6. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019).Google ScholarGoogle Scholar
  7. Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).Google ScholarGoogle Scholar
  8. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-Memory Transformer for Image Captioning. In CVPR. IEEE/CVF.Google ScholarGoogle Scholar
  10. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. In ACL .Google ScholarGoogle Scholar
  11. Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation. 376--380.Google ScholarGoogle ScholarCross RefCross Ref
  12. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  13. Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, Vol. 19, 9 (2017), 2045--2055.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019. Aligning linguistic words and visual semantic units for image captioning. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE.Google ScholarGoogle Scholar
  16. Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In NeurIPS . Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Xudong Hong, Rakshith Shetty, Asad Sayeed, Khushboo Mehra, Vera Demberg, and Bernt Schiele. 2020. Diverse and Relevant Visual Storytelling with Scene Graph Embeddings. In CNLL .Google ScholarGoogle Scholar
  19. Chao-Chun Hsu, Zi-Yuan Chen, Chi-Yang Hsu, Chih-Chia Li, Tzu-Yuan Lin, Ting-Hao'Kenneth' Huang, and Lun-Wei Ku. 2020. Knowledge-Enriched Visual Storytelling. In AAAI .Google ScholarGoogle Scholar
  20. Junjie Hu, Yu Cheng, Zhe Gan, Jingjing Liu, Jianfeng Gao, and Graham Neubig. 2020. What Makes A Good Story? Designing Composite Rewards for Visual Storytelling.. In AAAI .Google ScholarGoogle Scholar
  21. Yaosi Hu, Zhenzhong Chen, Zheng-Jun Zha, and Feng Wu. 2019. Hierarchical global-local temporal modeling for video captioning. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019 b. Attention on attention for image captioning. In ICCV. IEEE.Google ScholarGoogle Scholar
  23. Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Wu, Jianfeng Wang, and Xiaodong He. 2019 a. Hierarchically structured reinforcement learning for topically coherent visual story generation. In AAAI .Google ScholarGoogle Scholar
  24. Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1233--1239.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji, Fuhai Chen, Jianzhuang Liu, and Qi Tian. 2020. Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yunjae Jung, Dahun Kim, Sanghyun Woo, Kyungsu Kim, Sungjin Kim, and In So Kweon. 2020. Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling. In AAAI .Google ScholarGoogle Scholar
  27. Taehyeong Kim, Min-Oh Heo, Seonil Son, Kyoung-Wha Park, and Byoung-Tak Zhang. 2018. Glac net: Glocal attention cascading networks for multi-image cued story generation. In ACL .Google ScholarGoogle Scholar
  28. Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierarchical approach for generating descriptive image paragraphs. In CVPR. IEEE.Google ScholarGoogle Scholar
  29. Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV. IEEE.Google ScholarGoogle Scholar
  30. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. https://arxiv.org/abs/1602.07332Google ScholarGoogle Scholar
  31. Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L Berg, and Mohit Bansal. 2020. Mart: Memory-augmented recurrent transformer for coherent video paragraph captioning. In ACL .Google ScholarGoogle Scholar
  32. Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019 b. Entangled transformer for image captioning. In ICCV. IEEE.Google ScholarGoogle Scholar
  33. Jiacheng Li, Haizhou Shi, Siliang Tang, Fei Wu, and Yueting Zhuang. 2019 a. Informative Visual Storytelling with Cross-modal Rules. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jiacheng Li, Siliang Tang, Juncheng Li, Jun Xiao, Fei Wu, Shiliang Pu, and Yueting Zhuang. 2020. Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. 2018. Factorizable net: an efficient subgraph-based framework for scene graph generation. In ECCV. Springer.Google ScholarGoogle Scholar
  36. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.Google ScholarGoogle Scholar
  37. Jen-Chun Lin, Wen-Li Wei, Yen-Yu Lin, Tyng-Luh Liu, and Hong-Yuan Mark Liao. 2020. Learning From Music to Visual Storytelling of Shots: A Deep Interactive Learning Mechanism. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yang Liu. 2019. Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318 (2019).Google ScholarGoogle Scholar
  39. Yu Liu, Jianlong Fu, Tao Mei, and Chang Wen Chen. 2017. Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In AAAI . Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Bruce T Lowerre. 1976. The HARPY speech recognition system. Technical Report. CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE. Google ScholarGoogle Scholar
  41. Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In CVPR. IEEE.Google ScholarGoogle Scholar
  42. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL . Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Cesc C Park and Gunhee Kim. 2015. Expressing an image stream with a sequence of natural sentences. In NIPS . Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP .Google ScholarGoogle Scholar
  45. Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, and Jiebo Luo. 2019 a. Attentive relational networks for mapping images to scene graphs. In CVPR. IEEE.Google ScholarGoogle Scholar
  46. Mengshi Qi, Jie Qin, Yi Yang, Yunhong Wang, and Jiebo Luo. 2021. Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, Vol. 30 (2021), 2989--3004.Google ScholarGoogle ScholarCross RefCross Ref
  47. Mengshi Qi, Jie Qin, Xiantong Zhen, Di Huang, Yi Yang, and Jiebo Luo. 2020. Few-Shot Ensemble Learning for Video Classification with SlowFast Memory Networks. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Mengshi Qi, Yunhong Wang, and Annan Li. 2017. Online cross-modal scene retrieval by binary representation and semantic graph. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2018. Sports video captioning by attentive motion representation based hierarchical recurrent neural networks. In MMSports. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2019 b. Sports video captioning via attentive motion representation and group relationship modeling. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 8 (2019), 2617--2633.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Mengshi Qi, Yunhong Wang, Jie Qin, Annan Li, Jiebo Luo, and Luc Van Gool. 2019 c. stagNet: an attentive semantic RNN for group activity and individual action recognition. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 2 (2019), 549--565.Google ScholarGoogle ScholarCross RefCross Ref
  52. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS . Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL .Google ScholarGoogle Scholar
  54. Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. In CVPR. IEEE.Google ScholarGoogle Scholar
  55. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR .Google ScholarGoogle Scholar
  56. Yuqing Song, Shizhe Chen, Yida Zhao, and Qin Jin. 2019. Unpaired cross-lingual image caption generation with self-supervised rewards. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In ICCV. IEEE.Google ScholarGoogle Scholar
  58. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP .Google ScholarGoogle Scholar
  59. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS . Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. IEEE.Google ScholarGoogle Scholar
  61. Petar Velivc ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In ICLR .Google ScholarGoogle Scholar
  62. Paula Viana, Pedro Carvalho, Maria Teresa Andrade, Pieter P Jonker, Vasileios Papanikolaou, Inês N Teixeira, Luis Vilacc a, José P Pinto, and Tiago Costa. 2020. Semantic Storytelling Automation: A Context-Aware and Metadata-Driven Approach. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. IEEE.Google ScholarGoogle Scholar
  64. Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, and Feng Zhang. 2019. Hierarchical photo-scene encoder for album storytelling. In AAAI .Google ScholarGoogle Scholar
  65. Jing Wang, Jianlong Fu, Jinhui Tang, Zechao Li, and Tao Mei. 2018b. Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training. In AAAI .Google ScholarGoogle Scholar
  66. Jing Wang, Jinhui Tang, and Jiebo Luo. 2020 a. Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Ruize Wang, Zhongyu Wei, Piji Li, Qi Zhang, and Xuanjing Huang. 2020 b. Storytelling from an Image Stream Using Scene Graphs. In AAAI .Google ScholarGoogle Scholar
  68. Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. 2018a. No metrics are perfect: Adversarial reward learning for visual storytelling. In ACL .Google ScholarGoogle Scholar
  69. Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In CVPR. IEEE.Google ScholarGoogle Scholar
  70. Pengcheng Yang, Fuli Luo, Peng Chen, Lei Li, Zhiyi Yin, Xiaodong He, and Xu Sun. 2019 b. Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling.. In IJCAI . Google ScholarGoogle ScholarCross RefCross Ref
  71. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019 a. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS . Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Licheng Yu, Mohit Bansal, and Tamara L Berg. 2017. Hierarchically-attentive rnn for album summarization and storytelling. In EMNLP .Google ScholarGoogle Scholar
  73. Yitian Yuan, Lin Ma, Jingwen Wang, and Wenwu Zhu. 2020. Controllable Video Captioning with an Exemplar Sentence. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR. IEEE.Google ScholarGoogle Scholar
  75. Beichen Zhang, Liang Li, Li Su, Shuhui Wang, Jincan Deng, Zheng-Jun Zha, and Qingming Huang. 2020 a. Structural Semantic Adversarial Active Learning for Image Captioning. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, and Fei Wu. 2020 b. Poet: Product-oriented Video Captioner for E-commerce. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In CVPR. IEEE.Google ScholarGoogle Scholar
  78. Yongqing Zhu and Shuqiang Jiang. 2019. Attention-based densely connected LSTM for video captioning. In MM. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Latent Memory-augmented Graph Transformer for Visual Storytelling

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '21: Proceedings of the 29th ACM International Conference on Multimedia
        October 2021
        5796 pages
        ISBN:9781450386517
        DOI:10.1145/3474085

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 October 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader