ABSTRACT
Visual storytelling aims to automatically generate a human-like short story given an image stream. Most existing works utilize either scene-level or object-level representations, neglecting the interaction among objects in each image and the sequential dependency between consecutive images. In this paper, we present a novel Latent Memory-augmented Graph Transformer~(LMGT ), a Transformer based framework for visual story generation. LMGT directly inherits the merits from the Transformer, which is further enhanced with two carefully designed components, i.e., a graph encoding module and a latent memory unit. Specifically, the graph encoding module exploits the semantic relationships among image regions and attentively aggregates critical visual features based on the parsed scene graphs. Furthermore, to better preserve inter-sentence coherence and topic consistency, we introduce an augmented latent memory unit that learns and records highly summarized latent information as the story line from the image stream and the sentence history. Experimental results on three widely-used datasets demonstrate the superior performance of LMGT over the state-of-the-art methods.
Supplemental Material
- Vishal Anand, Raksha Ramesh, Ziyin Wang, Yijing Feng, Jiana Feng, Wenfeng Lyu, Tianle Zhu, Serena Yuan, and Ching-Yung Lin. 2020. Story Semantic Relationships from Multimodal Cognitions. In MM. ACM. Google Scholar
Digital Library
- Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV. Springer.Google Scholar
- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).Google Scholar
- Elaheh Barati and Xuewen Chen. 2019. Critic-based Attention Network for Event-based Video Captioning. In MM. ACM. Google Scholar
Digital Library
- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google Scholar
- Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019).Google Scholar
- Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).Google Scholar
- Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).Google Scholar
Digital Library
- Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-Memory Transformer for Image Captioning. In CVPR. IEEE/CVF.Google Scholar
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. In ACL .Google Scholar
- Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation. 376--380.Google Scholar
Cross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, Vol. 19, 9 (2017), 2045--2055.Google Scholar
Digital Library
- Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019. Aligning linguistic words and visual semantic units for image captioning. In MM. ACM. Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE.Google Scholar
- Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In NeurIPS . Google Scholar
Digital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google Scholar
Digital Library
- Xudong Hong, Rakshith Shetty, Asad Sayeed, Khushboo Mehra, Vera Demberg, and Bernt Schiele. 2020. Diverse and Relevant Visual Storytelling with Scene Graph Embeddings. In CNLL .Google Scholar
- Chao-Chun Hsu, Zi-Yuan Chen, Chi-Yang Hsu, Chih-Chia Li, Tzu-Yuan Lin, Ting-Hao'Kenneth' Huang, and Lun-Wei Ku. 2020. Knowledge-Enriched Visual Storytelling. In AAAI .Google Scholar
- Junjie Hu, Yu Cheng, Zhe Gan, Jingjing Liu, Jianfeng Gao, and Graham Neubig. 2020. What Makes A Good Story? Designing Composite Rewards for Visual Storytelling.. In AAAI .Google Scholar
- Yaosi Hu, Zhenzhong Chen, Zheng-Jun Zha, and Feng Wu. 2019. Hierarchical global-local temporal modeling for video captioning. In MM. ACM. Google Scholar
Digital Library
- Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019 b. Attention on attention for image captioning. In ICCV. IEEE.Google Scholar
- Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Wu, Jianfeng Wang, and Xiaodong He. 2019 a. Hierarchically structured reinforcement learning for topically coherent visual story generation. In AAAI .Google Scholar
- Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1233--1239.Google Scholar
Cross Ref
- Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji, Fuhai Chen, Jianzhuang Liu, and Qi Tian. 2020. Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal. In MM. ACM. Google Scholar
Digital Library
- Yunjae Jung, Dahun Kim, Sanghyun Woo, Kyungsu Kim, Sungjin Kim, and In So Kweon. 2020. Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling. In AAAI .Google Scholar
- Taehyeong Kim, Min-Oh Heo, Seonil Son, Kyoung-Wha Park, and Byoung-Tak Zhang. 2018. Glac net: Glocal attention cascading networks for multi-image cued story generation. In ACL .Google Scholar
- Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierarchical approach for generating descriptive image paragraphs. In CVPR. IEEE.Google Scholar
- Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV. IEEE.Google Scholar
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. https://arxiv.org/abs/1602.07332Google Scholar
- Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L Berg, and Mohit Bansal. 2020. Mart: Memory-augmented recurrent transformer for coherent video paragraph captioning. In ACL .Google Scholar
- Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019 b. Entangled transformer for image captioning. In ICCV. IEEE.Google Scholar
- Jiacheng Li, Haizhou Shi, Siliang Tang, Fei Wu, and Yueting Zhuang. 2019 a. Informative Visual Storytelling with Cross-modal Rules. In MM. ACM. Google Scholar
Digital Library
- Jiacheng Li, Siliang Tang, Juncheng Li, Jun Xiao, Fei Wu, Shiliang Pu, and Yueting Zhuang. 2020. Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling. In MM. ACM. Google Scholar
Digital Library
- Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. 2018. Factorizable net: an efficient subgraph-based framework for scene graph generation. In ECCV. Springer.Google Scholar
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.Google Scholar
- Jen-Chun Lin, Wen-Li Wei, Yen-Yu Lin, Tyng-Luh Liu, and Hong-Yuan Mark Liao. 2020. Learning From Music to Visual Storytelling of Shots: A Deep Interactive Learning Mechanism. In MM. ACM. Google Scholar
Digital Library
- Yang Liu. 2019. Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318 (2019).Google Scholar
- Yu Liu, Jianlong Fu, Tao Mei, and Chang Wen Chen. 2017. Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In AAAI . Google Scholar
Digital Library
- Bruce T Lowerre. 1976. The HARPY speech recognition system. Technical Report. CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE. Google Scholar
- Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In CVPR. IEEE.Google Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL . Google Scholar
Digital Library
- Cesc C Park and Gunhee Kim. 2015. Expressing an image stream with a sequence of natural sentences. In NIPS . Google Scholar
Digital Library
- Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP .Google Scholar
- Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, and Jiebo Luo. 2019 a. Attentive relational networks for mapping images to scene graphs. In CVPR. IEEE.Google Scholar
- Mengshi Qi, Jie Qin, Yi Yang, Yunhong Wang, and Jiebo Luo. 2021. Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, Vol. 30 (2021), 2989--3004.Google Scholar
Cross Ref
- Mengshi Qi, Jie Qin, Xiantong Zhen, Di Huang, Yi Yang, and Jiebo Luo. 2020. Few-Shot Ensemble Learning for Video Classification with SlowFast Memory Networks. In MM. ACM. Google Scholar
Digital Library
- Mengshi Qi, Yunhong Wang, and Annan Li. 2017. Online cross-modal scene retrieval by binary representation and semantic graph. In MM. ACM. Google Scholar
Digital Library
- Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2018. Sports video captioning by attentive motion representation based hierarchical recurrent neural networks. In MMSports. ACM. Google Scholar
Digital Library
- Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2019 b. Sports video captioning via attentive motion representation and group relationship modeling. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 8 (2019), 2617--2633.Google Scholar
Digital Library
- Mengshi Qi, Yunhong Wang, Jie Qin, Annan Li, Jiebo Luo, and Luc Van Gool. 2019 c. stagNet: an attentive semantic RNN for group activity and individual action recognition. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 2 (2019), 549--565.Google Scholar
Cross Ref
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS . Google Scholar
Digital Library
- Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL .Google Scholar
- Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. In CVPR. IEEE.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR .Google Scholar
- Yuqing Song, Shizhe Chen, Yida Zhao, and Qin Jin. 2019. Unpaired cross-lingual image caption generation with self-supervised rewards. In MM. ACM. Google Scholar
Digital Library
- Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In ICCV. IEEE.Google Scholar
- Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP .Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS . Google Scholar
Digital Library
- Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. IEEE.Google Scholar
- Petar Velivc ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In ICLR .Google Scholar
- Paula Viana, Pedro Carvalho, Maria Teresa Andrade, Pieter P Jonker, Vasileios Papanikolaou, Inês N Teixeira, Luis Vilacc a, José P Pinto, and Tiago Costa. 2020. Semantic Storytelling Automation: A Context-Aware and Metadata-Driven Approach. In MM. ACM. Google Scholar
Digital Library
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. IEEE.Google Scholar
- Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, and Feng Zhang. 2019. Hierarchical photo-scene encoder for album storytelling. In AAAI .Google Scholar
- Jing Wang, Jianlong Fu, Jinhui Tang, Zechao Li, and Tao Mei. 2018b. Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training. In AAAI .Google Scholar
- Jing Wang, Jinhui Tang, and Jiebo Luo. 2020 a. Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. In MM. ACM. Google Scholar
Digital Library
- Ruize Wang, Zhongyu Wei, Piji Li, Qi Zhang, and Xuanjing Huang. 2020 b. Storytelling from an Image Stream Using Scene Graphs. In AAAI .Google Scholar
- Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. 2018a. No metrics are perfect: Adversarial reward learning for visual storytelling. In ACL .Google Scholar
- Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In CVPR. IEEE.Google Scholar
- Pengcheng Yang, Fuli Luo, Peng Chen, Lei Li, Zhiyi Yin, Xiaodong He, and Xu Sun. 2019 b. Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling.. In IJCAI . Google Scholar
Cross Ref
- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019 a. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS . Google Scholar
Digital Library
- Licheng Yu, Mohit Bansal, and Tamara L Berg. 2017. Hierarchically-attentive rnn for album summarization and storytelling. In EMNLP .Google Scholar
- Yitian Yuan, Lin Ma, Jingwen Wang, and Wenwu Zhu. 2020. Controllable Video Captioning with an Exemplar Sentence. In MM. ACM. Google Scholar
Digital Library
- Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR. IEEE.Google Scholar
- Beichen Zhang, Liang Li, Li Su, Shuhui Wang, Jincan Deng, Zheng-Jun Zha, and Qingming Huang. 2020 a. Structural Semantic Adversarial Active Learning for Image Captioning. In MM. ACM. Google Scholar
Digital Library
- Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, and Fei Wu. 2020 b. Poet: Product-oriented Video Captioner for E-commerce. In MM. ACM. Google Scholar
Digital Library
- Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In CVPR. IEEE.Google Scholar
- Yongqing Zhu and Shuqiang Jiang. 2019. Attention-based densely connected LSTM for video captioning. In MM. ACM. Google Scholar
Digital Library
Index Terms
- Latent Memory-augmented Graph Transformer for Visual Storytelling
Recommendations
Visual Storytelling with Hierarchical BERT Semantic Guidance
MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in AsiaVisual storytelling, which aims at automatically producing a narrative paragraph for photo album, remains quite challenging due to the complexity and diversity of photo album content. In addition, open-domain photo albums cover a broad range of topics ...
Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling
MM '20: Proceedings of the 28th ACM International Conference on MultimediaVisual Storytelling~(VIST) is a task to tell a narrative story about a certain topic according to the given photo stream. The existing studies focus on designing complex models, which rely on a huge amount of human-annotated data. However, the ...
Sentimental Visual Captioning using Multimodal Transformer
AbstractWe propose a new task called sentimental visual captioning that generates captions with the inherent sentiment reflected by the input image or video. Compared with the stylized visual captioning task that requires a predefined style independent of ...





Comments