Abstract
Multimodal summarization aims to extract the most important information from the multimedia input. It is becoming increasingly popular due to the rapid growth of multimedia data in recent years. There are various researches focusing on different multimodal summarization tasks. However, the existing methods can only generate single-modal output or multimodal output. In addition, most of them need a lot of annotated samples for training, which makes it difficult to be generalized to other tasks or domains. Motivated by this, we propose a unified framework for multimodal summarization that can cover both single-modal output summarization and multimodal output summarization. In our framework, we consider three different scenarios and propose the respective unsupervised graph-based multimodal summarization models without the requirement of any manually annotated document-summary pairs for training: (1) generic multimodal ranking, (2) modal-dominated multimodal ranking, and (3) non-redundant text-image multimodal ranking. Furthermore, an image-text similarity estimation model is introduced to measure the semantic similarity between image and text. Experiments show that our proposed models outperform the single-modal summarization methods on both automatic and human evaluation metrics. Besides, our models can also improve the single-modal summarization with the guidance of the multimedia information. This study can be applied as the benchmark for further study on multimodal summarization task.
- Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M Blei, and Michael I Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res. 3, Feb. (2003), 1107–1135. Google Scholar
Digital Library
- Jingwen Bian, Yang Yang, Hanwang Zhang, and Tat-Seng Chua. 2015. Multimedia summarization for social events in microblog stream. IEEE Trans. Multim. 17, 2 (2015), 216–228.Google Scholar
Digital Library
- Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 1662–1675.Google Scholar
Cross Ref
- Jingqiang Chen and Hai Zhuge. 2018. Abstractive text-image summarization using multi-modal attentional hierarchical RNN. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 4046–4056.Google Scholar
Cross Ref
- Jinsoo Choi, Tae-Hyun Oh, and In So Kweon. 2018. Contextually customized video summaries via natural language. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’18). 1718–1726.Google Scholar
Cross Ref
- Ronan Collobert, Jason Weston, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 1 (2011), 2493–2537. Google Scholar
Digital Library
- Sandra Eliza Fontes De Avila, Ana Paula Brandão Lopes, Antonio da Luz Jr, and Arnaldo de Albuquerque Araújo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recog. Lett. 32, 1 (2011), 56–68. Google Scholar
Digital Library
- E. Elhamifar, G. Sapiro, and R. Vidal. 2012. See all by looking at a few: Sparse modeling for finding representative objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 1600–1607. Google Scholar
Digital Library
- Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22 (2004), 457–479. Google Scholar
Digital Library
- Georgios Evangelopoulos, Athanasia Zlatintsi, Alexandros Potamianos, Petros Maragos, Konstantinos Rapantzikos, Georgios Skoumas, and Yannis Avrithis. 2013. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Trans. Multim. 15, 7 (2013), 1553–1568. Google Scholar
Digital Library
- Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference (BMVC’18).Google Scholar
- Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov et al. 2013. DeVISE: A deep visual-semantic embedding model. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’13). 2121–2129. Google Scholar
Digital Library
- Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3090–3098.Google Scholar
Cross Ref
- Xiaofei He, Wei-Ying Ma, and Hongjiang Zhang. 2003. ImageRank: Spectral techniques for structural analysis of image database. In Proceedings of the International Conference on Multimedia and Expo (ICME’03). IEEE, I–25. Google Scholar
Digital Library
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128–3137.Google Scholar
Cross Ref
- George Karypis. 2001. Evaluation of item-based top-n recommendation algorithms. In Proceedings of the 10th International Conference on Information and Knowledge Management. 247–254. Google Scholar
Digital Library
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
- Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang, and Chengqing Zong. 2018. Multi-modal sentence summarization with modality attention and image filtering. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). International Joint Conferences on Artificial Intelligence Organization, 4152–4158. Google Scholar
Digital Library
- Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, and Chengqing Zong. 2017. Multi-modal summarization for asynchronous collection of text, image, audio and video. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 1092–1102.Google Scholar
Cross Ref
- Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, and Chengqing Zong. 2018. Read, watch, listen and summarize: multi-modal summarization for asynchronous text, image, audio and video. IEEE Trans. Knowl. Data Eng. 31, 5 (2018).Google Scholar
- Haoran Li, Junnan Zhu, Jiajun Zhang, Xiaodong He, and Chengqing Zong. 2020. Multimodal sentence summarization via multimodal selective encoding. In Proceedings of the 28th International Conference on Computational Linguistics (COLING’20). 5655–5667.Google Scholar
Cross Ref
- Haoran Li, Junnan Zhu, Jiajun Zhang, Chengqing Zong, and Xiaodong He. 2020. Keywords-guided abstractive sentence summarization. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 8.Google Scholar
Cross Ref
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. ACL.Google Scholar
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 740–755.Google Scholar
- Ioannis Mademlis, Anastasios Tefas, Nikos Nikolaidis, and Ioannis Pitas. 2016. Multimodal stereoscopic movie summarization conforming to narrative characteristics. IEEE Trans. Image Proc. 25, 12 (2016), 5828–5840. Google Scholar
Digital Library
- Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR’18). 19–27. Google Scholar
Digital Library
- Rameswar Panda, Niluthpol Chowdhury Mithun, and Amit K. Roy-Chowdhury. 2017. Diversity-aware multi-video summarization. IEEE Trans. Image Proc. 26, 10 (2017), 4712–4724.Google Scholar
Digital Library
- Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML’13). 1310–1318. Google Scholar
Digital Library
- Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
- Bryan A. Plummer, Matthew Brown, and Svetlana Lazebnik. 2017. Enhancing video summarization via vision-language embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5781–5789.Google Scholar
Cross Ref
- Shengsheng Qian, Tianzhu Zhang, and Changsheng Xu. 2016. Multi-modal multi-view topic-opinion mining for social event analysis. In Proceedings of the 24th ACM International Conference on Multimedia (ACM MM’16). ACM, 2–11. Google Scholar
Digital Library
- Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15). Association for Computational Linguistics, 379–389.Google Scholar
Cross Ref
- Eugene Seneta. 2006. Non-negative Matrices and Markov Chains. Springer Science & Business Media.Google Scholar
- Vasu Sharma, Akshay Kumar, Nishant Agrawal, Puneet Singh, and Rajat Kulshreshtha. 2015. Image summarization using topic modelling. In Proceedings of the IEEE International Conference on Signal and Image Processing Applications (ICSIPA’15). IEEE, 226–231.Google Scholar
Cross Ref
- Ian Simon, Noah Snavely, and Steven M. Seitz. 2007. Scene summarization for online image collections. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’07). IEEE, 1–8.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
- Pinaki Sinha, Hamed Pirsiavash, and Ramesh Jain. 2009. Personal photo album summarization. In Proceedings of the 17th ACM International Conference on Multimedia (ACM MM’09). ACM, 1131–1132. Google Scholar
Digital Library
- Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Ling. 2 (2014), 207–218.Google Scholar
Cross Ref
- Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Meeting of the Association for Computational Linguistics (ACL’19). 6558–6569.Google Scholar
Cross Ref
- Xiaojun Wan and Jianwu Yang. 2006. Improved affinity graph based multi-document summarization. In Proceedings of the Human Language Technology Conference of the NAACL. 181–184. Google Scholar
Digital Library
- Jingdong Wang, Liyan Jia, and Xian-Sheng Hua. 2011. Interactive browsing via diversified visual summarization for image search results. Multim. Syst. 17, 5 (2011), 379–391. Google Scholar
Digital Library
- Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. (2018). Retrieved from https://arxiv.org/abs/1704.03470. Google Scholar
Digital Library
- William Yang Wang, Yashar Mehdad, Dragomir R. Radev, and Amanda Stent. 2016. A low-rank approximation approach to learning joint embeddings of news stories and images for timeline summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16). 58–68.Google Scholar
Cross Ref
- Bo Xiong, Gunhee Kim, and Leonid Sigal. 2015. Storyline representation of egocentric videos with an applications to story-based search. In Proceedings of the IEEE International Conference on Computer Vision (CVPR’15). 4525–4533. Google Scholar
Digital Library
- Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67–78.Google Scholar
Cross Ref
- Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Jiajun Zhang, and Chengqing Zong. 2018. MSMO: Multimodal summarization with multimodal output. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 4154–4164.Google Scholar
Cross Ref
- Junnan Zhu, Qian Wang, Yining Wang, Yu Zhou, Jiajun Zhang, Shaonan Wang, and Chengqing Zong. 2019. NCLS: Neural cross-lingual summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 3054–3064.Google Scholar
Cross Ref
- Junnan Zhu, Long Zhou, Haoran Li, Jiajun Zhang, Yu Zhou, and Chengqing Zong. 2017. Augmenting neural sentence summarization through extractive summarization. In Proceedings of the 6th Conference on Natural Language Processing and Chinese Computing (NLPCC’17). 16–28.Google Scholar
- Junnan Zhu, Yu Zhou, Jiajun Zhang, Haoran Li, Chengqing Zong, and Changliang Li. 2020. Multimodal summarization with guidance of multimodal reference. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20).Google Scholar
Cross Ref
- Junnan Zhu, Yu Zhou, Jiajun Zhang, and Chengqing Zong. 2020. Attend, translate and summarize: An efficient method for neural cross-lingual summarization. In Proceedings of the 58th Meeting of the Association for Computational Linguistics (ACL’20). 1309–1321.Google Scholar
Cross Ref
- Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 19–27. Google Scholar
Digital Library
- Keneilwe Zuva and Tranos Zuva. 2012. Evaluation of information retrieval systems. Int. J. Comput. Sci. Inf. Technol. 4, 3 (2012), 35.Google Scholar
Cross Ref
Index Terms
Graph-based Multimodal Ranking Models for Multimodal Summarization
Recommendations
Multimodal Video Summarization via Time-Aware Transformers
MM '21: Proceedings of the 29th ACM International Conference on MultimediaWith the growing number of videos in video sharing platforms, how to facilitate the searching and browsing of the user-generated video has attracted intense attention by multimedia community. To help people efficiently search and browse relevant videos, ...
Multimodal summarization of complex sentences
IUI '11: Proceedings of the 16th international conference on Intelligent user interfacesIn this paper, we introduce the idea of automatically illustrating complex sentences as multimodal summaries that combine pictures, structure and simplified compressed text. By including text and structure in addition to pictures, multimodal summaries ...
EPICURE - Aspect-based Multimodal Review Summarization
WebSci '18: Proceedings of the 10th ACM Conference on Web ScienceRestaurant reviews are popular and a valuable source of information. Often, large number of reviews are written for restaurants which warrants the need for automated summarization systems. In this paper we present epicure, a novel text and image ...






Comments