skip to main content
research-article
Open Access

Graph-based Multimodal Ranking Models for Multimodal Summarization

Authors Info & Claims
Published:26 May 2021Publication History
Skip Abstract Section

Abstract

Multimodal summarization aims to extract the most important information from the multimedia input. It is becoming increasingly popular due to the rapid growth of multimedia data in recent years. There are various researches focusing on different multimodal summarization tasks. However, the existing methods can only generate single-modal output or multimodal output. In addition, most of them need a lot of annotated samples for training, which makes it difficult to be generalized to other tasks or domains. Motivated by this, we propose a unified framework for multimodal summarization that can cover both single-modal output summarization and multimodal output summarization. In our framework, we consider three different scenarios and propose the respective unsupervised graph-based multimodal summarization models without the requirement of any manually annotated document-summary pairs for training: (1) generic multimodal ranking, (2) modal-dominated multimodal ranking, and (3) non-redundant text-image multimodal ranking. Furthermore, an image-text similarity estimation model is introduced to measure the semantic similarity between image and text. Experiments show that our proposed models outperform the single-modal summarization methods on both automatic and human evaluation metrics. Besides, our models can also improve the single-modal summarization with the guidance of the multimedia information. This study can be applied as the benchmark for further study on multimodal summarization task.

References

  1. Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M Blei, and Michael I Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res. 3, Feb. (2003), 1107–1135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jingwen Bian, Yang Yang, Hanwang Zhang, and Tat-Seng Chua. 2015. Multimedia summarization for social events in microblog stream. IEEE Trans. Multim. 17, 2 (2015), 216–228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 1662–1675.Google ScholarGoogle ScholarCross RefCross Ref
  4. Jingqiang Chen and Hai Zhuge. 2018. Abstractive text-image summarization using multi-modal attentional hierarchical RNN. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 4046–4056.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jinsoo Choi, Tae-Hyun Oh, and In So Kweon. 2018. Contextually customized video summaries via natural language. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’18). 1718–1726.Google ScholarGoogle ScholarCross RefCross Ref
  6. Ronan Collobert, Jason Weston, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 1 (2011), 2493–2537. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Sandra Eliza Fontes De Avila, Ana Paula Brandão Lopes, Antonio da Luz Jr, and Arnaldo de Albuquerque Araújo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recog. Lett. 32, 1 (2011), 56–68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Elhamifar, G. Sapiro, and R. Vidal. 2012. See all by looking at a few: Sparse modeling for finding representative objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 1600–1607. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22 (2004), 457–479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Georgios Evangelopoulos, Athanasia Zlatintsi, Alexandros Potamianos, Petros Maragos, Konstantinos Rapantzikos, Georgios Skoumas, and Yannis Avrithis. 2013. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Trans. Multim. 15, 7 (2013), 1553–1568. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference (BMVC’18).Google ScholarGoogle Scholar
  12. Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov et al. 2013. DeVISE: A deep visual-semantic embedding model. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’13). 2121–2129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3090–3098.Google ScholarGoogle ScholarCross RefCross Ref
  14. Xiaofei He, Wei-Ying Ma, and Hongjiang Zhang. 2003. ImageRank: Spectral techniques for structural analysis of image database. In Proceedings of the International Conference on Multimedia and Expo (ICME’03). IEEE, I–25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128–3137.Google ScholarGoogle ScholarCross RefCross Ref
  16. George Karypis. 2001. Evaluation of item-based top-n recommendation algorithms. In Proceedings of the 10th International Conference on Information and Knowledge Management. 247–254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  18. Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang, and Chengqing Zong. 2018. Multi-modal sentence summarization with modality attention and image filtering. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). International Joint Conferences on Artificial Intelligence Organization, 4152–4158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, and Chengqing Zong. 2017. Multi-modal summarization for asynchronous collection of text, image, audio and video. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 1092–1102.Google ScholarGoogle ScholarCross RefCross Ref
  20. Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, and Chengqing Zong. 2018. Read, watch, listen and summarize: multi-modal summarization for asynchronous text, image, audio and video. IEEE Trans. Knowl. Data Eng. 31, 5 (2018).Google ScholarGoogle Scholar
  21. Haoran Li, Junnan Zhu, Jiajun Zhang, Xiaodong He, and Chengqing Zong. 2020. Multimodal sentence summarization via multimodal selective encoding. In Proceedings of the 28th International Conference on Computational Linguistics (COLING’20). 5655–5667.Google ScholarGoogle ScholarCross RefCross Ref
  22. Haoran Li, Junnan Zhu, Jiajun Zhang, Chengqing Zong, and Xiaodong He. 2020. Keywords-guided abstractive sentence summarization. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 8.Google ScholarGoogle ScholarCross RefCross Ref
  23. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. ACL.Google ScholarGoogle Scholar
  24. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 740–755.Google ScholarGoogle Scholar
  25. Ioannis Mademlis, Anastasios Tefas, Nikos Nikolaidis, and Ioannis Pitas. 2016. Multimodal stereoscopic movie summarization conforming to narrative characteristics. IEEE Trans. Image Proc. 25, 12 (2016), 5828–5840. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR’18). 19–27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Rameswar Panda, Niluthpol Chowdhury Mithun, and Amit K. Roy-Chowdhury. 2017. Diversity-aware multi-video summarization. IEEE Trans. Image Proc. 26, 10 (2017), 4712–4724.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML’13). 1310–1318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google ScholarGoogle Scholar
  30. Bryan A. Plummer, Matthew Brown, and Svetlana Lazebnik. 2017. Enhancing video summarization via vision-language embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5781–5789.Google ScholarGoogle ScholarCross RefCross Ref
  31. Shengsheng Qian, Tianzhu Zhang, and Changsheng Xu. 2016. Multi-modal multi-view topic-opinion mining for social event analysis. In Proceedings of the 24th ACM International Conference on Multimedia (ACM MM’16). ACM, 2–11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15). Association for Computational Linguistics, 379–389.Google ScholarGoogle ScholarCross RefCross Ref
  33. Eugene Seneta. 2006. Non-negative Matrices and Markov Chains. Springer Science & Business Media.Google ScholarGoogle Scholar
  34. Vasu Sharma, Akshay Kumar, Nishant Agrawal, Puneet Singh, and Rajat Kulshreshtha. 2015. Image summarization using topic modelling. In Proceedings of the IEEE International Conference on Signal and Image Processing Applications (ICSIPA’15). IEEE, 226–231.Google ScholarGoogle ScholarCross RefCross Ref
  35. Ian Simon, Noah Snavely, and Steven M. Seitz. 2007. Scene summarization for online image collections. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’07). IEEE, 1–8.Google ScholarGoogle Scholar
  36. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  37. Pinaki Sinha, Hamed Pirsiavash, and Ramesh Jain. 2009. Personal photo album summarization. In Proceedings of the 17th ACM International Conference on Multimedia (ACM MM’09). ACM, 1131–1132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Ling. 2 (2014), 207–218.Google ScholarGoogle ScholarCross RefCross Ref
  39. Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Meeting of the Association for Computational Linguistics (ACL’19). 6558–6569.Google ScholarGoogle ScholarCross RefCross Ref
  40. Xiaojun Wan and Jianwu Yang. 2006. Improved affinity graph based multi-document summarization. In Proceedings of the Human Language Technology Conference of the NAACL. 181–184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jingdong Wang, Liyan Jia, and Xian-Sheng Hua. 2011. Interactive browsing via diversified visual summarization for image search results. Multim. Syst. 17, 5 (2011), 379–391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. (2018). Retrieved from https://arxiv.org/abs/1704.03470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. William Yang Wang, Yashar Mehdad, Dragomir R. Radev, and Amanda Stent. 2016. A low-rank approximation approach to learning joint embeddings of news stories and images for timeline summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16). 58–68.Google ScholarGoogle ScholarCross RefCross Ref
  44. Bo Xiong, Gunhee Kim, and Leonid Sigal. 2015. Storyline representation of egocentric videos with an applications to story-based search. In Proceedings of the IEEE International Conference on Computer Vision (CVPR’15). 4525–4533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67–78.Google ScholarGoogle ScholarCross RefCross Ref
  46. Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Jiajun Zhang, and Chengqing Zong. 2018. MSMO: Multimodal summarization with multimodal output. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 4154–4164.Google ScholarGoogle ScholarCross RefCross Ref
  47. Junnan Zhu, Qian Wang, Yining Wang, Yu Zhou, Jiajun Zhang, Shaonan Wang, and Chengqing Zong. 2019. NCLS: Neural cross-lingual summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 3054–3064.Google ScholarGoogle ScholarCross RefCross Ref
  48. Junnan Zhu, Long Zhou, Haoran Li, Jiajun Zhang, Yu Zhou, and Chengqing Zong. 2017. Augmenting neural sentence summarization through extractive summarization. In Proceedings of the 6th Conference on Natural Language Processing and Chinese Computing (NLPCC’17). 16–28.Google ScholarGoogle Scholar
  49. Junnan Zhu, Yu Zhou, Jiajun Zhang, Haoran Li, Chengqing Zong, and Changliang Li. 2020. Multimodal summarization with guidance of multimodal reference. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20).Google ScholarGoogle ScholarCross RefCross Ref
  50. Junnan Zhu, Yu Zhou, Jiajun Zhang, and Chengqing Zong. 2020. Attend, translate and summarize: An efficient method for neural cross-lingual summarization. In Proceedings of the 58th Meeting of the Association for Computational Linguistics (ACL’20). 1309–1321.Google ScholarGoogle ScholarCross RefCross Ref
  51. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 19–27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Keneilwe Zuva and Tranos Zuva. 2012. Evaluation of information retrieval systems. Int. J. Comput. Sci. Inf. Technol. 4, 3 (2012), 35.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Graph-based Multimodal Ranking Models for Multimodal Summarization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 4
      July 2021
      419 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3465463
      Issue’s Table of Contents

      Copyright © 2021 Association for Computing Machinery.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 May 2021
      • Accepted: 1 December 2020
      • Revised: 1 October 2020
      • Received: 1 August 2019
      Published in tallip Volume 20, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!