skip to main content
research-article

From Coarse to Fine: Hierarchical Structure-aware Video Summarization

Authors Info & Claims
Published:25 January 2022Publication History
Skip Abstract Section

Abstract

Hierarchical structure is a common characteristic for some kinds of videos (e.g., sports videos, game videos): The videos are composed of several actions hierarchically and there exist temporal dependencies among segments with different scales, where action labels can be enumerated. Our ideas are based on two observations: First, the actions are the fundamental units for people to understand these videos. Second, the humans summarize a video by iteratively observing and refining, i.e., observing segments in video and hierarchically refining the boundaries of important actions. Based on the above insights, we generate action proposals to construct the structure of the video and formulate the summarization process as a hierarchical refining process. We also train a hierarchical summarization network with deep Q-learning (HQSN) to achieve the refining process and explore temporal dependency. Besides, we collect a new dataset that consists of structured game videos with fine-grain actions and importance annotations. The experimental results demonstrate the effectiveness of the proposed method.

REFERENCES

  1. [1] Abualigah Laith, Bashabsheh Mohammad Qassem, Alabool Hamzeh, and Shehab Mohammad. 2020. Text summarization: A brief review. Recent Advances in NLP: The Case of Arabic Language, Mohamed Abd Elaziz, Mohammed A. A. Al-qaness, Ahmed A. Ewees, and Abdelghani Dahou (Eds.). Springer International Publishing, 1–15. DOI: 10.1007/978-3-030-34614-0_1Google ScholarGoogle Scholar
  2. [2] Bettadapura Vinay, Pantofaru Caroline, and Essa Irfan. 2016. Leveraging contextual cues for generating basketball highlights. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, New York, NY, 908917. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Buch Shyamal, Escorcia Victor, Shen Chuanqi, Ghanem Bernard, and Niebles Juan Carlos. 2017. SST: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 29112920.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Bueno Míriam Bellver, Nieto Xavier Giró-i, Marqués Ferran, and Torres Jordi. 2017. Hierarchical object detection with deep reinforcement learning. Deep Learn. Image Process. Applic. 31, 164 (2017), 3.Google ScholarGoogle Scholar
  5. [5] Heilbron Fabian Caba, Niebles Juan Carlos, and Ghanem Bernard. 2016. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 19141923.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 62996308.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empirical Methods in Natural Language Processing. ACM, New York, NY, 17241734.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chu Wen-Sheng, Song Yale, and Jaimes Alejandro. 2015. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 35843592.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Cong Runmin, Lei Jianjun, Fu Huazhu, Cheng Ming-Ming, Lin Weisi, and Huang Qingming. 2018. Review of visual saliency detection with comprehensive information. IEEE Trans. Circ. Syst. Vid. Technol. 29, 10 (2018), 29412959.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Escorcia Victor, Heilbron Fabian Caba, Niebles Juan Carlos, and Ghanem Bernard. 2016. DAPs: Deep action proposals for action understanding. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 768784.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Gao Jiyang, Chen Kan, and Nevatia Ram. 2018. CTAP: Complementary temporal action proposal generation. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 6883.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Gao Jiyang, Yang Zhenheng, Chen Kan, Sun Chen, and Nevatia Ram. 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, New York, NY, 36283636.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Gong Boqing, Chao Wei-Lun, Grauman Kristen, and Sha Fei. 2014. Diverse sequential subset selection for supervised video summarization. In Proceedings of the Conference on Advances in Neural Information Processing Systems. The MIT Press, Cambridge, MA, 20692077. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Gygli Michael, Grabner Helmut, Riemenschneider Hayko, and Gool Luc Van. 2014. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 505520.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Gygli Michael, Grabner Helmut, and Gool Luc Van. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 30903098.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Hessel Matteo, Modayil Joseph, Hasselt Hado Van, Schaul Tom, Ostrovski Georg, Dabney Will, Horgan Dan, Piot Bilal, Azar Mohammad, and Silver David. 2018. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 32153222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Ioffe Sergey and Szegedy Christian. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning. ACM, New York, NY, 448456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Jiang Yudong, Cui Kaixu, Peng Bo, and Xu Changliang. 2019. Comprehensive video understanding: Video summarization with content-based video recommender design. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE Communications Society, New York, NY, 18.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Karaman Svebor, Seidenari Lorenzo, and Del Bimbo Alberto. 2014. Fast saliency based pooling of Fisher encoded dense trajectories. In Proceedings of the European Conference on Computer Vision THUMOS Workshop. Springer, Cham, 5.Google ScholarGoogle Scholar
  20. [20] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations. arXiv.org, Ithaca, NY, 115.Google ScholarGoogle Scholar
  21. [21] Kulesza Alex, Taskar Ben, et al. 2012. Determinantal point processes for machine learning. Found. Trends Mach. Learn. 5, 2–3 (2012), 123286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Kwon Heeseung, Shim Woohyun, and Cho Minsu. 2019. Temporal U-Nets for video summarization with scene and action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE, New York, NY, 14.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Li Zechao, Tang Jinhui, and Mei Tao. 2019. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 41, 9 (2019), 20702083. DOI:DOI: DOI: https://doi.org/10.1109/TPAMI.2018.2852750Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Li Zechao, Tang Jinhui, Wang Xueming, Liu Jing, and Lu Hanqing. 2016. Multimedia news summarization in search. ACM Trans. Intell. Syst. Technol. 7, 3 (Feb. 2016). DOI:DOI: DOI: https://doi.org/10.1145/2822907 Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Lin Tianwei, Zhao Xu, Su Haisheng, Wang Chongjing, and Yang Ming. 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 319.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Mathe Stefan, Pirinen Aleksis, and Sminchisescu Cristian. 2016. Reinforcement learning for visual object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 28942902.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Merler Michele, Mac Khoi-Nguyen C., Joshi Dhiraj, Nguyen Quoc-Bao, Hammer Stephen, Kent John, Xiong Jinjun, Do Minh N., Smith John R., and Feris Rogerio Schmidt. 2018. Automatic curation of sports highlights using multimodal excitement features. IEEE Trans. Multim. 21, 5 (2018), 11471160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Mnih Volodymyr, Kavukcuoglu Koray, Silver David, Graves Alex, Antonoglou Ioannis, Wierstra Daan, and Riedmiller Martin. 2013. Playing Atari with deep reinforcement learning. In Proceedings of the Neural Information Processing Systems Deep Learning Workshop. The MIT Press, Cambridge, MA, 19.Google ScholarGoogle Scholar
  29. [29] Otani Mayu, Nakashima Yuta, Rahtu Esa, and Heikkila Janne. 2019. Rethinking the evaluation of video summaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 75967604.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Panda Rameswar, Das Abir, Wu Ziyan, Ernst Jan, and Roy-Chowdhury Amit K.. 2017. Weakly supervised summarization of web videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, New York, NY, 36573666.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Park Jungin, Lee Jiyoung, Jeon Sangryul, and Sohn Kwanghoon. 2019. Video summarization by learning relationships between action and scene. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE, New York, NY, 18.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, Desmaison Alban, Kopf Andreas, Yang Edward, DeVito Zachary, Raison Martin, Tejani Alykhan, Chilamkurthy Sasank, Steiner Benoit, Fang Lu, Bai Junjie, and Chintala Soumith. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Conference on Advances in Neural Information Processing Systems, Wallach H., Larochelle H., Beygelzimer A., d'Alché-Buc F., Fox E., and Garnett R. (Eds.). Curran Associates, Inc., Red Hook, NY, 80248035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Potapov Danila, Douze Matthijs, Harchaoui Zaid, and Schmid Cordelia. 2014. Category-specific video summarization. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 540555.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Ren Zhou, Wang Xiaoyu, Zhang Ning, Lv Xutao, and Li Li-Jia. 2017. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 290298.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Rennie Steven J., Marcheret Etienne, Mroueh Youssef, Ross Jerret, and Goel Vaibhava. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 70087024.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Ringer Charles and Nicolaou Mihalis A.. 2018. Deep unsupervised multi-view detection of video game stream highlights. In Proceedings of the 13th International Conference on the Foundations of Digital Games. ACM, New York, NY, 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Rochan Mrigank and Wang Yang. 2019. Video summarization by learning from unpaired data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 79027911.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Seong Hongje, Hyun Junhyuk, and Kim Euntai. 2019. Video multitask transformer network. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE, New York, NY, 19.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Shou Zheng, Wang Dongang, and Chang Shih-Fu. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 10491058.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Song Yale, Vallmitjana Jordi, Stent Amanda, and Jaimes Alejandro. 2015. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 51795187.Google ScholarGoogle Scholar
  41. [41] Takahashi Naoya, Gygli Michael, and Gool Luc Van. 2017. AENet: Learning deep audio features for video analysis. IEEE Trans. Multim. 20, 3 (2017), 513524. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Hasselt Hado Van, Guez Arthur, and Silver David. 2016. Deep reinforcement learning with double Q-learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 20942100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Wang Limin, Qiao Yu, and Tang Xiaoou. 2014. Action recognition and detection by combining motion and appearance features. In Proceedings of the European Conference on Computer Vision THUMOS Workshop. Springer, Cham, 16.Google ScholarGoogle Scholar
  44. [44] Wang Xin, Chen Wenhu, Wu Jiawei, Wang Yuan-Fang, and Wang William Yang. 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 42134222.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Wang Ziyu, Schaul Tom, Hessel Matteo, Hasselt Hado, Lanctot Marc, and Freitas Nando. 2016. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. ACM, New York, NY, 19952003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Wenxu Li, Gang Pan, Chen Wang, Zhen Xing, Xiaozhou Zhou, Xiaoxuan Dong, and Jiawan Zhang. 2020. From coarse to fine: Hierarchical structure-aware video summarization. In Proceedings of the International Conference on Pattern Recognition FGVRID Workshop. IEEE, New York, NY, 113.Google ScholarGoogle Scholar
  47. [47] Xu Qianqian, Huang Qingming, Jiang Tingting, Yan Bowei, Lin Weisi, and Yao Yuan. 2012. HodgeRank on random graphs for subjective video quality assessment. IEEE Trans. Multim. 14, 3 (2012), 844857. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Yeung Serena, Russakovsky Olga, Mori Greg, and Fei-Fei Li. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 26782687.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Yuan Jun, Ni Bingbing, Yang Xiaokang, and Kassim Ashraf A.. 2016. Temporal action localization with pyramid of score distribution features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 30933102.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Yun Sangdoo, Choi Jongwon, Yoo Youngjoon, Yun Kimin, and Choi Jin Young. 2017. Action-decision networks for visual tracking with deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 27112720.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Zhang Ke, Chao Wei-Lun, Sha Fei, and Grauman Kristen. 2016. Summary transfer: Exemplar-based subset selection for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 10591067.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Zhang Ke, Chao Wei-Lun, Sha Fei, and Grauman Kristen. 2016. Video summarization with long short-term memory. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 766782.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Zhang Shu, Zhu Yingying, and Roy-Chowdhury Amit K.. 2016. Context-aware surveillance video summarization. IEEE Trans. Image Process. 25, 11 (2016), 54695478.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Zhao Bin, Li Xuelong, and Lu Xiaoqiang. 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 74057414.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Zhao Bin and Xing Eric P.. 2014. Quasi real-time summarization for consumer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 25132520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Zhao Yue, Xiong Yuanjun, Wang Limin, Wu Zhirong, Tang Xiaoou, and Lin Dahua. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, New York, NY, 29142923.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Zhou Kaiyang, Qiao Yu, and Xiang Tao. 2018. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 75827589. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Zhou Kaiyang, Xiang Tao, and Cavallaro Andrea. 2018. Video summarisation by classification with deep reinforcement learning. In Proceedings of the British Machine Vision Conference. Springer, Cham, 113.Google ScholarGoogle Scholar

Index Terms

  1. From Coarse to Fine: Hierarchical Structure-aware Video Summarization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 1s
      February 2022
      352 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3505206
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 January 2022
      • Accepted: 1 September 2021
      • Revised: 1 July 2021
      • Received: 1 January 2021
      Published in tomm Volume 18, Issue 1s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!