Abstract
Hierarchical structure is a common characteristic for some kinds of videos (e.g., sports videos, game videos): The videos are composed of several actions hierarchically and there exist temporal dependencies among segments with different scales, where action labels can be enumerated. Our ideas are based on two observations: First, the actions are the fundamental units for people to understand these videos. Second, the humans summarize a video by iteratively observing and refining, i.e., observing segments in video and hierarchically refining the boundaries of important actions. Based on the above insights, we generate action proposals to construct the structure of the video and formulate the summarization process as a hierarchical refining process. We also train a hierarchical summarization network with deep Q-learning (HQSN) to achieve the refining process and explore temporal dependency. Besides, we collect a new dataset that consists of structured game videos with fine-grain actions and importance annotations. The experimental results demonstrate the effectiveness of the proposed method.
- [1] . 2020. Text summarization: A brief review. Recent Advances in NLP: The Case of Arabic Language, Mohamed Abd Elaziz, Mohammed A. A. Al-qaness, Ahmed A. Ewees, and Abdelghani Dahou (Eds.). Springer International Publishing, 1–15.
DOI: 10.1007/978-3-030-34614-0_1Google Scholar - [2] . 2016. Leveraging contextual cues for generating basketball highlights. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, New York, NY, 908–917. Google Scholar
Digital Library
- [3] . 2017. SST: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 2911–2920.Google Scholar
Cross Ref
- [4] . 2017. Hierarchical object detection with deep reinforcement learning. Deep Learn. Image Process. Applic. 31, 164 (2017), 3.Google Scholar
- [5] . 2016. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 1914–1923.Google Scholar
Cross Ref
- [6] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 6299–6308.Google Scholar
Cross Ref
- [7] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Empirical Methods in Natural Language Processing. ACM, New York, NY, 1724–1734.Google Scholar
Cross Ref
- [8] . 2015. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 3584–3592.Google Scholar
Cross Ref
- [9] . 2018. Review of visual saliency detection with comprehensive information. IEEE Trans. Circ. Syst. Vid. Technol. 29, 10 (2018), 2941–2959.Google Scholar
Cross Ref
- [10] . 2016. DAPs: Deep action proposals for action understanding. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 768–784.Google Scholar
Cross Ref
- [11] . 2018. CTAP: Complementary temporal action proposal generation. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 68–83.Google Scholar
Cross Ref
- [12] . 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, New York, NY, 3628–3636.Google Scholar
Cross Ref
- [13] . 2014. Diverse sequential subset selection for supervised video summarization. In Proceedings of the Conference on Advances in Neural Information Processing Systems. The MIT Press, Cambridge, MA, 2069–2077. Google Scholar
Digital Library
- [14] . 2014. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 505–520.Google Scholar
Cross Ref
- [15] . 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 3090–3098.Google Scholar
Cross Ref
- [16] . 2018. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 3215–3222. Google Scholar
Digital Library
- [17] . 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning. ACM, New York, NY, 448–456. Google Scholar
Digital Library
- [18] . 2019. Comprehensive video understanding: Video summarization with content-based video recommender design. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE Communications Society, New York, NY, 1–8.Google Scholar
Cross Ref
- [19] . 2014. Fast saliency based pooling of Fisher encoded dense trajectories. In Proceedings of the European Conference on Computer Vision THUMOS Workshop. Springer, Cham, 5.Google Scholar
- [20] . 2014. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations. arXiv.org, Ithaca, NY, 1–15.Google Scholar
- [21] . 2012. Determinantal point processes for machine learning. Found. Trends Mach. Learn. 5, 2–3 (2012), 123–286. Google Scholar
Digital Library
- [22] . 2019. Temporal U-Nets for video summarization with scene and action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE, New York, NY, 1–4.Google Scholar
Cross Ref
- [23] . 2019. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 41, 9 (2019), 2070–2083.
DOI: DOI: DOI: https://doi.org/10.1109/TPAMI.2018.2852750Google ScholarCross Ref
- [24] . 2016. Multimedia news summarization in search. ACM Trans. Intell. Syst. Technol. 7, 3 (
Feb. 2016).DOI: DOI: DOI: https://doi.org/10.1145/2822907 Google ScholarCross Ref
- [25] . 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 3–19.Google Scholar
Cross Ref
- [26] . 2016. Reinforcement learning for visual object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 2894–2902.Google Scholar
Cross Ref
- [27] . 2018. Automatic curation of sports highlights using multimodal excitement features. IEEE Trans. Multim. 21, 5 (2018), 1147–1160.Google Scholar
Digital Library
- [28] . 2013. Playing Atari with deep reinforcement learning. In Proceedings of the Neural Information Processing Systems Deep Learning Workshop. The MIT Press, Cambridge, MA, 1–9.Google Scholar
- [29] . 2019. Rethinking the evaluation of video summaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 7596–7604.Google Scholar
Cross Ref
- [30] . 2017. Weakly supervised summarization of web videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, New York, NY, 3657–3666.Google Scholar
Cross Ref
- [31] . 2019. Video summarization by learning relationships between action and scene. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE, New York, NY, 1–8.Google Scholar
Cross Ref
- [32] . 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Conference on Advances in Neural Information Processing Systems, , , , , , and (Eds.). Curran Associates, Inc., Red Hook, NY, 8024–8035. Google Scholar
Digital Library
- [33] . 2014. Category-specific video summarization. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 540–555.Google Scholar
Cross Ref
- [34] . 2017. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 290–298.Google Scholar
Cross Ref
- [35] . 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 7008–7024.Google Scholar
Cross Ref
- [36] . 2018. Deep unsupervised multi-view detection of video game stream highlights. In Proceedings of the 13th International Conference on the Foundations of Digital Games. ACM, New York, NY, 1–6. Google Scholar
Digital Library
- [37] . 2019. Video summarization by learning from unpaired data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 7902–7911.Google Scholar
Cross Ref
- [38] . 2019. Video multitask transformer network. In Proceedings of the IEEE International Conference on Computer Vision Workshops. IEEE, New York, NY, 1–9.Google Scholar
Cross Ref
- [39] . 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 1049–1058.Google Scholar
Cross Ref
- [40] . 2015. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 5179–5187.Google Scholar
- [41] . 2017. AENet: Learning deep audio features for video analysis. IEEE Trans. Multim. 20, 3 (2017), 513–524. Google Scholar
Digital Library
- [42] . 2016. Deep reinforcement learning with double Q-learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 2094–2100. Google Scholar
Digital Library
- [43] . 2014. Action recognition and detection by combining motion and appearance features. In Proceedings of the European Conference on Computer Vision THUMOS Workshop. Springer, Cham, 1–6.Google Scholar
- [44] . 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 4213–4222.Google Scholar
Cross Ref
- [45] . 2016. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. ACM, New York, NY, 1995–2003. Google Scholar
Digital Library
- [46] . 2020. From coarse to fine: Hierarchical structure-aware video summarization. In Proceedings of the International Conference on Pattern Recognition FGVRID Workshop. IEEE, New York, NY, 1–13.Google Scholar
- [47] . 2012. HodgeRank on random graphs for subjective video quality assessment. IEEE Trans. Multim. 14, 3 (2012), 844–857. Google Scholar
Digital Library
- [48] . 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 2678–2687.Google Scholar
Cross Ref
- [49] . 2016. Temporal action localization with pyramid of score distribution features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 3093–3102.Google Scholar
Cross Ref
- [50] . 2017. Action-decision networks for visual tracking with deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 2711–2720.Google Scholar
Cross Ref
- [51] . 2016. Summary transfer: Exemplar-based subset selection for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 1059–1067.Google Scholar
Cross Ref
- [52] . 2016. Video summarization with long short-term memory. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 766–782.Google Scholar
Cross Ref
- [53] . 2016. Context-aware surveillance video summarization. IEEE Trans. Image Process. 25, 11 (2016), 5469–5478.Google Scholar
Cross Ref
- [54] . 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 7405–7414.Google Scholar
Cross Ref
- [55] . 2014. Quasi real-time summarization for consumer videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, New York, NY, 2513–2520. Google Scholar
Digital Library
- [56] . 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, New York, NY, 2914–2923.Google Scholar
Cross Ref
- [57] . 2018. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, Menlo Park, CA, 7582–7589. Google Scholar
Digital Library
- [58] . 2018. Video summarisation by classification with deep reinforcement learning. In Proceedings of the British Machine Vision Conference. Springer, Cham, 1–13.Google Scholar
Index Terms
From Coarse to Fine: Hierarchical Structure-aware Video Summarization
Recommendations
From Coarse to Fine: Hierarchical Structure-Aware Video Summarization
Pattern Recognition. ICPR International Workshops and ChallengesAbstractHierarchical structure is a common characteristic of some kinds of videos (e.g., sports videos, game videos): the videos are composed of several actions hierarchically and there exists temporal dependencies among segments of different scales, ...
Impact of document structure on hierarchical summarization
ICADL'06: Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and OpportunitiesHierarchical summarization technique summarizes a large document based on the hierarchical structure and salient features of the document. Previous study has shown that hierarchical summarization is a promising technique which can effectively extract ...
DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video Summarization
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalThe recent growth of web video sharing platforms has increased the demand for systems that can efficiently browse, retrieve and summarize video content. Query-aware multi-video summarization is a promising technique that caters to this demand. In this ...






Comments