Abstract
With the rapid growth of video data, video summarization is a promising approach to shorten a lengthy video into a compact version. Although supervised summarization approaches have achieved state-of-the-art performance, they require frame-level annotated labels. Such an annotation process is time-consuming and tedious. In this article, we propose a novel deep summarization framework named Deep Semantic and Attentive Network for Video Summarization (DSAVS) that can select the most semantically representative summary by minimizing the distance between video representation and text representation without any frame-level labels. Another challenge associated with video summarization tasks mainly originates from the difficulty of considering temporal information over a long time. Long Short-Term Memory (LSTM) performs well for temporal dependencies modeling but does not work well with long video clips. Therefore, we introduce a self-attention mechanism into our summarization framework to capture the long-range temporal dependencies among the frames. Extensive experiments on two popular benchmark datasets, i.e., SumMe and TVSum, show that our proposed framework outperforms other state-of-the-art unsupervised approaches and even most supervised methods.
- [1] . 2020. Unsupervised video summarization via attention-driven adversarial learning. In International Conference on Multimedia Modeling. Springer, 492–504.Google Scholar
Digital Library
- [2] . 2011. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2011), 66–75. Google Scholar
Digital Library
- [3] . 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56–68. Google Scholar
Digital Library
- [4] . 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.Google Scholar
Cross Ref
- [5] . 2013. Efficient visual attention based framework for extracting key frames from videos. Signal Processing: Image Communication 28, 1 (2013), 34–44. Google Scholar
Digital Library
- [6] . 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. 2121–2129. Google Scholar
Digital Library
- [7] . 2012. Visual-textual joint relevance learning for tag-based social image search. IEEE Transactions on Image Processing 22, 1 (2012), 363–376. Google Scholar
Digital Library
- [8] . 2014. Creating summaries from user videos. In European Conference on Computer Vision. Springer, 505–520.Google Scholar
Cross Ref
- [9] . 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and Imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.Google Scholar
Cross Ref
- [10] . 2019. Unsupervised video summarization with attentive conditional generative adversarial networks. In Proceedings of the 27th ACM International Conference on Multimedia. 2296–2304. Google Scholar
Digital Library
- [11] . 2020. Query-controllable video summarization. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 242–250. Google Scholar
Digital Library
- [12] . 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634–4643.Google Scholar
Cross Ref
- [13] . 2020. Deep attentive and semantic preserving video summarization. Neurocomputing 405 (2020), 200–207.Google Scholar
Cross Ref
- [14] . 2020. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology 30, 6 (2020), 1709–1717.Google Scholar
Cross Ref
- [15] . 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3232–3240.Google Scholar
Cross Ref
- [16] . 2019. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8537–8544. Google Scholar
Digital Library
- [17] . 1945. The treatment of ties in ranking problems. Biometrika 33, 3 (1945), 239–251.Google Scholar
Cross Ref
- [18] . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
- [19] . 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3294–3302. Google Scholar
Digital Library
- [20] . 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012), 1097–1105. Google Scholar
Digital Library
- [21] . 2019. Beyond RNNs: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8658–8665. Google Scholar
Digital Library
- [22] . 2017. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 202–211.Google Scholar
Cross Ref
- [23] . 2015. Video summarization via minimum sparse reconstruction. Pattern Recognition 48, 2 (2015), 522–533. Google Scholar
Digital Library
- [24] . 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111–3119. Google Scholar
Digital Library
- [25] . 2019. Rethinking the evaluation of video summaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7596–7604.Google Scholar
Cross Ref
- [26] . 2014. Category-specific video summarization. In European Conference on Computer Vision. Springer, 540–555.Google Scholar
Cross Ref
- [27] . 2017. Top-down visual saliency guided by captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7206–7215.Google Scholar
Cross Ref
- [28] . 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia. 251–260. Google Scholar
Digital Library
- [29] . 2019. Video summarization by learning from unpaired data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7902–7911.Google Scholar
Cross Ref
- [30] . 2018. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 347–363.Google Scholar
Cross Ref
- [31] . 2016. Query-focused extractive video summarization. In European Conference on Computer Vision. Springer, 3–19.Google Scholar
Cross Ref
- [32] . 2018. Disan: Directional self-attention network for RNN/CNN-free language understanding. In 32nd AAAI Conference on Artificial Intelligence. Google Scholar
Digital Library
- [33] . 2015. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5179–5187.Google Scholar
- [34] . 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google Scholar
Cross Ref
- [35] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008. Google Scholar
Digital Library
- [36] . 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542. Google Scholar
Digital Library
- [37] . 2018. Video summarization via semantic attended networks. In 32nd AAAI Conference on Artificial Intelligence. 216–223. Google Scholar
Digital Library
- [38] . 2014. Videoset: Video summary evaluation through text. arXiv preprint arXiv:1406.5824.Google Scholar
- [39] . 2019. Cycle-sum: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9143–9150. Google Scholar
Digital Library
- [40] . 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.Google Scholar
- [41] . 2016. Video summarization with long short-term memory. In European Conference on Computer Vision. Springer, 766–782.Google Scholar
Cross Ref
- [42] . 2018. Deep cross-modal projection learning for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 686–701.Google Scholar
Digital Library
- [43] . 2017. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM International Conference on Multimedia. 863–871. Google Scholar
Digital Library
- [44] . 2018. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7405–7414.Google Scholar
Cross Ref
- [45] . 2019. Video summarization via spatio-temporal deep architecture. Neurocomputing 332 (2019), 224–235.Google Scholar
Digital Library
- [46] . Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In 32nd AAAI Conference on Artificial Intelligence. 7582–7589. Google Scholar
Digital Library
- [47] . 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 19–27. Google Scholar
Digital Library
- [48] . 1999. CRC Standard Probability and Statistics Tables and Formulae. CRC Press.Google Scholar
Index Terms
Deep Semantic and Attentive Network for Unsupervised Video Summarization
Recommendations
Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks
MM '19: Proceedings of the 27th ACM International Conference on MultimediaWith the rapid growth of video data, video summarization technique plays a key role in reducing people's efforts to explore the content of videos by generating concise but informative summaries. Though supervised video summarization approaches have been ...
Hierarchical Recurrent Neural Network for Video Summarization
MM '17: Proceedings of the 25th ACM international conference on MultimediaExploiting the temporal dependency among video frames or subshots is very important for the task of video summarization. Practically, RNN is good at temporal dependency modeling, and has achieved overwhelming performance in many video-based tasks, such ...
Video summarization with a convolutional attentive adversarial network
Highlights- We integrate the self-attention mechanism and a fully convolutional sequence network to capture the global and local temporal dependencies of video frames.
AbstractWith the explosive growth of video data, video summarization, which attempts to seek the minimum subset of frames while still conveying the main story, has become one of the hottest topics. Nowadays, substantial achievements have been ...






Comments