Abstract
Video summarization and action recognition are two important areas of multimedia video analysis. While these two areas have been tackled separately to date, in this article, we present a latent structural SVM framework to recognize the action and derive the summary of a video in a joint, simultaneous fashion. Efficient inference is provided by a submodular score function that accounts for the action and summary jointly. In this article, we also define a novel measure to evaluate the quality of a predicted video summary against the annotations of multiple annotators. Quantitative and qualitative results over two challenging action datasets—the ACE and MSR DailyActivity3D datasets—show that the proposed joint approach leads to higher action recognition accuracy and equivalent or better summary quality than comparable approaches that perform these tasks separately.
- Jurandy Almeida, Neucimar J. Leite, and Ricardo da S. Torres. 2012. Vison: Video summarization for online applications. Pattern Recognition Letters 33, 4 (2012), 397--409. Google Scholar
Digital Library
- Yasemin Altun, Mikhail Belkin, and David A. Mcallester. 2005. Maximum margin semi-supervised learning for structured variables. In Advances in Neural Information Processing Systems (NIPS). 33--40. Google Scholar
Digital Library
- Francis R. Bach. 2013. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning 6, 2--3 (2013), 145--373. Google Scholar
Digital Library
- William Brendel and Sinisa Todorovic. 2010. Activities as time series of human postures. In Proceedings of the European Conference on Computer Vision (ECCV). 721--734. Google Scholar
Digital Library
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 27:1--27:27. Google Scholar
Digital Library
- Bo-Wei Chen, Jia-Ching Wang, and Jhing-Fa Wang. 2009. A novel video summarization based on mining the story-structure and semantic relations among concept entities. IEEE Transactions on Multimedia 11, 2 (2009), 295--312. Google Scholar
Digital Library
- Yang Cong, Junsong Yuan, and Jiebo Luo. 2012. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2012), 66--75. Google Scholar
Digital Library
- Sandra Eliza Fontes De Avila, Ana Paula Brandão Lopes, Antonio da Luz, and Arnaldo de Albuquerque Araújo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56--68. Google Scholar
Digital Library
- Maxime Devanne, Hazem Wannous, Stefano Berretti, Pietro Pala, Mohamed Daoudi, and Alberto Del Bimbo. 2015. 3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold. IEEE Trans. Cybernetics 45, 7 (2015), 1340--1352.Google Scholar
Cross Ref
- Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2625--2634.Google Scholar
Cross Ref
- Huizhong Duan, Yanen Li, ChengXiang Zhai, and Dan Roth. 2012. A discriminative model for query spelling correction with latent structural SVM. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12). 1511--1521. Google Scholar
Digital Library
- Naveed Ejaz, Irfan Mehmood, and Sung Wook Baik. 2013. Efficient visual attention based framework for extracting key frames from videos. Signal Processing: Image Communication 28, 1 (2013), 34--44. Google Scholar
Digital Library
- Naveed Ejaz, Tayyab Bin Tariq, and Sung Wook Baik. 2012. Adaptive key frame extraction for video summarization using an aggregation mechanism. Journal of Visual Communication and Image Representation 23, 7 (2012), 1031--1040. Google Scholar
Digital Library
- Joydeep Ghosh, Yong Jae Lee, and Kristen Grauman. 2012. Discovering important people and objects for egocentric video summarization. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1346--1353. Google Scholar
Digital Library
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Retrieved from http://www.deeplearningbook.org. Google Scholar
Digital Library
- Genliang Guan, Zhiyong Wang, Shaohui Mei, Max Ott, Mingyi He, and David Dagan Feng. 2014. A top-down approach for video summarization. ACM Trans. Multimedia Comput. Commun. Appl. 11, 1 (Sept. 2014), 4:1--4:21. Google Scholar
Digital Library
- Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3090--3098.Google Scholar
Cross Ref
- Yong Hu and Wei Zheng. 2011. Human action recognition based on key frames. In Proceedings of the Advances in Computer Science and Education Applications (CSE). Springer, 535--542.Google Scholar
Cross Ref
- Fairouz Hussein, Sari Awwad, and Massimo Piccardi. 2016. Joint action recognition and summarization by sub-modular inference. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2697--2701.Google Scholar
Digital Library
- Hamid Izadinia and Mubarak Shah. 2012. Recognizing complex events using large margin joint low-level event model. In European Conference on Computer Vision (ECCV). 430--444. Google Scholar
Digital Library
- Alexander Jaffe, Mor Naaman, Tamir Tassa, and Marc Davis. 2006. Generating summaries for large collections of geo-referenced photographs. In Proceedings of the 15th International Conference on World Wide Web (WWW). ACM, 853--854. Google Scholar
Digital Library
- Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3304--3311.Google Scholar
Cross Ref
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1725--1732. Google Scholar
Digital Library
- Gunhee Kim, Seungwhan Moon, and Leonid Sigal. 2015. Ranking and retrieval of image sequences from multiple paragraph queries. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1993--2001.Google Scholar
- Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1--8.Google Scholar
Cross Ref
- Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha, and Yong Yu. 2011. Video summarization via transferrable structured learning. In Proceedings of the 20th International Conference on the World Wide Web (WWW). ACM, 287--296. Google Scholar
Digital Library
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the ACL Workshop: Text Summarization Branches Out, Vol. 8. 74--81.Google Scholar
- Hui Lin and Jeff Bilmes. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 510--520. Google Scholar
Digital Library
- Yang Liu, Feng Zhou, Wei Liu, Fernando De la Torre, and Yan Liu. 2010. Unsupervised summarization of rushes videos. In Proceedings of the 18th ACM International Conference on Multimedia (ACM). 751--754. Google Scholar
Digital Library
- Guoliang Lu, Yiqi Zhou, Xueyong Li, and Peng Yan. 2016. Unsupervised, efficient and scalable key-frame selection for automatic summarization of surveillance videos. Multimedia Tools and Applications (2016), 1--23. Google Scholar
Digital Library
- Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, and Mingjing Li. 2002. A user attention model for video summarization. In Proceedings of the 10th ACM International Conference on Multimedia (ACM). 533--542. Google Scholar
Digital Library
- Mark Maybury, Andrew Merlino, and James Rayson. 1997. Segmentation, content extraction and visualization of broadcast news video using multistream analysis. In Proceedings of the 5th ACM International Conference on Multimedia. 102--112.Google Scholar
- Meinard Müller and Tido Röder. 2006. Motion templates for automatic classification and retrieval of motion capture data. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Eurographics Association, 137--146. Google Scholar
Digital Library
- Padmavathi Mundur, Yong Rao, and Yelena Yesha. 2006. Keyframe-based video summarization using Delaunay clustering. International Journal on Digital Libraries 6, 2 (2006), 219--232. Google Scholar
Digital Library
- Farhood Negin and François Bremond. 2016. Human Action Recognition in Videos: A Survey. INRIA Technical Report, Sophia Antipolis, France, 47 pages.Google Scholar
- George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. 1978. An analysis of approximations for maximizing submodular set functions-I. Mathematical Programming 14, 1 (1978), 265--294.Google Scholar
Digital Library
- Bingbing Ni, Pierre Moulin, and Shuicheng Yan. 2015. Pose adaptive motion feature pooling for human action analysis. International Journal of Computer Vision 111, 2 (2015), 229--248. Google Scholar
Digital Library
- Juan Carlos Niebles, Chih-Wei Chen, and Li Fei-Fei. 2010. Modeling temporal structure of decomposable motion segments for activity classification. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 392--405. Google Scholar
Digital Library
- Michalis Raptis and Leonid Sigal. 2013. Poselet key-framing: A model for human activity recognition. In 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2650--2657. Google Scholar
Digital Library
- Mrinmaya Sachan, Kumar Dubey, Eric P. Xing, and Matthew Richardson. 2015. Learning answer-entailing structures for machine comprehension. In Proceedings of the 2015 Conference of the Association for Computational Linguistics (ACL). 239--249.Google Scholar
Cross Ref
- Konrad Schindler and Luc Van Gool. 2008. Action snippets: How many frames does human action recognition require? In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1--8.Google Scholar
Cross Ref
- Atsushi Shimada, Kazuaki Kondo, Daisuke Deguchi, Géraldine Morin, and Helman Stern. 2013. Kitchen scene context based gesture recognition: A contest in ICPR2012. In Advances in Depth Image Analysis and Applications. Springer, 168--185. Google Scholar
Digital Library
- Ruben Sipos, Pannaga Shivaswamy, and Thorsten Joachims. 2012. Large-margin learning of submodular summarization models. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). 224--233. Google Scholar
Digital Library
- Rim Slama, Hazem Wannous, Mohamed Daoudi, and Anuj Srivastava. 2015. Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recognition 48, 2 (2015), 556--567. Google Scholar
Digital Library
- Kevin D. Tang, Fei-Fei Li, and Daphne Koller. 2012. Learning latent temporal structure for complex event detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1250--1257. Google Scholar
Digital Library
- Sebastian Tschiatschek, Rishabh K. Iyer, Haochen Wei, and Jeff A. Bilmes. 2014. Learning mixtures of submodular functions for image collection summarization. In Advances in Neural Information Processing Systems (NIPS). 1413--1421. Google Scholar
Digital Library
- Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. 2005. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, Sep (2005), 1453--1484. Google Scholar
Digital Library
- Pavan K. Turaga, Ashok Veeraraghavan, and Rama Chellappa. 2009. Unsupervised view and rate invariant clustering of video sequences. Computer Vision and Image Understanding 113, 3 (2009), 353--371. Google Scholar
Digital Library
- Andrea Vedaldi. 2011. A MATLAB wrapper of SVMstruct. Retrieved from http://www.vlfeat.org/vedaldi/code/1svm-struct-matlab.html.Google Scholar
- Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3169--3176. Google Scholar
Digital Library
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 3551--3558. Google Scholar
Digital Library
- Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the 2009 British Machine Vision Conference (BMVC). 124--1.Google Scholar
Cross Ref
- Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1290--1297. Google Scholar
Digital Library
- Yang Wang and Greg Mori. 2011. Hidden part models for human action recognition: Probabilistic versus max margin. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 7 (2011), 1310--1323. Google Scholar
Digital Library
- Xinxiao Wu, Dong Xu, Lixin Duan, Jiebo Luo, and Yunde Jia. 2013. Action recognition using multilevel features and latent structural SVM. IEEE Transactions on Circuits and Systems for Video Technology 23, 8 (2013), 1422--1431. Google Scholar
Digital Library
- Ziyou Xiong, Regunathan Radhakrishnan, Ajay Divakaran, Yong Rui, and Thomas S. Huang. 2006. A Unified Framework for Video Summarization, Browsing, and Retrieval with Applications to Consumer and Surveillance Video. Elsevier/Academic Press. Google Scholar
Digital Library
- Chunlei Yang, Jialie Shen, Jinye Peng, and Jianping Fan. 2013. Image collection summarization via dictionary learning for sparse representation. Pattern Recognition 46, 3 (2013), 948--961. Google Scholar
Digital Library
- Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2016. Semantic feature mining for video event understanding. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4 (Aug. 2016), 55:1--55:22. Google Scholar
Digital Library
- Chun-Nam John Yu and Thorsten Joachims. 2009. Learning structural SVMs with latent variables. In Proceedings of the 26th Annual International Conference on Machine Learning (ACM). 1169--1176. Google Scholar
Digital Library
- Junsong Yuan, Zicheng Liu, and Ying Wu. 2011. Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 9 (2011), 1728--1743. Google Scholar
Digital Library
- Long Zhu, Yuanhao Chen, Alan Yuille, and William Freeman. 2010. Latent hierarchical structural learning for object detection. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1062--1069.Google Scholar
Index Terms
V-JAUNE: A Framework for Joint Action Recognition and Video Summarization
Recommendations
Efficient action recognition via local position offset of 3D skeletal body joints
To accurately recognize human actions in less computational time is one important aspect for practical usage. This paper presents an efficient framework for recognizing actions by a RGB-D camera. The novel action patterns in the framework are extracted ...
Recognising occluded multi-view actions using local nearest neighbour embedding
We propose a robust learning-free algorithm: local nearest neighbour embedding (LNNE).We introduce 3 multi-view fusion scenarios to test the LNNE method.We conduct extensive experiments on two multi-view action data sets with occlusions, where the LNNE ...
MMA: a multi-view and multi-modality benchmark dataset for human action recognition
Human action recognition is an active research topic in both computer vision and machine learning communities, which has broad applications including surveillance, biometrics and human computer interaction. In the past decades, although some famous ...






Comments