skip to main content
research-article

V-JAUNE: A Framework for Joint Action Recognition and Video Summarization

Published:26 April 2017Publication History
Skip Abstract Section

Abstract

Video summarization and action recognition are two important areas of multimedia video analysis. While these two areas have been tackled separately to date, in this article, we present a latent structural SVM framework to recognize the action and derive the summary of a video in a joint, simultaneous fashion. Efficient inference is provided by a submodular score function that accounts for the action and summary jointly. In this article, we also define a novel measure to evaluate the quality of a predicted video summary against the annotations of multiple annotators. Quantitative and qualitative results over two challenging action datasets—the ACE and MSR DailyActivity3D datasets—show that the proposed joint approach leads to higher action recognition accuracy and equivalent or better summary quality than comparable approaches that perform these tasks separately.

References

  1. Jurandy Almeida, Neucimar J. Leite, and Ricardo da S. Torres. 2012. Vison: Video summarization for online applications. Pattern Recognition Letters 33, 4 (2012), 397--409. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Yasemin Altun, Mikhail Belkin, and David A. Mcallester. 2005. Maximum margin semi-supervised learning for structured variables. In Advances in Neural Information Processing Systems (NIPS). 33--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Francis R. Bach. 2013. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning 6, 2--3 (2013), 145--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. William Brendel and Sinisa Todorovic. 2010. Activities as time series of human postures. In Proceedings of the European Conference on Computer Vision (ECCV). 721--734. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 27:1--27:27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bo-Wei Chen, Jia-Ching Wang, and Jhing-Fa Wang. 2009. A novel video summarization based on mining the story-structure and semantic relations among concept entities. IEEE Transactions on Multimedia 11, 2 (2009), 295--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yang Cong, Junsong Yuan, and Jiebo Luo. 2012. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2012), 66--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Sandra Eliza Fontes De Avila, Ana Paula Brandão Lopes, Antonio da Luz, and Arnaldo de Albuquerque Araújo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32, 1 (2011), 56--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Maxime Devanne, Hazem Wannous, Stefano Berretti, Pietro Pala, Mohamed Daoudi, and Alberto Del Bimbo. 2015. 3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold. IEEE Trans. Cybernetics 45, 7 (2015), 1340--1352.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  11. Huizhong Duan, Yanen Li, ChengXiang Zhai, and Dan Roth. 2012. A discriminative model for query spelling correction with latent structural SVM. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12). 1511--1521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Naveed Ejaz, Irfan Mehmood, and Sung Wook Baik. 2013. Efficient visual attention based framework for extracting key frames from videos. Signal Processing: Image Communication 28, 1 (2013), 34--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Naveed Ejaz, Tayyab Bin Tariq, and Sung Wook Baik. 2012. Adaptive key frame extraction for video summarization using an aggregation mechanism. Journal of Visual Communication and Image Representation 23, 7 (2012), 1031--1040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Joydeep Ghosh, Yong Jae Lee, and Kristen Grauman. 2012. Discovering important people and objects for egocentric video summarization. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1346--1353. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Retrieved from http://www.deeplearningbook.org. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Genliang Guan, Zhiyong Wang, Shaohui Mei, Max Ott, Mingyi He, and David Dagan Feng. 2014. A top-down approach for video summarization. ACM Trans. Multimedia Comput. Commun. Appl. 11, 1 (Sept. 2014), 4:1--4:21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3090--3098.Google ScholarGoogle ScholarCross RefCross Ref
  18. Yong Hu and Wei Zheng. 2011. Human action recognition based on key frames. In Proceedings of the Advances in Computer Science and Education Applications (CSE). Springer, 535--542.Google ScholarGoogle ScholarCross RefCross Ref
  19. Fairouz Hussein, Sari Awwad, and Massimo Piccardi. 2016. Joint action recognition and summarization by sub-modular inference. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2697--2701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hamid Izadinia and Mubarak Shah. 2012. Recognizing complex events using large margin joint low-level event model. In European Conference on Computer Vision (ECCV). 430--444. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alexander Jaffe, Mor Naaman, Tamir Tassa, and Marc Davis. 2006. Generating summaries for large collections of geo-referenced photographs. In Proceedings of the 15th International Conference on World Wide Web (WWW). ACM, 853--854. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3304--3311.Google ScholarGoogle ScholarCross RefCross Ref
  23. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1725--1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Gunhee Kim, Seungwhan Moon, and Leonid Sigal. 2015. Ranking and retrieval of image sequences from multiple paragraph queries. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1993--2001.Google ScholarGoogle Scholar
  25. Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  26. Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha, and Yong Yu. 2011. Video summarization via transferrable structured learning. In Proceedings of the 20th International Conference on the World Wide Web (WWW). ACM, 287--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the ACL Workshop: Text Summarization Branches Out, Vol. 8. 74--81.Google ScholarGoogle Scholar
  28. Hui Lin and Jeff Bilmes. 2011. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 510--520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yang Liu, Feng Zhou, Wei Liu, Fernando De la Torre, and Yan Liu. 2010. Unsupervised summarization of rushes videos. In Proceedings of the 18th ACM International Conference on Multimedia (ACM). 751--754. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Guoliang Lu, Yiqi Zhou, Xueyong Li, and Peng Yan. 2016. Unsupervised, efficient and scalable key-frame selection for automatic summarization of surveillance videos. Multimedia Tools and Applications (2016), 1--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, and Mingjing Li. 2002. A user attention model for video summarization. In Proceedings of the 10th ACM International Conference on Multimedia (ACM). 533--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mark Maybury, Andrew Merlino, and James Rayson. 1997. Segmentation, content extraction and visualization of broadcast news video using multistream analysis. In Proceedings of the 5th ACM International Conference on Multimedia. 102--112.Google ScholarGoogle Scholar
  33. Meinard Müller and Tido Röder. 2006. Motion templates for automatic classification and retrieval of motion capture data. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Eurographics Association, 137--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Padmavathi Mundur, Yong Rao, and Yelena Yesha. 2006. Keyframe-based video summarization using Delaunay clustering. International Journal on Digital Libraries 6, 2 (2006), 219--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Farhood Negin and François Bremond. 2016. Human Action Recognition in Videos: A Survey. INRIA Technical Report, Sophia Antipolis, France, 47 pages.Google ScholarGoogle Scholar
  36. George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. 1978. An analysis of approximations for maximizing submodular set functions-I. Mathematical Programming 14, 1 (1978), 265--294.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Bingbing Ni, Pierre Moulin, and Shuicheng Yan. 2015. Pose adaptive motion feature pooling for human action analysis. International Journal of Computer Vision 111, 2 (2015), 229--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Juan Carlos Niebles, Chih-Wei Chen, and Li Fei-Fei. 2010. Modeling temporal structure of decomposable motion segments for activity classification. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 392--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Michalis Raptis and Leonid Sigal. 2013. Poselet key-framing: A model for human activity recognition. In 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2650--2657. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mrinmaya Sachan, Kumar Dubey, Eric P. Xing, and Matthew Richardson. 2015. Learning answer-entailing structures for machine comprehension. In Proceedings of the 2015 Conference of the Association for Computational Linguistics (ACL). 239--249.Google ScholarGoogle ScholarCross RefCross Ref
  41. Konrad Schindler and Luc Van Gool. 2008. Action snippets: How many frames does human action recognition require? In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  42. Atsushi Shimada, Kazuaki Kondo, Daisuke Deguchi, Géraldine Morin, and Helman Stern. 2013. Kitchen scene context based gesture recognition: A contest in ICPR2012. In Advances in Depth Image Analysis and Applications. Springer, 168--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ruben Sipos, Pannaga Shivaswamy, and Thorsten Joachims. 2012. Large-margin learning of submodular summarization models. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). 224--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Rim Slama, Hazem Wannous, Mohamed Daoudi, and Anuj Srivastava. 2015. Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recognition 48, 2 (2015), 556--567. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Kevin D. Tang, Fei-Fei Li, and Daphne Koller. 2012. Learning latent temporal structure for complex event detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1250--1257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Sebastian Tschiatschek, Rishabh K. Iyer, Haochen Wei, and Jeff A. Bilmes. 2014. Learning mixtures of submodular functions for image collection summarization. In Advances in Neural Information Processing Systems (NIPS). 1413--1421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. 2005. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6, Sep (2005), 1453--1484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Pavan K. Turaga, Ashok Veeraraghavan, and Rama Chellappa. 2009. Unsupervised view and rate invariant clustering of video sequences. Computer Vision and Image Understanding 113, 3 (2009), 353--371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Andrea Vedaldi. 2011. A MATLAB wrapper of SVMstruct. Retrieved from http://www.vlfeat.org/vedaldi/code/1svm-struct-matlab.html.Google ScholarGoogle Scholar
  50. Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3169--3176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 3551--3558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Proceedings of the 2009 British Machine Vision Conference (BMVC). 124--1.Google ScholarGoogle ScholarCross RefCross Ref
  53. Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1290--1297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Yang Wang and Greg Mori. 2011. Hidden part models for human action recognition: Probabilistic versus max margin. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 7 (2011), 1310--1323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Xinxiao Wu, Dong Xu, Lixin Duan, Jiebo Luo, and Yunde Jia. 2013. Action recognition using multilevel features and latent structural SVM. IEEE Transactions on Circuits and Systems for Video Technology 23, 8 (2013), 1422--1431. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Ziyou Xiong, Regunathan Radhakrishnan, Ajay Divakaran, Yong Rui, and Thomas S. Huang. 2006. A Unified Framework for Video Summarization, Browsing, and Retrieval with Applications to Consumer and Surveillance Video. Elsevier/Academic Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Chunlei Yang, Jialie Shen, Jinye Peng, and Jianping Fan. 2013. Image collection summarization via dictionary learning for sparse representation. Pattern Recognition 46, 3 (2013), 948--961. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2016. Semantic feature mining for video event understanding. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4 (Aug. 2016), 55:1--55:22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Chun-Nam John Yu and Thorsten Joachims. 2009. Learning structural SVMs with latent variables. In Proceedings of the 26th Annual International Conference on Machine Learning (ACM). 1169--1176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Junsong Yuan, Zicheng Liu, and Ying Wu. 2011. Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 9 (2011), 1728--1743. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Long Zhu, Yuanhao Chen, Alan Yuille, and William Freeman. 2010. Latent hierarchical structural learning for object detection. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1062--1069.Google ScholarGoogle Scholar

Index Terms

  1. V-JAUNE: A Framework for Joint Action Recognition and Video Summarization

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!