skip to main content
research-article

Semantic Feature Mining for Video Event Understanding

Authors Info & Claims
Published:03 August 2016Publication History
Skip Abstract Section

Abstract

Content-based video understanding is extremely difficult due to the semantic gap between low-level vision signals and the various semantic concepts (object, action, and scene) in videos. Though feature extraction from videos has achieved significant progress, most of the previous methods rely only on low-level features, such as the appearance and motion features. Recently, visual-feature extraction has been improved significantly with machine-learning algorithms, especially deep learning. However, there is still not enough work focusing on extracting semantic features from videos directly. The goal of this article is to adopt unlabeled videos with the help of text descriptions to learn an embedding function, which can be used to extract more effective semantic features from videos when only a few labeled samples are available for video recognition. To achieve this goal, we propose a novel embedding convolutional neural network (ECNN). We evaluate our algorithm by comparing its performance on three challenging benchmarks with several popular state-of-the-art methods. Extensive experimental results show that the proposed ECNN consistently and significantly outperforms the existing methods.

References

  1. Kobus Barnard, Pinar Duygulu, David A. Forsyth, Nando de Freitas, David M. Blei, and Michael I. Jordan. 2003. Matching words and pictures. Journal of Machine Learning Research 3, 1107--1135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2006. Greedy layer-wise training of deep networks. In NIPS. 153--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis R. Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Weakly-supervised alignment of video with text. In 2015 IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, December 7--13, 2015, 4462--4470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Xinlei Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, June 7--12, 2015, 2422--2431.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, June 7--12, 2015, 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  6. Lixin Duan, Dong Xu, and Shih-Fu Chang. 2012. Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In IEEE CVPR. 1338--1345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 2121--2159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Pinar Duygulu, Kobus Barnard, João F. G. de Freitas, and David A. Forsyth. 2002. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of Computer Vision - 7th European Conference on Computer Vision (ECCV’02), Part IV. Copenhagen, Denmark, May 28--31, 2002, 97--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of Computer Vision - 11th European Conference on Computer Vision (ECCV’10), Part IV. Heraklion, Crete, Greece, September 5--11, 2010, 15--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5--8, 2013, Lake Tahoe, NV, 2121--2129.Google ScholarGoogle Scholar
  11. Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. International Journal of Computer Vision 106, 2, 210--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In IEEE International Conference on Computer Vision (ICCV’13), Sydney, Australia, December 1--8, 2013, 2712--2719. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. AmirHossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014. VideoStory: A new multimedia embedding for few-example recognition and translation of events. In Proceedings of the ACM MM. 17--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18, 7, 1527--1554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Junlin Hu, Jiwen Lu, and Yap-Peng Tan. 2014. Discriminative deep metric learning for face verification in the wild. In CVPR. 1875--1882. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Naveed Imran, Jingen Liu, Jiebo Luo, and Mubarak Shah. 2009. Event recognition from photo collections via PageRank. In ACM Multimedia. 621--624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1, 221--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.Google ScholarGoogle Scholar
  19. Lu Jiang, Alexander G. Hauptmann, and Guang Xiang. 2012. Leveraging high-level and low-level features for multimedia event detection. In ACM MM. 449--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel P. W. Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ICMR. 29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In ACLs. 655--665.Google ScholarGoogle Scholar
  22. Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014a. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, December 8--13, 2014, Montreal, Quebec, Canada, 1889--1897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, June 7--12, 2015. 3128--3137.Google ScholarGoogle ScholarCross RefCross Ref
  24. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014b. Large-scale video classification with convolutional neural networks. In CVPR. 1725--1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/1411.2539.Google ScholarGoogle Scholar
  26. Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the 27th AAAI Conference on Artificial Intelligence, July 14--18, 2013, Bellevue, WA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NIPS. 1106--1114.Google ScholarGoogle Scholar
  28. Polina Kuznetsova, Vicente Ordonez, Tamara L. Berg, and Yejin Choi. 2014. TREETALK: Composition and compression of trees for image descriptions. TACL 2, 351--362.Google ScholarGoogle Scholar
  29. Rémi Lebret, Pedro O. Pinheiro, and Ronan Collobert. 2015. Phrase-based image captioning. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15), Lille, France, July 6--11, 2015. 2085--2094.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of IEEE 86, 11, 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  31. Mengyi Liu, Xin Liu, Yan Li, Xilin Chen, Alexander G. Hauptmann, and Shiguang Shan. 2015. Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In 2015 IEEE International Conference on Computer Vision Workshop (ICCV Workshops’15), Santiago, Chile, December 7--13, 2015. 274--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jiebo Luo, Jie Yu, Dhiraj Joshi, and Wei Hao. 2008. Event recognition: Viewing the world with a third eye. In Proceedings of the 16th ACM MM. 1071--1080. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zhigang Ma, Yi Yang, Zhongwen Xu, Nicu Sebe, and Alexander G. Hauptmann. 2013. We are not equally negative: Fine-grained labeling for multimedia event detection. In ACM MM. 293--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). CoRR abs/1412.6632.Google ScholarGoogle Scholar
  35. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. 3111--3119.Google ScholarGoogle Scholar
  36. Paul Over, Jon Fiscus, Greg Sanders, David Joy, Martial Michel, George Awad, Alan F. Smeaton, Wessel Kraaij, and Georges Quenot. 2013. TRECVID 2013 -- An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2013. NIST.Google ScholarGoogle Scholar
  37. Shengsheng Qian, Tianzhu Zhang, Richang Hong, and Changsheng Xu. 2015. Cross-domain collaborative learning in social multimedia. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference (MM’15), Brisbane, Australia, October 26--30, 2015, 99--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Shengsheng Qian, Tianzhu Zhang, Changsheng Xu, and M. Shamim Hossain. 2014. Social event classification via boosted multimodal supervised latent Dirichlet allocation. ACM Transactions on Multimedia Computing 11, 2, 27:1--27:22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Shengsheng Qian, Tianzhu Zhang, Changsheng Xu, and Jie Shao. 2016. Multi-modal event topic model for social event analysis. IEEE Transactions on Multimedia 18, 2, 233--246.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Vignesh Ramanathan, Percy Liang, and Fei-Fei Li. 2013. Video event understanding using natural language descriptions. In Proceedings of the IEEE ICCV. 905--912. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, June 7--12, 2015. 3202--3212.Google ScholarGoogle ScholarCross RefCross Ref
  42. Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In IEEE International Conference on Computer Vision (ICCV’13), Sydney, Australia, December 1--8, 2013, 433--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Rasmus Rothe, Radu Timofte, and Luc J. Van Gool. 2015. DLDR: Deep linear discriminative retrieval for cultural event classification from a single image. In 2015 IEEE International Conference on Computer Vision Workshop (ICCV Workshops’15), Santiago, Chile, December 7--13, 2015, 295--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3, 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ruslan Salakhutdinov and Geoffrey E. Hinton. 2009. Deep Boltzmann machines. In Proceedings of the International Conference on AISTATS. 448--455.Google ScholarGoogle Scholar
  46. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.Google ScholarGoogle Scholar
  47. Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. TACL 2, 207--218.Google ScholarGoogle ScholarCross RefCross Ref
  48. Stephanie Strassel, Amanda Morris, Jonathan G. Fiscus, Christopher Caruso, Haejoong Lee, Paul Over, James Fiumara, Barbara Shaw, Brian Antonishek, and Martial Michel. 2012. Creating HAVIC: Heterogeneous audio visual Internet collection. In Proceedings of the 8th International Conference on Language Resources and Evaluation. 2573--2577.Google ScholarGoogle Scholar
  49. Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2013. Deep convolutional network cascade for facial point detection. In CVPR. 3476--3483. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. CoRR abs/1409.4842.Google ScholarGoogle Scholar
  51. Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen. 2015. Book2Movie: Aligning video scenes with book chapters. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, USA, June 7--12, 2015, 1827--1835.Google ScholarGoogle ScholarCross RefCross Ref
  52. Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond J. Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational Linguistics (COLING’14), Technical Papers, August 23--29, 2014, Dublin, Ireland, 1218--1227.Google ScholarGoogle Scholar
  53. Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to sequence - Video to text. In 2015 IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, December 7--13, 2015, 4534--4542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015b. Translating videos to natural language using deep recurrent neural networks. In 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’15), Denver, CO, May 31 - June 5, 2015, 1494--1504.Google ScholarGoogle ScholarCross RefCross Ref
  55. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the ICML. 1096--1103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, June 7--12, 2015, 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  57. Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In CVPR. 3169--3176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. 2014. Learning fine-grained image similarity with deep ranking. In CVPR. 1386--1393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Limin Wang, Zhe Wang, Wenbin Du, and Yu Qiao. 2015a. Object-scene convolutional neural networks for event recognition in images. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), Boston, MA, June 7--12, 2015, 30--35.Google ScholarGoogle ScholarCross RefCross Ref
  61. Limin Wang, Zhe Wang, Sheng Guo, and Yu Qiao. 2015b. Better exploiting OS-CNNs for better event recognition in images. In 2015 IEEE International Conference on Computer Vision Workshop (ICCV Workshops’15), Santiago, Chile, December 7--13, 2015, 287--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2015a. Cross-domain feature learning in multimedia. IEEE Transactions on Multimedia 17, 1, 64--78.Google ScholarGoogle ScholarCross RefCross Ref
  63. Xiaoshan Yang, Tianzhu Zhang, Changsheng Xu, and M. Shamim Hossain. 2015b. Automatic visual concept learning for social event understanding. IEEE Transactions on Multimedia 17, 3, 346--358.Google ScholarGoogle ScholarCross RefCross Ref
  64. Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, and Alexander G. Hauptmann. 2013. How related exemplars help complex event detection in web videos. In ICCV. 2104--2111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing videos by exploiting temporal structure. In 2015 IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, December 7--13, 2015, 4507--4515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In 13th ECCV. 818--833.Google ScholarGoogle Scholar
  67. Tianzhu Zhang and Changsheng Xu. 2014. Cross-domain multi-event tracking via CO-PMHT. ACM Transactions on Multimedia Computing 10, 4, 31:1--31:19. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Semantic Feature Mining for Video Event Understanding

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 12, Issue 4
          August 2016
          219 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/2983297
          Issue’s Table of Contents

          Copyright © 2016 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 3 August 2016
          • Revised: 1 May 2016
          • Accepted: 1 May 2016
          • Received: 1 November 2015
          Published in tomm Volume 12, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!