Abstract
Content-based video understanding is extremely difficult due to the semantic gap between low-level vision signals and the various semantic concepts (object, action, and scene) in videos. Though feature extraction from videos has achieved significant progress, most of the previous methods rely only on low-level features, such as the appearance and motion features. Recently, visual-feature extraction has been improved significantly with machine-learning algorithms, especially deep learning. However, there is still not enough work focusing on extracting semantic features from videos directly. The goal of this article is to adopt unlabeled videos with the help of text descriptions to learn an embedding function, which can be used to extract more effective semantic features from videos when only a few labeled samples are available for video recognition. To achieve this goal, we propose a novel embedding convolutional neural network (ECNN). We evaluate our algorithm by comparing its performance on three challenging benchmarks with several popular state-of-the-art methods. Extensive experimental results show that the proposed ECNN consistently and significantly outperforms the existing methods.
- Kobus Barnard, Pinar Duygulu, David A. Forsyth, Nando de Freitas, David M. Blei, and Michael I. Jordan. 2003. Matching words and pictures. Journal of Machine Learning Research 3, 1107--1135. Google Scholar
Digital Library
- Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. 2006. Greedy layer-wise training of deep networks. In NIPS. 153--160. Google Scholar
Digital Library
- Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis R. Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Weakly-supervised alignment of video with text. In 2015 IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, December 7--13, 2015, 4462--4470. Google Scholar
Digital Library
- Xinlei Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, June 7--12, 2015, 2422--2431.Google Scholar
Cross Ref
- Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, June 7--12, 2015, 2625--2634.Google Scholar
Cross Ref
- Lixin Duan, Dong Xu, and Shih-Fu Chang. 2012. Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In IEEE CVPR. 1338--1345. Google Scholar
Digital Library
- John C. Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 2121--2159. Google Scholar
Digital Library
- Pinar Duygulu, Kobus Barnard, João F. G. de Freitas, and David A. Forsyth. 2002. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of Computer Vision - 7th European Conference on Computer Vision (ECCV’02), Part IV. Copenhagen, Denmark, May 28--31, 2002, 97--112. Google Scholar
Digital Library
- Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of Computer Vision - 11th European Conference on Computer Vision (ECCV’10), Part IV. Heraklion, Crete, Greece, September 5--11, 2010, 15--29. Google Scholar
Digital Library
- Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5--8, 2013, Lake Tahoe, NV, 2121--2129.Google Scholar
- Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling Internet images, tags, and their semantics. International Journal of Computer Vision 106, 2, 210--233. Google Scholar
Digital Library
- Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In IEEE International Conference on Computer Vision (ICCV’13), Sydney, Australia, December 1--8, 2013, 2712--2719. Google Scholar
Digital Library
- AmirHossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014. VideoStory: A new multimedia embedding for few-example recognition and translation of events. In Proceedings of the ACM MM. 17--26. Google Scholar
Digital Library
- Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18, 7, 1527--1554. Google Scholar
Digital Library
- Junlin Hu, Jiwen Lu, and Yap-Peng Tan. 2014. Discriminative deep metric learning for face verification in the wild. In CVPR. 1875--1882. Google Scholar
Digital Library
- Naveed Imran, Jingen Liu, Jiebo Luo, and Mubarak Shah. 2009. Event recognition from photo collections via PageRank. In ACM Multimedia. 621--624. Google Scholar
Digital Library
- Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1, 221--231. Google Scholar
Digital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.Google Scholar
- Lu Jiang, Alexander G. Hauptmann, and Guang Xiang. 2012. Leveraging high-level and low-level features for multimedia event detection. In ACM MM. 449--458. Google Scholar
Digital Library
- Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel P. W. Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ICMR. 29. Google Scholar
Digital Library
- Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In ACLs. 655--665.Google Scholar
- Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014a. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, December 8--13, 2014, Montreal, Quebec, Canada, 1889--1897. Google Scholar
Digital Library
- Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, June 7--12, 2015. 3128--3137.Google Scholar
Cross Ref
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014b. Large-scale video classification with convolutional neural networks. In CVPR. 1725--1732. Google Scholar
Digital Library
- Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/1411.2539.Google Scholar
- Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the 27th AAAI Conference on Artificial Intelligence, July 14--18, 2013, Bellevue, WA. Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NIPS. 1106--1114.Google Scholar
- Polina Kuznetsova, Vicente Ordonez, Tamara L. Berg, and Yejin Choi. 2014. TREETALK: Composition and compression of trees for image descriptions. TACL 2, 351--362.Google Scholar
- Rémi Lebret, Pedro O. Pinheiro, and Ronan Collobert. 2015. Phrase-based image captioning. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15), Lille, France, July 6--11, 2015. 2085--2094.Google Scholar
Digital Library
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of IEEE 86, 11, 2278--2324.Google Scholar
Cross Ref
- Mengyi Liu, Xin Liu, Yan Li, Xilin Chen, Alexander G. Hauptmann, and Shiguang Shan. 2015. Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In 2015 IEEE International Conference on Computer Vision Workshop (ICCV Workshops’15), Santiago, Chile, December 7--13, 2015. 274--279. Google Scholar
Digital Library
- Jiebo Luo, Jie Yu, Dhiraj Joshi, and Wei Hao. 2008. Event recognition: Viewing the world with a third eye. In Proceedings of the 16th ACM MM. 1071--1080. Google Scholar
Digital Library
- Zhigang Ma, Yi Yang, Zhongwen Xu, Nicu Sebe, and Alexander G. Hauptmann. 2013. We are not equally negative: Fine-grained labeling for multimedia event detection. In ACM MM. 293--302. Google Scholar
Digital Library
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). CoRR abs/1412.6632.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. 3111--3119.Google Scholar
- Paul Over, Jon Fiscus, Greg Sanders, David Joy, Martial Michel, George Awad, Alan F. Smeaton, Wessel Kraaij, and Georges Quenot. 2013. TRECVID 2013 -- An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2013. NIST.Google Scholar
- Shengsheng Qian, Tianzhu Zhang, Richang Hong, and Changsheng Xu. 2015. Cross-domain collaborative learning in social multimedia. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference (MM’15), Brisbane, Australia, October 26--30, 2015, 99--108. Google Scholar
Digital Library
- Shengsheng Qian, Tianzhu Zhang, Changsheng Xu, and M. Shamim Hossain. 2014. Social event classification via boosted multimodal supervised latent Dirichlet allocation. ACM Transactions on Multimedia Computing 11, 2, 27:1--27:22. Google Scholar
Digital Library
- Shengsheng Qian, Tianzhu Zhang, Changsheng Xu, and Jie Shao. 2016. Multi-modal event topic model for social event analysis. IEEE Transactions on Multimedia 18, 2, 233--246.Google Scholar
Digital Library
- Vignesh Ramanathan, Percy Liang, and Fei-Fei Li. 2013. Video event understanding using natural language descriptions. In Proceedings of the IEEE ICCV. 905--912. Google Scholar
Digital Library
- Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, June 7--12, 2015. 3202--3212.Google Scholar
Cross Ref
- Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In IEEE International Conference on Computer Vision (ICCV’13), Sydney, Australia, December 1--8, 2013, 433--440. Google Scholar
Digital Library
- Rasmus Rothe, Radu Timofte, and Luc J. Van Gool. 2015. DLDR: Deep linear discriminative retrieval for cultural event classification from a single image. In 2015 IEEE International Conference on Computer Vision Workshop (ICCV Workshops’15), Santiago, Chile, December 7--13, 2015, 295--302. Google Scholar
Digital Library
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3, 211--252. Google Scholar
Digital Library
- Ruslan Salakhutdinov and Geoffrey E. Hinton. 2009. Deep Boltzmann machines. In Proceedings of the International Conference on AISTATS. 448--455.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.Google Scholar
- Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. TACL 2, 207--218.Google Scholar
Cross Ref
- Stephanie Strassel, Amanda Morris, Jonathan G. Fiscus, Christopher Caruso, Haejoong Lee, Paul Over, James Fiumara, Barbara Shaw, Brian Antonishek, and Martial Michel. 2012. Creating HAVIC: Heterogeneous audio visual Internet collection. In Proceedings of the 8th International Conference on Language Resources and Evaluation. 2573--2577.Google Scholar
- Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2013. Deep convolutional network cascade for facial point detection. In CVPR. 3476--3483. Google Scholar
Digital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. CoRR abs/1409.4842.Google Scholar
- Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen. 2015. Book2Movie: Aligning video scenes with book chapters. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, USA, June 7--12, 2015, 1827--1835.Google Scholar
Cross Ref
- Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond J. Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational Linguistics (COLING’14), Technical Papers, August 23--29, 2014, Dublin, Ireland, 1218--1227.Google Scholar
- Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to sequence - Video to text. In 2015 IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, December 7--13, 2015, 4534--4542. Google Scholar
Digital Library
- Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015b. Translating videos to natural language using deep recurrent neural networks. In 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’15), Denver, CO, May 31 - June 5, 2015, 1494--1504.Google Scholar
Cross Ref
- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the ICML. 1096--1103. Google Scholar
Digital Library
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, June 7--12, 2015, 3156--3164.Google Scholar
Cross Ref
- Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In CVPR. 3169--3176. Google Scholar
Digital Library
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision. Google Scholar
Digital Library
- Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. 2014. Learning fine-grained image similarity with deep ranking. In CVPR. 1386--1393. Google Scholar
Digital Library
- Limin Wang, Zhe Wang, Wenbin Du, and Yu Qiao. 2015a. Object-scene convolutional neural networks for event recognition in images. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), Boston, MA, June 7--12, 2015, 30--35.Google Scholar
Cross Ref
- Limin Wang, Zhe Wang, Sheng Guo, and Yu Qiao. 2015b. Better exploiting OS-CNNs for better event recognition in images. In 2015 IEEE International Conference on Computer Vision Workshop (ICCV Workshops’15), Santiago, Chile, December 7--13, 2015, 287--294. Google Scholar
Digital Library
- Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2015a. Cross-domain feature learning in multimedia. IEEE Transactions on Multimedia 17, 1, 64--78.Google Scholar
Cross Ref
- Xiaoshan Yang, Tianzhu Zhang, Changsheng Xu, and M. Shamim Hossain. 2015b. Automatic visual concept learning for social event understanding. IEEE Transactions on Multimedia 17, 3, 346--358.Google Scholar
Cross Ref
- Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, and Alexander G. Hauptmann. 2013. How related exemplars help complex event detection in web videos. In ICCV. 2104--2111. Google Scholar
Digital Library
- Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing videos by exploiting temporal structure. In 2015 IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, December 7--13, 2015, 4507--4515. Google Scholar
Digital Library
- Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In 13th ECCV. 818--833.Google Scholar
- Tianzhu Zhang and Changsheng Xu. 2014. Cross-domain multi-event tracking via CO-PMHT. ACM Transactions on Multimedia Computing 10, 4, 31:1--31:19. Google Scholar
Digital Library
Index Terms
Semantic Feature Mining for Video Event Understanding
Recommendations
Research on semantic web reasoning based on event ontology
ICWL'10: Proceedings of the 2010 international conference on New horizons in web-based learningOntology is one of the key technologies in Semantic Web. And the traditional ontology is only used to describe the concepts and the relations between them which neglected to describe dynamic knowledge. However, the Semantic Web offers services, which ...
Event Feature and Personality-Event - Ontology Based for Classifying Chinese Web Pages
IWCSE '09: Proceedings of the 2009 Second International Workshop on Computer Science and Engineering - Volume 02In recent years, more and more experts find that events have more semantic means than any other features, such as, characters, terms and concepts. This paper described a method which used event features to classify Chinese web pages talking about ...
Clustering-based joint feature selection for semantic attribute prediction
IJCAI'16: Proceedings of the Twenty-Fifth International Joint Conference on Artificial IntelligenceSemantic attributes have been proposed to bridge the semantic gap between low-level feature representation and high-level semantic understanding of visual objects. Obtaining a good representation of semantic attributes usually requires learning from ...






Comments