Abstract
This article aims for the detection and search of events in videos, where video examples are either scarce or even absent during training. To enable such event detection and search, ImageNet concept banks have shown to be effective. Rather than employing the standard concept bank of 1,000 ImageNet classes, we leverage the full 21,841-class dataset. We identify two problems with using the full dataset: (i) there is an imbalance between the number of examples per concept, and (ii) not all concepts are equally relevant for events. In this article, we propose to balance large-scale image hierarchies for pre-training. We shuffle concepts based on bottom-up and top-down operations to overcome the problems of example imbalance and concept relevance. Using this strategy, we arrive at the shuffled ImageNet bank, a concept bank with an order of magnitude more concepts compared to standard ImageNet banks. Compared to standard ImageNet pre-training, our shuffles result in more discriminative representations to train event models from the limited video event examples. For event search, the broad range of concepts enable a closer match between textual queries of events and concept detections in videos. Experimentally, we show the benefit of the proposed bank for event detection and event search, with state-of-the-art performance for both tasks on the challenging TRECVID Multimedia Event Detection and Ad-Hoc Video Search benchmarks.
- George Awad, Asad Butt, Jonathan Fiscus, David Joy, Andrew Delgado, Martial Michel, Alan F. Smeaton, Yvette Graham, Wessel Kraaij, Georges Quénot, et al. 2017. Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In Proceedings of the TRECVID.Google Scholar
- Subhabrata Bhattacharya, Felix X. Yu, and Shih-Fu Chang. 2014. Minimally needed evidence for complex event recognition in unconstrained videos. In Proceedings of the ICMR.Google Scholar
Digital Library
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. TIST 2, 3 (2011), 27.Google Scholar
Digital Library
- Xiaojun Chang, Zhigang Ma, Yi Yang, Zhiqiang Zeng, and Alexander G. Hauptmann. 2017. Bi-level semantic representation analysis for multimedia event detection. ToC 47, 5 (2017), 1180--1197.Google Scholar
- Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G. Hauptmann. 2016. Dynamic concept composition for zero-example event detection. In Proceedings of the AAAI.Google Scholar
- Xiaojun Chang, Yi Yang, Eric Xing, and Yaoliang Yu. 2015. Complex event detection using semantic saliency and nearly isotonic SVM. In Proceedings of the ICML. 1348--1357.Google Scholar
- Xiaojun Chang, Yao-Liang Yu, Yi Yang, and Eric P. Xing. 2016. They are not equally reliable: Semantic event search using differentiated concept classifiers. In Proceedings of the CVPR.Google Scholar
- Xiaojun Chang, Yao-Liang Yu, Yi Yang, and Eric P. Xing. 2017. Semantic pooling for complex event analysis in untrimmed videos. TPAMI 39, 8 (2017), 1617--1632.Google Scholar
Digital Library
- Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. In Proceedings of the CoRR.Google Scholar
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the CVPR.Google Scholar
Cross Ref
- Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, and Cees G. M. Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the MM.Google Scholar
- Hehe Fan, Xiaojun Chang, De Cheng, Yi Yang, Dong Xu, and Alexander G. Hauptmann. 2017. Complex event detection by identifying reliable shots from untrimmed videos. In Proceedings of the ICCV.Google Scholar
- Jonathan T. Foote. 1997. Content-based retrieval of music and audio. In Multimedia Storage and Archiving Systems II, Vol. 3229. 138--148.Google Scholar
Cross Ref
- Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the CVPR.Google Scholar
- Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. 2016. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In Proceedings of the CVPR.Google Scholar
Cross Ref
- Julien Girard, Youssef Tamaazousti, Hervé Le Borgne, and Céline Hudelot. 2018. Learning finer-class networks for universal representations. In Proceedings of the BMVC.Google Scholar
- Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014. Composite concept discovery for zero-shot video event detection. In Proceedings of the ICMR.Google Scholar
- Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014. Videostory: A new multimedia embedding for few-example recognition and translation of events. In Proceedings of the MM.Google Scholar
- Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2017. Video2vec embeddings recognize events when examples are scarce. TPAMI 10 (2017), 2089--2103.Google Scholar
Digital Library
- Amirhossein Habibian and Cees G. M. Snoek. 2014. Recommendations for recognizing video events by concept vocabularies. CVIU 124 (2014), 110--122.Google Scholar
Cross Ref
- Alexander Hauptmann, Rong Yan, Wei-Hao Lin, Michael Christel, and Howard Wactlar. 2007. Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. TMM 9, 5 (2007), 958--966.Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR.Google Scholar
Cross Ref
- Nakamasa Inoue and Koichi Shinoda. 2016. Adaptation of word vectors using tree structure for visual semantics. In Proceedings of the MM.Google Scholar
Digital Library
- Alejandro Jaimes, Belle L. Tseng, and John R. Smith. 2003. Modal keywords, ontologies, and reasoning for video understanding. In Proceedings of the CIVR.Google Scholar
- Mihir Jain, Jan C. van Gemert, Thomas Mensink, and Cees G. M. Snoek. 2015. Objects2action: Classifying and localizing actions without any video example. In Proceedings of the ICCV.Google Scholar
- Lu Jiang, Shoou-I. Yu, Deyu Meng, Yi Yang, Teruko Mitamura, and Alexander G. Hauptmann. 2015. Fast and accurate content-based semantic search in 100m internet videos. In Proceedings of the MM.Google Scholar
- Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. 2007. VIREO-374: LSCOM semantic concept detectors using local keypoint features.Google Scholar
- Yu-Gang Jiang, Akira Yanagawa, Shih-Fu Chang, and Chong-Wah Ngo. 2008. CU-VIREO374: Fusing Columbia374 and VIREO374 for large scale semantic concept detection. Columbia University ADVENT (2008), 223--2008.Google Scholar
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the CVPR.Google Scholar
Digital Library
- Zhengzhong Lan, Ming Lin, Xuanchong Li, Alex G. Hauptmann, and Bhiksha Raj. 2015. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the CVPR.Google Scholar
- Chao Li, Jiewei Cao, Zi Huang, Lei Zhu, and Heng Tao Shen. 2017. Leveraging weak semantic relevance for complex video event classification. In Proceedings of the ICCV.Google Scholar
Cross Ref
- Xirong Li, Yujia Huo, Qin Jin, and Jieping Xu. 2016. Detecting violence in video using subclasses. In Proceedings of the MM.Google Scholar
Digital Library
- Jingen Liu, Qian Yu, Omar Javed, Saad Ali, Amir Tamrakar, Ajay Divakaran, Hui Cheng, and Harpreet Sawhney. 2013. Video event recognition using concept attributes. In Proceedings of the WACV.Google Scholar
Digital Library
- Zhigang Ma, Xiaojun Chang, Zhongwen Xu, Nicu Sebe, and Alexander G. Hauptmann. 2018. Joint attributes and event analysis for multimedia event detection. TNNLS 10 (2018).Google Scholar
- Zhigang Ma, Xiaojun Chang, Yi Yang, Nicu Sebe, and Alexander G. Hauptmann. 2017. The many shades of negativity. TMM 19, 7 (2017), 1558--1568.Google Scholar
Digital Library
- Foteini Markatopoulou, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2017. Query and keyframe representations for ad-hoc video search. In Proceedings of the ICMR.Google Scholar
Digital Library
- Masoud Mazloom, Efstratios Gavves, and Cees G. M. Snoek. 2014. Conceptlets: Selective semantics for classifying video events. TMM 16, 8 (2014), 2214--2228.Google Scholar
Cross Ref
- Masoud Mazloom, Xirong Li, and Cees G. M. Snoek. 2016. TagBook: A semantic video representation without supervision for event detection. TMM 18, 7 (2016), 1378--1388.Google Scholar
Digital Library
- Michele Merler, Bert Huang, Lexing Xie, Gang Hua, and Apostol Natsev. 2012. Semantic model vectors for complex video event recognition. TMM 14, 1 (2012), 88--101.Google Scholar
Digital Library
- Pascal Mettes, Dennis C. Koelma, and Cees G. M. Snoek. 2016. The imagenet shuffle: Reorganized pre-training for video event detection. In Proceedings of the ICMR.Google Scholar
- Pascal Mettes and Cees G. M. Snoek. 2017. Spatial-aware object embeddings for zero-shot localization and classification of actions. In Proceedings of the ICCV.Google Scholar
- Pascal Mettes, Jan C. van Gemert, Spencer Cappallo, Thomas Mensink, and Cees G. M. Snoek. 2015. Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting. In Proceedings of the ICMR.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS.Google Scholar
Digital Library
- George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41.Google Scholar
Digital Library
- Markus Nagel, Thomas Mensink, and Cees G. M. Snoek. 2015. Event fisher vectors: Robust encoding visual diversity of visual streams. In Proceedings of the BMVC.Google Scholar
- Milind Naphade, John R. Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Alexander Hauptmann, and Jon Curtis. 2006. Large-scale concept ontology for multimedia. IEEE Multimedia 13, 3 (2006), 86--91.Google Scholar
Digital Library
- Shi-Yong Neo, Jin Zhao, Min-Yen Kan, and Tat-Seng Chua. 2006. Video retrieval using high level features: Exploiting query matching and confidence-based weighting. In Proceedings of the CIVR.Google Scholar
Digital Library
- Dan Oneata, Jakob Verbeek, and Cordelia Schmid. 2013. Action and event recognition with fisher vectors on a compact feature set. In Proceedings of the ICCV.Google Scholar
Digital Library
- Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. From large scale image categorization to entry-level categories. In Proceedings of the ICCV.Google Scholar
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the MICCAI Society.Google Scholar
Cross Ref
- Eleanor Rosch, Carolyn B. Mervis, Wayne D. Gray, David M. Johnson, and Penny Boyes-Braem. 1976. Basic objects in natural categories. Cogn. Psychol. 8, 3 (1976), 382--439.Google Scholar
Cross Ref
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. IJCV 115, 3 (2015), 211--252.Google Scholar
Digital Library
- Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the fisher vector: Theory and practice. IJCV 105, 3 (2013), 222--245.Google Scholar
Digital Library
- Bharat Singh, Xintong Han, Zhe Wu, Vlad I. Morariu, and Larry S. Davis. 2015. Selecting relevant web trained concepts for automated event retrieval. In Proceedings of the ICCV.Google Scholar
- Cees G. M. Snoek, Bouke Huurnink, Laura Hollink, Maarten De Rijke, Guus Schreiber, and Marcel Worring. 2007. Adding semantics to detectors for video retrieval. TMM 9, 5 (2007), 975--986.Google Scholar
Digital Library
- Cees G. M. Snoek, Marcel Worring, Jan C. Van Gemert, Jan-Mark Geusebroek, and Arnold W. M. Smeulders. 2006. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the ACM MM.Google Scholar
- Chen Sun and Ram Nevatia. 2013. Large-scale web video event classification by use of fisher vectors. In Proceedings of the WACV.Google Scholar
Digital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the CVPR.Google Scholar
Cross Ref
- Amir Tamrakar, Saad Ali, Qian Yu, Jingen Liu, Omar Javed, Ajay Divakaran, Hui Cheng, and Harpreet Sawhney. 2012. Evaluation of low-level features and their combinations for complex event detection in open source videos. In Proceedings of the CVPR.Google Scholar
Cross Ref
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the ICCV.Google Scholar
Digital Library
- Paul van der Corput and Jarke J. van Wijk. 2017. Comparing personal image collections with picturevis. In Proceedings of the Computer Graphics Forum, Vol. 36. 295--304.Google Scholar
Digital Library
- Daan T. J. Vreeswijk, Cees G. M. Snoek, Koen E. A. van de Sande, and Arnold W. M. Smeulders. 2012. All vehicles are cars: Subclass preferences in container concepts. In Proceedings of the ICMR.Google Scholar
- Dong Wang, Xirong Li, Jianmin Li, and Bo Zhang. 2007. The importance of query-concept-mapping for automatic video retrieval. In Proceedings of the ACM MM.Google Scholar
Digital Library
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the ICCV.Google Scholar
Digital Library
- Hanzhang Wang, Hanli Wang, and Kaisheng Xu. 2018. Categorizing concepts with basic level for vision-to-language. In Proceedings of the CVPR.Google Scholar
Cross Ref
- Xiao-Yong Wei, Chong-Wah Ngo, and Yu-Gang Jiang. 2008. Selection of concept detectors for video search by ontology-enriched semantic spaces. TMM 10, 6 (2008), 1085--1096.Google Scholar
Digital Library
- Sebastien C. Wong, Adam Gatt, Victor Stamatescu, and Mark D. McDonnell. 2016. Understanding data augmentation for classification: When to warp? In Proceedings of the DICTA.Google Scholar
- Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the CVPR.Google Scholar
Cross Ref
- Zhongwen Xu, Yi Yang, and Alex G. Hauptmann. 2015. A discriminative CNN video representation for event detection. In Proceedings of the CVPR.Google Scholar
- Akira Yanagawa, Shih-Fu Chang, Lyndon Kennedy, and Winston Hsu. 2007. Columbia university’s baseline detectors for 374 lscom semantic visual concepts. Columbia University ADVENT Technical Report (2007).Google Scholar
- Yang Yang and Mubarak Shah. 2012. Complex events detection using data-driven concepts. In Proceedings of the ECCV.Google Scholar
Digital Library
- Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. 2015. Eventnet: A large scale structured concept library for complex event detection in video. In Proceedings of the MM.Google Scholar
Digital Library
- Litao Yu, Xiaoshuai Sun, and Zi Huang. 2016. Robust spatial-temporal deep model for multimedia event detection. Neurocomputing 213 (2016), 48--53.Google Scholar
Cross Ref
- Litao Yu, Yang Yang, Zi Huang, Peng Wang, Jingkuan Song, and Heng Tao Shen. 2016. Web video event recognition by semantic analysis from ubiquitous documents. TIP 25, 12 (2016), 5689--5701.Google Scholar
Digital Library
- Shoou-I. Yu, Lu Jiang, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2015. Content-based video search over 1 million videos with 1 core in 1 second. In Proceedings of the ICMR.Google Scholar
- Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. In Proceedings of the BMVC.Google Scholar
Cross Ref
- Xishan Zhang, Yang Yang, Yongdong Zhang, Huanbo Luan, Jintao Li, Hanwang Zhang, and Tat-Seng Chua. 2015. Enhancing video event recognition using automatically constructed semantic-visual knowledge base. TMM 17, 9 (2015), 1562--1575.Google Scholar
Digital Library
- Linchao Zhu, Zhongwen Xu, and Yi Yang. 2017. Bidirectional multirate reconstruction for temporal modeling in videos. In Proceedings of the CVPR.Google Scholar
Cross Ref
- Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. IJCV 124, 3 (2017), 409--421.Google Scholar
Digital Library
- Linchao Zhu and Yi Yang. 2018. Compound memory networks for few-shot video classification. In Proceedings of the ECCV.Google Scholar
Cross Ref
Index Terms
Shuffled ImageNet Banks for Video Event Detection and Search
Recommendations
Searching informative concept banks for video event detection
ICMR '13: Proceedings of the 3rd ACM conference on International conference on multimedia retrievalAn emerging trend in video event detection is to learn an event from a bank of concept detector scores. Different from existing work, which simply relies on a bank containing all available detectors, we propose in this paper an algorithm that learns ...
Video Event Detection: From Subvolume Localization to Spatiotemporal Path Search
Although sliding window-based approaches have been quite successful in detecting objects in images, it is not a trivial problem to extend them to detecting events in videos. We propose to search for spatiotemporal paths for video event detection. This ...
A generic framework for event detection in various video domains
MM '10: Proceedings of the 18th ACM international conference on MultimediaEvent detection is essential for the extensively studied video analysis and understanding area. Although various approaches have been proposed for event detection, there is a lack of a generic event detection framework that can be applied to various ...






Comments