skip to main content
research-article

Shuffled ImageNet Banks for Video Event Detection and Search

Published:22 May 2020Publication History
Skip Abstract Section

Abstract

This article aims for the detection and search of events in videos, where video examples are either scarce or even absent during training. To enable such event detection and search, ImageNet concept banks have shown to be effective. Rather than employing the standard concept bank of 1,000 ImageNet classes, we leverage the full 21,841-class dataset. We identify two problems with using the full dataset: (i) there is an imbalance between the number of examples per concept, and (ii) not all concepts are equally relevant for events. In this article, we propose to balance large-scale image hierarchies for pre-training. We shuffle concepts based on bottom-up and top-down operations to overcome the problems of example imbalance and concept relevance. Using this strategy, we arrive at the shuffled ImageNet bank, a concept bank with an order of magnitude more concepts compared to standard ImageNet banks. Compared to standard ImageNet pre-training, our shuffles result in more discriminative representations to train event models from the limited video event examples. For event search, the broad range of concepts enable a closer match between textual queries of events and concept detections in videos. Experimentally, we show the benefit of the proposed bank for event detection and event search, with state-of-the-art performance for both tasks on the challenging TRECVID Multimedia Event Detection and Ad-Hoc Video Search benchmarks.

References

  1. George Awad, Asad Butt, Jonathan Fiscus, David Joy, Andrew Delgado, Martial Michel, Alan F. Smeaton, Yvette Graham, Wessel Kraaij, Georges Quénot, et al. 2017. Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In Proceedings of the TRECVID.Google ScholarGoogle Scholar
  2. Subhabrata Bhattacharya, Felix X. Yu, and Shih-Fu Chang. 2014. Minimally needed evidence for complex event recognition in unconstrained videos. In Proceedings of the ICMR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. TIST 2, 3 (2011), 27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Xiaojun Chang, Zhigang Ma, Yi Yang, Zhiqiang Zeng, and Alexander G. Hauptmann. 2017. Bi-level semantic representation analysis for multimedia event detection. ToC 47, 5 (2017), 1180--1197.Google ScholarGoogle Scholar
  5. Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G. Hauptmann. 2016. Dynamic concept composition for zero-example event detection. In Proceedings of the AAAI.Google ScholarGoogle Scholar
  6. Xiaojun Chang, Yi Yang, Eric Xing, and Yaoliang Yu. 2015. Complex event detection using semantic saliency and nearly isotonic SVM. In Proceedings of the ICML. 1348--1357.Google ScholarGoogle Scholar
  7. Xiaojun Chang, Yao-Liang Yu, Yi Yang, and Eric P. Xing. 2016. They are not equally reliable: Semantic event search using differentiated concept classifiers. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  8. Xiaojun Chang, Yao-Liang Yu, Yi Yang, and Eric P. Xing. 2017. Semantic pooling for complex event analysis in untrimmed videos. TPAMI 39, 8 (2017), 1617--1632.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. In Proceedings of the CoRR.Google ScholarGoogle Scholar
  10. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  11. Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, and Cees G. M. Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the MM.Google ScholarGoogle Scholar
  12. Hehe Fan, Xiaojun Chang, De Cheng, Yi Yang, Dong Xu, and Alexander G. Hauptmann. 2017. Complex event detection by identifying reliable shots from untrimmed videos. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  13. Jonathan T. Foote. 1997. Content-based retrieval of music and audio. In Multimedia Storage and Archiving Systems II, Vol. 3229. 138--148.Google ScholarGoogle ScholarCross RefCross Ref
  14. Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  15. Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. 2016. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  16. Julien Girard, Youssef Tamaazousti, Hervé Le Borgne, and Céline Hudelot. 2018. Learning finer-class networks for universal representations. In Proceedings of the BMVC.Google ScholarGoogle Scholar
  17. Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014. Composite concept discovery for zero-shot video event detection. In Proceedings of the ICMR.Google ScholarGoogle Scholar
  18. Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014. Videostory: A new multimedia embedding for few-example recognition and translation of events. In Proceedings of the MM.Google ScholarGoogle Scholar
  19. Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2017. Video2vec embeddings recognize events when examples are scarce. TPAMI 10 (2017), 2089--2103.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Amirhossein Habibian and Cees G. M. Snoek. 2014. Recommendations for recognizing video events by concept vocabularies. CVIU 124 (2014), 110--122.Google ScholarGoogle ScholarCross RefCross Ref
  21. Alexander Hauptmann, Rong Yan, Wei-Hao Lin, Michael Christel, and Howard Wactlar. 2007. Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. TMM 9, 5 (2007), 958--966.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  23. Nakamasa Inoue and Koichi Shinoda. 2016. Adaptation of word vectors using tree structure for visual semantics. In Proceedings of the MM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Alejandro Jaimes, Belle L. Tseng, and John R. Smith. 2003. Modal keywords, ontologies, and reasoning for video understanding. In Proceedings of the CIVR.Google ScholarGoogle Scholar
  25. Mihir Jain, Jan C. van Gemert, Thomas Mensink, and Cees G. M. Snoek. 2015. Objects2action: Classifying and localizing actions without any video example. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  26. Lu Jiang, Shoou-I. Yu, Deyu Meng, Yi Yang, Teruko Mitamura, and Alexander G. Hauptmann. 2015. Fast and accurate content-based semantic search in 100m internet videos. In Proceedings of the MM.Google ScholarGoogle Scholar
  27. Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. 2007. VIREO-374: LSCOM semantic concept detectors using local keypoint features.Google ScholarGoogle Scholar
  28. Yu-Gang Jiang, Akira Yanagawa, Shih-Fu Chang, and Chong-Wah Ngo. 2008. CU-VIREO374: Fusing Columbia374 and VIREO374 for large scale semantic concept detection. Columbia University ADVENT (2008), 223--2008.Google ScholarGoogle Scholar
  29. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the CVPR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Zhengzhong Lan, Ming Lin, Xuanchong Li, Alex G. Hauptmann, and Bhiksha Raj. 2015. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  31. Chao Li, Jiewei Cao, Zi Huang, Lei Zhu, and Heng Tao Shen. 2017. Leveraging weak semantic relevance for complex video event classification. In Proceedings of the ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  32. Xirong Li, Yujia Huo, Qin Jin, and Jieping Xu. 2016. Detecting violence in video using subclasses. In Proceedings of the MM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jingen Liu, Qian Yu, Omar Javed, Saad Ali, Amir Tamrakar, Ajay Divakaran, Hui Cheng, and Harpreet Sawhney. 2013. Video event recognition using concept attributes. In Proceedings of the WACV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Zhigang Ma, Xiaojun Chang, Zhongwen Xu, Nicu Sebe, and Alexander G. Hauptmann. 2018. Joint attributes and event analysis for multimedia event detection. TNNLS 10 (2018).Google ScholarGoogle Scholar
  35. Zhigang Ma, Xiaojun Chang, Yi Yang, Nicu Sebe, and Alexander G. Hauptmann. 2017. The many shades of negativity. TMM 19, 7 (2017), 1558--1568.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Foteini Markatopoulou, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2017. Query and keyframe representations for ad-hoc video search. In Proceedings of the ICMR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Masoud Mazloom, Efstratios Gavves, and Cees G. M. Snoek. 2014. Conceptlets: Selective semantics for classifying video events. TMM 16, 8 (2014), 2214--2228.Google ScholarGoogle ScholarCross RefCross Ref
  38. Masoud Mazloom, Xirong Li, and Cees G. M. Snoek. 2016. TagBook: A semantic video representation without supervision for event detection. TMM 18, 7 (2016), 1378--1388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Michele Merler, Bert Huang, Lexing Xie, Gang Hua, and Apostol Natsev. 2012. Semantic model vectors for complex video event recognition. TMM 14, 1 (2012), 88--101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Pascal Mettes, Dennis C. Koelma, and Cees G. M. Snoek. 2016. The imagenet shuffle: Reorganized pre-training for video event detection. In Proceedings of the ICMR.Google ScholarGoogle Scholar
  41. Pascal Mettes and Cees G. M. Snoek. 2017. Spatial-aware object embeddings for zero-shot localization and classification of actions. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  42. Pascal Mettes, Jan C. van Gemert, Spencer Cappallo, Thomas Mensink, and Cees G. M. Snoek. 2015. Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting. In Proceedings of the ICMR.Google ScholarGoogle Scholar
  43. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Markus Nagel, Thomas Mensink, and Cees G. M. Snoek. 2015. Event fisher vectors: Robust encoding visual diversity of visual streams. In Proceedings of the BMVC.Google ScholarGoogle Scholar
  46. Milind Naphade, John R. Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Alexander Hauptmann, and Jon Curtis. 2006. Large-scale concept ontology for multimedia. IEEE Multimedia 13, 3 (2006), 86--91.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Shi-Yong Neo, Jin Zhao, Min-Yen Kan, and Tat-Seng Chua. 2006. Video retrieval using high level features: Exploiting query matching and confidence-based weighting. In Proceedings of the CIVR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Dan Oneata, Jakob Verbeek, and Cordelia Schmid. 2013. Action and event recognition with fisher vectors on a compact feature set. In Proceedings of the ICCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. From large scale image categorization to entry-level categories. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  50. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the MICCAI Society.Google ScholarGoogle ScholarCross RefCross Ref
  51. Eleanor Rosch, Carolyn B. Mervis, Wayne D. Gray, David M. Johnson, and Penny Boyes-Braem. 1976. Basic objects in natural categories. Cogn. Psychol. 8, 3 (1976), 382--439.Google ScholarGoogle ScholarCross RefCross Ref
  52. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. IJCV 115, 3 (2015), 211--252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the fisher vector: Theory and practice. IJCV 105, 3 (2013), 222--245.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Bharat Singh, Xintong Han, Zhe Wu, Vlad I. Morariu, and Larry S. Davis. 2015. Selecting relevant web trained concepts for automated event retrieval. In Proceedings of the ICCV.Google ScholarGoogle Scholar
  55. Cees G. M. Snoek, Bouke Huurnink, Laura Hollink, Maarten De Rijke, Guus Schreiber, and Marcel Worring. 2007. Adding semantics to detectors for video retrieval. TMM 9, 5 (2007), 975--986.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Cees G. M. Snoek, Marcel Worring, Jan C. Van Gemert, Jan-Mark Geusebroek, and Arnold W. M. Smeulders. 2006. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the ACM MM.Google ScholarGoogle Scholar
  57. Chen Sun and Ram Nevatia. 2013. Large-scale web video event classification by use of fisher vectors. In Proceedings of the WACV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  59. Amir Tamrakar, Saad Ali, Qian Yu, Jingen Liu, Omar Javed, Ajay Divakaran, Hui Cheng, and Harpreet Sawhney. 2012. Evaluation of low-level features and their combinations for complex event detection in open source videos. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  60. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the ICCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Paul van der Corput and Jarke J. van Wijk. 2017. Comparing personal image collections with picturevis. In Proceedings of the Computer Graphics Forum, Vol. 36. 295--304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Daan T. J. Vreeswijk, Cees G. M. Snoek, Koen E. A. van de Sande, and Arnold W. M. Smeulders. 2012. All vehicles are cars: Subclass preferences in container concepts. In Proceedings of the ICMR.Google ScholarGoogle Scholar
  63. Dong Wang, Xirong Li, Jianmin Li, and Bo Zhang. 2007. The importance of query-concept-mapping for automatic video retrieval. In Proceedings of the ACM MM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the ICCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Hanzhang Wang, Hanli Wang, and Kaisheng Xu. 2018. Categorizing concepts with basic level for vision-to-language. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  66. Xiao-Yong Wei, Chong-Wah Ngo, and Yu-Gang Jiang. 2008. Selection of concept detectors for video search by ontology-enriched semantic spaces. TMM 10, 6 (2008), 1085--1096.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Sebastien C. Wong, Adam Gatt, Victor Stamatescu, and Mark D. McDonnell. 2016. Understanding data augmentation for classification: When to warp? In Proceedings of the DICTA.Google ScholarGoogle Scholar
  68. Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  69. Zhongwen Xu, Yi Yang, and Alex G. Hauptmann. 2015. A discriminative CNN video representation for event detection. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  70. Akira Yanagawa, Shih-Fu Chang, Lyndon Kennedy, and Winston Hsu. 2007. Columbia university’s baseline detectors for 374 lscom semantic visual concepts. Columbia University ADVENT Technical Report (2007).Google ScholarGoogle Scholar
  71. Yang Yang and Mubarak Shah. 2012. Complex events detection using data-driven concepts. In Proceedings of the ECCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. 2015. Eventnet: A large scale structured concept library for complex event detection in video. In Proceedings of the MM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Litao Yu, Xiaoshuai Sun, and Zi Huang. 2016. Robust spatial-temporal deep model for multimedia event detection. Neurocomputing 213 (2016), 48--53.Google ScholarGoogle ScholarCross RefCross Ref
  74. Litao Yu, Yang Yang, Zi Huang, Peng Wang, Jingkuan Song, and Heng Tao Shen. 2016. Web video event recognition by semantic analysis from ubiquitous documents. TIP 25, 12 (2016), 5689--5701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Shoou-I. Yu, Lu Jiang, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2015. Content-based video search over 1 million videos with 1 core in 1 second. In Proceedings of the ICMR.Google ScholarGoogle Scholar
  76. Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. In Proceedings of the BMVC.Google ScholarGoogle ScholarCross RefCross Ref
  77. Xishan Zhang, Yang Yang, Yongdong Zhang, Huanbo Luan, Jintao Li, Hanwang Zhang, and Tat-Seng Chua. 2015. Enhancing video event recognition using automatically constructed semantic-visual knowledge base. TMM 17, 9 (2015), 1562--1575.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Linchao Zhu, Zhongwen Xu, and Yi Yang. 2017. Bidirectional multirate reconstruction for temporal modeling in videos. In Proceedings of the CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  79. Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. IJCV 124, 3 (2017), 409--421.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Linchao Zhu and Yi Yang. 2018. Compound memory networks for few-shot video classification. In Proceedings of the ECCV.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Shuffled ImageNet Banks for Video Event Detection and Search

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 2
      May 2020
      390 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3401894
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 May 2020
      • Online AM: 7 May 2020
      • Accepted: 1 January 2020
      • Revised: 1 December 2019
      • Received: 1 August 2019
      Published in tomm Volume 16, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!