skip to main content
survey

How Deep Features Have Improved Event Recognition in Multimedia: A Survey

Published:05 June 2019Publication History
Skip Abstract Section

Abstract

Event recognition is one of the areas in multimedia that is attracting great attention of researchers. Being applicable in a wide range of applications, from personal to collective events, a number of interesting solutions for event recognition using multimedia information sources have been proposed. On the other hand, following their immense success in classification, object recognition, and detection, deep learning has been shown to perform well in event recognition tasks also. Thus, a large portion of the literature on event analysis relies nowadays on deep learning architectures. In this article, we provide an extensive overview of the existing literature in this field, analyzing how deep features and deep learning architectures have changed the performance of event recognition frameworks. The literature on event-based analysis of multimedia contents can be categorized into four groups, namely (i) event recognition in single images; (ii) event recognition in personal photo collections; (iii) event recognition in videos; and (iv) event recognition in audio recordings. In this article, we extensively review different deep-learning-based frameworks for event recognition in these four domains. Furthermore, we also review some benchmark datasets made available to the scientific community to validate novel event recognition pipelines. In the final part of the manuscript, we also provide a detailed discussion on basic insights gathered from the literature review, and identify future trends and challenges.

References

  1. Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola, and Tuomas Virtanen. 2017. Sound event detection in multichannel audio using spatial and harmonic features. arXiv preprint arXiv:1706.02293 (2017).Google ScholarGoogle Scholar
  2. Sharath Adavanne, Archontis Politis, and Tuomas Virtanen. 2018. Multichannel sound event detection using 3D convolutional neural networks for learning inter-channel features. arXiv preprint arXiv:1801.09522 (2018).Google ScholarGoogle Scholar
  3. Kashif Ahmad, Nicola Conci, Giulia Boato, and Francesco G. B. De Natale. 2016. USED: A large-scale social event detection dataset. In Proceedings of the 7th International Conference on Multimedia Systems. ACM, 50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Kashif Ahmad, Nicola Conci, Giulia Boato, and Francesco G. B. De Natale. 2017. Event recognition in personal photo collections via multiple instance learning-based classification of multiple images. Journal of Electronic Imaging 26, 6 (2017), 060502.Google ScholarGoogle ScholarCross RefCross Ref
  5. Kashif Ahmad, Nicola Conci, and F. G. B. De Natale. 2018. A saliency-based approach to event recognition. Signal Processing: Image Communication 60 (2018), 42--51.Google ScholarGoogle ScholarCross RefCross Ref
  6. Kashif Ahmad, Francesco De Natale, Giulia Boato, and Andrea Rosani. 2016. A hierarchical approach to event discovery from single images using MIL framework. In Proceedings of the 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 1223--1227.Google ScholarGoogle ScholarCross RefCross Ref
  7. Kashif Ahmad, M. L. Mekhalfi, Nicola Conci, Giliua Boato, F. Melgani, and F. G. B. De Natale. 2017. A pool of deep models for event recognition. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2886--2890.Google ScholarGoogle ScholarCross RefCross Ref
  8. Kashif Ahmad, Mohamed Lamine Mekhalfi, and Nicola Conci. 2018. Event recognition in personal photo collections: An active learning approach. Electronic Imaging 2018, 2 (2018), 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  9. Kashif Ahmad, Mohamed Lamine Mekhalfi, Nicola Conci, Farid Melgani, and Francesco De Natale. 2018. Ensemble of deep models for event recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2 (2018), 51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kashif Ahmad, Konstantin Pogorelov, Michael Riegler, Nicola Conci, and Pål Halvorsen. 2018. Social media and satellites. Multimedia Tools and Applications (2018), 1--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kashif Ahmad, Konstantin Pogorelov, Michael Riegler, Nicola Conci, and H. Pal. 2017. CNN and GAN based satellite and social media data fusion for disaster detection. In Proceedings of the MediaEval 2017 Workshop, Dublin, Ireland.Google ScholarGoogle Scholar
  12. Kashif Ahmad, Amir Sohail, Nicola Conci, and Francesco De Natale. 2018. A comparative study of global and deep features for the analysis of user-generated natural disaster related images. In Proceedings of the 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP). IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  13. Sheharyar Ahmad, Kashif Ahmad, Nasir Ahmad, and Nicola Conci. 2017. Convolutional neural networks for disaster images retrieval. In Proceedings of the MediaEval 2017 Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google ScholarGoogle Scholar
  14. Siti Nor Khuzaimah Binti Amit, Soma Shiraishi, Tetsuo Inoshita, and Yoshimitsu Aoki. 2016. Analysis of satellite images for disaster detection. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 5189--5192.Google ScholarGoogle ScholarCross RefCross Ref
  15. Nazia Attari, Ferda Ofli, Mohammad Awad, Ji Lucas, and Sanjay Chawla. 2016. Nazr-CNN: Fine-grained classification of UAV imagery for damage assessment. arXiv preprint arXiv:1611.06474 (2016).Google ScholarGoogle Scholar
  16. Konstantinos Avgerinakis, Anastasia Moumtzidou, Stelios Andreadis, Emmanouil Michail, Ilias Gialampoukidis, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2017. Visual and textual analysis of social media and satellite images for flood detection@ multimedia satellite task MediaEval 2017. In Proceedings of the Working Notes Proceeding MediaEval Workshop, Dublin, Ireland. 13--15.Google ScholarGoogle Scholar
  17. Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems. 892--900. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Elham Babaee, Nor Badrul Anuar, Ainuddin Wahid Abdul Wahab, Shahaboddin Shamshirband, and Anthony T. Chronopoulos. 2018. An overview of audio event detection methods from feature extraction to classification. Applied Artificial Intelligence (2018), 1--54.Google ScholarGoogle Scholar
  19. Siham Bacha, Mohand Said Allili, and Nadjia Benblidia. 2016. Event recognition in photo albums using probabilistic graphical models and feature relevance. Journal of Visual Communication and Image Representation 40 (2016), 546--558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lamberto Ballan, Alessio Bazzica, Marco Bertini, Alberto Del Bimbo, and Giuseppe Serra. 2009. Deep networks for audio event classification in soccer videos. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’09). IEEE, 474--477. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In Proceedings of the European Conference on Computer Vision. Springer, 404--417. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Benjamin Bischke, Prakriti Bhardwaj, Aman Gautam, Patrick Helber, D. Borth, and A. Dengel. 2017. Detection of flooding events in social multimedia and satellite imagery using deep neural networks. In Proceedings of the Working Notes Proceeding MediaEval Workshop, Dublin, Ireland.Google ScholarGoogle Scholar
  23. Benjamin Bischke, Damian Borth, Christian Schulze, and Andreas Dengel. 2016. Contextual enrichment of remote-sensed events with social media streams. In Proceedings of the 2016 ACM Multimedia Conference. ACM, 1077--1081. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and Damian Borth. 2017. The multimedia satellite task at MediaEval 2017: Emergence response for flooding events. In Proceedings of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin, Ireland.Google ScholarGoogle Scholar
  25. Anna Bosch, Andrew Zisserman, and Xavier Munoz. 2007. Representing shape with a spatial pyramid kernel. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval. ACM, 401--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2013. Event recognition in photo collections with a stopwatch HMM. In Proceedings of the IEEE International Conference on Computer Vision. 1193--1200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Markus Brenner and Ebroul Izquierdo. 2011. MediaEval benchmark: Social event detection in collaborative photo collections. In MediaEval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Emre Cakir, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. 2015. Polyphonic sound event detection using multi label deep neural networks. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  29. Emre Cakir, Ezgi Can Ozan, and Tuomas Virtanen. 2016. Filterbank learning for deep neural network based polyphonic sound event detection. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 3399--3406.Google ScholarGoogle ScholarCross RefCross Ref
  30. Liangliang Cao, Shih-Fu Chang, Noel Codella, Courtenay Cotton, Dan Ellis, Leiguang Gong, Matthew Hill, Gang Hua, John Kender, Michele Merler, Yadong Mu, Apostol Natsev, and John R. Smith. 2011. IBM research and Columbia University TRECVID-2011 multimedia event detection (MED) system. In NIST TRECVID Workshop, Vol. 28.Google ScholarGoogle Scholar
  31. Xiaojun Chang, Yao-Liang Yu, Yi Yang, and Eric P. Xing. 2017. Semantic pooling for complex event analysis in untrimmed videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 8 (2017), 1617--1632.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Chatzichristofis, Y. Boutalis, and Mathias Lux. 2009. Selection of the proper compact composite descriptor for improving content based image retrieval. In Proceedings of the 6th IASTED International Conference, Vol. 134643. 064.Google ScholarGoogle Scholar
  33. Savvas A. Chatzichristofis and Yiannis S. Boutalis. 2008. CEDD: Color and edge directivity descriptor: A compact descriptor for image indexing and retrieval. In Proceedings of the International Conference on Computer Vision Systems. Springer, 312--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ming-yu Chen and Alexander Hauptmann. 2009. Mosift: Recognizing human actions in surveillance videos. (2009).Google ScholarGoogle Scholar
  35. Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. 2014. DeepSentiBank: Visual sentiment concept classification with deep convolutional neural networks. arXiv preprint arXiv:1410.8586 (2014).Google ScholarGoogle Scholar
  36. Inkyu Choi, Kisoo Kwon, Soo Hyun Bae, and Nam Soo Kim. 2016. DNN-based sound event detection with exemplar-based approach for noise reduction. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016). 16--19.Google ScholarGoogle Scholar
  37. Selina Chu, Shrikanth Narayanan, C.-C. Jay Kuo, and Maja J. Mataric. 2006. Where am I? Scene recognition for mobile robots using audio features. In Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, IEEE, 885--888.Google ScholarGoogle Scholar
  38. Courtenay V. Cotton and Daniel P. W. Ellis. 2011. Spectral vs. spectro-temporal features for acoustic event detection. In Proceedings of the 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 69--72.Google ScholarGoogle Scholar
  39. Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. 2018. AutoAugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018).Google ScholarGoogle Scholar
  40. Juncheng Li Dai Wei, Phuong Pham, Samarjit Das, Shuhui Qu, and Florian Metze. 2016. Sound event detection for real life audio DCASE challenge. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events.Google ScholarGoogle Scholar
  41. Minh-Son Dao, Duc-Tien Dang-Nguyen, and Francesco G. B. De Natale. 2014. Robust event discovery from photo collections using Signature Image Bases (SIBs). Multimedia Tools and Applications 70, 1 (2014), 25--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  43. Terrance DeVries and Graham W. Taylor. 2017. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538 (2017).Google ScholarGoogle Scholar
  44. Shengyong Ding, Liang Lin, Guangrun Wang, and Hongyang Chao. 2015. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48, 10 (2015), 2993--3003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Sergio Escalera, Junior Fabian, Pablo Pardo, Xavier Baró, Jordi Gonzalez, Hugo J. Escalante, Dusan Misevic, Ulrich Steiner, and Isabelle Guyon. 2015. Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Lijie Fan, Wenbing Huang, Stefano Ermon Chuang Gan, Boqing Gong, and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6016--6025.Google ScholarGoogle ScholarCross RefCross Ref
  47. Yachuang Feng, Yuan Yuan, and Xiaoqiang Lu. 2016. Deep representation for abnormal event detection in crowded scenes. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 591--595. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Jonathan G. Fiscus. 2010. TRECVID multimedia event detection 2010 evaluation. (2010).Google ScholarGoogle Scholar
  49. Pasquale Foggia, Nicolai Petkov, Alessia Saggese, Nicola Strisciuglio, and Mario Vento. 2015. Reliable detection of audio events in highly noisy environments. Pattern Recognition Letters 65 (2015), 22--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Pasquale Foggia, Nicolai Petkov, Alessia Saggese, Nicola Strisciuglio, and Mario Vento. 2016. Audio surveillance of roads: A system for detecting anomalous sounds. IEEE Transactions on Intelligent Transportation Systems 17, 1 (2016), 279--288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, 411--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Alexandre R. J. Francois, Ram Nevatia, Jerry Hobbs, Robert C. Bolles, and John R. Smith. 2005. VERL: An ontology framework for representing and annotating video events. IEEE Multimedia 12, 4 (2005), 76--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Steve Frolking, Jianjun Qiu, Stephen Boles, Xiangming Xiao, Jiyuan Liu, Yahui Zhuang, Changsheng Li, and Xiaoguang Qin. 2002. Combining remote sensing and ground census data to develop new maps of the distribution of rice agriculture in China. Global Biogeochemical Cycles 16, 4 (2002).Google ScholarGoogle Scholar
  54. Jianlong Fu, Yue Wu, Tao Mei, Jinqiao Wang, Hanqing Lu, and Yong Rui. 2015. Relaxing from vocabulary: Robust weakly-supervised deep learning for vocabulary-free image tagging. In Proceedings of the IEEE International Conference on Computer Vision. 1985--1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong. 2016. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In Proceedings of the European Conference on Computer Vision. Springer, 849--866.Google ScholarGoogle ScholarCross RefCross Ref
  56. Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2568--2577.Google ScholarGoogle Scholar
  57. Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. 2016. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 923--932.Google ScholarGoogle ScholarCross RefCross Ref
  58. Oguzhan Gencoglu, Tuomas Virtanen, and Heikki Huttunen. 2014. Recognition of acoustic events using deep neural networks. In Proceedings of the 2014 22nd European Signal Processing Conference (EUSIPCO). IEEE, 506--510.Google ScholarGoogle Scholar
  59. D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. Plumbley. 2013. IEEE AASP challenge: Detection and classification of acoustic scenes and events. Queen Mary University of London: London, UK (2013).Google ScholarGoogle Scholar
  60. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep Learning. Vol. 1. MIT Press, Cambridge. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Cong Guo and Xinmei Tian. 2015. Event recognition in personal photo collections using hierarchical model and multiple features. In Proceedings of the 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 1--6.Google ScholarGoogle Scholar
  64. Cong Guo, Xinmei Tian, and Tao Mei. 2017. Multi-granular event recognition of personal photo albums. IEEE Transactions on Multimedia (2017).Google ScholarGoogle Scholar
  65. Aki Harma, Martin F. McKinney, and Janto Skowronek. 2005. Automatic surveillance of the acoustic activity in our living environment. In Proceedings of the 2005 IEEE International Conference on Multimedia and Expo, 2005 (ICME 2005). IEEE, 4--pp.Google ScholarGoogle ScholarCross RefCross Ref
  66. Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Takaaki Hori, Jonathan Le Roux, and Kazuya Takeda. 2016. Bidirectional LSTM-HMM hybrid system for polyphonic sound event detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016). 35--39.Google ScholarGoogle Scholar
  67. Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Takaaki Hori, Jonathan Le Roux, and Kazuya Takeda. 2017. BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic sound event detection. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 766--770.Google ScholarGoogle ScholarCross RefCross Ref
  68. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015).Google ScholarGoogle Scholar
  69. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  70. Toni Heittola, Annamaria Mesaros, Tuomas Virtanen, and Moncef Gabbouj. 2013. Supervised model training for overlapping sound events based on unsupervised source separation. In Proceedings of ICASSP. 8677--8681.Google ScholarGoogle ScholarCross RefCross Ref
  71. Somboon Hongeng, Ram Nevatia, and Francois Bremond. 2004. Video-based event recognition: Activity representation and probabilistic recognition methods. Computer Vision and Image Understanding 96, 2 (2004), 129--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Yuanbo Hou and Shengchen Li. 2017. Sound Event Detection in Real Life Audio Using Multimodel System. Technical Report. DCASE2017 Challenge, Tech. Rep.Google ScholarGoogle Scholar
  73. Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision 116, 1 (2016), 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. I-Hong Jhuo and D. T. Lee. 2014. Video event detection via multi-modality deep learning. In Proceedings of the 2014 22nd International Conference on Pattern Recognition (ICPR). IEEE, 666--671. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Lu Jiang, Alexander G. Hauptmann, and Guang Xiang. 2012. Leveraging high-level and low-level features for multimedia event detection. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 449--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 2 (2018), 352--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Brendan Jou and Shih-Fu Chang. 2016. Deep cross residual learning for multitask visual recognition. In Proceedings of the ACM Conference on Multimedia. ACM, 998--1007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Andreas Kamilaris and Francesc X. Prenafeta-Boldú. 2018. Disaster monitoring using unmanned aerial vehicles and deep learning. arXiv preprint arXiv:1807.11805 (2018).Google ScholarGoogle Scholar
  80. Keiller Nogueira, Samuel G. Fadel, Ícaro C. Dourado, Rafael de O. Werneck, Javier A. V. Muñoz, Otávio A. B. Penatti, Rodrigo T. Calumby, Lin Tzy Li, Jefersson A. dos Santos, and Ricardo da S. Torres. 2017. Data-driven flood detection using neural networks. In Proceedings of the MediaEval 2017 Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google ScholarGoogle Scholar
  81. Zvi Kons and Orith Toledo-Ronen. 2013. Audio event classification using deep neural networks. In Interspeech. 1482--1486.Google ScholarGoogle Scholar
  82. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Julian Kürby, Rene Grzeszick, Axel Plinge, and Gernot A. Fink. 2016. Bag-of-features acoustic event detection for sensor networks. In Proceedings on the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE'16). 55--59.Google ScholarGoogle Scholar
  84. Ying-Hui Lai, Chun-Hao Wang, Shi-Yan Hou, Bang-Yin Chen, Yu Tsao, and Yi-Wen Liu. 2016. DCASE report for task 3: Sound event detection in real life audio. IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2016).Google ScholarGoogle Scholar
  85. Zhen-Zhong Lan, Lu Jiang, Shoou-I Yu, Shourabh Rawat, Yang Cai, Chenqiang Gao, Shicheng Xu, Haoquan Shen, Xuanchong Li, Yipei Wang, Waito Sze, Yan Yan, Zhigang Ma, Wei Tong, Yi Yang, Susanne Burger, Florian Metze, Rita Singh, Bhiksha Raj, Richard Stern, Teruko Mitamura, Eric Nyberg, and Alexander Hauptmann. 2013. CMU-informedia at TRECVID 2013 multimedia event detection. In TRECVID 2013 Workshop, Vol. 1. 5.Google ScholarGoogle Scholar
  86. Donmoon Lee, Subin Lee, Yoonchang Han, and Kyogu Lee. 2017. Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input. Technical Report. Tech. Rep., DCASE2017 Challenge.Google ScholarGoogle Scholar
  87. Li-Jia Li and Li Fei-Fei. 2007. What, where and who? Classifying events by scene and object recognition. In Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV’07). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  88. Hyungui Lim, Jeongsoo Park, Kyogu Lee, and Yoonchang Han. 2017. Rare sound event detection using 1D convolutional recurrent neural networks. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop.Google ScholarGoogle Scholar
  89. Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).Google ScholarGoogle Scholar
  90. Mengyi Liu, Xin Liu, Yan Li, Xilin Chen, Alexander G. Hauptmann, and Shiguang Shan. 2015. Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 32--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Xueliang Liu and Benoit Huet. 2013. Heterogeneous features and model selection for event-based media classification. In Proceedings of the 3rd ACM International Conference on Multimedia Retrieval. ACM, 151--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Ying Liu and Linzhi Wu. 2016. Geological disaster recognition on optical remote sensing images using deep learning. Procedia Computer Science 91 (2016), 566--575.Google ScholarGoogle ScholarCross RefCross Ref
  93. Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  94. Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.Google ScholarGoogle ScholarCross RefCross Ref
  95. Laura Lopez-Fuentes, Joost van de Weijer, Marc Bolanos, and Harald Skinnemoen. 2017. Multi-modal deep learning approach for flood detection. In Proceedings of the MediaEval 2017 Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google ScholarGoogle Scholar
  96. David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Mathias Lux, Michael Riegler, Pål Halvorsen, Konstantin Pogorelov, and Nektarios Anagnostopoulos. 2016. LIRE: Open source visual information retrieval. In Proceedings of the 7th International Conference on Multimedia Systems. ACM, 30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. R. Mattivi, G. Boato, and F. G. B. De Natale. 2011. Event-based media organization and indexing. Infocommunications Journal 3, 3 (2011), 9--18.Google ScholarGoogle Scholar
  99. Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. 2017. DCASE 2017 challenge setup: Tasks, datasets and baseline system. In Proceedings of the DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events.Google ScholarGoogle Scholar
  100. Annamaria Mesaros, Toni Heittola, Antti Eronen, and Tuomas Virtanen. 2010. Acoustic event detection in real life recordings. In Proceedings of the 2010 18th European Signal Processing Conference. IEEE, 1267--1271.Google ScholarGoogle Scholar
  101. Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. 2016. Metrics for polyphonic sound event detection. Applied Sciences 6, 6 (2016), 162.Google ScholarGoogle ScholarCross RefCross Ref
  102. Pascal Mettes, Dennis C. Koelma, and Cees G. M. Snoek. 2016. The imagenet shuffle: Reorganized pre-training for video event detection. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 175--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Matthias Meyer, Lukas Cavigelli, and Lothar Thiele. 2017. Efficient convolutional neural network for audio event detection. arXiv preprint arXiv:1709.09888 (2017).Google ScholarGoogle Scholar
  104. Dao Minh-Son, Pham Quang-Nhat-Minh, and Dang-Nguyen Duc-Tien. 2017. A domain-based late-fusion for disaster image retrieval from social media. In Proc. of the MediaEval Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google ScholarGoogle Scholar
  105. Hanif Muhammad, Atif Muhammad, Khan Mahrukh, and Rafi Mohammad. 2017. Flood detection using social media data and spectral regression based kernel discriminant analysis. In Proceedings of the MediaEval 2017 Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google ScholarGoogle Scholar
  106. Milind Naphade, John R. Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Alexander Hauptmann, and Jon Curtis. 2006. Large-scale concept ontology for multimedia. IEEE Multimedia 13, 3 (2006), 86--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Keiller Nogueira, Samuel G. Fadel, Ícaro C. Dourado, Rafael de O. Werneck, Javier A. V. Muñoz, Otávio A. B. Penatti, Rodrigo T. Calumby, Lin Tzy Li, Jefersson A. dos Santos, and Ricardo da S. Torres. 2017. Exploiting ConvNet diversity for flooding identification. arXiv preprint arXiv:1711.03564 (2017).Google ScholarGoogle Scholar
  108. Dan Oneata, Matthijs Douze, Jérôme Revaud, Schwenninger Jochen, Danila Potapov, Heng Wang, Zaid Harchaoui, Jakob Verbeek, Cordelia Schmid, Robin Aly, Kevin Mcguiness, Shu Chen, Noel O'ConnorKen ChatfieldOmkar Parkhi, Relja Arandjelovic, Andrew Zisserman, Fernando Basura, and Tinne Tuytelaars. 2012. Axes at TRECVID 2012: KIS, INS, and MED. In TRECVID Workshop.Google ScholarGoogle Scholar
  109. Paul Over, Jon Fiscus, Greg Sanders, David Joy, Martial Michel, George Awad, Alan Smeaton, Wessel Kraaij, and Georges Quénot. 2014. TRECVID 2014--An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID. 52.Google ScholarGoogle Scholar
  110. Symeon Papadopoulos, Raphael Troncy, Vasileios Mezaris, Benoit Huet, and Ioannis Kompatsiaris. 2011. Social event detection at MediaEval 2011: Challenges, dataset and evaluation. In Proceedings of MediaEval.Google ScholarGoogle Scholar
  111. Giambattista Parascandolo, Heikki Huttunen, and Tuomas Virtanen. 2016. Recurrent neural networks for polyphonic sound event detection in real life recordings. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6440--6444.Google ScholarGoogle ScholarCross RefCross Ref
  112. Sungheon Park and Nojun Kwak. 2015. Cultural event recognition by subregion classification with convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 45--50.Google ScholarGoogle ScholarCross RefCross Ref
  113. Georgios Petkos, Symeon Papadopoulos, Vasileios Mezaris, Raphael Troncy, Philipp Cimiano, Timo Reuter, and Yiannis Kompatsiaris. 2014. Social event detection at MediaEval: A three-year retrospect of tasks and results. In Proceedings of the International Conference on Multimedia Retrieval Workshop on Social Events in Web Multimedia (SEWM).Google ScholarGoogle Scholar
  114. Huy Phan, Lars Hertel, Marco Maass, and Alfred Mertins. 2016. Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv preprint arXiv:1604.06338 (2016).Google ScholarGoogle Scholar
  115. Karol J. Piczak. 2015. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 1015--1018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Axel Plinge, Rene Grzeszick, and Gernot A, Fink. 2014. A bag-of-features approach to acoustic event detection. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3704--3708.Google ScholarGoogle ScholarCross RefCross Ref
  117. Samira Pouyanfar and Shu-Ching Chen. 2016. Semantic event detection using ensemble deep learning. In Proceedings of the 2016 IEEE International Symposium on Multimedia (ISM). IEEE, 203--208.Google ScholarGoogle ScholarCross RefCross Ref
  118. Samira Pouyanfar and Shu-Ching Chen. 2017. Automatic video event detection for imbalance data using enhanced ensemble deep learning. International Journal of Semantic Computing 11, 1 (2017), 85--109.Google ScholarGoogle ScholarCross RefCross Ref
  119. Reza Fuad Rachmadi, Keiichi Uchimura, and Gou Koutaki. 2016. Combined convolutional neural network for event recognition. In Proceedings of the Korea-Japan Joint Workshop on Frontiers of Computer Vision. 85--90.Google ScholarGoogle Scholar
  120. Timo Reuter, Symeon Papadopoulos, Giorgos Petkos, Vasileios Mezaris, Yiannis Kompatsiaris, Philipp Cimiano, Christopher de Vries, and Shlomo Geva. 2013. Social event detection at mediaeval 2013: Challenges, datasets, and evaluation. In Proceedings of the MediaEval Multimedia Benchmark Workshop Barcelona, Spain, October 18--19, 2013.Google ScholarGoogle Scholar
  121. Jinyoung Rhee, Jungho Im, and Gregory J. Carbone. 2010. Monitoring agricultural drought for arid and humid regions using multi-sensor remote sensing data. Remote Sensing of Environment 114, 12 (2010), 2875--2887.Google ScholarGoogle ScholarCross RefCross Ref
  122. Seyed Morteza Safdarnejad, Xiaoming Liu, Lalita Udpa, Brooks Andrus, John Wood, and Dean Craven. 2015. Sports videos in the wild (SVW): A video dataset for sports analysis. In Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). Vol. 1. IEEE, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  123. Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. 2014. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 1041--1044. Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Amaia Salvador, Matthias Zeppelzauer, Daniel Manchon-Vizuete, Andrea Calafell, and Xavier Giro-i Nieto. 2015. Cultural event recognition with visual convnets and temporal models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 36--44.Google ScholarGoogle ScholarCross RefCross Ref
  125. Emmanouil Schinas, Georgios Petkos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2012. CERTH@ MediaEval 2012 social event detection task. In MediaEval. Citeseer.Google ScholarGoogle Scholar
  126. Yuhui Shi and Russell C. Eberhart. 1999. Empirical study of particle swarm optimization. In Proceedings of the 1999 Congress on Evolutionary Computation (CEC’99). Vol. 3. IEEE, 1945--1950.Google ScholarGoogle Scholar
  127. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  128. Bharat Singh, Xintong Han, Zhe Wu, Vlad I. Morariu, and Larry S. Davis. 2015. Selecting relevant web trained concepts for automated event retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 4561--4569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. Center for Research in Computer Vision (CRCV). Technical Report. University of Central Florida (UCF).Google ScholarGoogle Scholar
  130. Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, and Wei Zhang. 2018. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1390--1399.Google ScholarGoogle ScholarCross RefCross Ref
  131. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.Google ScholarGoogle ScholarCross RefCross Ref
  132. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.Google ScholarGoogle ScholarCross RefCross Ref
  133. Naoya Takahashi, Michael Gygli, Beat Pfister, and Luc Van Gool. 2016. Deep convolutional neural networks and data augmentation for acoustic event detection. arXiv preprint arXiv:1604.07160 (2016).Google ScholarGoogle Scholar
  134. Naoya Takahashi, Michael Gygli, and Luc Van Gool. 2018. Aenet: Learning deep audio features for video analysis. IEEE Transactions on Multimedia 20, 3 (2018), 513--524. Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. Planet Team. 2016. Planet application program interface: In Space for Life on Earth. San Francisco, CA.Google ScholarGoogle Scholar
  136. Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  138. Shen-Fu Tsai, Thomas S. Huang, and Feng Tang. 2011. Album-based object-centric event recognition. In Proceedings of the 2011 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. Christos Tzelepis, Zhigang Ma, Vasileios Mezaris, Bogdan Ionescu, Ioannis Kompatsiaris, Giulia Boato, Nicu Sebe, and Shuicheng Yan. 2016. Event-based media processing and analysis: A survey of the literature. Image and Vision Computing 53 (2016), 3--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. Dmitrii Ubskii and Alexei Pugachev. 2016. Sound event detection in real-life audio. IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2016).Google ScholarGoogle Scholar
  141. Jasper R. R. Uijlings, Koen E. A. Van De Sande, Theo Gevers, and Arnold W. M. Smeulders. 2013. Selective search for object recognition. International Journal of Computer Vision 104, 2 (2013), 154--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  142. MWW Van Grootel, Tjeerd C Andringa, and JD Krijnders. 2009. DARES-G1: Database of annotated real-world everyday sounds. In Proceedings of the NAG/DAGA International Conference on Acoustics.Google ScholarGoogle Scholar
  143. Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3169--3176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  144. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, 3551--3558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  145. Jun Wang and Jean-Daniel Zucker. 2000. Solving multiple-instance problem: A lazy learning approach. (2000).Google ScholarGoogle Scholar
  146. Limin Wang, Zhe Wang, Sheng Guo, and Yu Qiao. 2015. Better exploiting OS-CNNS for better event recognition in images. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 45--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  147. Limin Wang, Zhe Wang, Yu Qiao, and Luc Van Gool. 2017. Transferring deep object and scene representations for event recognition in still images. International Journal of Computer Vision (2017), 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.Google ScholarGoogle ScholarCross RefCross Ref
  149. Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. arXiv preprint arXiv:1806.01810 (2018).Google ScholarGoogle Scholar
  150. Xiaoyang Wang and Qiang Ji. 2015. Video event recognition with deep hierarchical context model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4418--4427.Google ScholarGoogle ScholarCross RefCross Ref
  151. Yun Wang and Florian Metze. 2016. Recurrent support vector machines for audio-based multimedia event detection. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 265--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  152. Yun Wang and Florian Metze. 2017. A first attempt at polyphonic sound event detection using connectionist temporal classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2986--2990.Google ScholarGoogle ScholarCross RefCross Ref
  153. Yun Wang, Leonardo Neves, and Florian Metze. 2016. Audio-based multimedia event detection using deep recurrent neural networks. In Proceedings of the 2016 I/eee International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2742--2746.Google ScholarGoogle ScholarCross RefCross Ref
  154. Xiu-Shen Wei, Bin-Bin Gao, and Jianxin Wu. 2015. Deep spatial pyramid ensemble for cultural event recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 38--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  155. Sebastien C. Wong, Adam Gatt, Victor Stamatescu, and Mark D. McDonnell. 2016. Understanding data augmentation for classification: When to warp? arXiv preprint arXiv:1609.08764 (2016).Google ScholarGoogle Scholar
  156. Zifeng Wu, Yongzhen Huang, and Liang Wang. 2015. Learning representative deep features for image set analysis. IEEE Transactions on Multimedia 17, 11 (2015), 1960--1968.Google ScholarGoogle ScholarDigital LibraryDigital Library
  157. Yuanjun Xiong, Kai Zhu, Dahua Lin, and Xiaoou Tang. 2015. Recognize complex events from static images by fusing deep channels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1600--1609.Google ScholarGoogle Scholar
  158. Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015).Google ScholarGoogle Scholar
  159. Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2015. A discriminative CNN video representation for event detection. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1798--1807.Google ScholarGoogle Scholar
  160. Ronald R. Yager and Dimitar P. Filev. 1999. Induced ordered weighted averaging operators. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 29, 2 (1999), 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  161. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507--4515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  162. Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. 2015. Eventnet: A large scale structured concept library for complex event detection in video. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 471--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  163. Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and Li Fei-Fei. 2018. Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision 126, 2--4 (2018), 375--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  164. Litao Yu, Xiaoshuai Sun, and Zi Huang. 2016. Robust spatial-temporal deep model for multimedia event detection. Neurocomputing 213 (2016), 48--53.Google ScholarGoogle ScholarCross RefCross Ref
  165. Shoou-I Yu, Lu Jiang, Zexi Mao, Xiaojun Chang, Xingzhong Du, Chuang Gan, Zhenzhong Lan, Zhongwen Xu, Xuanchong Li, Yang Cai, Anurag Kumar, Yajie Miao, Lara Martin, Nikolas Wolfe, Shicheng Xu, Huan Li, Ming Lin, Zhigang Ma, Yi Yang, Deyu Meng, Shiguang Shan, Pinar Duygulu Sahin, Susanne Burger, Florian Metze, Rita Singh, Bhiksha Raj, Teruko Mitamura, Richard Stern, and Alexander Hauptmann. 2014. MER. In Proceedings of the NIST TRECVID Video Retrieval Evaluation Workshop, Vol. 24.Google ScholarGoogle Scholar
  166. Joe Yue-Hei Ng, Fan Yang, and Larry S. Davis. 2015. Exploiting local features from deep networks for image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 53--61.Google ScholarGoogle Scholar
  167. Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint arXiv:1503.04144 (2015).Google ScholarGoogle Scholar
  168. Dongqing Zhang and Dan Ellis. 2001. Detecting sound events in basketball video archive. Dept. Electronic Eng., Columbia Univ., New York (2001).Google ScholarGoogle Scholar
  169. Xishan Zhang, Hanwang Zhang, Yongdong Zhang, Yang Yang, Meng Wang, Huanbo Luan, Jintao Li, and Tat-Seng Chua. 2016. Deep fusion of multiple semantic cues for complex event recognition. IEEE Transactions on Image Processing 25, 3 (2016), 1033--1046.Google ScholarGoogle ScholarDigital LibraryDigital Library
  170. Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems. 487--495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  171. Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision 124, 3 (2017), 409--421. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. How Deep Features Have Improved Event Recognition in Multimedia: A Survey

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!