Abstract
Event recognition is one of the areas in multimedia that is attracting great attention of researchers. Being applicable in a wide range of applications, from personal to collective events, a number of interesting solutions for event recognition using multimedia information sources have been proposed. On the other hand, following their immense success in classification, object recognition, and detection, deep learning has been shown to perform well in event recognition tasks also. Thus, a large portion of the literature on event analysis relies nowadays on deep learning architectures. In this article, we provide an extensive overview of the existing literature in this field, analyzing how deep features and deep learning architectures have changed the performance of event recognition frameworks. The literature on event-based analysis of multimedia contents can be categorized into four groups, namely (i) event recognition in single images; (ii) event recognition in personal photo collections; (iii) event recognition in videos; and (iv) event recognition in audio recordings. In this article, we extensively review different deep-learning-based frameworks for event recognition in these four domains. Furthermore, we also review some benchmark datasets made available to the scientific community to validate novel event recognition pipelines. In the final part of the manuscript, we also provide a detailed discussion on basic insights gathered from the literature review, and identify future trends and challenges.
- Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola, and Tuomas Virtanen. 2017. Sound event detection in multichannel audio using spatial and harmonic features. arXiv preprint arXiv:1706.02293 (2017).Google Scholar
- Sharath Adavanne, Archontis Politis, and Tuomas Virtanen. 2018. Multichannel sound event detection using 3D convolutional neural networks for learning inter-channel features. arXiv preprint arXiv:1801.09522 (2018).Google Scholar
- Kashif Ahmad, Nicola Conci, Giulia Boato, and Francesco G. B. De Natale. 2016. USED: A large-scale social event detection dataset. In Proceedings of the 7th International Conference on Multimedia Systems. ACM, 50. Google Scholar
Digital Library
- Kashif Ahmad, Nicola Conci, Giulia Boato, and Francesco G. B. De Natale. 2017. Event recognition in personal photo collections via multiple instance learning-based classification of multiple images. Journal of Electronic Imaging 26, 6 (2017), 060502.Google Scholar
Cross Ref
- Kashif Ahmad, Nicola Conci, and F. G. B. De Natale. 2018. A saliency-based approach to event recognition. Signal Processing: Image Communication 60 (2018), 42--51.Google Scholar
Cross Ref
- Kashif Ahmad, Francesco De Natale, Giulia Boato, and Andrea Rosani. 2016. A hierarchical approach to event discovery from single images using MIL framework. In Proceedings of the 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP). IEEE, 1223--1227.Google Scholar
Cross Ref
- Kashif Ahmad, M. L. Mekhalfi, Nicola Conci, Giliua Boato, F. Melgani, and F. G. B. De Natale. 2017. A pool of deep models for event recognition. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2886--2890.Google Scholar
Cross Ref
- Kashif Ahmad, Mohamed Lamine Mekhalfi, and Nicola Conci. 2018. Event recognition in personal photo collections: An active learning approach. Electronic Imaging 2018, 2 (2018), 1--5.Google Scholar
Cross Ref
- Kashif Ahmad, Mohamed Lamine Mekhalfi, Nicola Conci, Farid Melgani, and Francesco De Natale. 2018. Ensemble of deep models for event recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2 (2018), 51. Google Scholar
Digital Library
- Kashif Ahmad, Konstantin Pogorelov, Michael Riegler, Nicola Conci, and Pål Halvorsen. 2018. Social media and satellites. Multimedia Tools and Applications (2018), 1--39. Google Scholar
Digital Library
- Kashif Ahmad, Konstantin Pogorelov, Michael Riegler, Nicola Conci, and H. Pal. 2017. CNN and GAN based satellite and social media data fusion for disaster detection. In Proceedings of the MediaEval 2017 Workshop, Dublin, Ireland.Google Scholar
- Kashif Ahmad, Amir Sohail, Nicola Conci, and Francesco De Natale. 2018. A comparative study of global and deep features for the analysis of user-generated natural disaster related images. In Proceedings of the 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP). IEEE, 1--5.Google Scholar
Cross Ref
- Sheharyar Ahmad, Kashif Ahmad, Nasir Ahmad, and Nicola Conci. 2017. Convolutional neural networks for disaster images retrieval. In Proceedings of the MediaEval 2017 Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google Scholar
- Siti Nor Khuzaimah Binti Amit, Soma Shiraishi, Tetsuo Inoshita, and Yoshimitsu Aoki. 2016. Analysis of satellite images for disaster detection. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 5189--5192.Google Scholar
Cross Ref
- Nazia Attari, Ferda Ofli, Mohammad Awad, Ji Lucas, and Sanjay Chawla. 2016. Nazr-CNN: Fine-grained classification of UAV imagery for damage assessment. arXiv preprint arXiv:1611.06474 (2016).Google Scholar
- Konstantinos Avgerinakis, Anastasia Moumtzidou, Stelios Andreadis, Emmanouil Michail, Ilias Gialampoukidis, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2017. Visual and textual analysis of social media and satellite images for flood detection@ multimedia satellite task MediaEval 2017. In Proceedings of the Working Notes Proceeding MediaEval Workshop, Dublin, Ireland. 13--15.Google Scholar
- Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems. 892--900. Google Scholar
Digital Library
- Elham Babaee, Nor Badrul Anuar, Ainuddin Wahid Abdul Wahab, Shahaboddin Shamshirband, and Anthony T. Chronopoulos. 2018. An overview of audio event detection methods from feature extraction to classification. Applied Artificial Intelligence (2018), 1--54.Google Scholar
- Siham Bacha, Mohand Said Allili, and Nadjia Benblidia. 2016. Event recognition in photo albums using probabilistic graphical models and feature relevance. Journal of Visual Communication and Image Representation 40 (2016), 546--558. Google Scholar
Digital Library
- Lamberto Ballan, Alessio Bazzica, Marco Bertini, Alberto Del Bimbo, and Giuseppe Serra. 2009. Deep networks for audio event classification in soccer videos. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’09). IEEE, 474--477. Google Scholar
Digital Library
- Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In Proceedings of the European Conference on Computer Vision. Springer, 404--417. Google Scholar
Digital Library
- Benjamin Bischke, Prakriti Bhardwaj, Aman Gautam, Patrick Helber, D. Borth, and A. Dengel. 2017. Detection of flooding events in social multimedia and satellite imagery using deep neural networks. In Proceedings of the Working Notes Proceeding MediaEval Workshop, Dublin, Ireland.Google Scholar
- Benjamin Bischke, Damian Borth, Christian Schulze, and Andreas Dengel. 2016. Contextual enrichment of remote-sensed events with social media streams. In Proceedings of the 2016 ACM Multimedia Conference. ACM, 1077--1081. Google Scholar
Digital Library
- Benjamin Bischke, Patrick Helber, Christian Schulze, Srinivasan Venkat, Andreas Dengel, and Damian Borth. 2017. The multimedia satellite task at MediaEval 2017: Emergence response for flooding events. In Proceedings of the MediaEval 2017 Workshop (Sept. 13-15, 2017). Dublin, Ireland.Google Scholar
- Anna Bosch, Andrew Zisserman, and Xavier Munoz. 2007. Representing shape with a spatial pyramid kernel. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval. ACM, 401--408. Google Scholar
Digital Library
- Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2013. Event recognition in photo collections with a stopwatch HMM. In Proceedings of the IEEE International Conference on Computer Vision. 1193--1200. Google Scholar
Digital Library
- Markus Brenner and Ebroul Izquierdo. 2011. MediaEval benchmark: Social event detection in collaborative photo collections. In MediaEval. Google Scholar
Digital Library
- Emre Cakir, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. 2015. Polyphonic sound event detection using multi label deep neural networks. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--7.Google Scholar
Cross Ref
- Emre Cakir, Ezgi Can Ozan, and Tuomas Virtanen. 2016. Filterbank learning for deep neural network based polyphonic sound event detection. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 3399--3406.Google Scholar
Cross Ref
- Liangliang Cao, Shih-Fu Chang, Noel Codella, Courtenay Cotton, Dan Ellis, Leiguang Gong, Matthew Hill, Gang Hua, John Kender, Michele Merler, Yadong Mu, Apostol Natsev, and John R. Smith. 2011. IBM research and Columbia University TRECVID-2011 multimedia event detection (MED) system. In NIST TRECVID Workshop, Vol. 28.Google Scholar
- Xiaojun Chang, Yao-Liang Yu, Yi Yang, and Eric P. Xing. 2017. Semantic pooling for complex event analysis in untrimmed videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 8 (2017), 1617--1632.Google Scholar
Digital Library
- S. Chatzichristofis, Y. Boutalis, and Mathias Lux. 2009. Selection of the proper compact composite descriptor for improving content based image retrieval. In Proceedings of the 6th IASTED International Conference, Vol. 134643. 064.Google Scholar
- Savvas A. Chatzichristofis and Yiannis S. Boutalis. 2008. CEDD: Color and edge directivity descriptor: A compact descriptor for image indexing and retrieval. In Proceedings of the International Conference on Computer Vision Systems. Springer, 312--322. Google Scholar
Digital Library
- Ming-yu Chen and Alexander Hauptmann. 2009. Mosift: Recognizing human actions in surveillance videos. (2009).Google Scholar
- Tao Chen, Damian Borth, Trevor Darrell, and Shih-Fu Chang. 2014. DeepSentiBank: Visual sentiment concept classification with deep convolutional neural networks. arXiv preprint arXiv:1410.8586 (2014).Google Scholar
- Inkyu Choi, Kisoo Kwon, Soo Hyun Bae, and Nam Soo Kim. 2016. DNN-based sound event detection with exemplar-based approach for noise reduction. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016). 16--19.Google Scholar
- Selina Chu, Shrikanth Narayanan, C.-C. Jay Kuo, and Maja J. Mataric. 2006. Where am I? Scene recognition for mobile robots using audio features. In Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, IEEE, 885--888.Google Scholar
- Courtenay V. Cotton and Daniel P. W. Ellis. 2011. Spectral vs. spectro-temporal features for acoustic event detection. In Proceedings of the 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 69--72.Google Scholar
- Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. 2018. AutoAugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018).Google Scholar
- Juncheng Li Dai Wei, Phuong Pham, Samarjit Das, Shuhui Qu, and Florian Metze. 2016. Sound event detection for real life audio DCASE challenge. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events.Google Scholar
- Minh-Son Dao, Duc-Tien Dang-Nguyen, and Francesco G. B. De Natale. 2014. Robust event discovery from photo collections using Signature Image Bases (SIBs). Multimedia Tools and Applications 70, 1 (2014), 25--53. Google Scholar
Digital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.Google Scholar
Cross Ref
- Terrance DeVries and Graham W. Taylor. 2017. Dataset augmentation in feature space. arXiv preprint arXiv:1702.05538 (2017).Google Scholar
- Shengyong Ding, Liang Lin, Guangrun Wang, and Hongyang Chao. 2015. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48, 10 (2015), 2993--3003. Google Scholar
Digital Library
- Sergio Escalera, Junior Fabian, Pablo Pardo, Xavier Baró, Jordi Gonzalez, Hugo J. Escalante, Dusan Misevic, Ulrich Steiner, and Isabelle Guyon. 2015. Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1--9. Google Scholar
Digital Library
- Lijie Fan, Wenbing Huang, Stefano Ermon Chuang Gan, Boqing Gong, and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6016--6025.Google Scholar
Cross Ref
- Yachuang Feng, Yuan Yuan, and Xiaoqiang Lu. 2016. Deep representation for abnormal event detection in crowded scenes. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 591--595. Google Scholar
Digital Library
- Jonathan G. Fiscus. 2010. TRECVID multimedia event detection 2010 evaluation. (2010).Google Scholar
- Pasquale Foggia, Nicolai Petkov, Alessia Saggese, Nicola Strisciuglio, and Mario Vento. 2015. Reliable detection of audio events in highly noisy environments. Pattern Recognition Letters 65 (2015), 22--28. Google Scholar
Digital Library
- Pasquale Foggia, Nicolai Petkov, Alessia Saggese, Nicola Strisciuglio, and Mario Vento. 2016. Audio surveillance of roads: A system for detecting anomalous sounds. IEEE Transactions on Intelligent Transportation Systems 17, 1 (2016), 279--288.Google Scholar
Digital Library
- Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, 411--412. Google Scholar
Digital Library
- Alexandre R. J. Francois, Ram Nevatia, Jerry Hobbs, Robert C. Bolles, and John R. Smith. 2005. VERL: An ontology framework for representing and annotating video events. IEEE Multimedia 12, 4 (2005), 76--86. Google Scholar
Digital Library
- Steve Frolking, Jianjun Qiu, Stephen Boles, Xiangming Xiao, Jiyuan Liu, Yahui Zhuang, Changsheng Li, and Xiaoguang Qin. 2002. Combining remote sensing and ground census data to develop new maps of the distribution of rice agriculture in China. Global Biogeochemical Cycles 16, 4 (2002).Google Scholar
- Jianlong Fu, Yue Wu, Tao Mei, Jinqiao Wang, Hanqing Lu, and Yong Rui. 2015. Relaxing from vocabulary: Robust weakly-supervised deep learning for vocabulary-free image tagging. In Proceedings of the IEEE International Conference on Computer Vision. 1985--1993. Google Scholar
Digital Library
- Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong. 2016. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In Proceedings of the European Conference on Computer Vision. Springer, 849--866.Google Scholar
Cross Ref
- Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2568--2577.Google Scholar
- Chuang Gan, Ting Yao, Kuiyuan Yang, Yi Yang, and Tao Mei. 2016. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 923--932.Google Scholar
Cross Ref
- Oguzhan Gencoglu, Tuomas Virtanen, and Heikki Huttunen. 2014. Recognition of acoustic events using deep neural networks. In Proceedings of the 2014 22nd European Signal Processing Conference (EUSIPCO). IEEE, 506--510.Google Scholar
- D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. Plumbley. 2013. IEEE AASP challenge: Detection and classification of acoustic scenes and events. Queen Mary University of London: London, UK (2013).Google Scholar
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580--587. Google Scholar
Digital Library
- Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep Learning. Vol. 1. MIT Press, Cambridge. Google Scholar
Digital Library
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680. Google Scholar
Digital Library
- Cong Guo and Xinmei Tian. 2015. Event recognition in personal photo collections using hierarchical model and multiple features. In Proceedings of the 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 1--6.Google Scholar
- Cong Guo, Xinmei Tian, and Tao Mei. 2017. Multi-granular event recognition of personal photo albums. IEEE Transactions on Multimedia (2017).Google Scholar
- Aki Harma, Martin F. McKinney, and Janto Skowronek. 2005. Automatic surveillance of the acoustic activity in our living environment. In Proceedings of the 2005 IEEE International Conference on Multimedia and Expo, 2005 (ICME 2005). IEEE, 4--pp.Google Scholar
Cross Ref
- Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Takaaki Hori, Jonathan Le Roux, and Kazuya Takeda. 2016. Bidirectional LSTM-HMM hybrid system for polyphonic sound event detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016). 35--39.Google Scholar
- Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Takaaki Hori, Jonathan Le Roux, and Kazuya Takeda. 2017. BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic sound event detection. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 766--770.Google Scholar
Cross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
Cross Ref
- Toni Heittola, Annamaria Mesaros, Tuomas Virtanen, and Moncef Gabbouj. 2013. Supervised model training for overlapping sound events based on unsupervised source separation. In Proceedings of ICASSP. 8677--8681.Google Scholar
Cross Ref
- Somboon Hongeng, Ram Nevatia, and Francois Bremond. 2004. Video-based event recognition: Activity representation and probabilistic recognition methods. Computer Vision and Image Understanding 96, 2 (2004), 129--162. Google Scholar
Digital Library
- Yuanbo Hou and Shengchen Li. 2017. Sound Event Detection in Real Life Audio Using Multimodel System. Technical Report. DCASE2017 Challenge, Tech. Rep.Google Scholar
- Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision 116, 1 (2016), 1--20. Google Scholar
Digital Library
- I-Hong Jhuo and D. T. Lee. 2014. Video event detection via multi-modality deep learning. In Proceedings of the 2014 22nd International Conference on Pattern Recognition (ICPR). IEEE, 666--671. Google Scholar
Digital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675--678. Google Scholar
Digital Library
- Lu Jiang, Alexander G. Hauptmann, and Guang Xiang. 2012. Leveraging high-level and low-level features for multimedia event detection. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 449--458. Google Scholar
Digital Library
- Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 2 (2018), 352--364. Google Scholar
Digital Library
- Brendan Jou and Shih-Fu Chang. 2016. Deep cross residual learning for multitask visual recognition. In Proceedings of the ACM Conference on Multimedia. ACM, 998--1007. Google Scholar
Digital Library
- Andreas Kamilaris and Francesc X. Prenafeta-Boldú. 2018. Disaster monitoring using unmanned aerial vehicles and deep learning. arXiv preprint arXiv:1807.11805 (2018).Google Scholar
- Keiller Nogueira, Samuel G. Fadel, Ícaro C. Dourado, Rafael de O. Werneck, Javier A. V. Muñoz, Otávio A. B. Penatti, Rodrigo T. Calumby, Lin Tzy Li, Jefersson A. dos Santos, and Ricardo da S. Torres. 2017. Data-driven flood detection using neural networks. In Proceedings of the MediaEval 2017 Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google Scholar
- Zvi Kons and Orith Toledo-Ronen. 2013. Audio event classification using deep neural networks. In Interspeech. 1482--1486.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google Scholar
Digital Library
- Julian Kürby, Rene Grzeszick, Axel Plinge, and Gernot A. Fink. 2016. Bag-of-features acoustic event detection for sensor networks. In Proceedings on the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE'16). 55--59.Google Scholar
- Ying-Hui Lai, Chun-Hao Wang, Shi-Yan Hou, Bang-Yin Chen, Yu Tsao, and Yi-Wen Liu. 2016. DCASE report for task 3: Sound event detection in real life audio. IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2016).Google Scholar
- Zhen-Zhong Lan, Lu Jiang, Shoou-I Yu, Shourabh Rawat, Yang Cai, Chenqiang Gao, Shicheng Xu, Haoquan Shen, Xuanchong Li, Yipei Wang, Waito Sze, Yan Yan, Zhigang Ma, Wei Tong, Yi Yang, Susanne Burger, Florian Metze, Rita Singh, Bhiksha Raj, Richard Stern, Teruko Mitamura, Eric Nyberg, and Alexander Hauptmann. 2013. CMU-informedia at TRECVID 2013 multimedia event detection. In TRECVID 2013 Workshop, Vol. 1. 5.Google Scholar
- Donmoon Lee, Subin Lee, Yoonchang Han, and Kyogu Lee. 2017. Ensemble of Convolutional Neural Networks for Weakly-supervised Sound Event Detection Using Multiple Scale Input. Technical Report. Tech. Rep., DCASE2017 Challenge.Google Scholar
- Li-Jia Li and Li Fei-Fei. 2007. What, where and who? Classifying events by scene and object recognition. In Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV’07). IEEE, 1--8.Google Scholar
Cross Ref
- Hyungui Lim, Jeongsoo Park, Kyogu Lee, and Yoonchang Han. 2017. Rare sound event detection using 1D convolutional recurrent neural networks. In Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop.Google Scholar
- Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).Google Scholar
- Mengyi Liu, Xin Liu, Yan Li, Xilin Chen, Alexander G. Hauptmann, and Shiguang Shan. 2015. Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 32--37. Google Scholar
Digital Library
- Xueliang Liu and Benoit Huet. 2013. Heterogeneous features and model selection for event-based media classification. In Proceedings of the 3rd ACM International Conference on Multimedia Retrieval. ACM, 151--158. Google Scholar
Digital Library
- Ying Liu and Linzhi Wu. 2016. Geological disaster recognition on optical remote sensing images using deep learning. Procedia Computer Science 91 (2016), 566--575.Google Scholar
Cross Ref
- Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. Thirty-Second AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.Google Scholar
Cross Ref
- Laura Lopez-Fuentes, Joost van de Weijer, Marc Bolanos, and Harald Skinnemoen. 2017. Multi-modal deep learning approach for flood detection. In Proceedings of the MediaEval 2017 Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google Scholar
- David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91--110. Google Scholar
Digital Library
- Mathias Lux, Michael Riegler, Pål Halvorsen, Konstantin Pogorelov, and Nektarios Anagnostopoulos. 2016. LIRE: Open source visual information retrieval. In Proceedings of the 7th International Conference on Multimedia Systems. ACM, 30. Google Scholar
Digital Library
- R. Mattivi, G. Boato, and F. G. B. De Natale. 2011. Event-based media organization and indexing. Infocommunications Journal 3, 3 (2011), 9--18.Google Scholar
- Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. 2017. DCASE 2017 challenge setup: Tasks, datasets and baseline system. In Proceedings of the DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events.Google Scholar
- Annamaria Mesaros, Toni Heittola, Antti Eronen, and Tuomas Virtanen. 2010. Acoustic event detection in real life recordings. In Proceedings of the 2010 18th European Signal Processing Conference. IEEE, 1267--1271.Google Scholar
- Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. 2016. Metrics for polyphonic sound event detection. Applied Sciences 6, 6 (2016), 162.Google Scholar
Cross Ref
- Pascal Mettes, Dennis C. Koelma, and Cees G. M. Snoek. 2016. The imagenet shuffle: Reorganized pre-training for video event detection. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 175--182. Google Scholar
Digital Library
- Matthias Meyer, Lukas Cavigelli, and Lothar Thiele. 2017. Efficient convolutional neural network for audio event detection. arXiv preprint arXiv:1709.09888 (2017).Google Scholar
- Dao Minh-Son, Pham Quang-Nhat-Minh, and Dang-Nguyen Duc-Tien. 2017. A domain-based late-fusion for disaster image retrieval from social media. In Proc. of the MediaEval Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google Scholar
- Hanif Muhammad, Atif Muhammad, Khan Mahrukh, and Rafi Mohammad. 2017. Flood detection using social media data and spectral regression based kernel discriminant analysis. In Proceedings of the MediaEval 2017 Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google Scholar
- Milind Naphade, John R. Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Alexander Hauptmann, and Jon Curtis. 2006. Large-scale concept ontology for multimedia. IEEE Multimedia 13, 3 (2006), 86--91. Google Scholar
Digital Library
- Keiller Nogueira, Samuel G. Fadel, Ícaro C. Dourado, Rafael de O. Werneck, Javier A. V. Muñoz, Otávio A. B. Penatti, Rodrigo T. Calumby, Lin Tzy Li, Jefersson A. dos Santos, and Ricardo da S. Torres. 2017. Exploiting ConvNet diversity for flooding identification. arXiv preprint arXiv:1711.03564 (2017).Google Scholar
- Dan Oneata, Matthijs Douze, Jérôme Revaud, Schwenninger Jochen, Danila Potapov, Heng Wang, Zaid Harchaoui, Jakob Verbeek, Cordelia Schmid, Robin Aly, Kevin Mcguiness, Shu Chen, Noel O'ConnorKen ChatfieldOmkar Parkhi, Relja Arandjelovic, Andrew Zisserman, Fernando Basura, and Tinne Tuytelaars. 2012. Axes at TRECVID 2012: KIS, INS, and MED. In TRECVID Workshop.Google Scholar
- Paul Over, Jon Fiscus, Greg Sanders, David Joy, Martial Michel, George Awad, Alan Smeaton, Wessel Kraaij, and Georges Quénot. 2014. TRECVID 2014--An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID. 52.Google Scholar
- Symeon Papadopoulos, Raphael Troncy, Vasileios Mezaris, Benoit Huet, and Ioannis Kompatsiaris. 2011. Social event detection at MediaEval 2011: Challenges, dataset and evaluation. In Proceedings of MediaEval.Google Scholar
- Giambattista Parascandolo, Heikki Huttunen, and Tuomas Virtanen. 2016. Recurrent neural networks for polyphonic sound event detection in real life recordings. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6440--6444.Google Scholar
Cross Ref
- Sungheon Park and Nojun Kwak. 2015. Cultural event recognition by subregion classification with convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 45--50.Google Scholar
Cross Ref
- Georgios Petkos, Symeon Papadopoulos, Vasileios Mezaris, Raphael Troncy, Philipp Cimiano, Timo Reuter, and Yiannis Kompatsiaris. 2014. Social event detection at MediaEval: A three-year retrospect of tasks and results. In Proceedings of the International Conference on Multimedia Retrieval Workshop on Social Events in Web Multimedia (SEWM).Google Scholar
- Huy Phan, Lars Hertel, Marco Maass, and Alfred Mertins. 2016. Robust audio event recognition with 1-max pooling convolutional neural networks. arXiv preprint arXiv:1604.06338 (2016).Google Scholar
- Karol J. Piczak. 2015. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 1015--1018. Google Scholar
Digital Library
- Axel Plinge, Rene Grzeszick, and Gernot A, Fink. 2014. A bag-of-features approach to acoustic event detection. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3704--3708.Google Scholar
Cross Ref
- Samira Pouyanfar and Shu-Ching Chen. 2016. Semantic event detection using ensemble deep learning. In Proceedings of the 2016 IEEE International Symposium on Multimedia (ISM). IEEE, 203--208.Google Scholar
Cross Ref
- Samira Pouyanfar and Shu-Ching Chen. 2017. Automatic video event detection for imbalance data using enhanced ensemble deep learning. International Journal of Semantic Computing 11, 1 (2017), 85--109.Google Scholar
Cross Ref
- Reza Fuad Rachmadi, Keiichi Uchimura, and Gou Koutaki. 2016. Combined convolutional neural network for event recognition. In Proceedings of the Korea-Japan Joint Workshop on Frontiers of Computer Vision. 85--90.Google Scholar
- Timo Reuter, Symeon Papadopoulos, Giorgos Petkos, Vasileios Mezaris, Yiannis Kompatsiaris, Philipp Cimiano, Christopher de Vries, and Shlomo Geva. 2013. Social event detection at mediaeval 2013: Challenges, datasets, and evaluation. In Proceedings of the MediaEval Multimedia Benchmark Workshop Barcelona, Spain, October 18--19, 2013.Google Scholar
- Jinyoung Rhee, Jungho Im, and Gregory J. Carbone. 2010. Monitoring agricultural drought for arid and humid regions using multi-sensor remote sensing data. Remote Sensing of Environment 114, 12 (2010), 2875--2887.Google Scholar
Cross Ref
- Seyed Morteza Safdarnejad, Xiaoming Liu, Lalita Udpa, Brooks Andrus, John Wood, and Dean Craven. 2015. Sports videos in the wild (SVW): A video dataset for sports analysis. In Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). Vol. 1. IEEE, 1--7.Google Scholar
Cross Ref
- Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. 2014. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 1041--1044. Google Scholar
Digital Library
- Amaia Salvador, Matthias Zeppelzauer, Daniel Manchon-Vizuete, Andrea Calafell, and Xavier Giro-i Nieto. 2015. Cultural event recognition with visual convnets and temporal models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 36--44.Google Scholar
Cross Ref
- Emmanouil Schinas, Georgios Petkos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2012. CERTH@ MediaEval 2012 social event detection task. In MediaEval. Citeseer.Google Scholar
- Yuhui Shi and Russell C. Eberhart. 1999. Empirical study of particle swarm optimization. In Proceedings of the 1999 Congress on Evolutionary Computation (CEC’99). Vol. 3. IEEE, 1945--1950.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Bharat Singh, Xintong Han, Zhe Wu, Vlad I. Morariu, and Larry S. Davis. 2015. Selecting relevant web trained concepts for automated event retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 4561--4569. Google Scholar
Digital Library
- Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. Center for Research in Computer Vision (CRCV). Technical Report. University of Central Florida (UCF).Google Scholar
- Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, and Wei Zhang. 2018. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1390--1399.Google Scholar
Cross Ref
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.Google Scholar
Cross Ref
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818--2826.Google Scholar
Cross Ref
- Naoya Takahashi, Michael Gygli, Beat Pfister, and Luc Van Gool. 2016. Deep convolutional neural networks and data augmentation for acoustic event detection. arXiv preprint arXiv:1604.07160 (2016).Google Scholar
- Naoya Takahashi, Michael Gygli, and Luc Van Gool. 2018. Aenet: Learning deep audio features for video analysis. IEEE Transactions on Multimedia 20, 3 (2018), 513--524. Google Scholar
Digital Library
- Planet Team. 2016. Planet application program interface: In Space for Life on Earth. San Francisco, CA.Google Scholar
- Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64--73. Google Scholar
Digital Library
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google Scholar
Digital Library
- Shen-Fu Tsai, Thomas S. Huang, and Feng Tang. 2011. Album-based object-centric event recognition. In Proceedings of the 2011 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6. Google Scholar
Digital Library
- Christos Tzelepis, Zhigang Ma, Vasileios Mezaris, Bogdan Ionescu, Ioannis Kompatsiaris, Giulia Boato, Nicu Sebe, and Shuicheng Yan. 2016. Event-based media processing and analysis: A survey of the literature. Image and Vision Computing 53 (2016), 3--19. Google Scholar
Digital Library
- Dmitrii Ubskii and Alexei Pugachev. 2016. Sound event detection in real-life audio. IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2016).Google Scholar
- Jasper R. R. Uijlings, Koen E. A. Van De Sande, Theo Gevers, and Arnold W. M. Smeulders. 2013. Selective search for object recognition. International Journal of Computer Vision 104, 2 (2013), 154--171. Google Scholar
Digital Library
- MWW Van Grootel, Tjeerd C Andringa, and JD Krijnders. 2009. DARES-G1: Database of annotated real-world everyday sounds. In Proceedings of the NAG/DAGA International Conference on Acoustics.Google Scholar
- Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3169--3176. Google Scholar
Digital Library
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, 3551--3558. Google Scholar
Digital Library
- Jun Wang and Jean-Daniel Zucker. 2000. Solving multiple-instance problem: A lazy learning approach. (2000).Google Scholar
- Limin Wang, Zhe Wang, Sheng Guo, and Yu Qiao. 2015. Better exploiting OS-CNNS for better event recognition in images. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 45--52. Google Scholar
Digital Library
- Limin Wang, Zhe Wang, Yu Qiao, and Luc Van Gool. 2017. Transferring deep object and scene representations for event recognition in still images. International Journal of Computer Vision (2017), 1--20. Google Scholar
Digital Library
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.Google Scholar
Cross Ref
- Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. arXiv preprint arXiv:1806.01810 (2018).Google Scholar
- Xiaoyang Wang and Qiang Ji. 2015. Video event recognition with deep hierarchical context model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4418--4427.Google Scholar
Cross Ref
- Yun Wang and Florian Metze. 2016. Recurrent support vector machines for audio-based multimedia event detection. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 265--269. Google Scholar
Digital Library
- Yun Wang and Florian Metze. 2017. A first attempt at polyphonic sound event detection using connectionist temporal classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2986--2990.Google Scholar
Cross Ref
- Yun Wang, Leonardo Neves, and Florian Metze. 2016. Audio-based multimedia event detection using deep recurrent neural networks. In Proceedings of the 2016 I/eee International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2742--2746.Google Scholar
Cross Ref
- Xiu-Shen Wei, Bin-Bin Gao, and Jianxin Wu. 2015. Deep spatial pyramid ensemble for cultural event recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 38--44. Google Scholar
Digital Library
- Sebastien C. Wong, Adam Gatt, Victor Stamatescu, and Mark D. McDonnell. 2016. Understanding data augmentation for classification: When to warp? arXiv preprint arXiv:1609.08764 (2016).Google Scholar
- Zifeng Wu, Yongzhen Huang, and Liang Wang. 2015. Learning representative deep features for image set analysis. IEEE Transactions on Multimedia 17, 11 (2015), 1960--1968.Google Scholar
Digital Library
- Yuanjun Xiong, Kai Zhu, Dahua Lin, and Xiaoou Tang. 2015. Recognize complex events from static images by fusing deep channels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1600--1609.Google Scholar
- Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. 2015. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015).Google Scholar
- Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2015. A discriminative CNN video representation for event detection. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1798--1807.Google Scholar
- Ronald R. Yager and Dimitar P. Filev. 1999. Induced ordered weighted averaging operators. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 29, 2 (1999), 141--150. Google Scholar
Digital Library
- Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507--4515. Google Scholar
Digital Library
- Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. 2015. Eventnet: A large scale structured concept library for complex event detection in video. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 471--480. Google Scholar
Digital Library
- Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and Li Fei-Fei. 2018. Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision 126, 2--4 (2018), 375--389. Google Scholar
Digital Library
- Litao Yu, Xiaoshuai Sun, and Zi Huang. 2016. Robust spatial-temporal deep model for multimedia event detection. Neurocomputing 213 (2016), 48--53.Google Scholar
Cross Ref
- Shoou-I Yu, Lu Jiang, Zexi Mao, Xiaojun Chang, Xingzhong Du, Chuang Gan, Zhenzhong Lan, Zhongwen Xu, Xuanchong Li, Yang Cai, Anurag Kumar, Yajie Miao, Lara Martin, Nikolas Wolfe, Shicheng Xu, Huan Li, Ming Lin, Zhigang Ma, Yi Yang, Deyu Meng, Shiguang Shan, Pinar Duygulu Sahin, Susanne Burger, Florian Metze, Rita Singh, Bhiksha Raj, Teruko Mitamura, Richard Stern, and Alexander Hauptmann. 2014. MER. In Proceedings of the NIST TRECVID Video Retrieval Evaluation Workshop, Vol. 24.Google Scholar
- Joe Yue-Hei Ng, Fan Yang, and Larry S. Davis. 2015. Exploiting local features from deep networks for image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 53--61.Google Scholar
- Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint arXiv:1503.04144 (2015).Google Scholar
- Dongqing Zhang and Dan Ellis. 2001. Detecting sound events in basketball video archive. Dept. Electronic Eng., Columbia Univ., New York (2001).Google Scholar
- Xishan Zhang, Hanwang Zhang, Yongdong Zhang, Yang Yang, Meng Wang, Huanbo Luan, Jintao Li, and Tat-Seng Chua. 2016. Deep fusion of multiple semantic cues for complex event recognition. IEEE Transactions on Image Processing 25, 3 (2016), 1033--1046.Google Scholar
Digital Library
- Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems. 487--495. Google Scholar
Digital Library
- Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision 124, 3 (2017), 409--421. Google Scholar
Digital Library
Index Terms
How Deep Features Have Improved Event Recognition in Multimedia: A Survey
Recommendations
Non-linear dictionary representation of deep features for face recognition from a single sample per person
Unconstrained face recognition remain a challenging problem due to intra-class variations caused by occlusion, disguise, varying orientations, facial expressions, age variations and illumination in real circumstances...etc. the recognition rate of ...
Semantic event relationships identification and representation using HyperGraph in multimedia digital ecosystem
AbstractNowadays, multimedia-based digital ecosystem (e.g., social media sites) has become a great source of user-contributed multimedia documents for many types of real-world events. Very often social media posts about events are multimedia (e.g., image, ...
Generalized durative event detection on social media
AbstractGiven the recent availability of large volumes of social media discussions, finding temporal unusual phenomena, which can be called events, from such data is of great interest. Previous works on social media event detection either assume a ...






Comments