Abstract
In this article, we address the problem of recognizing an event from a single related picture. Given the large number of event classes and the limited information contained in a single shot, the problem is known to be particularly hard. To achieve a reliable detection, we propose a combination of multiple classifiers, and we compare three alternative strategies to fuse the results of each classifier, namely: (i) induced order weighted averaging operators, (ii) genetic algorithms, and (iii) particle swarm optimization. Each method is aimed at determining the optimal weights to be assigned to the decision scores yielded by different deep models, according to the relevant optimization strategy. Experimental tests have been performed on three event recognition datasets, evaluating the performance of various deep models, both alone and selectively combined. Experimental results demonstrate that the proposed approach outperforms traditional multiple classifier solutions based on uniform weighting, and outperforms recent state-of-the-art approaches.
- Kashif Ahmad, Nicola Conci, Giulia Boato, and Francesco G. B. De Natale. 2016. USED: A large-scale social event detection dataset. In Proceedings of the 7th International Conference on Multimedia Systems. ACM, 50. Google Scholar
Digital Library
- Kashif Ahmad, Nicola Conci, and F. G. B. De Natale. 2018. A saliency-based approach to event recognition. Signal Process.: Image Commun. 60 (2018), 42--51.Google Scholar
Cross Ref
- Kashif Ahmad, Francesco De Natale, Giulia Boato, and Andrea Rosani. 2016. A hierarchical approach to event discovery from single images using MIL framework. In Proceedings of the 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP’16). IEEE, 1223--1227.Google Scholar
Cross Ref
- Sheharyar Ahmad, Kashif Ahmad, Nasir Ahmad, and Nicola Conci. Convolutional neural networks for disaster images retrieval. In Proceedings of the MediaEval 2017 Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google Scholar
- Pradeep K. Atrey, M Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Syst. 16, 6 (2010), 345--379. Google Scholar
Digital Library
- Alec Banks, Jonathan Vincent, and Chukwudi Anyakoha. 2008. A review of particle swarm optimization. Part II: Hybridisation, combinatorial, multicriteria and constrained optimization, and indicative applications. Nat. Comput. 7, 1 (2008), 109--124. Google Scholar
Digital Library
- Yakoub Bazi and Farid Melgani. 2006. Toward an optimal SVM classification system for hyperspectral remote sensing images. IEEE Trans. Geosci. Remote Sens. 44, 11 (2006), 3374--3385.Google Scholar
Cross Ref
- Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2013. Event recognition in photo collections with a stopwatch hmm. In Proceedings of the IEEE International Conference on Computer Vision. 1193--1200. Google Scholar
Digital Library
- Markus Brenner and Ebroul Izquierdo. 2012. Social event detection and retrieval in collaborative photo collections. In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval. ACM, 21. Google Scholar
Digital Library
- Hyeran Byun and Seong-Whan Lee. 2002. Applications of support vector machines for pattern recognition: A survey. Pattern Recognit. Support Vector Mach. (2002), 571--591. Google Scholar
Digital Library
- Rich Caruana, Art Munson, and Alexandru Niculescu-Mizil. 2006. Getting the most out of ensemble selection. In Proceedings of the Sixth International Conference on Data Mining (ICDM’06). IEEE, 828--833. Google Scholar
Digital Library
- Shih-Fu Chang, R. Manmatha, and Tat-Seng Chua. 2005. Combining text and audio-visual features in video indexing. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05), Vol. 5. IEEE, v--1005.Google Scholar
- Jose M. Chaquet, Enrique J. Carmona, and Antonio Fernández-Caballero. 2013. A survey of video datasets for human action and activity recognition. Comput. Vis. Image Underst. 117, 6 (2013), 633--659. Google Scholar
Digital Library
- Ling Chen and Abhishek Roy. 2009. Event detection from flickr data through wavelet-based spatial analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 523--532. Google Scholar
Digital Library
- Minh-Son Dao, Duc-Tien Dang-Nguyen, and Francesco G. B. De Natale. 2014. Robust event discovery from photo collections using signature image bases (SIBs). Multimedia Tools and Applications 70, 1 (2014), 25--53. Google Scholar
Digital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.Google Scholar
Cross Ref
- Russell C. Eberhart and Yuhui Shi. 1998. Comparison between genetic algorithms and particle swarm optimization. In Proceedings of the International Conference on Evolutionary Programming. Springer, 611--616. Google Scholar
Digital Library
- Sergio Escalera, Junior Fabian, Pablo Pardo, Xavier Baró, Jordi Gonzalez, Hugo J. Escalante, Dusan Misevic, Ulrich Steiner, and Isabelle Guyon. 2015. Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1--9. Google Scholar
Digital Library
- Claudiu S. Firan, Mihai Georgescu, Wolfgang Nejdl, and Raluca Paiu. 2010. Bringing order to your photos: Event-driven classification of flickr images based on social knowledge. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 189--198. Google Scholar
Digital Library
- Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2568--2577.Google Scholar
- Yue-Jiao Gong and Jun Zhang. 2012. Real-time traffic signal control for roundabouts by using a PSO-based fuzzy controller. In Proceedings of the 2012 IEEE Congress on Evolutionary Computation (CEC’12). IEEE, 1--8.Google Scholar
Cross Ref
- Cong Guo and Xinmei Tian. 2015. Event recognition in personal photo collections using hierarchical model and multiple features. In Proceedings of the 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP’15). IEEE, 1--6.Google Scholar
- David L. Hall and James Llinas. 1997. An introduction to multisensor data fusion. Proc. IEEE 85, 1 (1997), 6--23.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
Cross Ref
- Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7 (2006), 1527--1554. Google Scholar
Digital Library
- Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. 2011. A survey on visual content-based video indexing and retrieval. IEEE Trans. Syst., Man Cybern., Part C (Appl. Revi.) 41, 6 (2011), 797--819. Google Scholar
Digital Library
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning. 448--456. Google Scholar
Digital Library
- Giridharan Iyengar, Harriet J Nock, and Chalapathy Neti. 2003. Audio-visual synchrony for detection of monologues in video archives. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), Vol. 5. IEEE, V--772.Google Scholar
Cross Ref
- Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision 116, 1 (2016), 1--20. Google Scholar
Digital Library
- Alejandro Jaimes and Nicu Sebe. 2007. Multimodal human--computer interaction: A survey. Comput. Vis. Image Underst. 108, 1 (2007), 116--134. Google Scholar
Digital Library
- Yu-Gang Jiang, Subhabrata Bhattacharya, Shih-Fu Chang, and Mubarak Shah. 2013. High-level event recognition in unconstrained videos. Int. J. Multimedia Inform. Retr. 2, 2 (2013), 73--101.Google Scholar
Cross Ref
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Adv. Neural Inform. Process. Syst. 1097--1105. Google Scholar
Digital Library
- Zhen-Zhong Lan, Lei Bao, Shoou-I Yu, Wei Liu, and Alexander G. Hauptmann. 2012. Double fusion for multimedia event detection. In Proceedings of the International Conference on MultiMedia Modeling. Springer, 173--185. Google Scholar
Digital Library
- Li-Jia Li and Li Fei-Fei. 2007. What, where and who? Classifying events by scene and object recognition. In Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV’07). IEEE, 1--8.Google Scholar
Cross Ref
- Mengyi Liu, Xin Liu, Yan Li, Xilin Chen, Alexander G. Hauptmann, and Shiguang Shan. 2015. Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 32--37. Google Scholar
Digital Library
- Xueliang Liu and Benoit Huet. 2013. Heterogeneous features and model selection for event-based media classification. In Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval. ACM, 151--158. Google Scholar
Digital Library
- Kieran McDonald and Alan F. Smeaton. 2005. A comparison of score, rank and probability-based fusion methods for video shot retrieval. In Proceedings of the International Conference on Image and Video Retrieval. Springer, 61--70. Google Scholar
Digital Library
- Vasileios Mezaris, Ansgar Scherp, Ramesh Jain, and Mohan S. Kankanhalli. 2014. Real-life events in multimedia: Detection, representation, retrieval, and applications. Multimedia Tools Appl. 70, 1 (2014), 1--6. Google Scholar
Digital Library
- Milind Naphade, John R. Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Alexander Hauptmann, and Jon Curtis. 2006. Large-scale concept ontology for multimedia. IEEE Multimedia 13, 3 (2006), 86--91. Google Scholar
Digital Library
- Pradeep Natarajan, Shuang Wu, Shiv Vitaladevuni, Xiaodan Zhuang, Stavros Tsakalidis, Unsang Park, Rohit Prasad, and Premkumar Natarajan. 2012. Multimodal feature fusion for robust event detection in web videos. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, 1298--1305. Google Scholar
Digital Library
- Kaoru Ota, Minh Son Dao, Vasileios Mezaris, and Francesco G. B. De Natale. 2017. Deep learning for mobile multimedia: A survey. ACM Trans. Multimedia Comput. Commun. Appl. 13, 3s (2017), 34. Google Scholar
Digital Library
- Symeon Papadopoulos, Raphael Troncy, Vasileios Mezaris, Benoit Huet, and Ioannis Kompatsiaris. 2011. Social event detection at mediaeval 2011: Challenges, dataset and evaluation. In MediaEval.Google Scholar
- Symeon Papadopoulos, Christos Zigkolis, Yiannis Kompatsiaris, and Athena Vakali. 2011. Cluster-based landmark and event detection for tagged photo collections. IEEE MultiMedia 18, 1 (2011), 52--63. Google Scholar
Digital Library
- Sungheon Park and Nojun Kwak. 2015. Cultural event recognition by subregion classification with convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 45--50.Google Scholar
Cross Ref
- Georgios Petkos, Symeon Papadopoulos, Vasileios Mezaris, Raphael Troncy, Philipp Cimiano, Timo Reuter, and Yiannis Kompatsiaris. 2014. Social event detection at mediaeval: A three-year retrospect of tasks and results. In Proc. ACM ICMR 2014 Workshop on Social Events in Web Multimedia (SEWM’14).Google Scholar
- Gerasimos Potamianos, Chalapathy Neti, Guillaume Gravier, Ashutosh Garg, and Andrew W. Senior. 2003. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91, 9 (2003), 1306--1326.Google Scholar
- Reza Fuad Rachmadi, Keiichi Uchimura, and Gou Koutaki. 2016. Combined convolutional neural network for event recognition. In Proceedings of the Korea-Japan Joint Workshop on Frontiers of Computer Vision. 85--90.Google Scholar
- Reza Fuad Rachmadi, Keiichi Uchimura, and Gou Koutaki. 2016. Spatial pyramid convolutional neural network for social event detection in static image. arXiv:1612.04062 (2016).Google Scholar
- Timo Reuter, Symeon Papadopoulos, Giorgos Petkos, Vasileios Mezaris, Yiannis Kompatsiaris, Philipp Cimiano, Christopher de Vries, and Shlomo Geva. 2013. Social event detection at mediaeval 2013: Challenges, datasets, and evaluation. In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop.Google Scholar
- Andrea Rosani, Giulia Boato, and Francesco G. B. De Natale. 2015. Eventmask: A game-based framework for event-saliency identification in images. IEEE Trans. Multimedia 17, 8 (2015), 1359--1371.Google Scholar
Cross Ref
- Amaia Salvador, Matthias Zeppelzauer, Daniel Manchon-Vizuete, Andrea Calafell, and Xavier Giro-i Nieto. 2015. Cultural event recognition with visual ConvNets and temporal models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 36--44.Google Scholar
Cross Ref
- Walter J. Scheirer, Lalit P. Jain, and Terrance E. Boult. 2014. Probability models for open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 11 (2014), 2317--2324.Google Scholar
Cross Ref
- Luca Scrucca. 2016. Genetic algorithms for subset selection in model-based clustering. In Unsupervised Learning Algorithms. Springer, 55--70.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014).Google Scholar
- Alan F. Smeaton. 1998. Independence of contributing retrieval strategies in data fusion for effective information retrieval. In BCS-IRSG Annual Colloquium on IR Research. Google Scholar
Digital Library
- Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. 2005. Early versus late fusion in semantic video analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia. ACM, 399--402. Google Scholar
Digital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google Scholar
Cross Ref
- Raphaël Troncy, Bartosz Malocha, and André T. S. Fialho. 2010. Linking events with media. In Proceedings of the 6th International Conference on Semantic Systems. ACM, 42. Google Scholar
Digital Library
- Christos Tzelepis, Zhigang Ma, Vasileios Mezaris, Bogdan Ionescu, Ioannis Kompatsiaris, Giulia Boato, Nicu Sebe, and Shuicheng Yan. 2016. Event-based media processing and analysis: A survey of the literature. Image Vis. Comput. 53 (2016), 3--19. Google Scholar
Digital Library
- Ellen M. Voorhees, Narendra K. Gupta, and Ben Johnson-Laird. 1995. Learning collection fusion strategies. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 172--179. Google Scholar
Digital Library
- Limin Wang, Zhe Wang, Wenbin Du, and Yu Qiao. 2015. Object-scene convolutional neural networks for event recognition in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 30--35.Google Scholar
Cross Ref
- Limin Wang, Zhe Wang, Sheng Guo, and Yu Qiao. 2015. Better exploiting OS-CNNS for better event recognition in images. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 45--52. Google Scholar
Digital Library
- Limin Wang, Zhe Wang, Yu Qiao, and Luc Van Gool. 2017. Transferring deep object and scene representations for event recognition in still images. Int. J. Comput. Vis. (2017), 1--20. Google Scholar
Digital Library
- Yao Wang, Zhu Liu, and Jin-Cheng Huang. 2000. Multimedia content analysis-using both audio and visual clues. IEEE Signal Process. Mag. 17, 6 (2000), 12--36.Google Scholar
Cross Ref
- Yanxiang Wang, Hari Sundaram, and Lexing Xie. 2012. Social event detection with interaction graph modeling. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 865--868. Google Scholar
Digital Library
- Utz Westermann and Ramesh Jain. 2007. Toward a common event model for multimedia applications. IEEE Multimedia 14, 1 (2007), 19--29. Google Scholar
Digital Library
- Yuanjun Xiong, Kai Zhu, Dahua Lin, and Xiaoou Tang. 2015. Recognize complex events from static images by fusing deep channels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1600--1609.Google Scholar
- Lei Xu, Adam Krzyzak, and Ching Y. Suen. 1992. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. Syst., Man Cybern. 22, 3 (1992), 418--435.Google Scholar
Cross Ref
- Ronald R. Yager and Dimitar P. Filev. 1999. Induced ordered weighted averaging operators. IEEE Trans. Syst. Man Cybern., Part B (Cybern.) 29, 2 (1999), 141--150. Google Scholar
Digital Library
- Wenyi Zhao, Rama Chellappa, P. Jonathon Phillips, and Azriel Rosenfeld. 2003. Face recognition: A literature survey. ACM Comput. Surv. 35, 4 (2003), 399--458. Google Scholar
Digital Library
- Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921--2929.Google Scholar
Cross Ref
- Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems. 487--495. Google Scholar
Digital Library
Index Terms
Ensemble of Deep Models for Event Recognition
Recommendations
Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification
MM '16: Proceedings of the 24th ACM international conference on MultimediaThis paper presents a novel framework to combine multiple layers and modalities of deep neural networks for video classification. We first propose a multilayer strategy to simultaneously capture a variety of levels of abstraction and invariance in a ...
A hybrid genetically-bacterial foraging algorithm converged by particle swarm optimisation for global optimisation
The social foraging behaviour of Escherichia coli bacteria and the effectiveness of genetic operators have recently been combined to develop a hybridised algorithm for distributed optimisation and control. The classical algorithms have their importance ...
Evolutionary method combining Particle Swarm Optimisation and Genetic Algorithms using fuzzy logic for parameter adaptation and aggregation: the case neural network optimisation for face recognition
We describe in this paper a new hybrid approach for optimisation combining Particle Swarm Optimisation (PSO) and Genetic Algorithms (GAs) using Fuzzy Logic for parameter adaptation and to integrate the results. The new evolutionary method combines the ...






Comments