skip to main content
research-article

Ensemble of Deep Models for Event Recognition

Authors Info & Claims
Published:01 May 2018Publication History
Skip Abstract Section

Abstract

In this article, we address the problem of recognizing an event from a single related picture. Given the large number of event classes and the limited information contained in a single shot, the problem is known to be particularly hard. To achieve a reliable detection, we propose a combination of multiple classifiers, and we compare three alternative strategies to fuse the results of each classifier, namely: (i) induced order weighted averaging operators, (ii) genetic algorithms, and (iii) particle swarm optimization. Each method is aimed at determining the optimal weights to be assigned to the decision scores yielded by different deep models, according to the relevant optimization strategy. Experimental tests have been performed on three event recognition datasets, evaluating the performance of various deep models, both alone and selectively combined. Experimental results demonstrate that the proposed approach outperforms traditional multiple classifier solutions based on uniform weighting, and outperforms recent state-of-the-art approaches.

References

  1. Kashif Ahmad, Nicola Conci, Giulia Boato, and Francesco G. B. De Natale. 2016. USED: A large-scale social event detection dataset. In Proceedings of the 7th International Conference on Multimedia Systems. ACM, 50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Kashif Ahmad, Nicola Conci, and F. G. B. De Natale. 2018. A saliency-based approach to event recognition. Signal Process.: Image Commun. 60 (2018), 42--51.Google ScholarGoogle ScholarCross RefCross Ref
  3. Kashif Ahmad, Francesco De Natale, Giulia Boato, and Andrea Rosani. 2016. A hierarchical approach to event discovery from single images using MIL framework. In Proceedings of the 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP’16). IEEE, 1223--1227.Google ScholarGoogle ScholarCross RefCross Ref
  4. Sheharyar Ahmad, Kashif Ahmad, Nasir Ahmad, and Nicola Conci. Convolutional neural networks for disaster images retrieval. In Proceedings of the MediaEval 2017 Workshop (Sept. 13--15, 2017). Dublin, Ireland.Google ScholarGoogle Scholar
  5. Pradeep K. Atrey, M Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Syst. 16, 6 (2010), 345--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Alec Banks, Jonathan Vincent, and Chukwudi Anyakoha. 2008. A review of particle swarm optimization. Part II: Hybridisation, combinatorial, multicriteria and constrained optimization, and indicative applications. Nat. Comput. 7, 1 (2008), 109--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yakoub Bazi and Farid Melgani. 2006. Toward an optimal SVM classification system for hyperspectral remote sensing images. IEEE Trans. Geosci. Remote Sens. 44, 11 (2006), 3374--3385.Google ScholarGoogle ScholarCross RefCross Ref
  8. Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2013. Event recognition in photo collections with a stopwatch hmm. In Proceedings of the IEEE International Conference on Computer Vision. 1193--1200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Markus Brenner and Ebroul Izquierdo. 2012. Social event detection and retrieval in collaborative photo collections. In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval. ACM, 21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hyeran Byun and Seong-Whan Lee. 2002. Applications of support vector machines for pattern recognition: A survey. Pattern Recognit. Support Vector Mach. (2002), 571--591. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Rich Caruana, Art Munson, and Alexandru Niculescu-Mizil. 2006. Getting the most out of ensemble selection. In Proceedings of the Sixth International Conference on Data Mining (ICDM’06). IEEE, 828--833. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Shih-Fu Chang, R. Manmatha, and Tat-Seng Chua. 2005. Combining text and audio-visual features in video indexing. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’05), Vol. 5. IEEE, v--1005.Google ScholarGoogle Scholar
  13. Jose M. Chaquet, Enrique J. Carmona, and Antonio Fernández-Caballero. 2013. A survey of video datasets for human action and activity recognition. Comput. Vis. Image Underst. 117, 6 (2013), 633--659. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ling Chen and Abhishek Roy. 2009. Event detection from flickr data through wavelet-based spatial analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 523--532. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Minh-Son Dao, Duc-Tien Dang-Nguyen, and Francesco G. B. De Natale. 2014. Robust event discovery from photo collections using signature image bases (SIBs). Multimedia Tools and Applications 70, 1 (2014), 25--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  17. Russell C. Eberhart and Yuhui Shi. 1998. Comparison between genetic algorithms and particle swarm optimization. In Proceedings of the International Conference on Evolutionary Programming. Springer, 611--616. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sergio Escalera, Junior Fabian, Pablo Pardo, Xavier Baró, Jordi Gonzalez, Hugo J. Escalante, Dusan Misevic, Ulrich Steiner, and Isabelle Guyon. 2015. Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Claudiu S. Firan, Mihai Georgescu, Wolfgang Nejdl, and Raluca Paiu. 2010. Bringing order to your photos: Event-driven classification of flickr images based on social knowledge. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 189--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2568--2577.Google ScholarGoogle Scholar
  21. Yue-Jiao Gong and Jun Zhang. 2012. Real-time traffic signal control for roundabouts by using a PSO-based fuzzy controller. In Proceedings of the 2012 IEEE Congress on Evolutionary Computation (CEC’12). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  22. Cong Guo and Xinmei Tian. 2015. Event recognition in personal photo collections using hierarchical model and multiple features. In Proceedings of the 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP’15). IEEE, 1--6.Google ScholarGoogle Scholar
  23. David L. Hall and James Llinas. 1997. An introduction to multisensor data fusion. Proc. IEEE 85, 1 (1997), 6--23.Google ScholarGoogle Scholar
  24. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  25. Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7 (2006), 1527--1554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. 2011. A survey on visual content-based video indexing and retrieval. IEEE Trans. Syst., Man Cybern., Part C (Appl. Revi.) 41, 6 (2011), 797--819. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning. 448--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Giridharan Iyengar, Harriet J Nock, and Chalapathy Neti. 2003. Audio-visual synchrony for detection of monologues in video archives. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), Vol. 5. IEEE, V--772.Google ScholarGoogle ScholarCross RefCross Ref
  29. Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision 116, 1 (2016), 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Alejandro Jaimes and Nicu Sebe. 2007. Multimodal human--computer interaction: A survey. Comput. Vis. Image Underst. 108, 1 (2007), 116--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yu-Gang Jiang, Subhabrata Bhattacharya, Shih-Fu Chang, and Mubarak Shah. 2013. High-level event recognition in unconstrained videos. Int. J. Multimedia Inform. Retr. 2, 2 (2013), 73--101.Google ScholarGoogle ScholarCross RefCross Ref
  32. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Adv. Neural Inform. Process. Syst. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zhen-Zhong Lan, Lei Bao, Shoou-I Yu, Wei Liu, and Alexander G. Hauptmann. 2012. Double fusion for multimedia event detection. In Proceedings of the International Conference on MultiMedia Modeling. Springer, 173--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Li-Jia Li and Li Fei-Fei. 2007. What, where and who? Classifying events by scene and object recognition. In Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV’07). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  35. Mengyi Liu, Xin Liu, Yan Li, Xilin Chen, Alexander G. Hauptmann, and Shiguang Shan. 2015. Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 32--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Xueliang Liu and Benoit Huet. 2013. Heterogeneous features and model selection for event-based media classification. In Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval. ACM, 151--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kieran McDonald and Alan F. Smeaton. 2005. A comparison of score, rank and probability-based fusion methods for video shot retrieval. In Proceedings of the International Conference on Image and Video Retrieval. Springer, 61--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Vasileios Mezaris, Ansgar Scherp, Ramesh Jain, and Mohan S. Kankanhalli. 2014. Real-life events in multimedia: Detection, representation, retrieval, and applications. Multimedia Tools Appl. 70, 1 (2014), 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Milind Naphade, John R. Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Alexander Hauptmann, and Jon Curtis. 2006. Large-scale concept ontology for multimedia. IEEE Multimedia 13, 3 (2006), 86--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Pradeep Natarajan, Shuang Wu, Shiv Vitaladevuni, Xiaodan Zhuang, Stavros Tsakalidis, Unsang Park, Rohit Prasad, and Premkumar Natarajan. 2012. Multimodal feature fusion for robust event detection in web videos. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, 1298--1305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Kaoru Ota, Minh Son Dao, Vasileios Mezaris, and Francesco G. B. De Natale. 2017. Deep learning for mobile multimedia: A survey. ACM Trans. Multimedia Comput. Commun. Appl. 13, 3s (2017), 34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Symeon Papadopoulos, Raphael Troncy, Vasileios Mezaris, Benoit Huet, and Ioannis Kompatsiaris. 2011. Social event detection at mediaeval 2011: Challenges, dataset and evaluation. In MediaEval.Google ScholarGoogle Scholar
  43. Symeon Papadopoulos, Christos Zigkolis, Yiannis Kompatsiaris, and Athena Vakali. 2011. Cluster-based landmark and event detection for tagged photo collections. IEEE MultiMedia 18, 1 (2011), 52--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Sungheon Park and Nojun Kwak. 2015. Cultural event recognition by subregion classification with convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 45--50.Google ScholarGoogle ScholarCross RefCross Ref
  45. Georgios Petkos, Symeon Papadopoulos, Vasileios Mezaris, Raphael Troncy, Philipp Cimiano, Timo Reuter, and Yiannis Kompatsiaris. 2014. Social event detection at mediaeval: A three-year retrospect of tasks and results. In Proc. ACM ICMR 2014 Workshop on Social Events in Web Multimedia (SEWM’14).Google ScholarGoogle Scholar
  46. Gerasimos Potamianos, Chalapathy Neti, Guillaume Gravier, Ashutosh Garg, and Andrew W. Senior. 2003. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91, 9 (2003), 1306--1326.Google ScholarGoogle Scholar
  47. Reza Fuad Rachmadi, Keiichi Uchimura, and Gou Koutaki. 2016. Combined convolutional neural network for event recognition. In Proceedings of the Korea-Japan Joint Workshop on Frontiers of Computer Vision. 85--90.Google ScholarGoogle Scholar
  48. Reza Fuad Rachmadi, Keiichi Uchimura, and Gou Koutaki. 2016. Spatial pyramid convolutional neural network for social event detection in static image. arXiv:1612.04062 (2016).Google ScholarGoogle Scholar
  49. Timo Reuter, Symeon Papadopoulos, Giorgos Petkos, Vasileios Mezaris, Yiannis Kompatsiaris, Philipp Cimiano, Christopher de Vries, and Shlomo Geva. 2013. Social event detection at mediaeval 2013: Challenges, datasets, and evaluation. In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop.Google ScholarGoogle Scholar
  50. Andrea Rosani, Giulia Boato, and Francesco G. B. De Natale. 2015. Eventmask: A game-based framework for event-saliency identification in images. IEEE Trans. Multimedia 17, 8 (2015), 1359--1371.Google ScholarGoogle ScholarCross RefCross Ref
  51. Amaia Salvador, Matthias Zeppelzauer, Daniel Manchon-Vizuete, Andrea Calafell, and Xavier Giro-i Nieto. 2015. Cultural event recognition with visual ConvNets and temporal models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 36--44.Google ScholarGoogle ScholarCross RefCross Ref
  52. Walter J. Scheirer, Lalit P. Jain, and Terrance E. Boult. 2014. Probability models for open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 11 (2014), 2317--2324.Google ScholarGoogle ScholarCross RefCross Ref
  53. Luca Scrucca. 2016. Genetic algorithms for subset selection in model-based clustering. In Unsupervised Learning Algorithms. Springer, 55--70.Google ScholarGoogle Scholar
  54. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  55. Alan F. Smeaton. 1998. Independence of contributing retrieval strategies in data fusion for effective information retrieval. In BCS-IRSG Annual Colloquium on IR Research. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. 2005. Early versus late fusion in semantic video analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia. ACM, 399--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  58. Raphaël Troncy, Bartosz Malocha, and André T. S. Fialho. 2010. Linking events with media. In Proceedings of the 6th International Conference on Semantic Systems. ACM, 42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Christos Tzelepis, Zhigang Ma, Vasileios Mezaris, Bogdan Ionescu, Ioannis Kompatsiaris, Giulia Boato, Nicu Sebe, and Shuicheng Yan. 2016. Event-based media processing and analysis: A survey of the literature. Image Vis. Comput. 53 (2016), 3--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Ellen M. Voorhees, Narendra K. Gupta, and Ben Johnson-Laird. 1995. Learning collection fusion strategies. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 172--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Limin Wang, Zhe Wang, Wenbin Du, and Yu Qiao. 2015. Object-scene convolutional neural networks for event recognition in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 30--35.Google ScholarGoogle ScholarCross RefCross Ref
  62. Limin Wang, Zhe Wang, Sheng Guo, and Yu Qiao. 2015. Better exploiting OS-CNNS for better event recognition in images. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 45--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Limin Wang, Zhe Wang, Yu Qiao, and Luc Van Gool. 2017. Transferring deep object and scene representations for event recognition in still images. Int. J. Comput. Vis. (2017), 1--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Yao Wang, Zhu Liu, and Jin-Cheng Huang. 2000. Multimedia content analysis-using both audio and visual clues. IEEE Signal Process. Mag. 17, 6 (2000), 12--36.Google ScholarGoogle ScholarCross RefCross Ref
  65. Yanxiang Wang, Hari Sundaram, and Lexing Xie. 2012. Social event detection with interaction graph modeling. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 865--868. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Utz Westermann and Ramesh Jain. 2007. Toward a common event model for multimedia applications. IEEE Multimedia 14, 1 (2007), 19--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Yuanjun Xiong, Kai Zhu, Dahua Lin, and Xiaoou Tang. 2015. Recognize complex events from static images by fusing deep channels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1600--1609.Google ScholarGoogle Scholar
  68. Lei Xu, Adam Krzyzak, and Ching Y. Suen. 1992. Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. Syst., Man Cybern. 22, 3 (1992), 418--435.Google ScholarGoogle ScholarCross RefCross Ref
  69. Ronald R. Yager and Dimitar P. Filev. 1999. Induced ordered weighted averaging operators. IEEE Trans. Syst. Man Cybern., Part B (Cybern.) 29, 2 (1999), 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Wenyi Zhao, Rama Chellappa, P. Jonathon Phillips, and Azriel Rosenfeld. 2003. Face recognition: A literature survey. ACM Comput. Surv. 35, 4 (2003), 399--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921--2929.Google ScholarGoogle ScholarCross RefCross Ref
  72. Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems. 487--495. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Ensemble of Deep Models for Event Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!