skip to main content
research-article

Visual Semantic-Based Representation Learning Using Deep CNNs for Scene Recognition

Published:11 May 2021Publication History
Skip Abstract Section

Abstract

In this work, we address the task of scene recognition from image data. A scene is a spatially correlated arrangement of various visual semantic contents also known as concepts, e.g., “chair,”  “car,”  “sky,”  etc. Representation learning using visual semantic content can be regarded as one of the most trivial ideas as it mimics the human behavior of perceiving visual information. Semantic multinomial (SMN) representation is one such representation that captures semantic information using posterior probabilities of concepts. The core part of obtaining SMN representation is the building of concept models. Therefore, it is necessary to have ground-truth (true) concept labels for every concept present in an image. Moreover, manual labeling of concepts is practically not feasible due to the large number of images in the dataset. To address this issue, we propose an approach for generating pseudo-concepts in the absence of true concept labels. We utilize the pre-trained deep CNN-based architectures where activation maps (filter responses) from convolutional layers are considered as initial cues to the pseudo-concepts. The non-significant activation maps are removed using the proposed filter-specific threshold-based approach that leads to the removal of non-prominent concepts from data. Further, we propose a grouping mechanism to group the same pseudo-concepts using subspace modeling of filter responses to achieve a non-redundant representation. Experimental studies show that generated SMN representation using pseudo-concepts achieves comparable results for scene recognition tasks on standard datasets like MIT-67 and SUN-397 even in the absence of true concept labels.

References

  1. A. Barla, F. Odone, and A. Verri. 2003. Histogram intersection kernel for image classification. In Proceedings of the International Conference on Image Processing (ICIP’03), Vol. 3. III--513. DOI:https://doi.org/10.1109/ICIP.2003.1247294Google ScholarGoogle Scholar
  2. Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. 2004. Learning multi-label scene classification. Pattern Recognition 37, 9 (September 2004), 1757--1771.Google ScholarGoogle ScholarCross RefCross Ref
  3. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 1--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ken Chatfield, Victor S. Lempitsky, Andrea Vedaldi, and Andrew Zisserman. 2011. The devil is in the details: An evaluation of recent feature encoding methods. In Proceedings of the British Machine Vision Conference (BMVC’11), Vol. 2. Dundee, Scotland, 8.Google ScholarGoogle ScholarCross RefCross Ref
  5. K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the British Machine Vision Conference (BMVC’14). arxiv:cs/1405.3531Google ScholarGoogle Scholar
  6. Xiaojuan Cheng, Jiwen Lu, Jianjiang Feng, Bo Yuan, and Jie Zhou. 2018. Scene recognition with objectness. Pattern Recognition 74 (2018), 474--487. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fan R. K. Chung and Fan Chung Graham. 1997. Spectral Graph Theory. Number 92. American Mathematical Society.Google ScholarGoogle Scholar
  8. Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. San Diego, CA, 886--893. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  10. A. D. Dileep and C. Chandra Sekhar. 2014. GMM-based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines. IEEE Transactions on Neural Networks and Learning Systems 25, 8 (August 2014), 1421--1432.Google ScholarGoogle ScholarCross RefCross Ref
  11. M. Dixit, Si Chen, Dashan Gao, N. Rasiwasia, and N. Vasconcelos. 2015. Scene classification with semantic Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2974--2983. DOI:https://doi.org/10.1109/CVPR.2015.7298916Google ScholarGoogle Scholar
  12. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML’14). 647--655. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008), 1871--1874. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Feng and B. Bhanu. 2016. Semantic concept co-occurrence patterns for image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 4 (April 2016), 785--799. DOI:https://doi.org/10.1109/TPAMI.2015.2469281 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ruth Fong and Andrea Vedaldi. 2018. Net2Vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. arXiv preprint arXiv:1801.03454 (March 2018).Google ScholarGoogle Scholar
  16. Bin-Bin Gao, Xiu-Shen Wei, Jianxin Wu, and Weiyao Lin. 2015. Deep spatial pyramid: The devil is once again in the details. arXiv preprint arXiv:1504.05277 (2015).Google ScholarGoogle Scholar
  17. Gene H. Golub and Charles F. van Loan. 2013. Matrix Computations. Retrieved from http://www.cs.cornell.edu/cv/GVL4/golubandvanloan.html.Google ScholarGoogle Scholar
  18. Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of European Conference on Computer Vision (ECCV’14). 392--407.Google ScholarGoogle ScholarCross RefCross Ref
  19. Shikha Gupta, A. D. Dileep, and Veena Thenkanidiyoor. 2017. The semantic multinomial representation of images obtained using dynamic kernel based pseudo-concept SVMs. In Proceedings of National Conference on Communication (NCC’17). 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  20. Shikha Gupta, Deepak Kumar Pradhan, Dileep Aroor Dinesh, and Veena Thenkanidiyoor. 2018. Deep spatial pyramid match kernel for scene classification. In Proceedings of the International Conference on Pattern Recognition Applications and Methods. 141--148.Google ScholarGoogle ScholarCross RefCross Ref
  21. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  22. John Henderson. 2005. Introduction to real-world scene perception. Visual Cognition 12, 6 (April 2005), 849--851.Google ScholarGoogle ScholarCross RefCross Ref
  23. Luis Herranz, Shuqiang Jiang, and Xiangyang Li. [n.d.]. Scene recognition with CNNs: Objects, scales and dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 571--579.Google ScholarGoogle Scholar
  24. Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. [n.d.]. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3304--3311.Google ScholarGoogle Scholar
  25. Shuqiang Jiang, Gongwei Chen, Xinhang Song, and Linhu Liu. 2019. Deep patch representations with shared codebook for scene classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 1s (2019), 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. H. Khan, M. Hayat, M. Bennamoun, R. Togneri, and F. A. Sohel. 2016. A discriminative representation of convolutional features for indoor scene recognition. IEEE Transactions on Image Processing 25, 7 (July 2016), 3372--3383.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’12), F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Li, F. Meng, and K. N. Ngan. 2013. Co-salient object detection from multiple images. IEEE Transactions on Multimedia 15, 8 (December 2013), 1896--1909. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Li-Jia Li, Hao Su, Yongwhan Lim, and Li Fei-Fei. 2014. Object bank: An object-level image representation for high-level visual recognition. International Journal of Computer Vision 107, 1 (2014), 20--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ping Li, Gennady Samorodnitsk, and John Hopcroft. 2013. Sign cauchy projections and chi-square kernel. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’13). 2571--2579. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yao Li, Lingqiao Liu, Chunhua Shen, and Anton Van Den Hengel. 2017. Mining mid-level visual patterns with deep CNN activations. International Journal of Computer Vision 121, 3 (2017), 344--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ce Liu, Jenny Yuen, and Antonio Torralba. 2009. Nonparametric scene parsing: Label transfer via dense scene alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1972--1979.Google ScholarGoogle ScholarCross RefCross Ref
  33. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (November 2008), 2579--2605.Google ScholarGoogle Scholar
  34. Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision 42, 3 (May 2001), 145--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Genevieve Patterson and James Hays. [n.d.]. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2751--2758. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Florent Perronnin, Jorge Sánchez, and Thomas Mensink. [n.d.]. Improving the Fisher kernel for large-scale image classification. In Proceedings of European Conference on Computer Vision (ECCV’10). 143--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Deepak Kumar Pradhan, Shikha Gupta, Veena Thenkanidiyoor, and Dileep Aroor Dinesh. 2017. Semantic multinomial representation for scene images using CNN-based pseudo-concepts and concept neural network. In National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics. Springer, 400--409.Google ScholarGoogle Scholar
  38. Ariadna Quattoni and Antonio Torralba. 2009. Recognizing indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 413--420.Google ScholarGoogle ScholarCross RefCross Ref
  39. Nikhil Rasiwasia, Pedro J. Moreno, and Nuno Vasconcelos. 2007. Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia 9, 5 (2007), 923--938. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Nikhil Rasiwasia and Nuno Vasconcelos. 2012. Holistic context models for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 5 (2012), 902--917. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (June 2013), 222--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. FOSNet: An end-to-end trainable deep neural network for scene recognition. IEEE Access 8 (2020), 82066--82077.Google ScholarGoogle ScholarCross RefCross Ref
  43. Krishan Sharma, Shikha Gupta, Aroor Dinesh Dileep, and Renu Rameshan. [n.d.]. Scene image classification using reduced virtual feature representation in sparse framework. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). 2701--2705.Google ScholarGoogle Scholar
  44. H. Shi, H. Li, F. Meng, Q. Wu, L. Xu, and K. N. Ngan. 2018. Hierarchical parsing net: Semantic scene parsing from global scene to objects. IEEE Transactions on Multimedia 20, 10 (October 2018), 2670--2682.Google ScholarGoogle ScholarCross RefCross Ref
  45. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint:1409.1556 (September 2014).Google ScholarGoogle Scholar
  46. Chiranjibi Sitaula, Yong Xiang, Yushu Zhang, Xuequan Lu, and Sunil Aryal. 2019. Indoor image representation by high-level semantic features. IEEE Access 7 (2019), 84967--84979.Google ScholarGoogle ScholarCross RefCross Ref
  47. Ning Sun, Wenli Li, Jixin Liu, Guang Han, and Cong Wu. 2019. Fusing object semantics and deep appearance features for scene recognition. IEEE Transactions on Circuits and Systems for Video Technology 29, 6 (2019), 1715--1728.Google ScholarGoogle ScholarCross RefCross Ref
  48. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  49. Pengjie Tang, Hanli Wang, and Sam Kwong. 2017. G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 225 (2017), 188--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Y. Tang, X. Wang, E. Dellandréa, and L. Chen. 2017. Weakly supervised learning of deformable part-based models for object detection via region proposals. IEEE Transactions on Multimedia 19, 2 (February 2017), 393--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Julia Vogel and Bernt Schiele. 2004. Natural scene retrieval based on a semantic modeling step. In Proceedings of the International Conference on Image and Video Retrieval (CIVR’04). 207--215.Google ScholarGoogle ScholarCross RefCross Ref
  52. Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and Computing 17, 4 (2007), 395--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. V. Wan and S. Renals. 2002. Evaluation of kernel methods for speaker verification and identification. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02). 669--672.Google ScholarGoogle Scholar
  54. Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. 2015. Harvesting discriminative meta objects with deep CNN features for scene classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1287--1295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3485--3492.Google ScholarGoogle ScholarCross RefCross Ref
  56. Guo-Sen Xie, Xu Zhang, Shuicheng Yan, and Cheng-Lin Liu. 2015. Hybrid CNN and dictionary-based models for scene recognition and domain adaptation. IEEE Transactions on Circuits and Systems for Video Technology 27, 6 (2015), 1263--1274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. 2009. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1794--1801.Google ScholarGoogle Scholar
  58. Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2014. Fisher kernel for deep neural activations. arXiv preprint arXiv:1412.1628 (2014).Google ScholarGoogle Scholar
  59. Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2015. Multi-scale pyramid pooling for deep convolutional representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15) Workshops. 71--80.Google ScholarGoogle ScholarCross RefCross Ref
  60. Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. 2015. Understanding neural networks through deep visualization. In Proceedings of the Deep Learning Workshop in International Conference on Machine Learning (ICML’15).Google ScholarGoogle Scholar
  61. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). 818--833.Google ScholarGoogle Scholar
  62. J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu. 2018. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia 20, 10 (October 2018), 2801--2813.Google ScholarGoogle ScholarCross RefCross Ref
  63. Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. [n.d.]. Deep semantic ranking based hashing for multi-label image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1556--1564.Google ScholarGoogle Scholar
  64. Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1452--1464.Google ScholarGoogle ScholarCross RefCross Ref
  65. Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’14). 487--495. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Visual Semantic-Based Representation Learning Using Deep CNNs for Scene Recognition

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!