Abstract
In this work, we address the task of scene recognition from image data. A scene is a spatially correlated arrangement of various visual semantic contents also known as concepts, e.g., “chair,” “car,” “sky,” etc. Representation learning using visual semantic content can be regarded as one of the most trivial ideas as it mimics the human behavior of perceiving visual information. Semantic multinomial (SMN) representation is one such representation that captures semantic information using posterior probabilities of concepts. The core part of obtaining SMN representation is the building of concept models. Therefore, it is necessary to have ground-truth (true) concept labels for every concept present in an image. Moreover, manual labeling of concepts is practically not feasible due to the large number of images in the dataset. To address this issue, we propose an approach for generating pseudo-concepts in the absence of true concept labels. We utilize the pre-trained deep CNN-based architectures where activation maps (filter responses) from convolutional layers are considered as initial cues to the pseudo-concepts. The non-significant activation maps are removed using the proposed filter-specific threshold-based approach that leads to the removal of non-prominent concepts from data. Further, we propose a grouping mechanism to group the same pseudo-concepts using subspace modeling of filter responses to achieve a non-redundant representation. Experimental studies show that generated SMN representation using pseudo-concepts achieves comparable results for scene recognition tasks on standard datasets like MIT-67 and SUN-397 even in the absence of true concept labels.
- A. Barla, F. Odone, and A. Verri. 2003. Histogram intersection kernel for image classification. In Proceedings of the International Conference on Image Processing (ICIP’03), Vol. 3. III--513. DOI:https://doi.org/10.1109/ICIP.2003.1247294Google Scholar
- Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. 2004. Learning multi-label scene classification. Pattern Recognition 37, 9 (September 2004), 1757--1771.Google Scholar
Cross Ref
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 1--27. Google Scholar
Digital Library
- Ken Chatfield, Victor S. Lempitsky, Andrea Vedaldi, and Andrew Zisserman. 2011. The devil is in the details: An evaluation of recent feature encoding methods. In Proceedings of the British Machine Vision Conference (BMVC’11), Vol. 2. Dundee, Scotland, 8.Google Scholar
Cross Ref
- K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the British Machine Vision Conference (BMVC’14). arxiv:cs/1405.3531Google Scholar
- Xiaojuan Cheng, Jiwen Lu, Jianjiang Feng, Bo Yuan, and Jie Zhou. 2018. Scene recognition with objectness. Pattern Recognition 74 (2018), 474--487. Google Scholar
Digital Library
- Fan R. K. Chung and Fan Chung Graham. 1997. Spectral Graph Theory. Number 92. American Mathematical Society.Google Scholar
- Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. San Diego, CA, 886--893. Google Scholar
Digital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255.Google Scholar
Cross Ref
- A. D. Dileep and C. Chandra Sekhar. 2014. GMM-based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines. IEEE Transactions on Neural Networks and Learning Systems 25, 8 (August 2014), 1421--1432.Google Scholar
Cross Ref
- M. Dixit, Si Chen, Dashan Gao, N. Rasiwasia, and N. Vasconcelos. 2015. Scene classification with semantic Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2974--2983. DOI:https://doi.org/10.1109/CVPR.2015.7298916Google Scholar
- Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML’14). 647--655. Google Scholar
Digital Library
- Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008), 1871--1874. Google Scholar
Digital Library
- L. Feng and B. Bhanu. 2016. Semantic concept co-occurrence patterns for image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 4 (April 2016), 785--799. DOI:https://doi.org/10.1109/TPAMI.2015.2469281 Google Scholar
Digital Library
- Ruth Fong and Andrea Vedaldi. 2018. Net2Vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. arXiv preprint arXiv:1801.03454 (March 2018).Google Scholar
- Bin-Bin Gao, Xiu-Shen Wei, Jianxin Wu, and Weiyao Lin. 2015. Deep spatial pyramid: The devil is once again in the details. arXiv preprint arXiv:1504.05277 (2015).Google Scholar
- Gene H. Golub and Charles F. van Loan. 2013. Matrix Computations. Retrieved from http://www.cs.cornell.edu/cv/GVL4/golubandvanloan.html.Google Scholar
- Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of European Conference on Computer Vision (ECCV’14). 392--407.Google Scholar
Cross Ref
- Shikha Gupta, A. D. Dileep, and Veena Thenkanidiyoor. 2017. The semantic multinomial representation of images obtained using dynamic kernel based pseudo-concept SVMs. In Proceedings of National Conference on Communication (NCC’17). 1--6.Google Scholar
Cross Ref
- Shikha Gupta, Deepak Kumar Pradhan, Dileep Aroor Dinesh, and Veena Thenkanidiyoor. 2018. Deep spatial pyramid match kernel for scene classification. In Proceedings of the International Conference on Pattern Recognition Applications and Methods. 141--148.Google Scholar
Cross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
Cross Ref
- John Henderson. 2005. Introduction to real-world scene perception. Visual Cognition 12, 6 (April 2005), 849--851.Google Scholar
Cross Ref
- Luis Herranz, Shuqiang Jiang, and Xiangyang Li. [n.d.]. Scene recognition with CNNs: Objects, scales and dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 571--579.Google Scholar
- Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. [n.d.]. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3304--3311.Google Scholar
- Shuqiang Jiang, Gongwei Chen, Xinhang Song, and Linhu Liu. 2019. Deep patch representations with shared codebook for scene classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 1s (2019), 5. Google Scholar
Digital Library
- S. H. Khan, M. Hayat, M. Bennamoun, R. Togneri, and F. A. Sohel. 2016. A discriminative representation of convolutional features for indoor scene recognition. IEEE Transactions on Image Processing 25, 7 (July 2016), 3372--3383.Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’12), F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). 1097--1105. Google Scholar
Digital Library
- H. Li, F. Meng, and K. N. Ngan. 2013. Co-salient object detection from multiple images. IEEE Transactions on Multimedia 15, 8 (December 2013), 1896--1909. Google Scholar
Digital Library
- Li-Jia Li, Hao Su, Yongwhan Lim, and Li Fei-Fei. 2014. Object bank: An object-level image representation for high-level visual recognition. International Journal of Computer Vision 107, 1 (2014), 20--39. Google Scholar
Digital Library
- Ping Li, Gennady Samorodnitsk, and John Hopcroft. 2013. Sign cauchy projections and chi-square kernel. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’13). 2571--2579. Google Scholar
Digital Library
- Yao Li, Lingqiao Liu, Chunhua Shen, and Anton Van Den Hengel. 2017. Mining mid-level visual patterns with deep CNN activations. International Journal of Computer Vision 121, 3 (2017), 344--364. Google Scholar
Digital Library
- Ce Liu, Jenny Yuen, and Antonio Torralba. 2009. Nonparametric scene parsing: Label transfer via dense scene alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1972--1979.Google Scholar
Cross Ref
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (November 2008), 2579--2605.Google Scholar
- Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision 42, 3 (May 2001), 145--175. Google Scholar
Digital Library
- Genevieve Patterson and James Hays. [n.d.]. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2751--2758. Google Scholar
Digital Library
- Florent Perronnin, Jorge Sánchez, and Thomas Mensink. [n.d.]. Improving the Fisher kernel for large-scale image classification. In Proceedings of European Conference on Computer Vision (ECCV’10). 143--156. Google Scholar
Digital Library
- Deepak Kumar Pradhan, Shikha Gupta, Veena Thenkanidiyoor, and Dileep Aroor Dinesh. 2017. Semantic multinomial representation for scene images using CNN-based pseudo-concepts and concept neural network. In National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics. Springer, 400--409.Google Scholar
- Ariadna Quattoni and Antonio Torralba. 2009. Recognizing indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 413--420.Google Scholar
Cross Ref
- Nikhil Rasiwasia, Pedro J. Moreno, and Nuno Vasconcelos. 2007. Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia 9, 5 (2007), 923--938. Google Scholar
Digital Library
- Nikhil Rasiwasia and Nuno Vasconcelos. 2012. Holistic context models for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 5 (2012), 902--917. Google Scholar
Digital Library
- Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (June 2013), 222--245. Google Scholar
Digital Library
- Hongje Seong, Junhyuk Hyun, and Euntai Kim. 2020. FOSNet: An end-to-end trainable deep neural network for scene recognition. IEEE Access 8 (2020), 82066--82077.Google Scholar
Cross Ref
- Krishan Sharma, Shikha Gupta, Aroor Dinesh Dileep, and Renu Rameshan. [n.d.]. Scene image classification using reduced virtual feature representation in sparse framework. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). 2701--2705.Google Scholar
- H. Shi, H. Li, F. Meng, Q. Wu, L. Xu, and K. N. Ngan. 2018. Hierarchical parsing net: Semantic scene parsing from global scene to objects. IEEE Transactions on Multimedia 20, 10 (October 2018), 2670--2682.Google Scholar
Cross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint:1409.1556 (September 2014).Google Scholar
- Chiranjibi Sitaula, Yong Xiang, Yushu Zhang, Xuequan Lu, and Sunil Aryal. 2019. Indoor image representation by high-level semantic features. IEEE Access 7 (2019), 84967--84979.Google Scholar
Cross Ref
- Ning Sun, Wenli Li, Jixin Liu, Guang Han, and Cong Wu. 2019. Fusing object semantics and deep appearance features for scene recognition. IEEE Transactions on Circuits and Systems for Video Technology 29, 6 (2019), 1715--1728.Google Scholar
Cross Ref
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1--9.Google Scholar
Cross Ref
- Pengjie Tang, Hanli Wang, and Sam Kwong. 2017. G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 225 (2017), 188--197. Google Scholar
Digital Library
- Y. Tang, X. Wang, E. Dellandréa, and L. Chen. 2017. Weakly supervised learning of deformable part-based models for object detection via region proposals. IEEE Transactions on Multimedia 19, 2 (February 2017), 393--407. Google Scholar
Digital Library
- Julia Vogel and Bernt Schiele. 2004. Natural scene retrieval based on a semantic modeling step. In Proceedings of the International Conference on Image and Video Retrieval (CIVR’04). 207--215.Google Scholar
Cross Ref
- Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and Computing 17, 4 (2007), 395--416. Google Scholar
Digital Library
- V. Wan and S. Renals. 2002. Evaluation of kernel methods for speaker verification and identification. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02). 669--672.Google Scholar
- Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. 2015. Harvesting discriminative meta objects with deep CNN features for scene classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1287--1295. Google Scholar
Digital Library
- Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3485--3492.Google Scholar
Cross Ref
- Guo-Sen Xie, Xu Zhang, Shuicheng Yan, and Cheng-Lin Liu. 2015. Hybrid CNN and dictionary-based models for scene recognition and domain adaptation. IEEE Transactions on Circuits and Systems for Video Technology 27, 6 (2015), 1263--1274.Google Scholar
Digital Library
- Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. 2009. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1794--1801.Google Scholar
- Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2014. Fisher kernel for deep neural activations. arXiv preprint arXiv:1412.1628 (2014).Google Scholar
- Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2015. Multi-scale pyramid pooling for deep convolutional representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15) Workshops. 71--80.Google Scholar
Cross Ref
- Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. 2015. Understanding neural networks through deep visualization. In Proceedings of the Deep Learning Workshop in International Conference on Machine Learning (ICML’15).Google Scholar
- Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV’14). 818--833.Google Scholar
- J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu. 2018. Multilabel image classification with regional latent semantic dependencies. IEEE Transactions on Multimedia 20, 10 (October 2018), 2801--2813.Google Scholar
Cross Ref
- Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. [n.d.]. Deep semantic ranking based hashing for multi-label image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1556--1564.Google Scholar
- Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 6 (2017), 1452--1464.Google Scholar
Cross Ref
- Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS’14). 487--495. Google Scholar
Digital Library
Index Terms
Visual Semantic-Based Representation Learning Using Deep CNNs for Scene Recognition
Recommendations
Recognition of varying size scene images using semantic analysis of deep activation maps
AbstractUnderstanding the complex semantic structure of scene images requires mapping the image from pixel space to high-level semantic space. In semantic space, a scene image is represented by the posterior probabilities of concepts (e.g., ‘car,’ ‘chair,’...
Deep CNN based pseudo-concept selection and modeling for generation of semantic multinomial representation of scene images
CODS-COMAD '18: Proceedings of the ACM India Joint International Conference on Data Science and Management of DataThough recent convolutional neural network (CNN) based method for scene classification task show impressive results but lacks in capturing the complex semantic content of the scene images. To reduce the semantic gap a semantic multinomial (SMN) ...
The visual quality recognition of nonwovens using a novel wavelet based contourlet transform
In this paper, a novel wavelet based contourlet transform for texture extraction is presented. The visual quality recognition of nonwovens based on image processing approach can be considered as a special case of the application of computer vision and ...






Comments