skip to main content
research-article

Deep Patch Representations with Shared Codebook for Scene Classification

Authors Info & Claims
Published:24 January 2019Publication History
Skip Abstract Section

Abstract

Scene classification is a challenging problem. Compared with object images, scene images are more abstract, as they are composed of objects. Object and scene images have different characteristics with different scales and composition structures. How to effectively integrate the local mid-level semantic representations including both object and scene concepts needs to be investigated, which is an important aspect for scene classification. In this article, the idea of a sharing codebook is introduced by organically integrating deep learning, concept feature, and local feature encoding techniques. More specifically, the shared local feature codebook is generated from the combined ImageNet1K and Places365 concepts (Mixed1365) using convolutional neural networks. As the Mixed1365 features cover all the semantic information including both object and scene concepts, we can extract a shared codebook from the Mixed1365 features, which only contain a subset of the whole 1,365 concepts with the same codebook size. The shared codebook can not only provide complementary representations without additional codebook training but also be adaptively extracted toward different scene classification tasks. A method of fusing the encoded features with both the original codebook and the shared codebook is proposed for scene classification. In this way, more comprehensive and representative image features can be generated for classification. Extensive experimentations conducted on two public datasets validate the effectiveness of the proposed method. Besides, some useful observations are also revealed to show the advantage of shared codebook.

References

  1. X. Bai, C. Yao, and W. Liu. 2016. Strokelets: A learned multi-scale mid-level representation for scene text recognition. IEEE Trans. Image Process. 25, 6 (Jun. 2016), 2789--2802. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alessandro Bergamo and Lorenzo Torresani. 2014. Classemes and other classifier-based features for efficient object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36, 10 (2014), 1988--2001.Google ScholarGoogle ScholarCross RefCross Ref
  3. L. Bo, X. Ren, and D. Fox. 2010. Kernel descriptors for visual recognition. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Liefeng Bo and Cristian Sminchisescu. 2009. Efficient match kernel between sets of features for visual recognition. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. X. Cao, X. Wei, Y. Han, and X. Chen. 2015. An object-level high-order contextual descriptor based on semantic, spatial, and scale cues. IEEE Trans. Cybernet. 45, 7 (Jul. 2015), 1327--1339.Google ScholarGoogle Scholar
  6. R. G. Cinbis, J. Verbeek, and C. Schmid. 2012. Image categorization using Fisher kernels of non-iid image models. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2184--2191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. 2004. Visual categorization with bags of keypoints. In Proceedings of the European Conference on Computer Vision Workshop on Statistical Learning in Computer Vision (ECCV’04). 1--22.Google ScholarGoogle Scholar
  8. Mandar Dixit, Si Chen, Dashan Gao, Nikhil Rasiwasia, and Nuno Vasconcelos. 2015. Scene classification with semantic Fisher vectors. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’15).Google ScholarGoogle ScholarCross RefCross Ref
  9. Mandar D. Dixit and Nuno Vasconcelos. 2016. Object based scene representations using Fisher scores of local subspace projections. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 2811--2819. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Carl Doersch, Abhinav Gupta, and Alexei A. Efros. 2013. Mid-level visual element discovery as discriminative mode seeking. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’13). 494--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning (ICML’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Fei-Fei and P. Perona. 2005. A bayesian hierarchical model for learning natural scene categories. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Gong, L. Wang, R. Guo, and S. Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of the Annual European Conference on Computer Vision (ECCV’14).Google ScholarGoogle Scholar
  14. L. Herranz, S. Jiang, and X. Li. 2016. Scene recognition with CNNs: Objects, scales and dataset bias. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 571--579.Google ScholarGoogle Scholar
  15. H. Jegou, M. Douze, C. Schmid, and P. Perez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’10).Google ScholarGoogle Scholar
  16. Mayank Juneja, Andrea Vedaldi, C. V. Jawahar, and Andrew Zisserman. 2013. Blocks that shout: Distinctive parts for scene classification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’12). 1106--1114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Roland Kwitt, Nuno Vasconcelos, and Nikhil Rasiwasia. 2012. Scene recognition on the semantic manifold. In Proceedings of the Annual European Conference on Computer Vision (ECCV’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. J. Li, H. Su, E. P. Xing, and L. Fei-Fei. 2010. Object bank: A high-level image representation for scene classification and semantic feature sparsification. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Liang Li, Shuqiang Jiang, and Qingming Huang. 2012. Learning hierarchical semantic description via mixed-norm regularization for image understanding. IEEE Trans. Multimedia 14, 5 (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Li-Jia Li, Hao Su, Yongwhan Lim, and Li Fei-Fei. 2014. Object bank: An object-level image representation for high-level visual recognition. Int. J. Comput. Vision 107, 1 (2014), 20--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Z. Li, J. Zhang, K. Zhang, and Z. Li. 2018. Visual tracking with weighted adaptive local sparse appearance model via spatio-temporal context learning. IEEE Trans. Image Process. 27, 9 (Sep. 2018), 4478--4489.Google ScholarGoogle ScholarCross RefCross Ref
  24. David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2 (2004), 91--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Z. Niu, G. Hua, X. Gao, and Q. Tian. 2012. Context aware topic model for scene recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Florent Perronnin and Christopher R. Dance. 2007. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’07).Google ScholarGoogle Scholar
  27. Florent Perronnin, Jorge Sanchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification. In Proceedings of the European Conference on Computer Vision (ECCV’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Quattoni and A. Torralba. 2009. Recognizing indoor scenes. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’09).Google ScholarGoogle Scholar
  29. N. Rasiwasia and N. Vasconcelos. 2007. Bridging the gap: Query by semantic example. IEEE Trans. Multimedia 9, 5 (2007), 923--938. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nikhil Rasiwasia and Nuno Vasconcelos. 2009. Holistic context modeling using semantic co-occurrences. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’09). 1889--1895.Google ScholarGoogle ScholarCross RefCross Ref
  31. N. Rasiwasia and N. Vasconcelos. 2012. Holistic context models for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34, 5 (2012), 902--917. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Nikhil Rasiwasia and Nuno Vasconcelos. 2013. Latent dirichlet allocation models for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 35, 11 (2013), 2665--2679. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 3 (2015), 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  35. Xinhang Song, Shuqiang Jiang, and Luis Herranz. 2015. Joint multi-feature spatial context for scene recognition on the semantic manifold. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’15).Google ScholarGoogle Scholar
  36. X. Song, S. Jiang, and L. Herranz. 2017. Multi-scale multi-feature context modeling for scene recognition in the semantic manifold. IEEE Trans. Image Process. 26, 6 (Jun. 2017), 2721--2735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Antonio Torralba and Aude Oliva. 1999. Semantic organization of scenes using discriminant structural templates. In Proceedings of the International Conference on Computer Vision (ICCV’99). 1253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jan C. van Gemert, Jan-Mark Geusebroek, Cor J. Veenman, and Arnold W. M. Smeulders. 2008. Kernel codebooks for scene categorization. In Proceedings of the 10th European Conference on Computer Vision (ECCV’08). 696--709. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Julia Vogel and Bernt Schiele. 2004. A Semantic Typicality Measure for Natural Scene Categorization. Springer, Berlin, 195--203.Google ScholarGoogle Scholar
  40. Julia Vogel and Bernt Schiele. 2007. Semantic modeling of natural scenes for content-based image retrieval. Int. J. Comput. Vision 72, 2 (Apr, 2007), 133--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, T. Huang, and Yihong Gong. 2010. Locality-constrained linear coding for image classification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’10).Google ScholarGoogle ScholarCross RefCross Ref
  42. Shuang Wang and Shuqiang Jiang. 2015. INSTRE: A new benchmark for instance-level object retrieval and recognition. ACM Trans. Multimedia Comput. Commun. Appl. 11, 3, Article 37 (Feb. 2015), 21 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. X. Wang and E. Grimson. 2007. Spatial latent dirichlet allocation. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Xinggang Wang, Baoyuan Wang, Xiang Bai, Wenyu Liu, and Zhuowen Tu. 2013. Max-margin multiple-instance dictionary learning. Proceedings of the International Conference on Machine Learning (ICML’13), 846--854. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. 2015. Harvesting discriminative meta objects with deep CNN features for scene classification. In Proceedings of the International Conference on Computer Vision (ICCV’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Xiao, J. Hayes, K. Ehringer, A. Olivia, and A. Torralba. 2010. SUN database: Largescale scene recognition from Abbey to Zoo. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’10).Google ScholarGoogle Scholar
  47. G. S. Xie, X. Y. Zhang, S. Yan, and C. L. Liu. 2017. Hybrid CNN and dictionary-based models for scene recognition and domain adaptation. IEEE Trans. Circ. Syst. Vid. Technol. PP, 27, 6 (2017), 1263--1274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Jianchao Yang, Kai Yu, Yihong Gong, and Thomas S. Huang. 2009. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’09).Google ScholarGoogle Scholar
  49. Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2015. Multi-scale pyramid pooling for deep convolutional representation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’15) Workshop.Google ScholarGoogle ScholarCross RefCross Ref
  50. Lei Zhang, Xiantong Zhen, and Ling Shao. 2014. Learning object-to-class kernels for scene classification. IEEE Trans. Image Process. 23, 8 (Aug. 2014), 3241--3253.Google ScholarGoogle ScholarCross RefCross Ref
  51. W. Zhang, C. W. Ngo, and X. Cao. 2016. Hyperlink-aware object retrieval. IEEE Trans. Image Process. 25, 9 (Sep. 2016), 4186--4198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, and Aude Oliva. 2016. Places: An image database for deep scene understanding. arXiv preprint arXiv:1610.02055 (2016).Google ScholarGoogle Scholar
  53. Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Annual Conference on Neural Information Processing Systems (NIPS’14), Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). 487--495. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Deep Patch Representations with Shared Codebook for Scene Classification

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 1s
          Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data
          January 2019
          265 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3309769
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 January 2019
          • Accepted: 1 June 2018
          • Revised: 1 March 2018
          • Received: 1 October 2017
          Published in tomm Volume 15, Issue 1s

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!