skip to main content
research-article

Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning

Published:05 February 2019Publication History
Skip Abstract Section

Abstract

Recent studies have shown that spatial relationships among objects are very important for visual recognition, since they can provide rich clues on object contexts within the images. In this article, we introduce a novel method to learn the Semantic Feature Map (SFM) with attention-based deep neural networks for image and video classification in an end-to-end manner, aiming to explicitly model the spatial object contexts within the images. In particular, we explicitly apply the designed gate units to the extracted object features for important objects selection and noise removal. These selected object features are then organized into the proposed SFM, which is a compact and discriminative representation with the spatial information among objects preserved. Finally, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) as the classifiers on top of the SFM for content recognition. A novel multi-task learning framework with image classification loss, object localization loss, and grid labeling loss are also introduced to help better learn the model parameters. We conduct extensive evaluations and comparative studies to verify the effectiveness of the proposed approach on Pascal VOC 2007/2012 and MS-COCO benchmarks for image classification. In addition, the experimental results also show that the SFMs learned from the image domain can be successfully transferred to CCV and FCVID benchmarks for video classification.

References

  1. Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross B. Girshick. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2874--2883.Google ScholarGoogle Scholar
  2. Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the British Machine Vision Conference. 54.1--54.12.Google ScholarGoogle ScholarCross RefCross Ref
  3. Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. 2014. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3286--3293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, ZhongYang Huang, and Shuicheng Yan. 2013. Subcategory-aware object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 827--834. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2014. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vision 111, 1 (2014), 98--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ross Girshick, Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 580--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ross B. Girshick. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision. 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of the European Conference on Computer Vision. 392--407.Google ScholarGoogle ScholarCross RefCross Ref
  9. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 9 (2015), 1904--1916.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  11. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3304--3311.Google ScholarGoogle ScholarCross RefCross Ref
  14. I.-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, and Shih-Fu Chang. 2014. Discovering joint audio-visual codewords for video event detection. Mach. Vision Appl. 25, 1 (Oct. 2014), 33--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yu-Gang Jiang, Qi Dai, Wei Liu, Xiangyang Xue, and Chong-Wah Ngo. 2015. Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans. Image Process. 24, 11 (2015), 3781--3795.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of ACM International Conference on Image and Video Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2 (Feb. 2018), 352--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel P. W. Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the ACM International Conference on Multimedia Retrieval. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kuan-Ting Lai, Felix X. Yu, Ming-Syan Chen, and Shih-Fu Chang. 2014. Video event detection by inferring temporal instance labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2251--2258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.Google ScholarGoogle Scholar
  22. Li-Jia Li, Hao Su, Yongwhan Lim, and Fei-Fei Li. 2014. Object bank: An object-level image representation for high-level visual recognition. Int. J. Comput. Vision 107, 1 (2014), 20--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755.Google ScholarGoogle Scholar
  24. Dong Liu, Kuan-Ting Lai, Guangnan Ye, Ming-Syan Chen, and Shih-Fu Chang. 2013. Sample-specific late fusion for visual category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 803--810. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer International Publishing, Amsterdam, The Netherlands, 21--37.Google ScholarGoogle Scholar
  26. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, MA, 3431--3440.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jianwei Luo, Jianguo Li, Jun Wang, Zhiguo Jiang, and Yurong Chen. 2015. Deep attributes from context-aware regional neural codes. arXiv.org. arXiv:1509.02470v1Google ScholarGoogle Scholar
  28. Andy J. Ma and Pong C. Yuen. 2014. Reduced analytic dependency modeling: Robust fusion for visual recognition. Int. J. Comput. Vision 109, 3 (2014), 233--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Pascal Mettes, Jan C. van Gemert, and Cees G. M. Snoek. 2016. No spare parts: Sharing part detectors for image categorization. Comput. Vision Image Understand. 152 (Nov. 2016), 131--141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Markus Nagel, Thomas Mensink, and Cees G. M. Snoek. 2015. Event fisher vectors: Robust encoding visual diversity of visual streams. In Proceedings of the British Machine Vision Conference. BMVA Press, Swansea, UK, 178.1--178.12.Google ScholarGoogle Scholar
  31. Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1717--1724. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lei Pang and Chong-Wah Ngo. 2015. Opinion question answering by sentiment clip localization. ACM Trans. Multimedia Comput. Commun. Appl. 12, 2, Article 31 (Nov. 2015), 19 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Joseph Redmon, Santosh Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 779--788.Google ScholarGoogle ScholarCross RefCross Ref
  34. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Olga Russakovsky, Jia Deng, Hao Su, et al. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 3 (2015), 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. In Proceedings of the International Conference on Learning Representations Workshop.Google ScholarGoogle Scholar
  37. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large scale image recognition. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  39. Yongqing Sun, Zuxuan Wu, Xi Wang, Hiroyuki Arai, Tetsuya Kinebuchi, and Yu-Gang Jiang. 2016. Exploiting objects with LSTMs for video categorization. In Proceedings of the ACM International Conference on Multimedia. ACM Press, Amsterdam, The Netherlands, 142--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  41. Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, and Arnold W. M. Smeulders. 2013. Selective search for object recognition. Int. J. Comput. Vision 104, 2 (2013), 154--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2285--2294.Google ScholarGoogle ScholarCross RefCross Ref
  43. Meng Wang, Changzhi Luo, Richang Hong, Jinhui Tang, and Jianshi Feng. 2016. Beyond object proposals: Random crop pooling for multi-label image recognition. IEEE Trans. Image Process. 25, 12 (Dec. 2016), 5678--5688. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2015. HCP: A flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. (2015), 1--8.Google ScholarGoogle Scholar
  45. Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. 2015. Harvesting discriminative meta objects with deep CNN features for scene classification. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1287--1295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, and Leonid Sigal. 2016. Harnessing object and scene semantics for large-scale video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3112--3121.Google ScholarGoogle ScholarCross RefCross Ref
  47. Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016. Multi-stream multi-class fusion of deep networks for video classification. In Proceedings of the ACM International Conference on Multimedia. ACM Press, Amsterdam, The Netherlands, 791--800. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep-learning framework for video classification. In Proceedings of the ACM International Conference on Multimedia. 461--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. 2015. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 842--850.Google ScholarGoogle Scholar
  50. Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zhongwen Xu, Yi Yang, Ivor W. Tsang, Nicu Sebe, and Alexander G. Hauptmann. 2013. Feature weighting via optimal thresholding for video analysis. In Proceedings of the IEEE International Conference on Computer Vision. 3440--3447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2015. Can partial strong labels boost multi-label object recognition? arXiv:1504.05843.Google ScholarGoogle Scholar
  53. Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2016. Semantic feature mining for video event understanding. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4, Article 55 (Aug. 2016), 22 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Guangnan Ye, Dong Liu, I-Hong Jhuo, and Shih-Fu Chang. 2012. Robust late fusion with rank minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3021--3028. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. In Proceedings of the British Machine Vision Conference, Xianghua Xie, Mark W. Jones, and Gary K. L. Tam (Eds.). BMVA Press, Swansea, UK, 60.1--60.13.Google ScholarGoogle ScholarCross RefCross Ref
  56. Bo Zhao, Xiao Wu, Jiashi Feng, Qiang Peng, and Shuicheng Yan. 2017. Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimedia 19, 6 (2017), 1245--1256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Rui-Wei Zhao, Jianguo Li, Yurong Chen, Jia-Ming Liu, Yu-Gang Jiang, and Xiangyang Xue. 2016. Regional gating neural networks for multi-label image classification. In Proceedings of the British Machine Vision Conference. British Machine Vision Association, York, UK, 72.1--72.12.Google ScholarGoogle ScholarCross RefCross Ref
  58. Rui-Wei Zhao, Zuxuan Wu, Jianguo Li, and Yu-Gang Jiang. 2017. Learning semantic feature map for visual content recognition. In Proceedings of the ACM International Conference on Multimedia. ACM Press, 1291--1299. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5219--5227.Google ScholarGoogle ScholarCross RefCross Ref
  60. C. Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision. 391--405.Google ScholarGoogle Scholar
  61. Zhen Zuo, Bing Shuai, Gang Wang, Xiao Liu, Xingxing Wang, Bing Wang, and Yushi Chen. 2016. Learning contextual dependence with convolutional hierarchical recurrent neural networks. IEEE Trans. Image Process. 25, 7 (Mar. 2016), 2983--2996.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 1s
        Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data
        January 2019
        265 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3309769
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 February 2019
        • Accepted: 1 June 2018
        • Revised: 1 March 2018
        • Received: 1 October 2017
        Published in tomm Volume 15, Issue 1s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!