Abstract
Recent studies have shown that spatial relationships among objects are very important for visual recognition, since they can provide rich clues on object contexts within the images. In this article, we introduce a novel method to learn the Semantic Feature Map (SFM) with attention-based deep neural networks for image and video classification in an end-to-end manner, aiming to explicitly model the spatial object contexts within the images. In particular, we explicitly apply the designed gate units to the extracted object features for important objects selection and noise removal. These selected object features are then organized into the proposed SFM, which is a compact and discriminative representation with the spatial information among objects preserved. Finally, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) as the classifiers on top of the SFM for content recognition. A novel multi-task learning framework with image classification loss, object localization loss, and grid labeling loss are also introduced to help better learn the model parameters. We conduct extensive evaluations and comparative studies to verify the effectiveness of the proposed approach on Pascal VOC 2007/2012 and MS-COCO benchmarks for image classification. In addition, the experimental results also show that the SFMs learned from the image domain can be successfully transferred to CCV and FCVID benchmarks for video classification.
- Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross B. Girshick. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2874--2883.Google Scholar
- Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the British Machine Vision Conference. 54.1--54.12.Google Scholar
Cross Ref
- Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. 2014. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3286--3293. Google Scholar
Digital Library
- Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, ZhongYang Huang, and Shuicheng Yan. 2013. Subcategory-aware object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 827--834. Google Scholar
Digital Library
- Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2014. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vision 111, 1 (2014), 98--136. Google Scholar
Digital Library
- Ross Girshick, Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 580--587. Google Scholar
Digital Library
- Ross B. Girshick. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision. 1440--1448. Google Scholar
Digital Library
- Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of the European Conference on Computer Vision. 392--407.Google Scholar
Cross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 9 (2015), 1904--1916.Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.Google Scholar
Cross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google Scholar
Digital Library
- Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. Google Scholar
Digital Library
- Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3304--3311.Google Scholar
Cross Ref
- I.-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, and Shih-Fu Chang. 2014. Discovering joint audio-visual codewords for video event detection. Mach. Vision Appl. 25, 1 (Oct. 2014), 33--47. Google Scholar
Digital Library
- Yu-Gang Jiang, Qi Dai, Wei Liu, Xiangyang Xue, and Chong-Wah Ngo. 2015. Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans. Image Process. 24, 11 (2015), 3781--3795.Google Scholar
Digital Library
- Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of ACM International Conference on Image and Video Retrieval. Google Scholar
Digital Library
- Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2 (Feb. 2018), 352--364. Google Scholar
Digital Library
- Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel P. W. Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the ACM International Conference on Multimedia Retrieval. ACM Press. Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1097--1105. Google Scholar
Digital Library
- Kuan-Ting Lai, Felix X. Yu, Ming-Syan Chen, and Shih-Fu Chang. 2014. Video event detection by inferring temporal instance labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2251--2258. Google Scholar
Digital Library
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.Google Scholar
- Li-Jia Li, Hao Su, Yongwhan Lim, and Fei-Fei Li. 2014. Object bank: An object-level image representation for high-level visual recognition. Int. J. Comput. Vision 107, 1 (2014), 20--39. Google Scholar
Digital Library
- Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755.Google Scholar
- Dong Liu, Kuan-Ting Lai, Guangnan Ye, Ming-Syan Chen, and Shih-Fu Chang. 2013. Sample-specific late fusion for visual category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 803--810. Google Scholar
Digital Library
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer International Publishing, Amsterdam, The Netherlands, 21--37.Google Scholar
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, MA, 3431--3440.Google Scholar
Cross Ref
- Jianwei Luo, Jianguo Li, Jun Wang, Zhiguo Jiang, and Yurong Chen. 2015. Deep attributes from context-aware regional neural codes. arXiv.org. arXiv:1509.02470v1Google Scholar
- Andy J. Ma and Pong C. Yuen. 2014. Reduced analytic dependency modeling: Robust fusion for visual recognition. Int. J. Comput. Vision 109, 3 (2014), 233--251. Google Scholar
Digital Library
- Pascal Mettes, Jan C. van Gemert, and Cees G. M. Snoek. 2016. No spare parts: Sharing part detectors for image categorization. Comput. Vision Image Understand. 152 (Nov. 2016), 131--141. Google Scholar
Digital Library
- Markus Nagel, Thomas Mensink, and Cees G. M. Snoek. 2015. Event fisher vectors: Robust encoding visual diversity of visual streams. In Proceedings of the British Machine Vision Conference. BMVA Press, Swansea, UK, 178.1--178.12.Google Scholar
- Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1717--1724. Google Scholar
Digital Library
- Lei Pang and Chong-Wah Ngo. 2015. Opinion question answering by sentiment clip localization. ACM Trans. Multimedia Comput. Commun. Appl. 12, 2, Article 31 (Nov. 2015), 19 pages. Google Scholar
Digital Library
- Joseph Redmon, Santosh Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 779--788.Google Scholar
Cross Ref
- Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 91--99. Google Scholar
Digital Library
- Olga Russakovsky, Jia Deng, Hao Su, et al. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 3 (2015), 211--252. Google Scholar
Digital Library
- Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. In Proceedings of the International Conference on Learning Representations Workshop.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 568--576. Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large scale image recognition. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Yongqing Sun, Zuxuan Wu, Xi Wang, Hiroyuki Arai, Tetsuya Kinebuchi, and Yu-Gang Jiang. 2016. Exploiting objects with LSTMs for video categorization. In Proceedings of the ACM International Conference on Multimedia. ACM Press, Amsterdam, The Netherlands, 142--146. Google Scholar
Digital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--9.Google Scholar
Cross Ref
- Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, and Arnold W. M. Smeulders. 2013. Selective search for object recognition. Int. J. Comput. Vision 104, 2 (2013), 154--171. Google Scholar
Digital Library
- Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2285--2294.Google Scholar
Cross Ref
- Meng Wang, Changzhi Luo, Richang Hong, Jinhui Tang, and Jianshi Feng. 2016. Beyond object proposals: Random crop pooling for multi-label image recognition. IEEE Trans. Image Process. 25, 12 (Dec. 2016), 5678--5688. Google Scholar
Digital Library
- Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2015. HCP: A flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. (2015), 1--8.Google Scholar
- Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. 2015. Harvesting discriminative meta objects with deep CNN features for scene classification. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1287--1295. Google Scholar
Digital Library
- Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, and Leonid Sigal. 2016. Harnessing object and scene semantics for large-scale video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3112--3121.Google Scholar
Cross Ref
- Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016. Multi-stream multi-class fusion of deep networks for video classification. In Proceedings of the ACM International Conference on Multimedia. ACM Press, Amsterdam, The Netherlands, 791--800. Google Scholar
Digital Library
- Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep-learning framework for video classification. In Proceedings of the ACM International Conference on Multimedia. 461--470. Google Scholar
Digital Library
- Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. 2015. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 842--850.Google Scholar
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057. Google Scholar
Digital Library
- Zhongwen Xu, Yi Yang, Ivor W. Tsang, Nicu Sebe, and Alexander G. Hauptmann. 2013. Feature weighting via optimal thresholding for video analysis. In Proceedings of the IEEE International Conference on Computer Vision. 3440--3447. Google Scholar
Digital Library
- Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2015. Can partial strong labels boost multi-label object recognition? arXiv:1504.05843.Google Scholar
- Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2016. Semantic feature mining for video event understanding. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4, Article 55 (Aug. 2016), 22 pages. Google Scholar
Digital Library
- Guangnan Ye, Dong Liu, I-Hong Jhuo, and Shih-Fu Chang. 2012. Robust late fusion with rank minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3021--3028. Google Scholar
Digital Library
- Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. In Proceedings of the British Machine Vision Conference, Xianghua Xie, Mark W. Jones, and Gary K. L. Tam (Eds.). BMVA Press, Swansea, UK, 60.1--60.13.Google Scholar
Cross Ref
- Bo Zhao, Xiao Wu, Jiashi Feng, Qiang Peng, and Shuicheng Yan. 2017. Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimedia 19, 6 (2017), 1245--1256. Google Scholar
Digital Library
- Rui-Wei Zhao, Jianguo Li, Yurong Chen, Jia-Ming Liu, Yu-Gang Jiang, and Xiangyang Xue. 2016. Regional gating neural networks for multi-label image classification. In Proceedings of the British Machine Vision Conference. British Machine Vision Association, York, UK, 72.1--72.12.Google Scholar
Cross Ref
- Rui-Wei Zhao, Zuxuan Wu, Jianguo Li, and Yu-Gang Jiang. 2017. Learning semantic feature map for visual content recognition. In Proceedings of the ACM International Conference on Multimedia. ACM Press, 1291--1299. Google Scholar
Digital Library
- Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5219--5227.Google Scholar
Cross Ref
- C. Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision. 391--405.Google Scholar
- Zhen Zuo, Bing Shuai, Gang Wang, Xiao Liu, Xingxing Wang, Bing Wang, and Yushi Chen. 2016. Learning contextual dependence with convolutional hierarchical recurrent neural networks. IEEE Trans. Image Process. 25, 7 (Mar. 2016), 2983--2996.Google Scholar
Digital Library
Index Terms
Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning
Recommendations
Learning Semantic Feature Map for Visual Content Recognition
MM '17: Proceedings of the 25th ACM international conference on MultimediaThe spatial relationship among objects provide rich clues to object contexts for visual recognition. In this paper, we propose to learn Semantic Feature Map (SFM) by deep neural networks to model the spatial object contexts for better understanding of ...
Multi-view Multi-task Feature Extraction for Web Image Classification
MM '14: Proceedings of the 22nd ACM international conference on MultimediaThe features used in many multimedia analysis-based applications are frequently of very high dimension. Feature extraction offers several advantages in highly dimensional cases, and many recent studies have used multi-task feature extraction approaches, ...
Object Bank: An Object-Level Image Representation for High-Level Visual Recognition
It is a remarkable fact that images are related to objects constituting them. In this paper, we propose to represent images by using objects appearing in them. We introduce the novel concept of object bank (OB), a high-level image representation ...






Comments