Abstract
While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNN features are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity—that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity—that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.
- Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marques, and Jitendra Malik. 2014. Multiscale combinatorial grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 328--335. Google Scholar
Digital Library
- Dan Banica and Cristian Sminchisescu. 2015. Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in rgb-D images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3517--3526.Google Scholar
Cross Ref
- Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2011. Hierarchical matching pursuit for image classification: Architecture and fast algorithms. In Conference on Neural Information Processing Systems (NIPS’11). 2115--2123. Google Scholar
Digital Library
- Joao Carreira, Rui Caseiro, Jorge Batista, and Cristian Sminchisescu. 2012. Semantic segmentation with second-order pooling. In European Conference on Computer Vision (ECCV’12). 430--443. Google Scholar
Digital Library
- Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference (BMVC’14).Google Scholar
Cross Ref
- Anoop Cherian, Julien Mairal, Karteek Alahari, and Cordelia Schmid. 2014. Mixing body-part sequences for human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 2353--2360. Google Scholar
Digital Library
- Guilhem Chéron, Ivan Laptev, and Cordelia Schmid. 2015. P-CNN: Pose-based CNN features for action recognition. In IEEE International Conference on Computer Vision (ICCV’15). 3218--3226. Google Scholar
Digital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255.Google Scholar
Cross Ref
- Victor Escorcia and Juan Niebles. 2013. Spatio-temporal human-object interactions for action recognition in videos. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 508--514. Google Scholar
Digital Library
- Gunnar Farnebäck. 2003. Two-frame motion estimation based on polynomial expansion. In Scandinavian Conference on Image Analysis. Springer, 363--370. Google Scholar
Digital Library
- Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 759--768.Google Scholar
Cross Ref
- Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In European Conference on Computer Vision (ECCV’14). 392--407.Google Scholar
Cross Ref
- Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2014. Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. International Journal of Computer Vision 112, 2 (2014), 133--149. Google Scholar
Digital Library
- Swastik Gupta, Pablo Arbelaez, and Jagannath Malik. 2013. Perceptual organization and recognition of indoor scenes from RGB-D images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 564--571. Google Scholar
Digital Library
- Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. 2014. Learning rich features from RGB-D images for object detection and segmentation. In European Conference on Computer Vision (ECCV’14). 345--360.Google Scholar
Cross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google Scholar
Digital Library
- Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 5344--5352.Google Scholar
Cross Ref
- Junzhou Huang, Tong Zhang, and Dimitris Metaxas. 2011. Learning with structured sparsity. Journal of Machine Learning Research 12 (2011), 3371--3412. Google Scholar
Digital Library
- Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T. Barron, Mario Fritz, Kate Saenko, and Trevor Darrell. 2013. A category-level 3d object dataset: Putting the Kinect to work. In Consumer Depth Cameras for Computer Vision. Springer, 141--165.Google Scholar
- Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3304--3311.Google Scholar
Cross Ref
- Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. 2013. Towards understanding action recognition. In IEEE International Conference on Computer Vision (ICCV’13). 3192--3199. Google Scholar
Digital Library
- Deguang Kong, Ryohei Fujimaki, Ji Liu, Feiping Nie, and Chris Ding. 2014. Exclusive feature learning on arbitrary structures via -norm. In Conference on Neural Information Processing Systems (NIPS’14). 1655--1663. Google Scholar
Digital Library
- Yu Kong and Yun Fu. 2015. Bilinear heterogeneous information machine for RGB-D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1054--1062.Google Scholar
Cross Ref
- Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Saxena. 2013. Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32, 8 (2013), 951--970. Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Conference on Neural Information Processing Systems (NIPS’12). 1097--1105. Google Scholar
Digital Library
- Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08). 1--8.Google Scholar
Cross Ref
- Yiyi Liao, Sarath Kodagoda, Yue Wang, Lei Shi, and Yong Liu. 2015. Understand scene categories by objects: A semantic regularized scene classifier using convolutional neural networks. arXiv preprint arXiv:1509.06470 (2015).Google Scholar
- Ivan Lillo, Juan Carlos Niebles, and Alvaro Soto. 2016. A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1981--1990.Google Scholar
Cross Ref
- Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. 2009. Actions in context. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 2929--2936.Google Scholar
Cross Ref
- Nikhil Naikal, Allen Y. Yang, and S. Shankar Sastry. 2011. Informative feature selection for object recognition via sparse PCA. In IEEE International Conference on Computer Vision (ICCV’11). 818--825. Google Scholar
Digital Library
- David Sontag, Nathan Silberman, and Rob Fergus. 2014. Instance segmentation of indoor scenes using a coverage loss. In European Conference on Computer Vision (ECCV’14). 616--631.Google Scholar
- Pushmeet Kohli, Nathan Silberman, Derek Hoiem, and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV’12).Google Scholar
- Omar Oreifej and Zicheng Liu. 2013. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 716--723. Google Scholar
Digital Library
- Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification. In European Conference on Computer Vision (ECCV’10). 143--156. Google Scholar
Digital Library
- Xiaofeng Ren, Liefeng Bo, and Dieter Fox. 2012. Rgb-(d) scene labeling: Features and algorithms. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2759--2766. Google Scholar
Digital Library
- Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. 2012. A database for fine grained activity detection of cooking activities. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 1194--1201. Google Scholar
Digital Library
- Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (2013), 222--245. Google Scholar
Digital Library
- Amir Shahroudy, Tian-Tsong Ng, Qingxiong Yang, and Gang Wang. 2016. Multimodal multipart learning for action recognition in depth videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2016), 2123--2129. Google Scholar
Digital Library
- Nathan Silberman and Rob Fergus. 2011. Indoor scene segmentation using a structured light sensor. In IEEE International Conference on Computer Vision Workshops. 601--608.Google Scholar
Cross Ref
- Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV’12). 746--760. Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Conference on Neural Information Processing Systems (NIPS’14). 568--576. Google Scholar
Digital Library
- Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. 2015. SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 567--576.Google Scholar
Cross Ref
- Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 1 (1996), 267--288.Google Scholar
Cross Ref
- Anran Wang, Jianfei Cai, Jiwen Lu, and Tat-Jen Cham. 2016. Modality and component aware feature fusion for RGB-D scene classification. IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google Scholar
Cross Ref
- Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 3169--3176. Google Scholar
Digital Library
- Hua Wang, Feiping Nie, Heng Huang, Shannon Risacher, Chibiao Ding, Andrew J. Saykin, and Li Shen. 2011. Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance. In IEEE International Conference on Computer Vision (ICCV’11). 557--562. Google Scholar
Digital Library
- Hua Wang, Feiping Nie, Heng Huang, Shannon L. Risacher, Andrew J. Saykin, and Li Shen. 2012. Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics 28, 12 (2012), i127--i136. Google Scholar
Digital Library
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV’13). 3551--3558. Google Scholar
Digital Library
- Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2014. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 5 (2014), 914--927.Google Scholar
Cross Ref
- John Wright, Allen Y. Yang, Arvind Ganesh, Shankar S. Sastry, and Yi Ma. 2009. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 2 (2009), 210--227. Google Scholar
Digital Library
- Jianxiong Xiao, Andrew Owens, and Antonio Torralba. 2013. SUN3D: A database of big spaces reconstructed using sfm and object labels. In IEEE International Conference on Computer Vision (ICCV’13). 1625--1632. Google Scholar
Digital Library
- Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2015. Can partial strong labels boost multi-label object recognition? arXiv preprint arXiv:1504.05843 (2015).Google Scholar
- Jianchao Yang, John Wright, Thomas S. Huang, and Yi Ma. 2010. Image super-resolution via sparse representation. IEEE Transactions on Image Processing 19, 11 (2010), 2861--2873. Google Scholar
Digital Library
- Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. 2011. Human action recognition by learning bases of action attributes and parts. In IEEE International Conference on Computer Vision (ICCV’11). 1331--1338. Google Scholar
Digital Library
- Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2014. Fisher kernel for deep neural activations. arXiv preprint arXiv:1412.1628 (2014).Google Scholar
- Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 1 (2006), 49--67.Google Scholar
Cross Ref
- Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4694--4702.Google Scholar
Cross Ref
- Yu Zhang, Xiu-shen Wei, Jianxin Wu, Jianfei Cai, Jiangbo Lu, Viet-Anh Nguyen, and Minh N. Do. 2015. Weakly supervised fine-grained image categorization. arXiv preprint arXiv:1504.04943 (2015).Google Scholar
- Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Conference on Neural Information Processing Systems (NIPS’14). 487--495. Google Scholar
Digital Library
- Yang Zhou, Rong Jin, and Steven Hoi. 2010. Exclusive lasso for multi-task feature selection. In International Conference on Artificial Intelligence and Statistics. 988--995.Google Scholar
- Yang Zhou, Bingbing Ni, Richang Hong, Meng Wang, and Qi Tian. 2015. Interaction part mining: A mid-level approach for fine-grained action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3323--3331.Google Scholar
Cross Ref
- Zhen Zuo, Gang Wang, Bing Shuai, Lifan Zhao, Qingxiong Yang, and Xudong Jiang. 2014. Learning discriminative and shareable features for scene classification. In European Conference on Computer Vision (ECCV’14). 552--568.Google Scholar
Cross Ref
Index Terms
Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond
Recommendations
Human action recognition via skeletal and depth based feature fusion
MIG '16: Proceedings of the 9th International Conference on Motion in GamesThis paper addresses the problem of recognizing human actions captured with depth cameras. Human action recognition is a challenging task as the articulated action data is high dimensional in both spatial and temporal domains. An effective approach to ...
Expression recognition methods based on feature fusion
BI'10: Proceedings of the 2010 international conference on Brain informaticsExpression recognition is popular research focus in Artificial Intelligence and Pattern Recognition. Feature fusion is one of the most important technical methods in expression recognition. To study how the feature information extracted from different ...
Birdsong classification based on multi feature channel fusion
AbstractAiming at the essential feature of the time-continuity of birdsong in nature, this paper proposed a birdsong classification model composed of two feature channels, which combines the features of time domain and time frequency domain. In order to ...






Comments