skip to main content
research-article

Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

Published:22 May 2018Publication History
Skip Abstract Section

Abstract

While convolutional neural networks (CNNs) have been excellent for object recognition, the greater spatial variability in scene images typically means that the standard full-image CNN features are suboptimal for scene classification. In this article, we investigate a framework allowing greater spatial flexibility, in which the Fisher vector (FV)-encoded distribution of local CNN features, obtained from a multitude of region proposals per image, is considered instead. The CNN features are computed from an augmented pixel-wise representation consisting of multiple modalities of RGB, HHA, and surface normals, as extracted from RGB-D data. More significantly, we make two postulates: (1) component sparsity—that only a small variety of region proposals and their corresponding FV GMM components contribute to scene discriminability, and (2) modal nonsparsity—that features from all modalities are encouraged to coexist. In our proposed feature fusion framework, these are implemented through regularization terms that apply group lasso to GMM components and exclusive group lasso across modalities. By learning and combining regressors for both proposal-based FV features and global CNN features, we are able to achieve state-of-the-art scene classification performance on the SUNRGBD Dataset and NYU Depth Dataset V2. Moreover, we further apply our feature fusion framework on an action recognition task to demonstrate that our framework can be generalized for other multimodal well-structured features. In particular, for action recognition, we enforce interpart sparsity to choose more discriminative body parts, and intermodal nonsparsity to make informative features from both appearance and motion modalities coexist. Experimental results on the JHMDB and MPII Cooking Datasets show that our feature fusion is also very effective for action recognition, achieving very competitive performance compared with the state of the art.

References

  1. Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marques, and Jitendra Malik. 2014. Multiscale combinatorial grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 328--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dan Banica and Cristian Sminchisescu. 2015. Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in rgb-D images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3517--3526.Google ScholarGoogle ScholarCross RefCross Ref
  3. Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2011. Hierarchical matching pursuit for image classification: Architecture and fast algorithms. In Conference on Neural Information Processing Systems (NIPS’11). 2115--2123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Joao Carreira, Rui Caseiro, Jorge Batista, and Cristian Sminchisescu. 2012. Semantic segmentation with second-order pooling. In European Conference on Computer Vision (ECCV’12). 430--443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference (BMVC’14).Google ScholarGoogle ScholarCross RefCross Ref
  6. Anoop Cherian, Julien Mairal, Karteek Alahari, and Cordelia Schmid. 2014. Mixing body-part sequences for human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 2353--2360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Guilhem Chéron, Ivan Laptev, and Cordelia Schmid. 2015. P-CNN: Pose-based CNN features for action recognition. In IEEE International Conference on Computer Vision (ICCV’15). 3218--3226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  9. Victor Escorcia and Juan Niebles. 2013. Spatio-temporal human-object interactions for action recognition in videos. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 508--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gunnar Farnebäck. 2003. Two-frame motion estimation based on polynomial expansion. In Scandinavian Conference on Image Analysis. Springer, 363--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 759--768.Google ScholarGoogle ScholarCross RefCross Ref
  12. Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In European Conference on Computer Vision (ECCV’14). 392--407.Google ScholarGoogle ScholarCross RefCross Ref
  13. Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2014. Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. International Journal of Computer Vision 112, 2 (2014), 133--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Swastik Gupta, Pablo Arbelaez, and Jagannath Malik. 2013. Perceptual organization and recognition of indoor scenes from RGB-D images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 564--571. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. 2014. Learning rich features from RGB-D images for object detection and segmentation. In European Conference on Computer Vision (ECCV’14). 345--360.Google ScholarGoogle ScholarCross RefCross Ref
  16. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 5344--5352.Google ScholarGoogle ScholarCross RefCross Ref
  18. Junzhou Huang, Tong Zhang, and Dimitris Metaxas. 2011. Learning with structured sparsity. Journal of Machine Learning Research 12 (2011), 3371--3412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T. Barron, Mario Fritz, Kate Saenko, and Trevor Darrell. 2013. A category-level 3d object dataset: Putting the Kinect to work. In Consumer Depth Cameras for Computer Vision. Springer, 141--165.Google ScholarGoogle Scholar
  20. Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3304--3311.Google ScholarGoogle ScholarCross RefCross Ref
  21. Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. 2013. Towards understanding action recognition. In IEEE International Conference on Computer Vision (ICCV’13). 3192--3199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Deguang Kong, Ryohei Fujimaki, Ji Liu, Feiping Nie, and Chris Ding. 2014. Exclusive feature learning on arbitrary structures via -norm. In Conference on Neural Information Processing Systems (NIPS’14). 1655--1663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yu Kong and Yun Fu. 2015. Bilinear heterogeneous information machine for RGB-D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1054--1062.Google ScholarGoogle ScholarCross RefCross Ref
  24. Hema Swetha Koppula, Rudhir Gupta, and Ashutosh Saxena. 2013. Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32, 8 (2013), 951--970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Conference on Neural Information Processing Systems (NIPS’12). 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08). 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  27. Yiyi Liao, Sarath Kodagoda, Yue Wang, Lei Shi, and Yong Liu. 2015. Understand scene categories by objects: A semantic regularized scene classifier using convolutional neural networks. arXiv preprint arXiv:1509.06470 (2015).Google ScholarGoogle Scholar
  28. Ivan Lillo, Juan Carlos Niebles, and Alvaro Soto. 2016. A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1981--1990.Google ScholarGoogle ScholarCross RefCross Ref
  29. Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. 2009. Actions in context. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 2929--2936.Google ScholarGoogle ScholarCross RefCross Ref
  30. Nikhil Naikal, Allen Y. Yang, and S. Shankar Sastry. 2011. Informative feature selection for object recognition via sparse PCA. In IEEE International Conference on Computer Vision (ICCV’11). 818--825. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. David Sontag, Nathan Silberman, and Rob Fergus. 2014. Instance segmentation of indoor scenes using a coverage loss. In European Conference on Computer Vision (ECCV’14). 616--631.Google ScholarGoogle Scholar
  32. Pushmeet Kohli, Nathan Silberman, Derek Hoiem, and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV’12).Google ScholarGoogle Scholar
  33. Omar Oreifej and Zicheng Liu. 2013. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 716--723. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification. In European Conference on Computer Vision (ECCV’10). 143--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xiaofeng Ren, Liefeng Bo, and Dieter Fox. 2012. Rgb-(d) scene labeling: Features and algorithms. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2759--2766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. 2012. A database for fine grained activity detection of cooking activities. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 1194--1201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (2013), 222--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Amir Shahroudy, Tian-Tsong Ng, Qingxiong Yang, and Gang Wang. 2016. Multimodal multipart learning for action recognition in depth videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (2016), 2123--2129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Nathan Silberman and Rob Fergus. 2011. Indoor scene segmentation using a structured light sensor. In IEEE International Conference on Computer Vision Workshops. 601--608.Google ScholarGoogle ScholarCross RefCross Ref
  40. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV’12). 746--760. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Conference on Neural Information Processing Systems (NIPS’14). 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. 2015. SUN RGB-D: A RGB-D scene understanding benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 567--576.Google ScholarGoogle ScholarCross RefCross Ref
  43. Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 1 (1996), 267--288.Google ScholarGoogle ScholarCross RefCross Ref
  44. Anran Wang, Jianfei Cai, Jiwen Lu, and Tat-Jen Cham. 2016. Modality and component aware feature fusion for RGB-D scene classification. IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle ScholarCross RefCross Ref
  45. Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). 3169--3176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Hua Wang, Feiping Nie, Heng Huang, Shannon Risacher, Chibiao Ding, Andrew J. Saykin, and Li Shen. 2011. Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance. In IEEE International Conference on Computer Vision (ICCV’11). 557--562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Hua Wang, Feiping Nie, Heng Huang, Shannon L. Risacher, Andrew J. Saykin, and Li Shen. 2012. Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics 28, 12 (2012), i127--i136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV’13). 3551--3558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2014. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 5 (2014), 914--927.Google ScholarGoogle ScholarCross RefCross Ref
  50. John Wright, Allen Y. Yang, Arvind Ganesh, Shankar S. Sastry, and Yi Ma. 2009. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 2 (2009), 210--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jianxiong Xiao, Andrew Owens, and Antonio Torralba. 2013. SUN3D: A database of big spaces reconstructed using sfm and object labels. In IEEE International Conference on Computer Vision (ICCV’13). 1625--1632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2015. Can partial strong labels boost multi-label object recognition? arXiv preprint arXiv:1504.05843 (2015).Google ScholarGoogle Scholar
  53. Jianchao Yang, John Wright, Thomas S. Huang, and Yi Ma. 2010. Image super-resolution via sparse representation. IEEE Transactions on Image Processing 19, 11 (2010), 2861--2873. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. 2011. Human action recognition by learning bases of action attributes and parts. In IEEE International Conference on Computer Vision (ICCV’11). 1331--1338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Donggeun Yoo, Sunggyun Park, Joon-Young Lee, and In So Kweon. 2014. Fisher kernel for deep neural activations. arXiv preprint arXiv:1412.1628 (2014).Google ScholarGoogle Scholar
  56. Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 1 (2006), 49--67.Google ScholarGoogle ScholarCross RefCross Ref
  57. Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4694--4702.Google ScholarGoogle ScholarCross RefCross Ref
  58. Yu Zhang, Xiu-shen Wei, Jianxin Wu, Jianfei Cai, Jiangbo Lu, Viet-Anh Nguyen, and Minh N. Do. 2015. Weakly supervised fine-grained image categorization. arXiv preprint arXiv:1504.04943 (2015).Google ScholarGoogle Scholar
  59. Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Conference on Neural Information Processing Systems (NIPS’14). 487--495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Yang Zhou, Rong Jin, and Steven Hoi. 2010. Exclusive lasso for multi-task feature selection. In International Conference on Artificial Intelligence and Statistics. 988--995.Google ScholarGoogle Scholar
  61. Yang Zhou, Bingbing Ni, Richang Hong, Meng Wang, and Qi Tian. 2015. Interaction part mining: A mid-level approach for fine-grained action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3323--3331.Google ScholarGoogle ScholarCross RefCross Ref
  62. Zhen Zuo, Gang Wang, Bing Shuai, Lifan Zhao, Qingxiong Yang, and Xudong Jiang. 2014. Learning discriminative and shareable features for scene classification. In European Conference on Computer Vision (ECCV’14). 552--568.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Structure-Aware Multimodal Feature Fusion for RGB-D Scene Classification and Beyond

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!