skip to main content
research-article

CovLets: A Second-Order Descriptor for Modeling Multiple Features

Authors Info & Claims
Published:17 April 2020Publication History
Skip Abstract Section

Abstract

State-of-the-art techniques for image and video classification take a bottom-up approach where local features are aggregated into a global final representation. Existing frameworks (i.e., bag of words or Fisher vectors) are specifically designed to aggregate vector-valued features such as SIFT descriptors. In this article, we propose a technique to aggregate local descriptors in the form of covariance descriptors (CovDs) into a rich descriptor, which in essence benefit from the second-order statistics along the coding pipeline. The difficulty in aggregating CovDs arises from the fact that CovDs lie on the Riemannian manifold of symmetric positive definite (SPD) matrices. Therefore, the aggregating scheme must take advantage of metrics and the geometry of the SPD manifolds. In our proposal, we make use of the Stein divergence and Nyström method to embed the SPD manifold into a Hilbert space. We compare our proposal, dubbed CovLets, against state-of-the-art methods on several image and video classification problems including facial expression recognition and action recognition.

References

  1. Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. 2006. Log-Euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine 56, 2 (2006), 411--421.Google ScholarGoogle Scholar
  2. Anoop Cherian and Suvrit Sra. 2014. Riemannian sparse coding for positive definite matrices. In Proceedings of the European Conference on Computer Vision (ICCV’14). 299--314.Google ScholarGoogle ScholarCross RefCross Ref
  3. Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. 2018. ATOM: Accurate tracking by overlap maximization. arxiv:1811.07628.Google ScholarGoogle Scholar
  4. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The PASCAL Visual Object Classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Griffin, A. Holub, and P. Perona. 2007. Caltech-256 Object cCategory Dataset. Technical Report 7694. California Institute of Technology.Google ScholarGoogle Scholar
  6. K. Guo, P. Ishwar, and J. Konrad. 2013. Action recognition from video using feature covariance matrices. IEEE Transactions on Image Processing 22, 6 (2013), 2479--2494.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. T. Harandi, R. Hartley, B. Lovell, and C. Sanderson. 2015. Sparse coding on symmetric positive definite manifolds using Bregman divergences. IEEE Transactions on Neural Networks and Learning Systems PP, 99 (2015), 1.Google ScholarGoogle Scholar
  8. Zhenyu He, Xin Li, Xinge You, Dacheng Tao, and Yuan Yan Tang. 2016. Connected component model for multi-object tracking. IEEE Transactions on Image Processing 25, 8 (2016), 3698--3711. DOI:https://doi.org/10.1109/TIP.2016.2570553Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Zhenyu He, Shuangyan Yi, Yiu-Ming Cheung, Xinge You, and Yuan Yan Tang. 2017. Robust object tracking via key patch sparse representation. IEEE Transactions on Cybernetics 47, 2 (2017), 354--364. DOI:https://doi.org/10.1109/TCYB.2016.2514714Google ScholarGoogle Scholar
  10. W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang. 2012. Single and multiple object tracking using log-Euclidean Riemannian subspace and block-division appearance model. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 12 (2012), 2420--2440.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. S. Jaakkola and D. Haussler. 1999. Exploiting generative models in discriminative classifiers. In Proceedings of Neural Information Processing Systems (NIPS’99). 487--493.Google ScholarGoogle Scholar
  12. S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. 2013. Kernel methods on the Riemannian manifold of symmetric positive definite matrices. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 73--80.Google ScholarGoogle Scholar
  13. Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3304--3311.Google ScholarGoogle ScholarCross RefCross Ref
  14. Jianchao, Kai Yu, Yihong Gong, and Thomas Huang. 2009. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1794--1801.Google ScholarGoogle Scholar
  15. Feng Jiang, Shengping Zhang, Shen Wu, Yang Gao, and Debin Zhao. 2015. Multi-layered gesture recognition with Kinect. Journal of Machine Learning Research 16 (2015), 227--254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Zheheng Jiang, Danny Crookes, Brian D. Green, Yunfeng Zhao, Haiping Ma, Ling Li, Shengping Zhang, Dacheng Tao, and Huiyu Zhou. 2019. Context-aware mouse behaviour recognition using hidden Markov models. IEEE Transactions on Image Processing 28, 3 (2019), 1133--1148.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Alexander Kläser and M. Marszalek. 2008. A spatio-temporal descriptor based on 3D-gradients. In Proceedings of the British Machine Vision Conference (BMVC’08). 1--10.Google ScholarGoogle Scholar
  18. Xiangyuan Lan, Andy Jinhua Ma, and Pong Chi Yuen. 2014. Multi-cue visual tracking using robust feature-level fusion based on joint sparse representation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1194--1201.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’06). 2169--2178.Google ScholarGoogle Scholar
  20. Q. Le, Alexandre Karpenko, Jiquan Ngiam, and A. Ng. 2011. ICA with reconstruction cost for efficient overcomplete feature learning. In Proceedings of Neural Information Processing Systems (NIPS’11). 1017--1025.Google ScholarGoogle Scholar
  21. F.-F. Li, R. Fergus, and P. Perona. 2004. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In Proceeding of the Computer Vision and Pattern Recognition Workshop on Generative Model Based Vision.Google ScholarGoogle Scholar
  22. Peihua Li, Qilong Wang, Wangmeng Zuo, and Lei Zhang. 2013. Log-Euclidean kernels for sparse representation and dictionary learning. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV’13).Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Liu, J. Luo, and M. Shah. 2009. Recognizing realistic actions from videos “in the wild.” In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1996--2003.Google ScholarGoogle Scholar
  24. Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen. 2014. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1749--1756.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91--110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Marzalek, I. Laptev, and C. Schmid. 2009. Actions in context. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 2929--2936.Google ScholarGoogle Scholar
  27. Xavier Pennec, Pierre Fillard, and Nicholas Ayache. 2006. A Riemannian framework for tensor computing. International Journal of Computer Vision 66, 1 (2006), 41--66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification. In Proceedings of the European Conference on Computer Vision (ECCV’10). 143--156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yuankai Qi, Lei Qin, Jian Zhang, Shengping Zhang, Qingming Huang, and Ming-Hsuan Yang. 2018. Structure-aware local sparse coding for visual tracking. IEEE Transactions on Image Processing 27, 8 (2018), 3857--3869.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yuankai Qi, Shengping Zhang, Lei Qin, Hongxun Yao, Qingming Huang, Jongwoo Lim, and Ming-Hsuan Yang. 2019. Hedging deep features for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 5 (2019), 1116--1130.Google ScholarGoogle ScholarCross RefCross Ref
  31. M. Rodriguez, J. Ahmed, and M. Shah. 2008. Action MACH: A spatio-temporal maximum average correlation height filter for action recognition. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08). 1--8.Google ScholarGoogle Scholar
  32. Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (2013), 222--245.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. Schuldt, I. Laptev, and B. Caputo. 2004. Recognizing human actions: A local SVM approach. In Proceedings of the International Conference on Pattern Recognition. 32--36.Google ScholarGoogle Scholar
  34. J. Sivic and A. Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV’03). 1470--1477.Google ScholarGoogle Scholar
  35. S. Sra. 2012. A new metric on the manifold of kernel matrices with application to matrix geometric means. In Proceedings of Neural Information Processing Systems (NIPS’12). 144--152.Google ScholarGoogle Scholar
  36. O. Tuzel, F. Porikli, and P. Meer. 2006. Region covariance: A fast descriptor for detection and classification. In Proceedings of the European Conference on Computer Vision (ECCV’06). 589--600.Google ScholarGoogle Scholar
  37. M. Valstar and M. Pantic. 2010. Induced disgust, happiness and surprise: An addition to the MMI facial expression database. In Proceedingsof the International Conference on Language Resources and Evaluation, Workshop on EMOTION (LRECW’10).Google ScholarGoogle Scholar
  38. Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103, 1 (2013), 60--79.Google ScholarGoogle ScholarCross RefCross Ref
  39. L. Wang, Y. Qiao, and X. Tang. 2013. Motionlets: Mid-level 3D parts for human motion recognition. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 2674--2681.Google ScholarGoogle Scholar
  40. Peter Wilf, Shengping Zhang, Sharat Chikkerur, Stefan A. Little, Scott L. Wing, and Thomas Serre. 2016. Computer vision cracks the leaf code. Proceedings of the National Academy of Sciences of the United States of America 113, 12 (2016), 3305--3310.Google ScholarGoogle ScholarCross RefCross Ref
  41. Christopher Williams and Matthias Seeger. 2000. The effect of the input density distribution on kernel-based classifiers. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). 1159--1166.Google ScholarGoogle Scholar
  42. Yingjie Yao, Xiaohe Wu, Lei Zhang, Shiguang Shan, and Wangmeng Zuo. 2018. Joint representation and truncated inference learning for correlation filter based tracking. In Proceedings of the European Conference on Computer Vision (ECCV’18). 552--567.Google ScholarGoogle ScholarCross RefCross Ref
  43. Shuangyan Yi, Zhihui Lai, Zhenyu He, Yiu-Ming Cheung, and Yang Liu. 2017. Joint sparse principal component analysis. Pattern Recognition 61 (2017), 524--536. DOI:https://doi.org/10.1016/j.patcog.2016.08.025Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Shuangyan Yi, Yingyi Liang, Zhenyu He, Yi Li, and Yiu-Ming Cheung. 2019. Dual pursuit for subspace learning. IEEE Transactions on Multimedia 21, 6 (2019), 1399--1411. DOI:https://doi.org/10.1109/TMM.2018.2877888Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Lei Zhang, Wen Wu, Terrence Chen, Norbert Strobel, and Dorin Comaniciu. 2015. Robust object tracking using semi-supervised appearance dictionary learning. Pattern Recognition Letters 62 (2015), 17--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Lei Zhang, Shengping Zhang, Feng Jiang, Yuankai Qi, Jun Zhang, Yuliang Guo, and Huiyu Zhou. 2018. BoMW: Bag of manifold words for one-shot learning gesture recognition from Kinect. IEEE Transactions on Circuits and Systems for Video Technology 28, 10 (2018), 2562--2573.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Shengping Zhang, Shiva Kasiviswanathan, Pong C. Yuen, and Mehrtash Harandi. 2015. Online dictionary learning on symmetric positive definite manifolds with vision applications. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 3165--3173.Google ScholarGoogle Scholar
  48. Shengping Zhang, Xiangyuan Lan, Yuankai Qi, and Pong C. Yuen. 2017. Robust visual tracking via basis matching. IEEE Transactions on Circuits and Systems for Video Technology 27, 3 (2017), 421--430.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2017. A biologically inspired appearance model for robust visual tracking. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2017), 2357--2370.Google ScholarGoogle ScholarCross RefCross Ref
  50. Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2018. Point-to-set distance metric learning on deep representations for visual tracking. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 187--198.Google ScholarGoogle ScholarCross RefCross Ref
  51. Shengping Zhang, Hongxun Yao, Xin Sun, and Shaouhui Liu. 2012. Robust visual tracking using an effective appearance model based on sparse coding. ACM Transactions on Intelligent Systems and Technology 3, 3 (2012), 1--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Shengping Zhang, Hongxun Yao, Xin Sun, and Xiusheng Lu. 2013. Sparse coding based visual tracking: Review and experimental comparison. Pattern Recognition 46, 7 (2013), 1772--1788.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Shengping Zhang, Hongxun Yao, Xin Sun, Kuanquan Wang, Jun Zhang, Xiusheng Lu, and Yanhao Zhang. 2014. Action recognition based on overcomplete independent component analysis. Information Sciences 281 (2014), 635--647.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. S. Zhang, H. Yao, H. Zhou, X. Sun, and S. Liu. 2013. Robust visual tracking based on online learning sparse representation. Neurocomputing 100, 1 (2013), 31--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Shengping Zhang, Huiyu Zhou, Feng Jiang, and Xuelong Li. 2015. Robust visual tracking using structurally random projection and weighted least squares. IEEE Transactions on Circuits and Systems for Video Technology 25, 11 (2015), 1749--1760.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Shengping Zhang, Huiyu Zhou, Hongxun Yao, Yanhao Zhang, Kuanquan Wang, and Jun Zhang. 2015. Adaptive NormalHedge for robust visual tracking. Signal Processing 110 (2015), 132--142.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. G. Zhao, X. Huang, M. Taini, S. Z. Li, and Matti Pietikäinen. 2011. Facial expression recognition from near-infrared videos. Image and Vision Computing 29, 9 (2011), 607--619.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas. 2012. Learning active facial patches for expression analysis. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2562--2569.Google ScholarGoogle Scholar
  59. Heyan Zhu, Xinyuan Huang, Shengping Zhang, and Pong C. Yuen. 2017. Plant identification via multipath sparse coding. Multimedia Tools and Applications 76, 3 (2017), 4599--4615. DOI:https://doi.org/10.1007/s11042-016-3538-4Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CovLets: A Second-Order Descriptor for Modeling Multiple Features

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 1s
      Special Issue on Multimodal Machine Learning for Human Behavior Analysis and Special Issue on Computational Intelligence for Biomedical Data and Imaging
      January 2020
      376 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3388236
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 April 2020
      • Accepted: 1 August 2019
      • Revised: 1 July 2019
      • Received: 1 April 2019
      Published in tomm Volume 16, Issue 1s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!