Abstract
State-of-the-art techniques for image and video classification take a bottom-up approach where local features are aggregated into a global final representation. Existing frameworks (i.e., bag of words or Fisher vectors) are specifically designed to aggregate vector-valued features such as SIFT descriptors. In this article, we propose a technique to aggregate local descriptors in the form of covariance descriptors (CovDs) into a rich descriptor, which in essence benefit from the second-order statistics along the coding pipeline. The difficulty in aggregating CovDs arises from the fact that CovDs lie on the Riemannian manifold of symmetric positive definite (SPD) matrices. Therefore, the aggregating scheme must take advantage of metrics and the geometry of the SPD manifolds. In our proposal, we make use of the Stein divergence and Nyström method to embed the SPD manifold into a Hilbert space. We compare our proposal, dubbed CovLets, against state-of-the-art methods on several image and video classification problems including facial expression recognition and action recognition.
- Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. 2006. Log-Euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine 56, 2 (2006), 411--421.Google Scholar
- Anoop Cherian and Suvrit Sra. 2014. Riemannian sparse coding for positive definite matrices. In Proceedings of the European Conference on Computer Vision (ICCV’14). 299--314.Google Scholar
Cross Ref
- Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. 2018. ATOM: Accurate tracking by overlap maximization. arxiv:1811.07628.Google Scholar
- M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The PASCAL Visual Object Classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303--338.Google Scholar
Digital Library
- G. Griffin, A. Holub, and P. Perona. 2007. Caltech-256 Object cCategory Dataset. Technical Report 7694. California Institute of Technology.Google Scholar
- K. Guo, P. Ishwar, and J. Konrad. 2013. Action recognition from video using feature covariance matrices. IEEE Transactions on Image Processing 22, 6 (2013), 2479--2494.Google Scholar
Digital Library
- M. T. Harandi, R. Hartley, B. Lovell, and C. Sanderson. 2015. Sparse coding on symmetric positive definite manifolds using Bregman divergences. IEEE Transactions on Neural Networks and Learning Systems PP, 99 (2015), 1.Google Scholar
- Zhenyu He, Xin Li, Xinge You, Dacheng Tao, and Yuan Yan Tang. 2016. Connected component model for multi-object tracking. IEEE Transactions on Image Processing 25, 8 (2016), 3698--3711. DOI:https://doi.org/10.1109/TIP.2016.2570553Google Scholar
Digital Library
- Zhenyu He, Shuangyan Yi, Yiu-Ming Cheung, Xinge You, and Yuan Yan Tang. 2017. Robust object tracking via key patch sparse representation. IEEE Transactions on Cybernetics 47, 2 (2017), 354--364. DOI:https://doi.org/10.1109/TCYB.2016.2514714Google Scholar
- W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang. 2012. Single and multiple object tracking using log-Euclidean Riemannian subspace and block-division appearance model. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 12 (2012), 2420--2440.Google Scholar
Digital Library
- T. S. Jaakkola and D. Haussler. 1999. Exploiting generative models in discriminative classifiers. In Proceedings of Neural Information Processing Systems (NIPS’99). 487--493.Google Scholar
- S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. 2013. Kernel methods on the Riemannian manifold of symmetric positive definite matrices. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 73--80.Google Scholar
- Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). 3304--3311.Google Scholar
Cross Ref
- Jianchao, Kai Yu, Yihong Gong, and Thomas Huang. 2009. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1794--1801.Google Scholar
- Feng Jiang, Shengping Zhang, Shen Wu, Yang Gao, and Debin Zhao. 2015. Multi-layered gesture recognition with Kinect. Journal of Machine Learning Research 16 (2015), 227--254.Google Scholar
Digital Library
- Zheheng Jiang, Danny Crookes, Brian D. Green, Yunfeng Zhao, Haiping Ma, Ling Li, Shengping Zhang, Dacheng Tao, and Huiyu Zhou. 2019. Context-aware mouse behaviour recognition using hidden Markov models. IEEE Transactions on Image Processing 28, 3 (2019), 1133--1148.Google Scholar
Digital Library
- Alexander Kläser and M. Marszalek. 2008. A spatio-temporal descriptor based on 3D-gradients. In Proceedings of the British Machine Vision Conference (BMVC’08). 1--10.Google Scholar
- Xiangyuan Lan, Andy Jinhua Ma, and Pong Chi Yuen. 2014. Multi-cue visual tracking using robust feature-level fusion based on joint sparse representation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1194--1201.Google Scholar
Digital Library
- S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’06). 2169--2178.Google Scholar
- Q. Le, Alexandre Karpenko, Jiquan Ngiam, and A. Ng. 2011. ICA with reconstruction cost for efficient overcomplete feature learning. In Proceedings of Neural Information Processing Systems (NIPS’11). 1017--1025.Google Scholar
- F.-F. Li, R. Fergus, and P. Perona. 2004. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In Proceeding of the Computer Vision and Pattern Recognition Workshop on Generative Model Based Vision.Google Scholar
- Peihua Li, Qilong Wang, Wangmeng Zuo, and Lei Zhang. 2013. Log-Euclidean kernels for sparse representation and dictionary learning. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV’13).Google Scholar
Digital Library
- J. Liu, J. Luo, and M. Shah. 2009. Recognizing realistic actions from videos “in the wild.” In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 1996--2003.Google Scholar
- Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen. 2014. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). 1749--1756.Google Scholar
Digital Library
- D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2 (2004), 91--110.Google Scholar
Digital Library
- M. Marzalek, I. Laptev, and C. Schmid. 2009. Actions in context. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). 2929--2936.Google Scholar
- Xavier Pennec, Pierre Fillard, and Nicholas Ayache. 2006. A Riemannian framework for tensor computing. International Journal of Computer Vision 66, 1 (2006), 41--66.Google Scholar
Digital Library
- Florent Perronnin, Jorge Sánchez, and Thomas Mensink. 2010. Improving the Fisher kernel for large-scale image classification. In Proceedings of the European Conference on Computer Vision (ECCV’10). 143--156.Google Scholar
Digital Library
- Yuankai Qi, Lei Qin, Jian Zhang, Shengping Zhang, Qingming Huang, and Ming-Hsuan Yang. 2018. Structure-aware local sparse coding for visual tracking. IEEE Transactions on Image Processing 27, 8 (2018), 3857--3869.Google Scholar
Digital Library
- Yuankai Qi, Shengping Zhang, Lei Qin, Hongxun Yao, Qingming Huang, Jongwoo Lim, and Ming-Hsuan Yang. 2019. Hedging deep features for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 5 (2019), 1116--1130.Google Scholar
Cross Ref
- M. Rodriguez, J. Ahmed, and M. Shah. 2008. Action MACH: A spatio-temporal maximum average correlation height filter for action recognition. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08). 1--8.Google Scholar
- Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek. 2013. Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision 105, 3 (2013), 222--245.Google Scholar
Digital Library
- C. Schuldt, I. Laptev, and B. Caputo. 2004. Recognizing human actions: A local SVM approach. In Proceedings of the International Conference on Pattern Recognition. 32--36.Google Scholar
- J. Sivic and A. Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV’03). 1470--1477.Google Scholar
- S. Sra. 2012. A new metric on the manifold of kernel matrices with application to matrix geometric means. In Proceedings of Neural Information Processing Systems (NIPS’12). 144--152.Google Scholar
- O. Tuzel, F. Porikli, and P. Meer. 2006. Region covariance: A fast descriptor for detection and classification. In Proceedings of the European Conference on Computer Vision (ECCV’06). 589--600.Google Scholar
- M. Valstar and M. Pantic. 2010. Induced disgust, happiness and surprise: An addition to the MMI facial expression database. In Proceedingsof the International Conference on Language Resources and Evaluation, Workshop on EMOTION (LRECW’10).Google Scholar
- Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2013. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103, 1 (2013), 60--79.Google Scholar
Cross Ref
- L. Wang, Y. Qiao, and X. Tang. 2013. Motionlets: Mid-level 3D parts for human motion recognition. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13). 2674--2681.Google Scholar
- Peter Wilf, Shengping Zhang, Sharat Chikkerur, Stefan A. Little, Scott L. Wing, and Thomas Serre. 2016. Computer vision cracks the leaf code. Proceedings of the National Academy of Sciences of the United States of America 113, 12 (2016), 3305--3310.Google Scholar
Cross Ref
- Christopher Williams and Matthias Seeger. 2000. The effect of the input density distribution on kernel-based classifiers. In Proceedings of the 17th International Conference on Machine Learning (ICML’00). 1159--1166.Google Scholar
- Yingjie Yao, Xiaohe Wu, Lei Zhang, Shiguang Shan, and Wangmeng Zuo. 2018. Joint representation and truncated inference learning for correlation filter based tracking. In Proceedings of the European Conference on Computer Vision (ECCV’18). 552--567.Google Scholar
Cross Ref
- Shuangyan Yi, Zhihui Lai, Zhenyu He, Yiu-Ming Cheung, and Yang Liu. 2017. Joint sparse principal component analysis. Pattern Recognition 61 (2017), 524--536. DOI:https://doi.org/10.1016/j.patcog.2016.08.025Google Scholar
Digital Library
- Shuangyan Yi, Yingyi Liang, Zhenyu He, Yi Li, and Yiu-Ming Cheung. 2019. Dual pursuit for subspace learning. IEEE Transactions on Multimedia 21, 6 (2019), 1399--1411. DOI:https://doi.org/10.1109/TMM.2018.2877888Google Scholar
Digital Library
- Lei Zhang, Wen Wu, Terrence Chen, Norbert Strobel, and Dorin Comaniciu. 2015. Robust object tracking using semi-supervised appearance dictionary learning. Pattern Recognition Letters 62 (2015), 17--23.Google Scholar
Digital Library
- Lei Zhang, Shengping Zhang, Feng Jiang, Yuankai Qi, Jun Zhang, Yuliang Guo, and Huiyu Zhou. 2018. BoMW: Bag of manifold words for one-shot learning gesture recognition from Kinect. IEEE Transactions on Circuits and Systems for Video Technology 28, 10 (2018), 2562--2573.Google Scholar
Digital Library
- Shengping Zhang, Shiva Kasiviswanathan, Pong C. Yuen, and Mehrtash Harandi. 2015. Online dictionary learning on symmetric positive definite manifolds with vision applications. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 3165--3173.Google Scholar
- Shengping Zhang, Xiangyuan Lan, Yuankai Qi, and Pong C. Yuen. 2017. Robust visual tracking via basis matching. IEEE Transactions on Circuits and Systems for Video Technology 27, 3 (2017), 421--430.Google Scholar
Digital Library
- Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2017. A biologically inspired appearance model for robust visual tracking. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2017), 2357--2370.Google Scholar
Cross Ref
- Shengping Zhang, Yuankai Qi, Feng Jiang, Xiangyuan Lan, Pong C. Yuen, and Huiyu Zhou. 2018. Point-to-set distance metric learning on deep representations for visual tracking. IEEE Transactions on Intelligent Transportation Systems 19, 1 (2018), 187--198.Google Scholar
Cross Ref
- Shengping Zhang, Hongxun Yao, Xin Sun, and Shaouhui Liu. 2012. Robust visual tracking using an effective appearance model based on sparse coding. ACM Transactions on Intelligent Systems and Technology 3, 3 (2012), 1--18.Google Scholar
Digital Library
- Shengping Zhang, Hongxun Yao, Xin Sun, and Xiusheng Lu. 2013. Sparse coding based visual tracking: Review and experimental comparison. Pattern Recognition 46, 7 (2013), 1772--1788.Google Scholar
Digital Library
- Shengping Zhang, Hongxun Yao, Xin Sun, Kuanquan Wang, Jun Zhang, Xiusheng Lu, and Yanhao Zhang. 2014. Action recognition based on overcomplete independent component analysis. Information Sciences 281 (2014), 635--647.Google Scholar
Digital Library
- S. Zhang, H. Yao, H. Zhou, X. Sun, and S. Liu. 2013. Robust visual tracking based on online learning sparse representation. Neurocomputing 100, 1 (2013), 31--40.Google Scholar
Digital Library
- Shengping Zhang, Huiyu Zhou, Feng Jiang, and Xuelong Li. 2015. Robust visual tracking using structurally random projection and weighted least squares. IEEE Transactions on Circuits and Systems for Video Technology 25, 11 (2015), 1749--1760.Google Scholar
Digital Library
- Shengping Zhang, Huiyu Zhou, Hongxun Yao, Yanhao Zhang, Kuanquan Wang, and Jun Zhang. 2015. Adaptive NormalHedge for robust visual tracking. Signal Processing 110 (2015), 132--142.Google Scholar
Digital Library
- G. Zhao, X. Huang, M. Taini, S. Z. Li, and Matti Pietikäinen. 2011. Facial expression recognition from near-infrared videos. Image and Vision Computing 29, 9 (2011), 607--619.Google Scholar
Digital Library
- L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas. 2012. Learning active facial patches for expression analysis. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 2562--2569.Google Scholar
- Heyan Zhu, Xinyuan Huang, Shengping Zhang, and Pong C. Yuen. 2017. Plant identification via multipath sparse coding. Multimedia Tools and Applications 76, 3 (2017), 4599--4615. DOI:https://doi.org/10.1007/s11042-016-3538-4Google Scholar
Digital Library
Index Terms
CovLets: A Second-Order Descriptor for Modeling Multiple Features
Recommendations
Robust Visual Tracking Using Kernel Sparse Coding on Multiple Covariance Descriptors
Special Issue on Multimodal Machine Learning for Human Behavior Analysis and Special Issue on Computational Intelligence for Biomedical Data and ImagingIn this article, we aim to improve the performance of visual tracking by combing different features of multiple modalities. The core idea is to use covariance matrices as feature descriptors and then use sparse coding to encode different features. The ...
Action recognition in depth videos using hierarchical gaussian descriptor
In this paper, we propose a new approach based on distribution descriptors for action recognition in depth videos. Our local features are computed from binary patterns which incorporate the shape and motion cues for effective action recognition. Given ...
Hierarchical Gaussian descriptor based on local pooling for action recognition
In this paper, we propose a new approach based on Gaussian descriptors for action recognition. We first develop a feature representation technique that encodes high-order statistics of local features in two levels, where single Gaussians are used to ...






Comments