Abstract
In this article, we propose a novel method for unsupervised learning of human action categories in still images. In contrast to previous methods, the proposed method explores distinctive information of actions directly from unlabeled image databases, attempting to learn discriminative deep representations in an unsupervised manner to distinguish different actions. In the proposed method, action image collections can be used without manual annotations. Specifically, (i) to deal with the problem that unsupervised discriminative deep representations are difficult to learn, the proposed method builds a training dataset with surrogate labels from the unlabeled dataset, then learns discriminative representations by alternately updating convolutional neural network (CNN) parameters and the surrogate training dataset in an iterative manner; (ii) to explore the discriminatory information among different action categories, training batches for updating the CNN parameters are built with triplet groups and the triplet loss function is introduced to update the CNN parameters; and (iii) to learn more discriminative deep representations, a Random Forest classifier is adopted to update the surrogate training dataset, and more beneficial triplet groups then can be built with the updated surrogate training dataset. Extensive experiments on four benchmark datasets demonstrate the effectiveness of the proposed method.
- Kashif Ahmad, Mohamed Lamine Mekhalfi, Nicola Conci, Farid Melgani, and Francesco G. B. De Natale. 2018. Ensemble of deep models for event recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2, 51:1--51:20.Google Scholar
Digital Library
- Miguel Ángel Bautista, Artsiom Sanakoyeu, and Björn Ommer. 2017. Deep unsupervised similarity learning using partially ordered sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1923--1932.Google Scholar
Cross Ref
- Miguel Ángel Bautista, Artsiom Sanakoyeu, Ekaterina Tikhoncheva, and Björn Ommer. 2016. CliqueCNN: Deep unsupervised exemplar learning. In Advances in Neural Information Processing Systems. NIPSF, 3846--3854.Google Scholar
- Anna Bosch, Andrew Zisserman, and Xavier Muñoz. 2007. Image classification using random forests and ferns. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1--8.Google Scholar
Cross Ref
- Lukas Bossard, Matthieu Guillaumin, and Luc J. Van Gool. 2014. Food-101— mining discriminative components with random forests. In Proceedings of the European Conference on Computer Vision. Springer, 446--461.Google Scholar
- Leo Breiman. 2001. Random forests. Machine Learning 45, 1, 5--32.Google Scholar
Digital Library
- Deng Cai, Xiaofei He, and Jiawei Han. 2005. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering 17, 12, 1624--1637.Google Scholar
Digital Library
- Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision. Springer, 139--156.Google Scholar
Digital Library
- Vincent Delaitre, Ivan Laptev, and Josef Sivic. 2010. Recognizing human actions in still images: A study of bag-of-features and part-based representations. In Proceedings of the British Machine Vision Conference. BMVA, 1--11.Google Scholar
Cross Ref
- Vincent Delaitre, Josef Sivic, and Ivan Laptev. 2011. Learning person-object interactions for action recognition in still images. In Advances in Neural Information Processing Systems. NIPSF, 1503--1511.Google Scholar
- Carl Doersch, Abhinav Gupta, and Alexei A. Efros. 2015. Unsupervised visual representation learning by context prediction. In Advances in Neural Information Processing Systems. NIPSF, 1422--1430.Google Scholar
- Alexey Dosovitskiy, Jost Tobias Springenberg, Martin A. Riedmiller, and Thomas Brox. 2014. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems. NIPSF, 766--774.Google Scholar
- Haoshu Fang, Jinkun Cao, Yu-Wing Tai, and Cewu Lu. 2018. Pairwise body-part attention for recognizing human-object interactions. In Proceedings of the European Conference on Computer Vision. Springer, 52--68.Google Scholar
Digital Library
- Basura Fernando, Sareh Shirazi, and Stephen Gould. 2017. Unsupervised human action detection by action matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1604--1612.Google Scholar
Cross Ref
- Georgia Gkioxari, Ross B. Girshick, and Jitendra Malik. 2015. Actions and attributes from wholes and parts. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2470--2478.Google Scholar
Digital Library
- Georgia Gkioxari, Ross B. Girshick, and Jitendra Malik. 2015. Contextual action recognition with R*CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1080--1088.Google Scholar
Digital Library
- Guodong Guo and Alice Lai. 2014. A survey on still image based human action recognition. Pattern Recognition 47, 10, 3343--3361.Google Scholar
Cross Ref
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2980--2988.Google Scholar
- Min Huang, Song-Zhi Su, Hongbo Zhang, Guo-Rong Cai, Dong-Ying Gong, Donglin Cao, and Shao-Zi Li. 2018. Multifeature selection for 3D human action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2, 45:1--45:18.Google Scholar
Digital Library
- Nazli Ikizler, Ramazan Gokberk Cinbis, Selen Pehlivan, and Pinar Duygulu. 2008. Recognizing actions from still images. In Proceedings of the International Conference on Pattern Recognition. IEEE, 1--4.Google Scholar
Cross Ref
- Anil K. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31, 8, 651--666.Google Scholar
Digital Library
- Herve Jegou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3304---3311.Google Scholar
Cross Ref
- Shuhui Jiang, Yue Wu, and Yun Fu. 2018. Deep bidirectional cross-triplet embedding for online clothing shopping. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1, 5:1--5:22.Google Scholar
Digital Library
- Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Ahmed Sohel, and Farid Boussaïd. 2018. Learning clip representations for skeleton-based 3D action recognition. IEEE Transactions on Image Processing 27, 6, 2842--2855.Google Scholar
- Alex Krizhevsky and Geoffrey E. Hinton. 2011. Using very deep autoencoders for content-based image retrieval. In Proceedings of the European Symposium on Artificial Neural Networks. i6doc.com publication, 489--494.Google Scholar
- Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2169--2178.Google Scholar
Digital Library
- Dieu-Thu Le, Raffaella Bernardi, and Jasper R. R. Uijlings. 2013. Exploiting language models to recognize unseen actions. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 231--238.Google Scholar
- Quoc V. Le. 2013. Building high-level features using large scale unsupervised learning. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 8595--8598.Google Scholar
Cross Ref
- Honglak Lee, Roger B. Grosse, Rajesh Ranganath, and Andrew Y. Ng. 2009. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the International Conference on Machine Learning. ACM, 609--616.Google Scholar
- Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2017. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 667--676.Google Scholar
Cross Ref
- Fei-Fei Li, Rob Fergus, and Pietro Perona. 2004. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 178--178.Google Scholar
- Piji Li, Jun Ma, and Shuai Gao. 2011. Actions in still web images: Visualization, detection and retrieval. In Web-Age Information Management. 302--313.Google Scholar
- Sheng Li, Kang Li, and Yun Fu. 2018. Early recognition of 3D human actions. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1s, 20:1--20:21.Google Scholar
Digital Library
- Xin Li and Mooi Choo Chuah. 2018. ReHAR: Robust and efficient human activity recognition. In Proceedings of the IEEE Conference on Applications of Computer Vision. IEEE, 362--371.Google Scholar
Cross Ref
- Jun Liu, Amir Shahroudy, Gang Wang, Ling-Yu Duan, and Alex C. Kot. 2019. Skeleton-based online action prediction using scale selection network. IEEE Transactions on Pattern Analysis and Machine Intelligence.Google Scholar
- Jiawei Liu, Zheng-Jun Zha, Xuejin Chen, Zilei Wang, and Yongdong Zhang. 2019. Dense 3D-convolutional neural network for person re-identification in videos. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s, 8:1--8:19.Google Scholar
Digital Library
- David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2, 91--110.Google Scholar
Digital Library
- Fan Ma, Deyu Meng, Qi Xie, Zina Li, and Xuanyi Dong. 2017. Self-paced co-training. In Proceedings of the International Conference on Machine Learning. IMLS, 2275--2284.Google Scholar
- Shugao Ma, Sarah Adel Bargal, Jianming Zhang, Leonid Sigal, and Stan Sclaroff. 2017. Do less and achieve more: Training CNNs for action recognition utilizing action images from the web. Pattern Recognition 68, 334--345.Google Scholar
Digital Library
- Subhransu Maji, Lubomir D. Bourdev, and Jitendra Malik. 2011. Action recognition from a distributed representation of pose and appearance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3177--3184.Google Scholar
Digital Library
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.Google Scholar
Digital Library
- Juan Carlos Niebles, Hongcheng Wang, and Fei-Fei Li. 2008. Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision 79, 3, 299--318.Google Scholar
Digital Library
- Christos H. Papadimitriou and Kenneth Steiglitz. 1998. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall.Google Scholar
Digital Library
- Alessandro Prest, Cordelia Schmid, and Vittorio Ferrari. 2012. Weakly supervised learning of interactions between humans and objects. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 3, 601--614.Google Scholar
Digital Library
- Lei Qi, Xiaoqiang Lu, and Xuelong Li. 2018. Action recognition by jointly using video proposal and trajectory. In ACM International Conference on Vision, Image and Signal Processing. ACM, 4--4.Google Scholar
Digital Library
- Hossein Rahmani and Mohammed Bennamoun. 2017. Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5833--5842.Google Scholar
Cross Ref
- Nima Razavi, Juergen Gall, and Luc J. Van Gool. 2011. Scalable multi-class object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1505--1512.Google Scholar
- Marko Ristin, Matthieu Guillaumin, Juergen Gall, and Luc J. Van Gool. 2016. Incremental learning of random forests for large-scale image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 3, 490--503.Google Scholar
Digital Library
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 815--823.Google Scholar
Cross Ref
- Fadime Sener, Cagdas Bas, and Nazli Ikizler-Cinbis. 2012. On recognizing actions in still images via multiple features. In Proceedings of the European Conference on Computer Vision. Springer, 263--272.Google Scholar
Digital Library
- Gaurav Sharma, Frédéric Jurie, and Cordelia Schmid. 2017. Expanded parts model for semantic description of humans in still images. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 1, 87--101.Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations. ICLR.Google Scholar
- Khurram Soomro and Mubarak Shah. 2017. Unsupervised action discovery and localization in videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 696--705.Google Scholar
Cross Ref
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: a dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01.Google Scholar
- Alexander Strehl and Joydeep Ghosh. 2002. Cluster ensembles — a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583--617.Google Scholar
Digital Library
- Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas S. Huang, and Yihong Gong. 2010. Locality-constrained linear coding for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3360--3367.Google Scholar
Cross Ref
- Peisong Wang, Qinghao Hu, Zhiwei Fang, Chaoyang Zhao, and Jian Cheng. 2018. DeepSearch: a fast image search framework for mobile devices. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1, 6:1--6:22.Google Scholar
Digital Library
- Xiaolong Wang, Kaiming He, and Abhinav Gupta. 2017. Transitive invariance for self-supervised visual representation learning. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1338--1347.Google Scholar
Cross Ref
- Yang Wang, Hao Jiang, Mark S. Drew, Ze-Nian Li, and Greg Mori. 2006. Unsupervised discovery of action classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1654--1661.Google Scholar
Digital Library
- Chenxia Wu, Jiemi Zhang, Silvio Savarese, and Ashutosh Saxena. 2015. Watch-n-patch: Unsupervised understanding of actions and relations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4362--4370.Google Scholar
Cross Ref
- Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. 2018. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5177--5186.Google Scholar
Cross Ref
- Jianwei Yang, Devi Parikh, and Dhruv Batra. 2016. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5147--5156.Google Scholar
Cross Ref
- Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas J. Guibas, and Fei-Fei Li. 2011. Human action recognition by learning bases of action attributes and parts. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1331--1338.Google Scholar
Digital Library
- Bangpeng Yao and Fei-Fei Li. 2012. Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 9, 1691--1703.Google Scholar
Digital Library
- Mark Yatskar, Luke S. Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition. IEEE, 5534--5542.Google Scholar
Cross Ref
- Yuan Yuan, Lei Qi, and Xiaoqiang Lu. 2016. Action recognition by joint learning. Image and Vision Computing 55, 77--85.Google Scholar
Digital Library
- Yuan Yuan, Yang Zhao, and Qi Wang. 2018. Action recognition using spatial-optical data organization and sequential learning framework. Neurocomputing 315, 221--233.Google Scholar
- Yu Zhang, Li Cheng, Jianxin Wu, Jianfei Cai, Minh N. Do, and Jiangbo Lu. 2016. Action recognition in still images with minimum annotation efforts. IEEE Transactions on Image Processing 25, 11, 5479--5490.Google Scholar
Cross Ref
- Shichao Zhao, Yanbin Liu, Yahong Han, Richang Hong, Qinghua Hu, and Qi Tian. 2018. Pooling the convolutional layers in deep ConvNets for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology 28, 8, 1839--1849.Google Scholar
Digital Library
- Zhichen Zhao, Huimin Ma, and Xiaozhi Chen. 2016. Semantic parts based top-down pyramid for action recognition. Pattern Recognition Letters 84, 134--141.Google Scholar
Cross Ref
- Yin Zheng, Yu-Jin Zhang, Xue Li, and Bao-Di Liu. 2012. Action recognition in still images using a combination of human pose and context information. In Proceedings of the IEEE International Conference on Image Processing. IEEE, 785--788.Google Scholar
Cross Ref
- Zhedong Zheng, Liang Zheng, and Yi Yang. 2018. A discriminatively learned CNN embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1 (2018), 13:1--13:20.Google Scholar
Digital Library
- Yu Zhu, Wenbin Chen, and Guodong Guo. 2015. Fusing multiple features for depth-based action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 6, 2, 18:1--18:20.Google Scholar
- Maryam Ziaeefard and Robert Bergevin. 2015. Semantic human activity recognition: a literature review. Pattern Recognition 48, 8, 2329--2345.Google Scholar
Digital Library
Index Terms
Unsupervised Learning of Human Action Categories in Still Images with Deep Representations
Recommendations
Unsupervised Image Classification for Deep Representation Learning
Computer Vision – ECCV 2020 WorkshopsAbstractDeep clustering against self-supervised learning (SSL) is a very important and promising direction for unsupervised visual representation learning since it requires little domain knowledge to design pretext tasks. However, the key component, ...
Unsupervised Cell Segmentation in Fluorescence Microscopy Images via Self-supervised Learning
Pattern Recognition and Artificial IntelligenceAbstractCell segmentation in microscopy images is challenging particularly when only few or no annotations available. Existing unsupervised deep learning-based segmentation methods rely on large data sets to train large networks, use synthetic training ...
A novel double-layer sparse representation approach for unsupervised dictionary learning
We propose a DLSR approach for dictionary learning.The DLSR formulation enhances reconstructive and discriminative abilities of dictionary.A DLSR-OMP algorithm is developed to solve the DLSR formulation. This paper presents a novel double-layer sparse ...






Comments