skip to main content
research-article

Unsupervised Learning of Human Action Categories in Still Images with Deep Representations

Authors Info & Claims
Published:16 December 2019Publication History
Skip Abstract Section

Abstract

In this article, we propose a novel method for unsupervised learning of human action categories in still images. In contrast to previous methods, the proposed method explores distinctive information of actions directly from unlabeled image databases, attempting to learn discriminative deep representations in an unsupervised manner to distinguish different actions. In the proposed method, action image collections can be used without manual annotations. Specifically, (i) to deal with the problem that unsupervised discriminative deep representations are difficult to learn, the proposed method builds a training dataset with surrogate labels from the unlabeled dataset, then learns discriminative representations by alternately updating convolutional neural network (CNN) parameters and the surrogate training dataset in an iterative manner; (ii) to explore the discriminatory information among different action categories, training batches for updating the CNN parameters are built with triplet groups and the triplet loss function is introduced to update the CNN parameters; and (iii) to learn more discriminative deep representations, a Random Forest classifier is adopted to update the surrogate training dataset, and more beneficial triplet groups then can be built with the updated surrogate training dataset. Extensive experiments on four benchmark datasets demonstrate the effectiveness of the proposed method.

References

  1. Kashif Ahmad, Mohamed Lamine Mekhalfi, Nicola Conci, Farid Melgani, and Francesco G. B. De Natale. 2018. Ensemble of deep models for event recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2, 51:1--51:20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Miguel Ángel Bautista, Artsiom Sanakoyeu, and Björn Ommer. 2017. Deep unsupervised similarity learning using partially ordered sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1923--1932.Google ScholarGoogle ScholarCross RefCross Ref
  3. Miguel Ángel Bautista, Artsiom Sanakoyeu, Ekaterina Tikhoncheva, and Björn Ommer. 2016. CliqueCNN: Deep unsupervised exemplar learning. In Advances in Neural Information Processing Systems. NIPSF, 3846--3854.Google ScholarGoogle Scholar
  4. Anna Bosch, Andrew Zisserman, and Xavier Muñoz. 2007. Image classification using random forests and ferns. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  5. Lukas Bossard, Matthieu Guillaumin, and Luc J. Van Gool. 2014. Food-101— mining discriminative components with random forests. In Proceedings of the European Conference on Computer Vision. Springer, 446--461.Google ScholarGoogle Scholar
  6. Leo Breiman. 2001. Random forests. Machine Learning 45, 1, 5--32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Deng Cai, Xiaofei He, and Jiawei Han. 2005. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering 17, 12, 1624--1637.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision. Springer, 139--156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Vincent Delaitre, Ivan Laptev, and Josef Sivic. 2010. Recognizing human actions in still images: A study of bag-of-features and part-based representations. In Proceedings of the British Machine Vision Conference. BMVA, 1--11.Google ScholarGoogle ScholarCross RefCross Ref
  10. Vincent Delaitre, Josef Sivic, and Ivan Laptev. 2011. Learning person-object interactions for action recognition in still images. In Advances in Neural Information Processing Systems. NIPSF, 1503--1511.Google ScholarGoogle Scholar
  11. Carl Doersch, Abhinav Gupta, and Alexei A. Efros. 2015. Unsupervised visual representation learning by context prediction. In Advances in Neural Information Processing Systems. NIPSF, 1422--1430.Google ScholarGoogle Scholar
  12. Alexey Dosovitskiy, Jost Tobias Springenberg, Martin A. Riedmiller, and Thomas Brox. 2014. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems. NIPSF, 766--774.Google ScholarGoogle Scholar
  13. Haoshu Fang, Jinkun Cao, Yu-Wing Tai, and Cewu Lu. 2018. Pairwise body-part attention for recognizing human-object interactions. In Proceedings of the European Conference on Computer Vision. Springer, 52--68.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Basura Fernando, Sareh Shirazi, and Stephen Gould. 2017. Unsupervised human action detection by action matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1604--1612.Google ScholarGoogle ScholarCross RefCross Ref
  15. Georgia Gkioxari, Ross B. Girshick, and Jitendra Malik. 2015. Actions and attributes from wholes and parts. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2470--2478.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Georgia Gkioxari, Ross B. Girshick, and Jitendra Malik. 2015. Contextual action recognition with R*CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1080--1088.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Guodong Guo and Alice Lai. 2014. A survey on still image based human action recognition. Pattern Recognition 47, 10, 3343--3361.Google ScholarGoogle ScholarCross RefCross Ref
  18. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2980--2988.Google ScholarGoogle Scholar
  19. Min Huang, Song-Zhi Su, Hongbo Zhang, Guo-Rong Cai, Dong-Ying Gong, Donglin Cao, and Shao-Zi Li. 2018. Multifeature selection for 3D human action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 2, 45:1--45:18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Nazli Ikizler, Ramazan Gokberk Cinbis, Selen Pehlivan, and Pinar Duygulu. 2008. Recognizing actions from still images. In Proceedings of the International Conference on Pattern Recognition. IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  21. Anil K. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31, 8, 651--666.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Herve Jegou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3304---3311.Google ScholarGoogle ScholarCross RefCross Ref
  23. Shuhui Jiang, Yue Wu, and Yun Fu. 2018. Deep bidirectional cross-triplet embedding for online clothing shopping. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1, 5:1--5:22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Ahmed Sohel, and Farid Boussaïd. 2018. Learning clip representations for skeleton-based 3D action recognition. IEEE Transactions on Image Processing 27, 6, 2842--2855.Google ScholarGoogle Scholar
  25. Alex Krizhevsky and Geoffrey E. Hinton. 2011. Using very deep autoencoders for content-based image retrieval. In Proceedings of the European Symposium on Artificial Neural Networks. i6doc.com publication, 489--494.Google ScholarGoogle Scholar
  26. Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2169--2178.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Dieu-Thu Le, Raffaella Bernardi, and Jasper R. R. Uijlings. 2013. Exploiting language models to recognize unseen actions. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 231--238.Google ScholarGoogle Scholar
  28. Quoc V. Le. 2013. Building high-level features using large scale unsupervised learning. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 8595--8598.Google ScholarGoogle ScholarCross RefCross Ref
  29. Honglak Lee, Roger B. Grosse, Rajesh Ranganath, and Andrew Y. Ng. 2009. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the International Conference on Machine Learning. ACM, 609--616.Google ScholarGoogle Scholar
  30. Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2017. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 667--676.Google ScholarGoogle ScholarCross RefCross Ref
  31. Fei-Fei Li, Rob Fergus, and Pietro Perona. 2004. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 178--178.Google ScholarGoogle Scholar
  32. Piji Li, Jun Ma, and Shuai Gao. 2011. Actions in still web images: Visualization, detection and retrieval. In Web-Age Information Management. 302--313.Google ScholarGoogle Scholar
  33. Sheng Li, Kang Li, and Yun Fu. 2018. Early recognition of 3D human actions. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1s, 20:1--20:21.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xin Li and Mooi Choo Chuah. 2018. ReHAR: Robust and efficient human activity recognition. In Proceedings of the IEEE Conference on Applications of Computer Vision. IEEE, 362--371.Google ScholarGoogle ScholarCross RefCross Ref
  35. Jun Liu, Amir Shahroudy, Gang Wang, Ling-Yu Duan, and Alex C. Kot. 2019. Skeleton-based online action prediction using scale selection network. IEEE Transactions on Pattern Analysis and Machine Intelligence.Google ScholarGoogle Scholar
  36. Jiawei Liu, Zheng-Jun Zha, Xuejin Chen, Zilei Wang, and Yongdong Zhang. 2019. Dense 3D-convolutional neural network for person re-identification in videos. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1s, 8:1--8:19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 2, 91--110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Fan Ma, Deyu Meng, Qi Xie, Zina Li, and Xuanyi Dong. 2017. Self-paced co-training. In Proceedings of the International Conference on Machine Learning. IMLS, 2275--2284.Google ScholarGoogle Scholar
  39. Shugao Ma, Sarah Adel Bargal, Jianming Zhang, Leonid Sigal, and Stan Sclaroff. 2017. Do less and achieve more: Training CNNs for action recognition utilizing action images from the web. Pattern Recognition 68, 334--345.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Subhransu Maji, Lubomir D. Bourdev, and Jitendra Malik. 2011. Action recognition from a distributed representation of pose and appearance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3177--3184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Juan Carlos Niebles, Hongcheng Wang, and Fei-Fei Li. 2008. Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision 79, 3, 299--318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Christos H. Papadimitriou and Kenneth Steiglitz. 1998. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Alessandro Prest, Cordelia Schmid, and Vittorio Ferrari. 2012. Weakly supervised learning of interactions between humans and objects. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 3, 601--614.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Lei Qi, Xiaoqiang Lu, and Xuelong Li. 2018. Action recognition by jointly using video proposal and trajectory. In ACM International Conference on Vision, Image and Signal Processing. ACM, 4--4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Hossein Rahmani and Mohammed Bennamoun. 2017. Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5833--5842.Google ScholarGoogle ScholarCross RefCross Ref
  47. Nima Razavi, Juergen Gall, and Luc J. Van Gool. 2011. Scalable multi-class object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1505--1512.Google ScholarGoogle Scholar
  48. Marko Ristin, Matthieu Guillaumin, Juergen Gall, and Luc J. Van Gool. 2016. Incremental learning of random forests for large-scale image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 3, 490--503.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. FaceNet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 815--823.Google ScholarGoogle ScholarCross RefCross Ref
  50. Fadime Sener, Cagdas Bas, and Nazli Ikizler-Cinbis. 2012. On recognizing actions in still images via multiple features. In Proceedings of the European Conference on Computer Vision. Springer, 263--272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Gaurav Sharma, Frédéric Jurie, and Cordelia Schmid. 2017. Expanded parts model for semantic description of humans in still images. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 1, 87--101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations. ICLR.Google ScholarGoogle Scholar
  53. Khurram Soomro and Mubarak Shah. 2017. Unsupervised action discovery and localization in videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 696--705.Google ScholarGoogle ScholarCross RefCross Ref
  54. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: a dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01.Google ScholarGoogle Scholar
  55. Alexander Strehl and Joydeep Ghosh. 2002. Cluster ensembles — a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583--617.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas S. Huang, and Yihong Gong. 2010. Locality-constrained linear coding for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3360--3367.Google ScholarGoogle ScholarCross RefCross Ref
  57. Peisong Wang, Qinghao Hu, Zhiwei Fang, Chaoyang Zhao, and Jian Cheng. 2018. DeepSearch: a fast image search framework for mobile devices. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1, 6:1--6:22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Xiaolong Wang, Kaiming He, and Abhinav Gupta. 2017. Transitive invariance for self-supervised visual representation learning. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1338--1347.Google ScholarGoogle ScholarCross RefCross Ref
  59. Yang Wang, Hao Jiang, Mark S. Drew, Ze-Nian Li, and Greg Mori. 2006. Unsupervised discovery of action classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1654--1661.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Chenxia Wu, Jiemi Zhang, Silvio Savarese, and Ashutosh Saxena. 2015. Watch-n-patch: Unsupervised understanding of actions and relations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4362--4370.Google ScholarGoogle ScholarCross RefCross Ref
  61. Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. 2018. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5177--5186.Google ScholarGoogle ScholarCross RefCross Ref
  62. Jianwei Yang, Devi Parikh, and Dhruv Batra. 2016. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5147--5156.Google ScholarGoogle ScholarCross RefCross Ref
  63. Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas J. Guibas, and Fei-Fei Li. 2011. Human action recognition by learning bases of action attributes and parts. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1331--1338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Bangpeng Yao and Fei-Fei Li. 2012. Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 9, 1691--1703.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Mark Yatskar, Luke S. Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition. IEEE, 5534--5542.Google ScholarGoogle ScholarCross RefCross Ref
  66. Yuan Yuan, Lei Qi, and Xiaoqiang Lu. 2016. Action recognition by joint learning. Image and Vision Computing 55, 77--85.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Yuan Yuan, Yang Zhao, and Qi Wang. 2018. Action recognition using spatial-optical data organization and sequential learning framework. Neurocomputing 315, 221--233.Google ScholarGoogle Scholar
  68. Yu Zhang, Li Cheng, Jianxin Wu, Jianfei Cai, Minh N. Do, and Jiangbo Lu. 2016. Action recognition in still images with minimum annotation efforts. IEEE Transactions on Image Processing 25, 11, 5479--5490.Google ScholarGoogle ScholarCross RefCross Ref
  69. Shichao Zhao, Yanbin Liu, Yahong Han, Richang Hong, Qinghua Hu, and Qi Tian. 2018. Pooling the convolutional layers in deep ConvNets for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology 28, 8, 1839--1849.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Zhichen Zhao, Huimin Ma, and Xiaozhi Chen. 2016. Semantic parts based top-down pyramid for action recognition. Pattern Recognition Letters 84, 134--141.Google ScholarGoogle ScholarCross RefCross Ref
  71. Yin Zheng, Yu-Jin Zhang, Xue Li, and Bao-Di Liu. 2012. Action recognition in still images using a combination of human pose and context information. In Proceedings of the IEEE International Conference on Image Processing. IEEE, 785--788.Google ScholarGoogle ScholarCross RefCross Ref
  72. Zhedong Zheng, Liang Zheng, and Yi Yang. 2018. A discriminatively learned CNN embedding for person reidentification. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1 (2018), 13:1--13:20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Yu Zhu, Wenbin Chen, and Guodong Guo. 2015. Fusing multiple features for depth-based action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 6, 2, 18:1--18:20.Google ScholarGoogle Scholar
  74. Maryam Ziaeefard and Robert Bergevin. 2015. Semantic human activity recognition: a literature review. Pattern Recognition 48, 8, 2329--2345.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Unsupervised Learning of Human Action Categories in Still Images with Deep Representations

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Multimedia Computing, Communications, and Applications
            ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 4
            November 2019
            322 pages
            ISSN:1551-6857
            EISSN:1551-6865
            DOI:10.1145/3376119
            Issue’s Table of Contents

            Copyright © 2019 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 16 December 2019
            • Accepted: 1 September 2019
            • Revised: 1 June 2019
            • Received: 1 August 2018
            Published in tomm Volume 15, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!