skip to main content
research-article

Am I Done? Predicting Action Progress in Videos

Published:17 December 2020Publication History
Skip Abstract Section

Abstract

In this article, we deal with the problem of predicting action progress in videos. We argue that this is an extremely important task, since it can be valuable for a wide range of interaction applications. To this end, we introduce a novel approach, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution. To provide a general definition of action progress, we ground our work in the linguistics literature, borrowing terms and concepts to understand which actions can be the subject of progress estimation. As a result, we define a categorization of actions and their phases. Motivated by the recent success obtained from the interaction of Convolutional and Recurrent Neural Networks, our model is based on a combination of the Faster R-CNN framework, to make framewise predictions, and LSTM networks, to estimate action progress through time. After introducing two evaluation protocols for the task at hand, we demonstrate the capability of our model to effectively predict action progress on the UCF-101 and J-HMDB datasets.

References

  1. J. K. Aggarwal and M. S. Ryoo. 2011. Human activity analysis: A review. Comput. Surveys 43, 3 (2011), 16:1--16:43.Google ScholarGoogle Scholar
  2. Mohammad Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Basura Fernando, Lars Petersson, and Lars Andersson. 2017. Encouraging lstms to anticipate actions very early. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 280--289.Google ScholarGoogle ScholarCross RefCross Ref
  3. Michael A. Arbib. 2006. A sentence is to speech as what is to action? Cortex 42, 4 (2006), 507--514.Google ScholarGoogle ScholarCross RefCross Ref
  4. Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. SST: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  5. Bernard Comrie. 1976. Aspect: An Introduction to the Study of Verbal Aspect and Related Problems. Vol. 2. Cambridge university press.Google ScholarGoogle Scholar
  6. Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees GM Snoek, and Tinne Tuytelaars. 2016. Online action detection. In European Conference on Computer Vision. 269--284.Google ScholarGoogle ScholarCross RefCross Ref
  7. Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. 2019. Temporal cycle-consistency learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1801--1810.Google ScholarGoogle ScholarCross RefCross Ref
  8. Victor Escorcia, Cuong D Dao, Mihir Jain, Bernard Ghanem, and Cees Snoek. 2020. Guess where? Actor-supervision for spatiotemporal action localization. Computer Vision and Image Understanding 192 (2020), 102886.Google ScholarGoogle ScholarCross RefCross Ref
  9. Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 961--970.Google ScholarGoogle Scholar
  10. Cornelia Fermüller, Fang Wang, Yezhou Yang, Konstantinos Zampogiannis, Yi Zhang, Francisco Barranco, and Michael Pfeiffer. 2016. Prediction of manipulation actions. International Journal of Computer Vision (2016), 1--17.Google ScholarGoogle Scholar
  11. J. Randall Flanagan and Roland S. Johansson. 2003. Action plans used in action observation. Nature 424, 6950 (2003), 769.Google ScholarGoogle Scholar
  12. Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. 2013. Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 11 (2013), 2782--2795.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  14. Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In IEEE International Conference on Computer Vision. 759--768.Google ScholarGoogle ScholarCross RefCross Ref
  15. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks.. In Artificial Intelligence and Statistics Conference. 249--256.Google ScholarGoogle Scholar
  16. Tengda Han, Jue Wang, Anoop Cherian, and Stephen Gould. 2017. Human action forecasting by learning task grammarss. arXiv preprint arXiv:1412.6980 (2017).Google ScholarGoogle Scholar
  17. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision. 346--361.Google ScholarGoogle ScholarCross RefCross Ref
  18. Farnoosh Heidarivincheh, Majid Mirmehdi, and Dima Damen. 2016. Beyond action recognition: Action completion in RGB-D data. In British Machine Vision Conference.Google ScholarGoogle ScholarCross RefCross Ref
  19. Farnoosh Heidarivincheh, Majid Mirmehdi, and Dima Damen. 2018. Action completion: A temporal model for moment detection. In Proceedings of the British Machine Vision Conference (BMVC).Google ScholarGoogle Scholar
  20. Farnoosh Heidarivincheh, Majid Mirmehdi, and Dima Damen. 2019. Weakly-supervised completion moment detection using temporal attention. In Proceedings of the IEEE International Conference on Computer Vision Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  21. Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In IEEE Conference on Computer Vision and Pattern Recognition. 1914--1923.Google ScholarGoogle ScholarCross RefCross Ref
  22. Minh Hoai and Fernando De la Torre. 2014. Max-margin early event detectors. International Journal of Computer Vision 107, 2 (2014), 191--202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Rui Hou, Chen Chen, and Mubarak Shah. 2017. Tube convolutional neural network (T-CNN) for action detection in videos. In IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  24. Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155 (2017), 1--23.Google ScholarGoogle ScholarCross RefCross Ref
  25. Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. 2013. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 3192--3199.Google ScholarGoogle Scholar
  26. Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. 2017. Action tubelet detector for spatio-temporal action localization. In IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  27. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  28. Yu Kong, Zhiqiang Tao, and Yun Fu. 2017. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1481.Google ScholarGoogle ScholarCross RefCross Ref
  29. Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2017. Unsupervised representation learning by sorting sequences. In IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  30. Xinyu Li, Yanyi Zhang, Jianyu Zhang, Moliang Zhou, Shuhong Chen, Yue Gu, Yueyang Chen, Ivan Marsic, Richard A. Farneth, and Randall S. Burd. 2017. Progress estimation and phase detection for sequential processes. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 1, 3, Article Article 73 (Sept. 2017), 20 pages. DOI:https://doi.org/10.1145/3130936Google ScholarGoogle Scholar
  31. Junwei Liang, Lu Jiang, Juan Carlos Niebles, Alexander G. Hauptmann, and Li Fei-Fei. 2019. Peeking into the future: Predicting future person activities and locations in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5725--5734.Google ScholarGoogle ScholarCross RefCross Ref
  32. Daochang Liu, Tingting Jiang, and Yizhou Wang. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1298--1307.Google ScholarGoogle ScholarCross RefCross Ref
  33. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. Ssd: Single shot multibox detector. In European Conference on Computer Vision. Springer, 21--37.Google ScholarGoogle Scholar
  34. Ziwei Liu, Raymond A. Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. 2017. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision. 4463--4471.Google ScholarGoogle ScholarCross RefCross Ref
  35. Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 344--353.Google ScholarGoogle ScholarCross RefCross Ref
  36. Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection. In IEEE Conference on Computer Vision and Pattern Recognition. 1942--1950.Google ScholarGoogle ScholarCross RefCross Ref
  37. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, Nov (2008), 2579--2605.Google ScholarGoogle Scholar
  38. Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015).Google ScholarGoogle Scholar
  39. Davide Moltisanti, Michael Wray, Walterio Mayol-Cuevas, and Dima Damen. 2017. Trespassing the boundaries: Labeling temporal bounds for object interactions in egocentric video. In Proceedings of the IEEE International Conference on Computer Vision. 2886--2894.Google ScholarGoogle ScholarCross RefCross Ref
  40. A. Neumann and Andrew Zisserman. 2019. Future event prediction: If and when. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.Google ScholarGoogle ScholarCross RefCross Ref
  41. Phuc Xuan Nguyen, Deva Ramanan, and Charless C. Fowlkes. 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision. 5502--5511.Google ScholarGoogle Scholar
  42. Katerina Pastra and Yiannis Aloimonos. 2012. The minimalist grammar of action. Philosophical Transactions of the Royal Society B: Biological Sciences 367, 1585 (2012), 103--117.Google ScholarGoogle ScholarCross RefCross Ref
  43. A. Patra and J. A. Noble. 2018. Sequential anatomy localization in fetal echocardiography videos. arXiv preprint arXiv:1810.11868 (2018).Google ScholarGoogle Scholar
  44. Xiaojiang Peng and Cordelia Schmid. 2016. Multi-region two-stream R-CNN for action detection. In European Conference on Computer Vision. 744--759.Google ScholarGoogle ScholarCross RefCross Ref
  45. Ronald Poppe. 2010. A survey on vision-based human action recognition. Image and Vision Computing 28, 6 (2010), 976--990.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.Google ScholarGoogle Scholar
  47. Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. 2016. Deep learning for detecting multiple space-time action tubes in videos. In British Machine Vision Conference.Google ScholarGoogle ScholarCross RefCross Ref
  48. Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  49. G. A. Sigurdsson, O. Russakovsky, and A. Gupta. 2017. What actions are needed for understanding human actions in videos? In IEEE International Conference on Computer Vision. 2137--2146.Google ScholarGoogle Scholar
  50. Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, and Fabio Cuzzolin. 2017. Online real-time multiple spatiotemporal action localisation and prediction. In IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  51. Khurram Soomro, Amir R. Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google ScholarGoogle Scholar
  52. Bilge Soran, Ali Farhadi, and Linda Shapiro. 2015. Generating notifications for missing actions: Don’t forget to turn the lights off!. In IEEE International Conference on Computer Vision. 4669--4677.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. James Steele, Pier Francesco Ferrari, and Leonardo Fogassi. 2012. From action to language: Comparative perspectives on primate tool use, gesture and the evolution of human language. Philosophical Transactions of the Royal Society B: Biological Sciences 367, 1585 (2012), 4.Google ScholarGoogle ScholarCross RefCross Ref
  54. Andru Putra Twinanda, Gaurav Yengera, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. 2018. RSDNet: Learning to predict remaining surgery duration from laparoscopic videos without manual annotations. IEEE Transactions on Medical Imaging 38, 4 (2018), 1069--1078.Google ScholarGoogle ScholarCross RefCross Ref
  55. Zeno Vendler. 1957. Verbs and times. The Philosophical Review 66, 2 (1957), 143--160.Google ScholarGoogle ScholarCross RefCross Ref
  56. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Anticipating visual representations with unlabeled video. In IEEE Conference on Computer Vision and Pattern Recognition. 98--106.Google ScholarGoogle ScholarCross RefCross Ref
  57. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems. 613--621.Google ScholarGoogle Scholar
  58. Xiaolong Wang, Ali Farhadi, and Abhinav Gupta. 2016. Actions ∼ transformations. In IEEE Conference on Computer Vision and Pattern Recognition. 2658--2667.Google ScholarGoogle ScholarCross RefCross Ref
  59. Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. 2015. Learning to track for spatio-temporal action localization. In IEEE International Conference on Computer Vision. 3164--3172.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Yuanjun Xiong, Yue Zhao, Limin Wang, Dahua Lin, and Xiaoou Tang. 2017. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716 (2017).Google ScholarGoogle Scholar
  61. Mingze Xu, Mingfei Gao, Yi-Ting Chen, Larry S. Davis, and David J. Crandall. 2019. Temporal recurrent networks for online action detection. In Proceedings of the IEEE International Conference on Computer Vision. 5532--5541.Google ScholarGoogle Scholar
  62. Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In IEEE Conference on Computer Vision and Pattern Recognition. 2678--2687.Google ScholarGoogle ScholarCross RefCross Ref
  63. Gang Yu and Junsong Yuan. 2015. Fast action proposals for human action detection and search. In IEEE Conference on Computer Vision and Pattern Recognition. 1302--1311.Google ScholarGoogle ScholarCross RefCross Ref
  64. Zehuan Yuan, Jonathan C. Stroud, Tong Lu, and Jia Deng. 2017. Temporal action localization by structured maximal sums. In IEEE Conference on Computer Vision and Pattern Recognition. 3684--3692.Google ScholarGoogle ScholarCross RefCross Ref
  65. Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  66. Hongyuan Zhu, Romain Vial, and Shijian Lu. 2017. TORNADO: A spatio-temporal convolutional regression network for video action proposal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5813--5821.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Am I Done? Predicting Action Progress in Videos

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 4
      November 2020
      372 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3444749
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 December 2020
      • Accepted: 1 May 2020
      • Revised: 1 April 2020
      • Received: 1 March 2020
      Published in tomm Volume 16, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!