Abstract
In this article, we deal with the problem of predicting action progress in videos. We argue that this is an extremely important task, since it can be valuable for a wide range of interaction applications. To this end, we introduce a novel approach, named ProgressNet, capable of predicting when an action takes place in a video, where it is located within the frames, and how far it has progressed during its execution. To provide a general definition of action progress, we ground our work in the linguistics literature, borrowing terms and concepts to understand which actions can be the subject of progress estimation. As a result, we define a categorization of actions and their phases. Motivated by the recent success obtained from the interaction of Convolutional and Recurrent Neural Networks, our model is based on a combination of the Faster R-CNN framework, to make framewise predictions, and LSTM networks, to estimate action progress through time. After introducing two evaluation protocols for the task at hand, we demonstrate the capability of our model to effectively predict action progress on the UCF-101 and J-HMDB datasets.
- J. K. Aggarwal and M. S. Ryoo. 2011. Human activity analysis: A review. Comput. Surveys 43, 3 (2011), 16:1--16:43.Google Scholar
- Mohammad Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Basura Fernando, Lars Petersson, and Lars Andersson. 2017. Encouraging lstms to anticipate actions very early. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 280--289.Google Scholar
Cross Ref
- Michael A. Arbib. 2006. A sentence is to speech as what is to action? Cortex 42, 4 (2006), 507--514.Google Scholar
Cross Ref
- Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. SST: Single-stream temporal action proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- Bernard Comrie. 1976. Aspect: An Introduction to the Study of Verbal Aspect and Related Problems. Vol. 2. Cambridge university press.Google Scholar
- Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees GM Snoek, and Tinne Tuytelaars. 2016. Online action detection. In European Conference on Computer Vision. 269--284.Google Scholar
Cross Ref
- Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. 2019. Temporal cycle-consistency learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1801--1810.Google Scholar
Cross Ref
- Victor Escorcia, Cuong D Dao, Mihir Jain, Bernard Ghanem, and Cees Snoek. 2020. Guess where? Actor-supervision for spatiotemporal action localization. Computer Vision and Image Understanding 192 (2020), 102886.Google Scholar
Cross Ref
- Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 961--970.Google Scholar
- Cornelia Fermüller, Fang Wang, Yezhou Yang, Konstantinos Zampogiannis, Yi Zhang, Francisco Barranco, and Michael Pfeiffer. 2016. Prediction of manipulation actions. International Journal of Computer Vision (2016), 1--17.Google Scholar
- J. Randall Flanagan and Roland S. Johansson. 2003. Action plans used in action observation. Nature 424, 6950 (2003), 769.Google Scholar
- Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. 2013. Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 11 (2013), 2782--2795.Google Scholar
Digital Library
- Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In IEEE International Conference on Computer Vision. 759--768.Google Scholar
Cross Ref
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks.. In Artificial Intelligence and Statistics Conference. 249--256.Google Scholar
- Tengda Han, Jue Wang, Anoop Cherian, and Stephen Gould. 2017. Human action forecasting by learning task grammarss. arXiv preprint arXiv:1412.6980 (2017).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision. 346--361.Google Scholar
Cross Ref
- Farnoosh Heidarivincheh, Majid Mirmehdi, and Dima Damen. 2016. Beyond action recognition: Action completion in RGB-D data. In British Machine Vision Conference.Google Scholar
Cross Ref
- Farnoosh Heidarivincheh, Majid Mirmehdi, and Dima Damen. 2018. Action completion: A temporal model for moment detection. In Proceedings of the British Machine Vision Conference (BMVC).Google Scholar
- Farnoosh Heidarivincheh, Majid Mirmehdi, and Dima Damen. 2019. Weakly-supervised completion moment detection using temporal attention. In Proceedings of the IEEE International Conference on Computer Vision Workshops.Google Scholar
Cross Ref
- Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In IEEE Conference on Computer Vision and Pattern Recognition. 1914--1923.Google Scholar
Cross Ref
- Minh Hoai and Fernando De la Torre. 2014. Max-margin early event detectors. International Journal of Computer Vision 107, 2 (2014), 191--202.Google Scholar
Digital Library
- Rui Hou, Chen Chen, and Mubarak Shah. 2017. Tube convolutional neural network (T-CNN) for action detection in videos. In IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155 (2017), 1--23.Google Scholar
Cross Ref
- Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. 2013. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 3192--3199.Google Scholar
- Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. 2017. Action tubelet detector for spatio-temporal action localization. In IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Yu Kong, Zhiqiang Tao, and Yun Fu. 2017. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1481.Google Scholar
Cross Ref
- Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2017. Unsupervised representation learning by sorting sequences. In IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- Xinyu Li, Yanyi Zhang, Jianyu Zhang, Moliang Zhou, Shuhong Chen, Yue Gu, Yueyang Chen, Ivan Marsic, Richard A. Farneth, and Randall S. Burd. 2017. Progress estimation and phase detection for sequential processes. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 1, 3, Article Article 73 (Sept. 2017), 20 pages. DOI:https://doi.org/10.1145/3130936Google Scholar
- Junwei Liang, Lu Jiang, Juan Carlos Niebles, Alexander G. Hauptmann, and Li Fei-Fei. 2019. Peeking into the future: Predicting future person activities and locations in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5725--5734.Google Scholar
Cross Ref
- Daochang Liu, Tingting Jiang, and Yizhou Wang. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1298--1307.Google Scholar
Cross Ref
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. Ssd: Single shot multibox detector. In European Conference on Computer Vision. Springer, 21--37.Google Scholar
- Ziwei Liu, Raymond A. Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. 2017. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision. 4463--4471.Google Scholar
Cross Ref
- Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 344--353.Google Scholar
Cross Ref
- Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection. In IEEE Conference on Computer Vision and Pattern Recognition. 1942--1950.Google Scholar
Cross Ref
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, Nov (2008), 2579--2605.Google Scholar
- Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015).Google Scholar
- Davide Moltisanti, Michael Wray, Walterio Mayol-Cuevas, and Dima Damen. 2017. Trespassing the boundaries: Labeling temporal bounds for object interactions in egocentric video. In Proceedings of the IEEE International Conference on Computer Vision. 2886--2894.Google Scholar
Cross Ref
- A. Neumann and Andrew Zisserman. 2019. Future event prediction: If and when. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.Google Scholar
Cross Ref
- Phuc Xuan Nguyen, Deva Ramanan, and Charless C. Fowlkes. 2019. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision. 5502--5511.Google Scholar
- Katerina Pastra and Yiannis Aloimonos. 2012. The minimalist grammar of action. Philosophical Transactions of the Royal Society B: Biological Sciences 367, 1585 (2012), 103--117.Google Scholar
Cross Ref
- A. Patra and J. A. Noble. 2018. Sequential anatomy localization in fetal echocardiography videos. arXiv preprint arXiv:1810.11868 (2018).Google Scholar
- Xiaojiang Peng and Cordelia Schmid. 2016. Multi-region two-stream R-CNN for action detection. In European Conference on Computer Vision. 744--759.Google Scholar
Cross Ref
- Ronald Poppe. 2010. A survey on vision-based human action recognition. Image and Vision Computing 28, 6 (2010), 976--990.Google Scholar
Digital Library
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.Google Scholar
- Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. 2016. Deep learning for detecting multiple space-time action tubes in videos. In British Machine Vision Conference.Google Scholar
Cross Ref
- Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- G. A. Sigurdsson, O. Russakovsky, and A. Gupta. 2017. What actions are needed for understanding human actions in videos? In IEEE International Conference on Computer Vision. 2137--2146.Google Scholar
- Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, and Fabio Cuzzolin. 2017. Online real-time multiple spatiotemporal action localisation and prediction. In IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- Khurram Soomro, Amir R. Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
- Bilge Soran, Ali Farhadi, and Linda Shapiro. 2015. Generating notifications for missing actions: Don’t forget to turn the lights off!. In IEEE International Conference on Computer Vision. 4669--4677.Google Scholar
Digital Library
- James Steele, Pier Francesco Ferrari, and Leonardo Fogassi. 2012. From action to language: Comparative perspectives on primate tool use, gesture and the evolution of human language. Philosophical Transactions of the Royal Society B: Biological Sciences 367, 1585 (2012), 4.Google Scholar
Cross Ref
- Andru Putra Twinanda, Gaurav Yengera, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. 2018. RSDNet: Learning to predict remaining surgery duration from laparoscopic videos without manual annotations. IEEE Transactions on Medical Imaging 38, 4 (2018), 1069--1078.Google Scholar
Cross Ref
- Zeno Vendler. 1957. Verbs and times. The Philosophical Review 66, 2 (1957), 143--160.Google Scholar
Cross Ref
- Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Anticipating visual representations with unlabeled video. In IEEE Conference on Computer Vision and Pattern Recognition. 98--106.Google Scholar
Cross Ref
- Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems. 613--621.Google Scholar
- Xiaolong Wang, Ali Farhadi, and Abhinav Gupta. 2016. Actions ∼ transformations. In IEEE Conference on Computer Vision and Pattern Recognition. 2658--2667.Google Scholar
Cross Ref
- Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. 2015. Learning to track for spatio-temporal action localization. In IEEE International Conference on Computer Vision. 3164--3172.Google Scholar
Digital Library
- Yuanjun Xiong, Yue Zhao, Limin Wang, Dahua Lin, and Xiaoou Tang. 2017. A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716 (2017).Google Scholar
- Mingze Xu, Mingfei Gao, Yi-Ting Chen, Larry S. Davis, and David J. Crandall. 2019. Temporal recurrent networks for online action detection. In Proceedings of the IEEE International Conference on Computer Vision. 5532--5541.Google Scholar
- Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In IEEE Conference on Computer Vision and Pattern Recognition. 2678--2687.Google Scholar
Cross Ref
- Gang Yu and Junsong Yuan. 2015. Fast action proposals for human action detection and search. In IEEE Conference on Computer Vision and Pattern Recognition. 1302--1311.Google Scholar
Cross Ref
- Zehuan Yuan, Jonathan C. Stroud, Tong Lu, and Jia Deng. 2017. Temporal action localization by structured maximal sums. In IEEE Conference on Computer Vision and Pattern Recognition. 3684--3692.Google Scholar
Cross Ref
- Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In IEEE International Conference on Computer Vision.Google Scholar
Cross Ref
- Hongyuan Zhu, Romain Vial, and Shijian Lu. 2017. TORNADO: A spatio-temporal convolutional regression network for video action proposal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5813--5821.Google Scholar
Cross Ref
Index Terms
Am I Done? Predicting Action Progress in Videos
Recommendations
Recognizing 50 human action categories of web videos
Action recognition on large categories of unconstrained videos taken from the web is a very challenging problem compared to datasets like KTH (6 actions), IXMAS (13 actions), and Weizmann (10 actions). Challenges like camera motion, different viewpoints,...
Action density based frame sampling for human action recognition in videos
AbstractIn the action recognition, a proper frame sampling method can not only reduce redundant video information, but also improve the accuracy of action recognition. In this paper, an action density based non-isometric frame sampling method, ...
Learning Temporal Structure of Videos for Action Recognition Using Pattern Theory
ICCAI '20: Proceedings of the 2020 6th International Conference on Computing and Artificial IntelligenceAiming at the problem that a large amount of background information in the videos cause low judgment of actions, this paper proposed a graph model based on pattern theory for human complex action recognition. Firstly, a video is divided into video units ...






Comments