Abstract
Temporal action detection not only requires correct classification but also needs to detect the start and end times of each action accurately. However, traditional approaches always employ sliding windows or actionness to predict the actions, and it is different to train to model with sliding windows or actionness by end-to-end means. In this article, we attempt a different idea to detect the actions end-to-end, which can calculate the probabilities of actions directly through one network as one part of the results. We present PCAD, a novel proposal complementary action detector to deal with video streams under continuous, untrimmed conditions. Our approach first uses a simple fully 3D convolutional network to encode the video streams and then generates candidate temporal proposals for activities by using anchor segments. To generate more precise proposals, we also design a boundary proposal network to offer some complementary information for the candidate proposals. Finally, we learn an efficient classifier to classify the generated proposals into different activities and refine their temporal boundaries at the same time. Our model can achieve end-to-end training by jointly optimizing classification loss and regression loss. When evaluating on the THUMOS’14 detection benchmark, PCAD achieves state-of-the-art performance in high-speed models.
- Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. 2017. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference (BMVC’17), Vol. 2. 1--7.Google Scholar
Cross Ref
- Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 1130--1139.Google Scholar
Cross Ref
- Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S. Davis, and Yan Qiu Chen. 2017. Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision. 5793--5802.Google Scholar
Cross Ref
- N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Google Scholar
- Yoav Freund and Robert E. Schapire. 1996. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning (ICML’96). 148--156.Google Scholar
Digital Library
- Jiyang Gao, Kan Chen, and Ram Nevatia. 2018. CTAP: Complementary temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 68--83.Google Scholar
Cross Ref
- Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision. 3628--3636.Google Scholar
Cross Ref
- Jiyang Gao, Zhenheng Yang, and Ram Nevatia. 2017. Cascaded boundary regression for temporal action detection. In Proceedings of the British Machine Vision Conference (BMVC’17). DOI:https://doi.org/10.5244/C.31.52Google Scholar
Cross Ref
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2016. Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (Jan. 2016), 142--158. DOI:https://doi.org/10.1109/TPAMI.2015.2437384Google Scholar
Digital Library
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961--2969.Google Scholar
- Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS Challenge 2014: Action Recognition with a Large Number of Classes. Retrieved May 14, 2020 from http://crcv.ucf.edu/THUMOS14/.Google Scholar
- Andy Liaw and Matthew Wiener. 2002. Classification and regression by randomForest. R News 2, 3 (2002), 18--22.Google Scholar
- Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia (MM’17). ACM, New York, NY, 988--996. DOI:https://doi.org/10.1145/3123266.3123343Google Scholar
Digital Library
- Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 3--19.Google Scholar
Cross Ref
- TsungYi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980--2988.Google Scholar
Cross Ref
- Constantine Papageorgiou, Michael Oren, and Tomaso A. Poggio. 1998. A general framework for object detection. In Proceedings of the 6th International Conference on Computer Vision. 555.Google Scholar
- Matti Pietikäinen. 2005. Image analysis with local binary patterns. In Proceedings of the Scandinavian Conference on Image Analysis. 115--118.Google Scholar
Digital Library
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (June 2017), 1137--1149. DOI:https://doi.org/10.1109/TPAMI.2016.2577031Google Scholar
Digital Library
- Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5734--5743.Google Scholar
Cross Ref
- Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049--1058.Google Scholar
Cross Ref
- Gunnar A. Sigurdsson, Gül Varol Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision—ECCV 2016. Lecture Notes in Computer Science, Vol. 9905. Springer, 510--526.Google Scholar
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). IEEE, Los Alamitos, CA, 4489--4497. DOI:https://doi.org/10.1109/ICCV.2015.510Google Scholar
Digital Library
- Vladimir Vapnik. 2013. The Nature of Statistical Learning Theory. Springer Science 8 Business Media.Google Scholar
Digital Library
- Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 5783--5792.Google Scholar
Cross Ref
- Da Zhang, Xiyang Dai, Xin Wang, and Yuan-Fang Wang. 2018. S3D: Single shot multi-span detector via fully 3D convolutional network. In Proceedings of the British Machine Vision Conference (BMVC’18). 1--11.Google Scholar
- Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision. 2914--2923.Google Scholar
Cross Ref
Index Terms
Proposal Complementary Action Detection
Recommendations
Single Shot Temporal Action Detection
MM '17: Proceedings of the 25th ACM international conference on MultimediaTemporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting ...
CTAP: Complementary Temporal Action Proposal Generation
Computer Vision – ECCV 2018AbstractTemporal action proposal generation is an important task, akin to object proposals, temporal action proposals are intended to capture “clips” or temporal intervals in videos that are likely to contain an action. Previous methods can be divided to ...
BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
Computer Vision – ECCV 2018AbstractTemporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This ...






Comments