skip to main content
research-article

Proposal Complementary Action Detection

Published:21 June 2020Publication History
Skip Abstract Section

Abstract

Temporal action detection not only requires correct classification but also needs to detect the start and end times of each action accurately. However, traditional approaches always employ sliding windows or actionness to predict the actions, and it is different to train to model with sliding windows or actionness by end-to-end means. In this article, we attempt a different idea to detect the actions end-to-end, which can calculate the probabilities of actions directly through one network as one part of the results. We present PCAD, a novel proposal complementary action detector to deal with video streams under continuous, untrimmed conditions. Our approach first uses a simple fully 3D convolutional network to encode the video streams and then generates candidate temporal proposals for activities by using anchor segments. To generate more precise proposals, we also design a boundary proposal network to offer some complementary information for the candidate proposals. Finally, we learn an efficient classifier to classify the generated proposals into different activities and refine their temporal boundaries at the same time. Our model can achieve end-to-end training by jointly optimizing classification loss and regression loss. When evaluating on the THUMOS’14 detection benchmark, PCAD achieves state-of-the-art performance in high-speed models.

References

  1. Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. 2017. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference (BMVC’17), Vol. 2. 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  2. Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 1130--1139.Google ScholarGoogle ScholarCross RefCross Ref
  3. Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S. Davis, and Yan Qiu Chen. 2017. Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision. 5793--5802.Google ScholarGoogle ScholarCross RefCross Ref
  4. N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  5. Yoav Freund and Robert E. Schapire. 1996. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning (ICML’96). 148--156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jiyang Gao, Kan Chen, and Ram Nevatia. 2018. CTAP: Complementary temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 68--83.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision. 3628--3636.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jiyang Gao, Zhenheng Yang, and Ram Nevatia. 2017. Cascaded boundary regression for temporal action detection. In Proceedings of the British Machine Vision Conference (BMVC’17). DOI:https://doi.org/10.5244/C.31.52Google ScholarGoogle ScholarCross RefCross Ref
  9. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2016. Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (Jan. 2016), 142--158. DOI:https://doi.org/10.1109/TPAMI.2015.2437384Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961--2969.Google ScholarGoogle Scholar
  11. Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS Challenge 2014: Action Recognition with a Large Number of Classes. Retrieved May 14, 2020 from http://crcv.ucf.edu/THUMOS14/.Google ScholarGoogle Scholar
  12. Andy Liaw and Matthew Wiener. 2002. Classification and regression by randomForest. R News 2, 3 (2002), 18--22.Google ScholarGoogle Scholar
  13. Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia (MM’17). ACM, New York, NY, 988--996. DOI:https://doi.org/10.1145/3123266.3123343Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 3--19.Google ScholarGoogle ScholarCross RefCross Ref
  15. TsungYi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980--2988.Google ScholarGoogle ScholarCross RefCross Ref
  16. Constantine Papageorgiou, Michael Oren, and Tomaso A. Poggio. 1998. A general framework for object detection. In Proceedings of the 6th International Conference on Computer Vision. 555.Google ScholarGoogle Scholar
  17. Matti Pietikäinen. 2005. Image analysis with local binary patterns. In Proceedings of the Scandinavian Conference on Image Analysis. 115--118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (June 2017), 1137--1149. DOI:https://doi.org/10.1109/TPAMI.2016.2577031Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5734--5743.Google ScholarGoogle ScholarCross RefCross Ref
  20. Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049--1058.Google ScholarGoogle ScholarCross RefCross Ref
  21. Gunnar A. Sigurdsson, Gül Varol Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision—ECCV 2016. Lecture Notes in Computer Science, Vol. 9905. Springer, 510--526.Google ScholarGoogle Scholar
  22. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). IEEE, Los Alamitos, CA, 4489--4497. DOI:https://doi.org/10.1109/ICCV.2015.510Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Vladimir Vapnik. 2013. The Nature of Statistical Learning Theory. Springer Science 8 Business Media.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 5783--5792.Google ScholarGoogle ScholarCross RefCross Ref
  25. Da Zhang, Xiyang Dai, Xin Wang, and Yuan-Fang Wang. 2018. S3D: Single shot multi-span detector via fully 3D convolutional network. In Proceedings of the British Machine Vision Conference (BMVC’18). 1--11.Google ScholarGoogle Scholar
  26. Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision. 2914--2923.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Proposal Complementary Action Detection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 2s
      Special Issue on Smart Communications and Networking for Future Video Surveillance and Special Section on Extended MMSYS-NOSSDAV 2019 Best Papers
      April 2020
      291 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3407689
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Accepted: 1 September 2020
      • Published: 21 June 2020
      • Online AM: 7 May 2020
      • Revised: 1 August 2019
      • Received: 1 June 2019
      Published in tomm Volume 16, Issue 2s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!