skip to main content
research-article

Spatio-Temporal Deep Residual Network with Hierarchical Attentions for Video Event Recognition

Authors Info & Claims
Published:21 June 2020Publication History
Skip Abstract Section

Abstract

Event recognition in surveillance video has gained extensive attention from the computer vision community. This process still faces enormous challenges due to the tiny inter-class variations that are caused by various facets, such as severe occlusion, cluttered backgrounds, and so forth. To address these issues, we propose a spatio-temporal deep residual network with hierarchical attentions (STDRN-HA) for video event recognition. In the first attention layer, the ResNet fully connected feature guides the Faster R-CNN feature to generate object-based attention (O-attention) for target objects. In the second attention layer, the O-attention further guides the ResNet convolutional feature to yield the holistic attention (H-attention) in order to perceive more details of the occluded objects and the global background. In the third attention layer, the attention maps use the deep features to obtain the attention-enhanced features. Then, the attention-enhanced features are input into a deep residual recurrent network, which is used to mine more event clues from videos. Furthermore, an optimized loss function named softmax-RC is designed, which embeds the residual block regularization and center loss to solve the vanishing gradient in a deep network and enlarge the distance between inter-classes. We also build a temporal branch to exploit the long- and short-term motion information. The final results are obtained by fusing the outputs of the spatial and temporal streams. Experiments on the four realistic video datasets, CCV, VIRAT 1.0, VIRAT 2.0, and HMDB51, demonstrate that the proposed method has good performance and achieves state-of-the-art results.

References

  1. Mohamed R. Amer and Sinisa Todorovic. 2012. Sum-product networks for modeling activities with stochastic structure. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1314--1321.Google ScholarGoogle Scholar
  2. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6077--6086.Google ScholarGoogle ScholarCross RefCross Ref
  3. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6299--6308.Google ScholarGoogle ScholarCross RefCross Ref
  4. Wenqing Chu, Hongyang Xue, Chengwei Yao, and Deng Cai. 2019. Sparse coding guided spatiotemporal feature learning for abnormal event detection in large videos. IEEE Transactions on Multimedia 21, 1 (2019), 246--255.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  6. Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016. Spatiotemporal residual networks for video action recognition. In Advances in Neural Information Processing Systems (NIPS). 3468--3476.Google ScholarGoogle Scholar
  7. Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4768--4777.Google ScholarGoogle Scholar
  8. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1933--1941.Google ScholarGoogle ScholarCross RefCross Ref
  9. Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2019), 1112--1131. DOI:10.1109/TPAMI.2019.2894139Google ScholarGoogle Scholar
  10. Zhanning Gao, Gang Hua, Dongqing Zhang, Nebojsa Jojic, Le Wang, Jianru Xue, and Nanning Zheng. 2017. ER3: A unified framework for event retrieval, recognition and recounting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2253--2262.Google ScholarGoogle ScholarCross RefCross Ref
  11. Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning spatio-temporal aggregation for action classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 971--980.Google ScholarGoogle ScholarCross RefCross Ref
  12. Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In The 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 273--278.Google ScholarGoogle ScholarCross RefCross Ref
  13. Zhang Hao and Ngo Chong-Wah. 2019. A fine granularity object-level representation for event detection and recounting. IEEE Transactions on Multimedia 21, 6 (2019), 1450--1463.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6546--6555.Google ScholarGoogle ScholarCross RefCross Ref
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  16. Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, and Devi Parikh. 2019. End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2352--2356.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jingyi Hou, Xinxiao Wu, Feiwu Yu, and Yunde Jia. 2016. Multimedia event detection via deep spatial-temporal neural networks. In The IEEE International Conference on Multimedia and Expo. 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  18. Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li. 2019. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology (2019). DOI:10.1109/TCSVT.2019.2904996Google ScholarGoogle ScholarCross RefCross Ref
  19. Yu Gang Jiang, Chong Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In ACM International Conference on Image and Video Retrieval. 494--501.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yu Gang Jiang, Guangnan Ye, Shih Fu Chang, Daniel P. W. Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR).Google ScholarGoogle Scholar
  21. Hou Jingyi, Wu Xinxiao, Sun Yuchao, and Jia Yunde. 2018. Content-attention representation by factorized action-scene network for action recognition. IEEE Transactions on Multimedia 20, 6 (2018), 1537--1547.Google ScholarGoogle ScholarCross RefCross Ref
  22. Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. 2011. HMDB51: A large video database for human motion recognition. In The IEEE International Conference on Computer Vision (ICCV). 2556--2563.Google ScholarGoogle Scholar
  23. N. Kumaran, A. Vadivel, and S. Saravana Kumar. 2018. Recognition of human actions using CNN-GWO: A novel modeling of CNN for enhancement of classification performance. Multimedia Tools 8 Applications 77, 18 (2018), 23115--23147.Google ScholarGoogle Scholar
  24. Inwoong Lee, Doyoung Kim, Seoungyoon Kang, and Sanghoon Lee. 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In The IEEE International Conference on Computer Vision (ICCV).Google ScholarGoogle ScholarCross RefCross Ref
  25. Chao Li, Jiewei Cao, Zi Huang, Lei Zhu, and Heng Tao Shen. 2017. Leveraging weak semantic relevance for complex video event classification. In The IEEE International Conference on Computer Vision (ICCV). 3647--3656.Google ScholarGoogle ScholarCross RefCross Ref
  26. Chao Li, Zi Huang, Yang Yang, Jiewei Cao, Xiaoshuai Sun, and Heng Tao Shen. 2017. Hierarchical latent concept discovery for video event detection. IEEE Transactions on Image Processing 26, 5 (2017), 2149--2162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yonggang Li, Zhaohui Wang, Xiaoyi Wan, Husheng Dong, Shengrong Gong, Chunping Liu, Yi Ji, and Rong Zhu. 2018. Deep residual dual unidirectional DLSTM for video event recognition with spatial-temporal consistency. Chinese Journal of Computers 41, 12 (2018), 2852--2866.Google ScholarGoogle Scholar
  28. Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees G. M. Snoek. 2018. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding 166 (2018), 41--50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, and Alex C. Kot. 2017. Global context-aware attention LSTM networks for 3D action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  30. Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In AAAI.Google ScholarGoogle Scholar
  31. Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7834--7843.Google ScholarGoogle ScholarCross RefCross Ref
  32. Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. 2018. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In AAAI Conference on Artificial Intelligence. 7218--7225.Google ScholarGoogle Scholar
  33. Tahmida Mahmud, Mahmudul Hasan, and Amit K. Roy-Chowdhury. 2017. Joint prediction of activity labels and starting times in untrimmed videos. In The IEEE International Conference on Computer Vision (ICCV). 5773--5782.Google ScholarGoogle Scholar
  34. Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Massachusetts Institute of Technology Press. 2204--2212.Google ScholarGoogle Scholar
  35. Soltanian Mohammad and Ghaemmaghami Shahrokh. 2019. Hierarchical concept score postprocessing and concept-wise normalization in CNN-Based video event recognition. IEEE Transactions on Multimedia 21, 1 (2019), 157--172.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, J. K. Aggarwal, Hyungtae Lee, Larry Davis, et al. 2011. A large-scale benchmark dataset for event recognition in surveillance video. In Computer Vision and Pattern Recognition (CVPR). IEEE, 3153--3160.Google ScholarGoogle Scholar
  37. Sujoy Paul, Jawadul H. Bappy, and Amit K. Roy-Chowdhury. 2017. Non-uniform subset selection for active learning in structured data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3138--3147.Google ScholarGoogle Scholar
  38. Wenjie Pei, Tadas Baltrusaitis, David M. J. Tax, and Louis-Philippe Morency. 2017. Temporal attention-gated model for robust sequence classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  39. Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. 2019. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology 29, 3 (2019), 773--786.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Vignesh Ramanathan, Kevin Tang, Greg Mori, and Li Fei-Fei. 2015. Learning temporal embeddings for complex video analysis. In IEEE International Conference on Computer Vision (ICCV). 4471--4479.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28. 91--99.Google ScholarGoogle Scholar
  42. M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673--2681.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2016. Action recognition using visual attention. ICLR (2016).Google ScholarGoogle Scholar
  44. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.Google ScholarGoogle Scholar
  45. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  46. Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2018. Deep semantic role labeling with self-attention. In AAAI Conference on Artificial Intelligence. 4929--4936.Google ScholarGoogle Scholar
  47. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In The IEEE International Conference on Computer Vision (ICCV). 4489--4497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In The IEEE International Conference on Computer Vision (ICCV). 3551--3558.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Hongsong Wang and Liang Wang. 2017. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  50. Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  51. Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Computer Vision and Pattern Recognition (CVPR). 4305--4314.Google ScholarGoogle Scholar
  52. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV). 20--36.Google ScholarGoogle ScholarCross RefCross Ref
  53. Xiaoyang Wang and Qiang Ji. 2014. A hierarchical context model for event recognition in surveillance video. In Computer Vision and Pattern Recognition (CVPR). 2561--2568.Google ScholarGoogle Scholar
  54. Xiaoyang Wang and Qiang Ji. 2015. Video event recognition with deep hierarchical context model. In Computer Vision and Pattern Recognition (CVPR). 4418--4427.Google ScholarGoogle Scholar
  55. Xiaoyang Wang and Qiang Ji. 2017. Hierarchical context modeling for video event recognition. IEEE Transactions on Pattern Analysis 8 Machine Intelligence 39, 9 (2017), 1770--1782.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision. Springer, 499--515.Google ScholarGoogle ScholarCross RefCross Ref
  57. Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM International Conference on Multimedia. 461--470.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Yang Xian, Xuejian Rong, Xiaodong Yang, and Yingli Tian. 2017. Evaluation of low-level features for real-world surveillance event detection. IEEE Transactions on Circuits 8 Systems for Video Technology 27, 3 (2017), 624--634.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Wenlong Xie, Hongxun Yao, Xiaoshuai Sun, Tingting Han, Sicheng Zhao, and Tat-Seng Chua. 2019. Discovering latent discriminative patterns for multi-mode event representation. IEEE Transactions on Multimedia 21, 6 (2019), 1425--1436.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision. 451--466.Google ScholarGoogle ScholarCross RefCross Ref
  61. Zhongwen Xu, Yi Yang, and Alex G. Hauptmann. 2015. A discriminative CNN video representation for event detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1798--1807.Google ScholarGoogle Scholar
  62. Li Yikang, Yu Tianshu, and Li Baoxin. 2018. Simultaneous event localization and recognition in surveillance video. In The IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). 1--6.Google ScholarGoogle Scholar
  63. Tingzhao Yu, Lingfeng Wang, Cheng Da, Huxiang Gu, Shiming Xiang, and Chunhong Pan. 2019. Weakly semantic guided action recognition. IEEE Transactions on Multimedia 21, 10 (2019), 2504--2517.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Sanyi Zhang, Zhanjie Song, Xiaochun Cao, and Hua Zhang. 2019. Task-aware attention model for clothing attribute prediction. IEEE Transactions on Circuits and Systems for Video Technology (2019). DOI:https://doi.org/10.1109/TCSVT.2019.2902268Google ScholarGoogle Scholar
  65. Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai. 2017. What to do next: Modeling user behaviors by time-LSTM. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 3602--3608.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Yingying Zhu, Nandita Nanyak, and Amit Roychowdhury. 2015. Context-aware activity modeling using hierarchical conditional random fields. IEEE Transactions on Pattern Analysis 8 Machine Intelligence 37, 7 (2015), 1360--1372.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Yingying Zhu, Nandita M. Nayak, and Amit K. Roy-Chowdhury. 2013. Context-aware modeling and recognition of activities in video. In Computer Vision and Pattern Recognition (CVPR). 2491--2498.Google ScholarGoogle Scholar
  68. Yi Zhu and Shawn Newsam. 2016. Depth2action: Exploring embedded depth for large-scale action recognition. In European Conference on Computer Vision. 668--684.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Spatio-Temporal Deep Residual Network with Hierarchical Attentions for Video Event Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 2s
      Special Issue on Smart Communications and Networking for Future Video Surveillance and Special Section on Extended MMSYS-NOSSDAV 2019 Best Papers
      April 2020
      291 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3407689
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 June 2020
      • Online AM: 7 May 2020
      • Accepted: 1 January 2020
      • Revised: 1 November 2019
      • Received: 1 May 2019
      Published in tomm Volume 16, Issue 2s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!