Abstract
Event recognition in surveillance video has gained extensive attention from the computer vision community. This process still faces enormous challenges due to the tiny inter-class variations that are caused by various facets, such as severe occlusion, cluttered backgrounds, and so forth. To address these issues, we propose a spatio-temporal deep residual network with hierarchical attentions (STDRN-HA) for video event recognition. In the first attention layer, the ResNet fully connected feature guides the Faster R-CNN feature to generate object-based attention (O-attention) for target objects. In the second attention layer, the O-attention further guides the ResNet convolutional feature to yield the holistic attention (H-attention) in order to perceive more details of the occluded objects and the global background. In the third attention layer, the attention maps use the deep features to obtain the attention-enhanced features. Then, the attention-enhanced features are input into a deep residual recurrent network, which is used to mine more event clues from videos. Furthermore, an optimized loss function named softmax-RC is designed, which embeds the residual block regularization and center loss to solve the vanishing gradient in a deep network and enlarge the distance between inter-classes. We also build a temporal branch to exploit the long- and short-term motion information. The final results are obtained by fusing the outputs of the spatial and temporal streams. Experiments on the four realistic video datasets, CCV, VIRAT 1.0, VIRAT 2.0, and HMDB51, demonstrate that the proposed method has good performance and achieves state-of-the-art results.
- Mohamed R. Amer and Sinisa Todorovic. 2012. Sum-product networks for modeling activities with stochastic structure. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1314--1321.Google Scholar
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6077--6086.Google Scholar
Cross Ref
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6299--6308.Google Scholar
Cross Ref
- Wenqing Chu, Hongyang Xue, Chengwei Yao, and Deng Cai. 2019. Sparse coding guided spatiotemporal feature learning for abnormal event detection in large videos. IEEE Transactions on Multimedia 21, 1 (2019), 246--255.Google Scholar
Digital Library
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2625--2634.Google Scholar
Cross Ref
- Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016. Spatiotemporal residual networks for video action recognition. In Advances in Neural Information Processing Systems (NIPS). 3468--3476.Google Scholar
- Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4768--4777.Google Scholar
- Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1933--1941.Google Scholar
Cross Ref
- Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2019), 1112--1131. DOI:10.1109/TPAMI.2019.2894139Google Scholar
- Zhanning Gao, Gang Hua, Dongqing Zhang, Nebojsa Jojic, Le Wang, Jianru Xue, and Nanning Zheng. 2017. ER3: A unified framework for event retrieval, recognition and recounting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2253--2262.Google Scholar
Cross Ref
- Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning spatio-temporal aggregation for action classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 971--980.Google Scholar
Cross Ref
- Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In The 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 273--278.Google Scholar
Cross Ref
- Zhang Hao and Ngo Chong-Wah. 2019. A fine granularity object-level representation for event detection and recounting. IEEE Transactions on Multimedia 21, 6 (2019), 1450--1463.Google Scholar
Digital Library
- Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6546--6555.Google Scholar
Cross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, and Devi Parikh. 2019. End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2352--2356.Google Scholar
Cross Ref
- Jingyi Hou, Xinxiao Wu, Feiwu Yu, and Yunde Jia. 2016. Multimedia event detection via deep spatial-temporal neural networks. In The IEEE International Conference on Multimedia and Expo. 1--6.Google Scholar
Cross Ref
- Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li. 2019. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology (2019). DOI:10.1109/TCSVT.2019.2904996Google Scholar
Cross Ref
- Yu Gang Jiang, Chong Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In ACM International Conference on Image and Video Retrieval. 494--501.Google Scholar
Digital Library
- Yu Gang Jiang, Guangnan Ye, Shih Fu Chang, Daniel P. W. Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR).Google Scholar
- Hou Jingyi, Wu Xinxiao, Sun Yuchao, and Jia Yunde. 2018. Content-attention representation by factorized action-scene network for action recognition. IEEE Transactions on Multimedia 20, 6 (2018), 1537--1547.Google Scholar
Cross Ref
- Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. 2011. HMDB51: A large video database for human motion recognition. In The IEEE International Conference on Computer Vision (ICCV). 2556--2563.Google Scholar
- N. Kumaran, A. Vadivel, and S. Saravana Kumar. 2018. Recognition of human actions using CNN-GWO: A novel modeling of CNN for enhancement of classification performance. Multimedia Tools 8 Applications 77, 18 (2018), 23115--23147.Google Scholar
- Inwoong Lee, Doyoung Kim, Seoungyoon Kang, and Sanghoon Lee. 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
Cross Ref
- Chao Li, Jiewei Cao, Zi Huang, Lei Zhu, and Heng Tao Shen. 2017. Leveraging weak semantic relevance for complex video event classification. In The IEEE International Conference on Computer Vision (ICCV). 3647--3656.Google Scholar
Cross Ref
- Chao Li, Zi Huang, Yang Yang, Jiewei Cao, Xiaoshuai Sun, and Heng Tao Shen. 2017. Hierarchical latent concept discovery for video event detection. IEEE Transactions on Image Processing 26, 5 (2017), 2149--2162.Google Scholar
Digital Library
- Yonggang Li, Zhaohui Wang, Xiaoyi Wan, Husheng Dong, Shengrong Gong, Chunping Liu, Yi Ji, and Rong Zhu. 2018. Deep residual dual unidirectional DLSTM for video event recognition with spatial-temporal consistency. Chinese Journal of Computers 41, 12 (2018), 2852--2866.Google Scholar
- Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees G. M. Snoek. 2018. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding 166 (2018), 41--50.Google Scholar
Digital Library
- Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, and Alex C. Kot. 2017. Global context-aware attention LSTM networks for 3D action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In AAAI.Google Scholar
- Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7834--7843.Google Scholar
Cross Ref
- Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. 2018. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In AAAI Conference on Artificial Intelligence. 7218--7225.Google Scholar
- Tahmida Mahmud, Mahmudul Hasan, and Amit K. Roy-Chowdhury. 2017. Joint prediction of activity labels and starting times in untrimmed videos. In The IEEE International Conference on Computer Vision (ICCV). 5773--5782.Google Scholar
- Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Massachusetts Institute of Technology Press. 2204--2212.Google Scholar
- Soltanian Mohammad and Ghaemmaghami Shahrokh. 2019. Hierarchical concept score postprocessing and concept-wise normalization in CNN-Based video event recognition. IEEE Transactions on Multimedia 21, 1 (2019), 157--172.Google Scholar
Digital Library
- Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, J. K. Aggarwal, Hyungtae Lee, Larry Davis, et al. 2011. A large-scale benchmark dataset for event recognition in surveillance video. In Computer Vision and Pattern Recognition (CVPR). IEEE, 3153--3160.Google Scholar
- Sujoy Paul, Jawadul H. Bappy, and Amit K. Roy-Chowdhury. 2017. Non-uniform subset selection for active learning in structured data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3138--3147.Google Scholar
- Wenjie Pei, Tadas Baltrusaitis, David M. J. Tax, and Louis-Philippe Morency. 2017. Temporal attention-gated model for robust sequence classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. 2019. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology 29, 3 (2019), 773--786.Google Scholar
Digital Library
- Vignesh Ramanathan, Kevin Tang, Greg Mori, and Li Fei-Fei. 2015. Learning temporal embeddings for complex video analysis. In IEEE International Conference on Computer Vision (ICCV). 4471--4479.Google Scholar
Digital Library
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28. 91--99.Google Scholar
- M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673--2681.Google Scholar
Digital Library
- Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2016. Action recognition using visual attention. ICLR (2016).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2018. Deep semantic role labeling with self-attention. In AAAI Conference on Artificial Intelligence. 4929--4936.Google Scholar
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In The IEEE International Conference on Computer Vision (ICCV). 4489--4497.Google Scholar
Digital Library
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In The IEEE International Conference on Computer Vision (ICCV). 3551--3558.Google Scholar
Digital Library
- Hongsong Wang and Liang Wang. 2017. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Cross Ref
- Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Computer Vision and Pattern Recognition (CVPR). 4305--4314.Google Scholar
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV). 20--36.Google Scholar
Cross Ref
- Xiaoyang Wang and Qiang Ji. 2014. A hierarchical context model for event recognition in surveillance video. In Computer Vision and Pattern Recognition (CVPR). 2561--2568.Google Scholar
- Xiaoyang Wang and Qiang Ji. 2015. Video event recognition with deep hierarchical context model. In Computer Vision and Pattern Recognition (CVPR). 4418--4427.Google Scholar
- Xiaoyang Wang and Qiang Ji. 2017. Hierarchical context modeling for video event recognition. IEEE Transactions on Pattern Analysis 8 Machine Intelligence 39, 9 (2017), 1770--1782.Google Scholar
Digital Library
- Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision. Springer, 499--515.Google Scholar
Cross Ref
- Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM International Conference on Multimedia. 461--470.Google Scholar
Digital Library
- Yang Xian, Xuejian Rong, Xiaodong Yang, and Yingli Tian. 2017. Evaluation of low-level features for real-world surveillance event detection. IEEE Transactions on Circuits 8 Systems for Video Technology 27, 3 (2017), 624--634.Google Scholar
Digital Library
- Wenlong Xie, Hongxun Yao, Xiaoshuai Sun, Tingting Han, Sicheng Zhao, and Tat-Seng Chua. 2019. Discovering latent discriminative patterns for multi-mode event representation. IEEE Transactions on Multimedia 21, 6 (2019), 1425--1436.Google Scholar
Digital Library
- Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision. 451--466.Google Scholar
Cross Ref
- Zhongwen Xu, Yi Yang, and Alex G. Hauptmann. 2015. A discriminative CNN video representation for event detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1798--1807.Google Scholar
- Li Yikang, Yu Tianshu, and Li Baoxin. 2018. Simultaneous event localization and recognition in surveillance video. In The IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). 1--6.Google Scholar
- Tingzhao Yu, Lingfeng Wang, Cheng Da, Huxiang Gu, Shiming Xiang, and Chunhong Pan. 2019. Weakly semantic guided action recognition. IEEE Transactions on Multimedia 21, 10 (2019), 2504--2517.Google Scholar
Digital Library
- Sanyi Zhang, Zhanjie Song, Xiaochun Cao, and Hua Zhang. 2019. Task-aware attention model for clothing attribute prediction. IEEE Transactions on Circuits and Systems for Video Technology (2019). DOI:https://doi.org/10.1109/TCSVT.2019.2902268Google Scholar
- Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai. 2017. What to do next: Modeling user behaviors by time-LSTM. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 3602--3608.Google Scholar
Digital Library
- Yingying Zhu, Nandita Nanyak, and Amit Roychowdhury. 2015. Context-aware activity modeling using hierarchical conditional random fields. IEEE Transactions on Pattern Analysis 8 Machine Intelligence 37, 7 (2015), 1360--1372.Google Scholar
Digital Library
- Yingying Zhu, Nandita M. Nayak, and Amit K. Roy-Chowdhury. 2013. Context-aware modeling and recognition of activities in video. In Computer Vision and Pattern Recognition (CVPR). 2491--2498.Google Scholar
- Yi Zhu and Shawn Newsam. 2016. Depth2action: Exploring embedded depth for large-scale action recognition. In European Conference on Computer Vision. 668--684.Google Scholar
Cross Ref
Index Terms
Spatio-Temporal Deep Residual Network with Hierarchical Attentions for Video Event Recognition
Recommendations
Spatio-Temporal Learning for Video Deblurring based on Two-Stream Generative Adversarial Network
AbstractVideo-deblurring has achieved excellent results by using deep learning approaches. How to capture the dynamic spatio-temporal information in the videos is crucial on deblurring. In this paper, we propose a two-stream DeblurGAN which combines a 3D ...
Expression recognition using fuzzy spatio-temporal modeling
In human-computer interaction, there is a need for computer to recognize human facial expression accurately. This paper proposes a novel and effective approach for facial expression recognition that analyzes a sequence of images (displaying one ...






Comments