skip to main content
research-article

Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition

Published:06 October 2022Publication History
Skip Abstract Section

Abstract

Currently, many action recognition methods mostly consider the information from spatial streams. We propose a new perspective inspired by the human visual system to combine both spatial and temporal streams to measure their attention consistency. Specifically, a branch-independent convolutional neural network (CNN) based algorithm is developed with a novel attention-consistency loss metric, enabling the temporal stream to concentrate on consistent discriminative regions with the spatial stream in the same period. The consistency loss is further combined with the cross-entropy loss to enhance the visual attention consistency. We evaluate the proposed method for action recognition on two benchmark datasets: Kinetics400 and UCF101. Despite its apparent simplicity, our proposed framework with the attention consistency achieves better performance than most of the two-stream networks, i.e., 75.7% top-1 accuracy on Kinetics400 and 95.7% on UCF101, while reducing 7.1% computational cost compared with our baseline. Particularly, our proposed method can attain remarkable improvements on complex action classes, showing that our proposed network can act as a potential benchmark to handle complicated scenarios in industry 4.0 applications.

REFERENCES

  1. [1] Arnab Anurag, Dehghani Mostafa, Heigold Georg, Sun Chen, Lučić Mario, and Schmid Cordelia. 2021. ViViT: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021).Google ScholarGoogle Scholar
  2. [2] Baker Simon, Scharstein Daniel, Lewis J. P., Roth Stefan, Black Michael J., and Szeliski Richard. 2011. A database and evaluation methodology for optical flow. International Journal of Computer Vision 92, 1 (2011), 131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Bertasius Gedas, Wang Heng, and Torresani Lorenzo. 2021. Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095 (2021).Google ScholarGoogle Scholar
  4. [4] Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 62996308.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Yunpeng, Kalantidis Yannis, Li Jianshu, Yan Shuicheng, and Feng Jiashi. 2018. Multi-fiber networks for video recognition. In Proceedings of the European Conference on Computer Vision. 352367.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Derrington A. M. and Lennie P.. 1984. Spatial and temporal contrast sensitivities of neurones in lateral geniculate nucleus of macaque. The Journal of Physiology 357, 1 (1984), 219240.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Diba Ali, Fayyaz Mohsen, Sharma Vivek, Arzani M. Mahdi, Yousefzadeh Rahman, Gall Juergen, and Gool Luc Van. 2018. Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision. 284299.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Diba Ali, Fayyaz Mohsen, Sharma Vivek, Karami Amir Hossein, Arzani Mohammad Mahdi, Yousefzadeh Rahman, and Gool Luc Van. 2017. Temporal 3D ConvNets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200 (2017).Google ScholarGoogle Scholar
  9. [9] Fan Haoqi, Xiong Bo, Mangalam Karttikeya, Li Yanghao, Yan Zhicheng, Malik Jitendra, and Feichtenhofer Christoph. 2021. Multiscale vision transformers. arXiv preprint arXiv:2104.11227 (2021).Google ScholarGoogle Scholar
  10. [10] Feichtenhofer Christoph. 2020. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203213.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Feichtenhofer Christoph, Fan Haoqi, Malik Jitendra, and He Kaiming. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 62026211.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Feichtenhofer Christoph, Pinz Axel, and Zisserman Andrew. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19331941.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Gavrilyuk Kirill, Ghodrati Amir, Li Zhenyang, and Snoek Cees G. M.. 2018. Actor and action video segmentation from a sentence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 59585966.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Girdhar Rohit, Carreira Joao, Doersch Carl, and Zisserman Andrew. 2019. Video action transformer network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 244253.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Goyal Raghav, Kahou Samira Ebrahimi, Michalski Vincent, Materzynska Joanna, Westphal Susanne, Kim Heuna, Haenel Valentin, Fruend Ingo, Yianilos Peter, Mueller-Freitag Moritz, et al. 2017. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision. 58425850.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Gu Chunhui, Sun Chen, Ross David A., Vondrick Carl, Pantofaru Caroline, Li Yeqing, Vijayanarasimhan Sudheendra, Toderici George, Ricco Susanna, Sukthankar Rahul, et al. 2018. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60476056.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Hu Jie, Shen Li, and Sun Gang. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 71327141.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Huang Kaizhu, Hussain Amir, Wang Qiu-Feng, and Zhang Rui. 2019. Deep Learning: Fundamentals, Theory and Applications. Vol. 2. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Hubel David H. and Wiesel Torsten N.. 1965. Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. Journal of Neurophysiology 28, 2 (1965), 229289.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Ji Shuiwang, Xu Wei, Yang Ming, and Yu Kai. 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 221231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Karpathy Andrej, Toderici George, Shetty Sanketh, Leung Thomas, Sukthankar Rahul, and Fei-Fei Li. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 17251732.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Kay Will, Carreira Joao, Simonyan Karen, Zhang Brian, Hillier Chloe, Vijayanarasimhan Sudheendra, Viola Fabio, Green Tim, Back Trevor, Natsev Paul, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).Google ScholarGoogle Scholar
  24. [24] Li Ce, Xie Chunyu, Zhang Baochang, Han Jungong, Zhen Xiantong, and Chen Jie. 2021. Memory attention networks for skeleton-based action recognition. IEEE Transactions on Neural Networks and Learning Systems (2021).Google ScholarGoogle Scholar
  25. [25] Li Yonggang, Liu Chunping, Ji Yi, Gong Shengrong, and Xu Haibao. 2020. Spatio-temporal deep residual network with hierarchical attentions for video event recognition. 16, 2s, Article 62 (June 2020), 21 pages. Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Lin Ji, Gan Chuang, and Han Song. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 70837093.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Livingstone Margaret and Hubel David. 1988. Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science 240, 4853 (1988), 740749.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Sevilla-Lara Laura, Liao Yiyi, Güney Fatma, Jampani Varun, Geiger Andreas, and Black Michael J.. 2018. On the integration of optical flow and action recognition. In German Conference on Pattern Recognition. Springer, 281297.Google ScholarGoogle Scholar
  29. [29] Sharir Gilad, Noy Asaf, and Zelnik-Manor Lihi. 2021. An image is worth 16 \( \times \) 16 words, what is a video worth? arXiv preprint arXiv:2103.13915 (2021).Google ScholarGoogle Scholar
  30. [30] Simonyan Karen and Zisserman Andrew. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Soomro Khurram, Zamir Amir Roshan, and Shah Mubarak. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google ScholarGoogle Scholar
  32. [32] Stroud Jonathan, Ross David, Sun Chen, Deng Jia, and Sukthankar Rahul. 2020. D3D: Distilled 3D networks for video action recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 625634.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Sun Lin, Jia Kui, Yeung Dit-Yan, and Shi Bertram E.. 2015. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 45974605.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 44894497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Tran Du, Wang Heng, Torresani Lorenzo, Ray Jamie, LeCun Yann, and Paluri Manohar. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 64506459.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Essen David C. Van and Gallant Jack L.. 1994. Neural mechanisms of form and motion processing in the primate visual system. Neuron 13, 1 (1994), 110.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Wang Fei, Jiang Mengqing, Qian Chen, Yang Shuo, Li Cheng, Zhang Honggang, Wang Xiaogang, and Tang Xiaoou. 2017. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wang Heng, Kläser Alexander, Schmid Cordelia, and Liu Cheng-Lin. 2013. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103, 1 (2013), 6079.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Wang Heng and Schmid Cordelia. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 35513558.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Wang Limin, Li Wei, Li Wen, and Gool Luc Van. 2018. Appearance-and-relation networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14301439.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Wang Limin, Xiong Yuanjun, Wang Zhe, Qiao Yu, Lin Dahua, Tang Xiaoou, and Gool Luc Van. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 2036.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Wang Xiaolong, Girshick Ross, Gupta Abhinav, and He Kaiming. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 77947803.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Wang Zhenzhi, Gao Ziteng, Wang Limin, Li Zhifeng, and Wu Gangshan. 2020. Boundary-aware cascade networks for temporal action segmentation. In Proceedings of the European Conference on Computer Vision. Springer, 3451.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Woo Sanghyun, Park Jongchan, Lee Joon-Young, and Kweon In So. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision. 319.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Xie Saining, Girshick Ross, Dollár Piotr, Tu Zhuowen, and He Kaiming. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14921500.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Xie Saining, Sun Chen, Huang Jonathan, Tu Zhuowen, and Murphy Kevin. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 305321.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Xu Haotian, Jin Xiaobo, Wang Qiufeng, and Huang Kaizhu. 2020. Multi-scale attention consistency for multi-label image classification. In International Conference on Neural Information Processing. Springer, 815823.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Zhang Junxuan, Hu Haifeng, and Lu Xinlong. 2019. Moving foreground-aware visual attention and key volume mining for human action recognition. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3, Article 74 (Aug. 2019), 16 pages. Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Zhao Yue, Xiong Yuanjun, and Lin Dahua. 2018. Trajectory convolution for action recognition. In Proceedings of the International Conference on Neural Information Processing Systems. 22082219.Google ScholarGoogle Scholar
  50. [50] Zhou Bolei, Khosla Aditya, Lapedriza Agata, Oliva Aude, and Torralba Antonio. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 29212929.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Zolfaghari Mohammadreza, Singh Kamaljeet, and Brox Thomas. 2018. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision. 695712.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2s
      June 2022
      383 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3561949
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 October 2022
      • Online AM: 28 May 2022
      • Accepted: 6 April 2022
      • Revised: 16 January 2022
      • Received: 31 October 2021
      Published in tomm Volume 18, Issue 2s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!