skip to main content
research-article

MV2Flow: Learning Motion Representation for Fast Compressed Video Action Recognition

Authors Info & Claims
Published:31 December 2020Publication History
Skip Abstract Section

Abstract

In video action recognition, motion is a very crucial clue, which is usually represented by optical flow. However, optical flow is computationally expensive to obtain, which becomes the bottleneck for the efficiency of traditional action recognition algorithms. In this article, we propose a network called MV2Flow to learn motion representation efficiently from the signals in the compressed domain. To learn the network, three losses are defined. First, we select the classical TV-L1 flow as proxy ground truth to guide the learning. Besides, an unsupervised image reconstruction loss is proposed to further refine it. Moreover, toward the task of action recognition, the above two losses are combined with a motion content loss. To evaluate our approach, extensive experiments on two benchmark datasets UCF-101 and HMDB-51 are conducted. The motion representation generated with our MV2Flow has shown comparable classification performance on action recognition with TV-L1 flow, while operating at an over 200× faster speed. Based on our MV2Flow and 2D-CNN-based network, we have achieved state-of-the-art performance in the compressed domain. With 3D-CNN-based network, we also achieve comparable accuracy with higher inference speed than methods in the decoded domain setting.

References

  1. Aria Ahmadi and Ioannis Patras. 2016. Unsupervised convolutional neural networks for motion estimation. In Proceedings of the IEEE International Conference on Image Processing. 1629--1633.Google ScholarGoogle ScholarCross RefCross Ref
  2. Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. 2004. High accuracy optical flow estimation based on a theory for warping. In Proceedings of the European Conference on Computer Vision. 25--36.Google ScholarGoogle ScholarCross RefCross Ref
  3. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarGoogle ScholarCross RefCross Ref
  4. Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. 2018. Multi-fiber networks for video recognition. In Proceedings of the European Conference on Computer Vision. 352--367.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  6. Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758--2766.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6016--6025.Google ScholarGoogle ScholarCross RefCross Ref
  8. Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202--6211.Google ScholarGoogle ScholarCross RefCross Ref
  9. Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4768--4777.Google ScholarGoogle Scholar
  10. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933--1941.Google ScholarGoogle ScholarCross RefCross Ref
  11. David Gadot and Lior Wolf. 2016. PatchBatch: A batch augmented loss for optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4236--4245.Google ScholarGoogle ScholarCross RefCross Ref
  12. Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2568--2577.Google ScholarGoogle Scholar
  13. Ruohan Gao, Bo Xiong, and Kristen Grauman. 2018. Im2flow: Motion hallucination from static images for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5937--5947.Google ScholarGoogle ScholarCross RefCross Ref
  14. Fatma Güney and Andreas Geiger. 2016. Deep discrete flow. In Proceedings of the Asian Conference on Computer Vision. 207--224.Google ScholarGoogle Scholar
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  16. Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going deeper into action recognition: A survey. Image Vision Comput. 60 (2017), 4--21.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Berthold K. P. Horn and Brian G. Schunck. 1981. Determining optical flow. Artificial Intelligence 17, 1–3 (1981), 185--203.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Wenbing Huang, Lijie Fan, Mehrtash Harandi, Lin Ma, Huaping Liu, Wei Liu, and Chuang Gan. 2019. Toward efficient action recognition: Principal backpropagation for training two-stream networks. IEEE Trans. Image Process. 28, 4 (2019), 1773--1782.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems. 2017--2025.Google ScholarGoogle Scholar
  20. J. Yu Jason, Adam W. Harley, and Konstantinos G. Derpanis. 2016. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In Proceedings of the European Conference on Computer Vision. 3--10.Google ScholarGoogle Scholar
  21. H. Jhuang, H. Garrote, E. Poggio, T. Serre, and T. Hmdb. 2011. Hmdb: A large video database for human motion recognition. In Proceedings of IEEE International Conference on Computer Vision. 2556--2563.Google ScholarGoogle Scholar
  22. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1 (2013), 221--231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Vadim Kantorov and Ivan Laptev. 2014. Efficient feature extraction, encoding and classification for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2593--2600.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1097--1105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wei-Sheng Lai, Jia-Bin Huang, and Ming-Hsuan Yang. 2017. Semi-supervised learning for optical flow with generative adversarial networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 354--364.Google ScholarGoogle Scholar
  27. Didier Le Gall. 1991. MPEG: A video compression standard for multimedia applications. Commun. ACM 34, 4 (1991), 46--58.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jianing Li, Shiliang Zhang, and Tiejun Huang. 2019. Multi-scale 3d convolution network for video-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence. 8618--8625.Google ScholarGoogle ScholarCross RefCross Ref
  29. Kun Liu, Wu Liu, Chuang Gan, Mingkui Tan, and Huadong Ma. 2018. T-C3D: Temporal convolutional 3d network for real-time action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  30. Xiang Long, Chuang Gan, Gerard De Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention-based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.Google ScholarGoogle ScholarCross RefCross Ref
  31. Simon Meister, Junhwa Hur, and Stefan Roth. 2018. UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI Conference on Artificial Intelligence. 7251--7259.Google ScholarGoogle Scholar
  32. Etienne Mémin and Patrick Pérez. 1998. Dense estimation and object-based segmentation of the optical flow with robust techniques. IEEE Trans. Image Process. 7, 5 (1998), 703--719.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. AJ Piergiovanni and Michael S. Ryoo. 2019. Representation flow for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9945--9953.Google ScholarGoogle Scholar
  34. Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision. 5533--5541.Google ScholarGoogle ScholarCross RefCross Ref
  35. Anurag Ranjan and Michael J. Black. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4161--4170.Google ScholarGoogle Scholar
  36. Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. 2017. Unsupervised deep learning for optical flow estimation. In Proceedings of the AAAI Conference on Artificial Intelligence. 1495--1501.Google ScholarGoogle Scholar
  37. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. 234--241.Google ScholarGoogle ScholarCross RefCross Ref
  38. Laura Sevilla-Lara, Yiyi Liao, Fatma Güney, Varun Jampani, Andreas Geiger, and Michael J. Black. 2018. On the integration of optical flow and action recognition. In Proceedings of the German Conference on Pattern Recognition. 281--297.Google ScholarGoogle Scholar
  39. Zheng Shou, Zhicheng Yan, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Xudong Lin, and Shih-Fu Chang. 2019. DMC-Net: Generating discriminative motion cues for fast compressed video action recognition. Retrieved from https://Arxiv:1901.03460.Google ScholarGoogle Scholar
  40. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 568--576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Xiaolin Song, Cuiling Lan, Wenjun Zeng, Junliang Xing, Xiaoyan Sun, and Jingyu Yang. 2020. Temporal-spatial mapping for action recognition. IEEE Trans. Circ. Syst. Video Technol. 30, 3 (2020), 748--759.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01 (2012), 2, 5, 6, 7.Google ScholarGoogle Scholar
  43. Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8934--8943.Google ScholarGoogle ScholarCross RefCross Ref
  44. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. 2017. Convnet architecture search for spatiotemporal feature learning. Retrieved from https://Arxiv:1708.05038.Google ScholarGoogle Scholar
  46. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450--6459.Google ScholarGoogle ScholarCross RefCross Ref
  47. Zhigang Tu, Wei Xie, Justin Dauwels, Baoxin Li, and Junsong Yuan. 2019. Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Trans. Circ. Syst. Video Technol. 29, 5 (2019), 1423--1437.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2018. Appearance-and-relation networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1430--1439.Google ScholarGoogle ScholarCross RefCross Ref
  50. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20--36.Google ScholarGoogle ScholarCross RefCross Ref
  51. Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and Cordelia Schmid. 2013. DeepFlow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision. 1385--1392.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J. Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6026--6035.Google ScholarGoogle ScholarCross RefCross Ref
  53. Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 305--321.Google ScholarGoogle ScholarCross RefCross Ref
  54. Chuohao Yeo, Parvez Ahammad, Kannan Ramchandran, and S. Shankar Sastry. 2008. High-speed action recognition and localization in compressed domain videos. IEEE Trans. Circ. Syst. Video Technol. 18, 8 (2008), 1006--1015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.Google ScholarGoogle ScholarCross RefCross Ref
  56. Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality-based approach for realtime TV-L1 optical flow. In Pattern Recognition. 214--223.Google ScholarGoogle Scholar
  57. Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2718--2726.Google ScholarGoogle ScholarCross RefCross Ref
  58. Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. MV2Flow: Learning Motion Representation for Fast Compressed Video Action Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 3s
      Special Issue on Privacy and Security in Evolving Internet of Multimedia Things and Regular Papers
      October 2020
      190 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3444536
      Issue’s Table of Contents

      Copyright © 2020 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 31 December 2020
      • Accepted: 1 July 2020
      • Revised: 1 March 2020
      • Received: 1 October 2019
      Published in tomm Volume 16, Issue 3s

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!