Abstract
In video action recognition, motion is a very crucial clue, which is usually represented by optical flow. However, optical flow is computationally expensive to obtain, which becomes the bottleneck for the efficiency of traditional action recognition algorithms. In this article, we propose a network called MV2Flow to learn motion representation efficiently from the signals in the compressed domain. To learn the network, three losses are defined. First, we select the classical TV-L1 flow as proxy ground truth to guide the learning. Besides, an unsupervised image reconstruction loss is proposed to further refine it. Moreover, toward the task of action recognition, the above two losses are combined with a motion content loss. To evaluate our approach, extensive experiments on two benchmark datasets UCF-101 and HMDB-51 are conducted. The motion representation generated with our MV2Flow has shown comparable classification performance on action recognition with TV-L1 flow, while operating at an over 200× faster speed. Based on our MV2Flow and 2D-CNN-based network, we have achieved state-of-the-art performance in the compressed domain. With 3D-CNN-based network, we also achieve comparable accuracy with higher inference speed than methods in the decoded domain setting.
- Aria Ahmadi and Ioannis Patras. 2016. Unsupervised convolutional neural networks for motion estimation. In Proceedings of the IEEE International Conference on Image Processing. 1629--1633.Google Scholar
Cross Ref
- Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. 2004. High accuracy optical flow estimation based on a theory for warping. In Proceedings of the European Conference on Computer Vision. 25--36.Google Scholar
Cross Ref
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google Scholar
Cross Ref
- Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. 2018. Multi-fiber networks for video recognition. In Proceedings of the European Conference on Computer Vision. 352--367.Google Scholar
Cross Ref
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248--255.Google Scholar
Cross Ref
- Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758--2766.Google Scholar
Digital Library
- Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6016--6025.Google Scholar
Cross Ref
- Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202--6211.Google Scholar
Cross Ref
- Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4768--4777.Google Scholar
- Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933--1941.Google Scholar
Cross Ref
- David Gadot and Lior Wolf. 2016. PatchBatch: A batch augmented loss for optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4236--4245.Google Scholar
Cross Ref
- Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2568--2577.Google Scholar
- Ruohan Gao, Bo Xiong, and Kristen Grauman. 2018. Im2flow: Motion hallucination from static images for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5937--5947.Google Scholar
Cross Ref
- Fatma Güney and Andreas Geiger. 2016. Deep discrete flow. In Proceedings of the Asian Conference on Computer Vision. 207--224.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
Cross Ref
- Samitha Herath, Mehrtash Harandi, and Fatih Porikli. 2017. Going deeper into action recognition: A survey. Image Vision Comput. 60 (2017), 4--21.Google Scholar
Digital Library
- Berthold K. P. Horn and Brian G. Schunck. 1981. Determining optical flow. Artificial Intelligence 17, 1–3 (1981), 185--203.Google Scholar
Digital Library
- Wenbing Huang, Lijie Fan, Mehrtash Harandi, Lin Ma, Huaping Liu, Wei Liu, and Chuang Gan. 2019. Toward efficient action recognition: Principal backpropagation for training two-stream networks. IEEE Trans. Image Process. 28, 4 (2019), 1773--1782.Google Scholar
Digital Library
- Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. 2015. Spatial transformer networks. In Proceedings of the Advances in Neural Information Processing Systems. 2017--2025.Google Scholar
- J. Yu Jason, Adam W. Harley, and Konstantinos G. Derpanis. 2016. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In Proceedings of the European Conference on Computer Vision. 3--10.Google Scholar
- H. Jhuang, H. Garrote, E. Poggio, T. Serre, and T. Hmdb. 2011. Hmdb: A large video database for human motion recognition. In Proceedings of IEEE International Conference on Computer Vision. 2556--2563.Google Scholar
- Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1 (2013), 221--231.Google Scholar
Digital Library
- Vadim Kantorov and Ivan Laptev. 2014. Efficient feature extraction, encoding and classification for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2593--2600.Google Scholar
Digital Library
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732.Google Scholar
Digital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1097--1105.Google Scholar
Digital Library
- Wei-Sheng Lai, Jia-Bin Huang, and Ming-Hsuan Yang. 2017. Semi-supervised learning for optical flow with generative adversarial networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 354--364.Google Scholar
- Didier Le Gall. 1991. MPEG: A video compression standard for multimedia applications. Commun. ACM 34, 4 (1991), 46--58.Google Scholar
Digital Library
- Jianing Li, Shiliang Zhang, and Tiejun Huang. 2019. Multi-scale 3d convolution network for video-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence. 8618--8625.Google Scholar
Cross Ref
- Kun Liu, Wu Liu, Chuang Gan, Mingkui Tan, and Huadong Ma. 2018. T-C3D: Temporal convolutional 3d network for real-time action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.Google Scholar
- Xiang Long, Chuang Gan, Gerard De Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention-based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.Google Scholar
Cross Ref
- Simon Meister, Junhwa Hur, and Stefan Roth. 2018. UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI Conference on Artificial Intelligence. 7251--7259.Google Scholar
- Etienne Mémin and Patrick Pérez. 1998. Dense estimation and object-based segmentation of the optical flow with robust techniques. IEEE Trans. Image Process. 7, 5 (1998), 703--719.Google Scholar
Digital Library
- AJ Piergiovanni and Michael S. Ryoo. 2019. Representation flow for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9945--9953.Google Scholar
- Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision. 5533--5541.Google Scholar
Cross Ref
- Anurag Ranjan and Michael J. Black. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4161--4170.Google Scholar
- Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. 2017. Unsupervised deep learning for optical flow estimation. In Proceedings of the AAAI Conference on Artificial Intelligence. 1495--1501.Google Scholar
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. 234--241.Google Scholar
Cross Ref
- Laura Sevilla-Lara, Yiyi Liao, Fatma Güney, Varun Jampani, Andreas Geiger, and Michael J. Black. 2018. On the integration of optical flow and action recognition. In Proceedings of the German Conference on Pattern Recognition. 281--297.Google Scholar
- Zheng Shou, Zhicheng Yan, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Xudong Lin, and Shih-Fu Chang. 2019. DMC-Net: Generating discriminative motion cues for fast compressed video action recognition. Retrieved from https://Arxiv:1901.03460.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 568--576.Google Scholar
Digital Library
- Xiaolin Song, Cuiling Lan, Wenjun Zeng, Junliang Xing, Xiaoyan Sun, and Jingyu Yang. 2020. Temporal-spatial mapping for action recognition. IEEE Trans. Circ. Syst. Video Technol. 30, 3 (2020), 748--759.Google Scholar
Digital Library
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01 (2012), 2, 5, 6, 7.Google Scholar
- Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. 2018. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8934--8943.Google Scholar
Cross Ref
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.Google Scholar
Digital Library
- Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. 2017. Convnet architecture search for spatiotemporal feature learning. Retrieved from https://Arxiv:1708.05038.Google Scholar
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450--6459.Google Scholar
Cross Ref
- Zhigang Tu, Wei Xie, Justin Dauwels, Baoxin Li, and Junsong Yuan. 2019. Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Trans. Circ. Syst. Video Technol. 29, 5 (2019), 1423--1437.Google Scholar
Digital Library
- Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision. 3551--3558.Google Scholar
Digital Library
- Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2018. Appearance-and-relation networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1430--1439.Google Scholar
Cross Ref
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20--36.Google Scholar
Cross Ref
- Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and Cordelia Schmid. 2013. DeepFlow: Large displacement optical flow with deep matching. In Proceedings of the IEEE International Conference on Computer Vision. 1385--1392.Google Scholar
Digital Library
- Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J. Smola, and Philipp Krähenbühl. 2018. Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6026--6035.Google Scholar
Cross Ref
- Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision. 305--321.Google Scholar
Cross Ref
- Chuohao Yeo, Parvez Ahammad, Kannan Ramchandran, and S. Shankar Sastry. 2008. High-speed action recognition and localization in compressed domain videos. IEEE Trans. Circ. Syst. Video Technol. 18, 8 (2008), 1006--1015.Google Scholar
Digital Library
- Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.Google Scholar
Cross Ref
- Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality-based approach for realtime TV-L1 optical flow. In Pattern Recognition. 214--223.Google Scholar
- Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. 2016. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2718--2726.Google Scholar
Cross Ref
- Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and Yu Qiao. 2016. A key volume mining deep framework for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1991--1999.Google Scholar
Cross Ref
Index Terms
MV2Flow: Learning Motion Representation for Fast Compressed Video Action Recognition
Recommendations
Fine-grained Action Recognition with Robust Motion Representation Decoupling and Concentration
MM '22: Proceedings of the 30th ACM International Conference on MultimediaFine-grained action recognition is a challenging task that requires identifying discriminative and subtle motion variations among fine-grained action classes. Existing methods typically focus on spatio-temporal feature extraction and long-temporal ...
Using Phase Instead of Optical Flow for Action Recognition
Computer Vision – ECCV 2018 WorkshopsAbstractCurrently, the most common motion representation for action recognition is optical flow. Optical flow is based on particle tracking which adheres to a Lagrangian perspective on dynamics. In contrast to the Lagrangian perspective, the Eulerian ...
STAR: Efficient SpatioTemporal Modeling for Action Recognition
AbstractAction recognition in video has gained significant attention over the past several years. While conventional 2D CNNs have found great success in understanding images, they are not as effective in capturing temporal relationships present in video. ...






Comments