skip to main content
research-article

DEEPEYE: A Deeply Tensor-Compressed Neural Network for Video Comprehension on Terminal Devices

Authors Info & Claims
Published:18 May 2020Publication History
Skip Abstract Section

Abstract

Video object detection and action recognition typically require deep neural networks (DNNs) with huge number of parameters. It is thereby challenging to develop a DNN video comprehension unit in resource-constrained terminal devices. In this article, we introduce a deeply tensor-compressed video comprehension neural network, called DEEPEYE, for inference on terminal devices. Instead of building a Long Short-Term Memory (LSTM) network directly from high-dimensional raw video data input, we construct an LSTM-based spatio-temporal model from structured, tensorized time-series features for object detection and action recognition. A deep compression is achieved by tensor decomposition and trained quantization of the time-series feature-based LSTM network. We have implemented DEEPEYE on an ARM-core-based IOT board with 31 FPS consuming only 2.4W power. Using the video datasets MOMENTS, UCF11 and HMDB51 as benchmarks, DEEPEYE achieves a 228.1× model compression with only 0.47% mAP reduction; as well as 15k× parameter reduction with up to 8.01% accuracy improvement over other competing approaches.

References

  1. Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In International Conference on Machine Learning. 2285--229Google ScholarGoogle Scholar
  2. Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. 2018. Multi-fiber networks for video recognition. In European Conference on Computer Vision. 352--367.Google ScholarGoogle ScholarCross RefCross Ref
  3. Yuan Cheng, Guangya Li, Ngai Wong, Hai-Bao Chen, and Hao Yu. 2019. DEEPEYE: A deeply tensor-compressed neural network hardware accelerator. In IEEE/ACM International Conference on Computer-Aided Design. 1--8.Google ScholarGoogle Scholar
  4. Yuan Cheng, Chao Wang, Hai-Bao Chen, and Hao Yu. 2019. A large-scale in-memory computing for deep neural network with trained quantization. Integration (2019).Google ScholarGoogle Scholar
  5. Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando De Freitas. 2013. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems. 2148--2156.Google ScholarGoogle Scholar
  6. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  7. Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. 2015. Recurrent neural networks for emotion recognition in video. In ACM Conference on Multimodal Interaction. 467--474.Google ScholarGoogle Scholar
  8. Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 6016--6025.Google ScholarGoogle ScholarCross RefCross Ref
  10. Basura Fernando and Stephen Gould. 2016. Learning end-to-end video classification with rank-pooling. In International Conference on Machine Learning. 1187--1196.Google ScholarGoogle Scholar
  11. Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2568--2577.Google ScholarGoogle Scholar
  12. Ross Girshick. 2015. Fast r-CNN. In IEEE Conference on Computer Vision. 1440--1448.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Arxiv Preprint Arxiv:1510.00149 (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. 1135--1143.Google ScholarGoogle Scholar
  15. Mahmudul Hasan and Amit K Roy-Chowdhury. 2014. Incremental activity modeling and recognition in streaming videos. In IEEE Conference on Computer Vision and Pattern Recognition. 796--803.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  17. Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. 2017. Wider and deeper, cheaper and faster: Tensorized LSTMs for sequence learning. In Advances in Neural Information Processing Systems. 1--11.Google ScholarGoogle Scholar
  18. Haifeng Hu. 2014. Multiview gait recognition based on patch distribution features and uncorrelated multilinear sparse local discriminant canonical correlation analysis. IEEE Transactions on Circuits and Systems for Video Technology 24, 4 (2014), 617--630.Google ScholarGoogle ScholarCross RefCross Ref
  19. Hantao Huang and Hao Yu. 2018. LTNN: A layerwise tensorized compression of multilayer neural network. IEEE Transactions on Neural Networks and Learning Systems (2018).Google ScholarGoogle Scholar
  20. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. Arxiv Preprint Arxiv:1609.07061 (2016).Google ScholarGoogle Scholar
  21. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations.Journal of Machine Learning Research 18 (2017), 187--1.Google ScholarGoogle Scholar
  22. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Arxiv Preprint Arxiv:1502.03167 (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Nandakishore Kambhatla and Todd K. Leen. 1997. Dimension reduction by local principal component analysis. Neural Computation 9, 7 (1997), 1493--1516.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shubham Kamdar and Neha Kamdar. 2015. big. LITTLE architecture: Heterogeneous multicore processing. International Journal of Computer Applications 119, 1 (2015).Google ScholarGoogle ScholarCross RefCross Ref
  25. Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, and Wanli Ouyang. 2017. T-CNN: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology (2017).Google ScholarGoogle Scholar
  26. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.Google ScholarGoogle ScholarCross RefCross Ref
  27. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Henk A. L. Kiers. 2000. Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics 14, 3 (2000), 105--122.Google ScholarGoogle ScholarCross RefCross Ref
  29. D. Kinga and J. Ba Adam. 2015. A method for stochastic optimization. In International Conference on Learning Representations, Vol. 5.Google ScholarGoogle Scholar
  30. Tamara G. Kolda and Brett W. Bader. 2009. Tensor decompositions and applications. SIAM Review 51, 3 (2009), 455--500.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In International Conference on Computer Vision. 2556--2563.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. 2014. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. Arxiv Preprint Arxiv:1412.6553 (2014).Google ScholarGoogle Scholar
  33. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  34. Yixing Li, Zichuan Liu, Wenye Liu, Yu Jiang, Yongliang Wang, Wang Ling Goh, Hao Yu, and Fengbo Ren. 2018. A 34-FPS 698-GOP/s/W binarized deep neural network-based natural scene text interpretation accelerator for mobile edge computing. IEEE Transactions on Industrial Electronics (2018).Google ScholarGoogle Scholar
  35. Dianting Liu, Mei-Ling Shyu, and Guiru Zhao. 2013. Spatial-temporal motion information integration for action detection and recognition in non-static background. In IEEE Conference on Information Reuse 8 Integration. 626--633.Google ScholarGoogle ScholarCross RefCross Ref
  36. Jiaying Liu, Yanghao Li, Sijie Song, Junliang Xing, Cuiling Lan, and Wenjun Zeng. 2018. Multi-modality multi-task recurrent neural network for online action detection. IEEE Transactions on Circuits and Systems for Video Technology (2018).Google ScholarGoogle Scholar
  37. Jingen Liu, Jiebo Luo, and Mubarak Shah. 2009. Recognizing realistic actions from videos “in the wild”. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1996--2003.Google ScholarGoogle ScholarCross RefCross Ref
  38. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. Ssd: Single shot multibox detector. In European Conference on Computer Vision. 21--37.Google ScholarGoogle Scholar
  39. Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.Google ScholarGoogle ScholarCross RefCross Ref
  40. Weixin Luo, Wen Liu, and Shenghua Gao. 2017. Remembering history with convolutional LSTM for anomaly detection. In IEEE International Conference on Multimedia and Expo. 439--444.Google ScholarGoogle ScholarCross RefCross Ref
  41. Jefferson Ryan Medel and Andreas Savakis. 2016. Anomaly detection in video using predictive convolutional long short-term memory networks. Arxiv Preprint Arxiv:1612.00390 (2016).Google ScholarGoogle Scholar
  42. Mathew Monfort, Bolei Zhou, Sarah Adel Bargal, Alex Andonian, Tom Yan, Kandan Ramakrishnan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, and Aude Oliva. 2018. Moments in time dataset: One million videos for event understanding. Arxiv Preprint Arxiv:1801.03150 (2018).Google ScholarGoogle Scholar
  43. Mohammad Motamedi, Daniel Fong, and Soheil Ghiasi. 2017. Machine intelligence on resource-constrained IoT devices: The case of thread granularity optimization for CNN inference. ACM Transactions on Embedded Computing Systems 16, 5s (2017), 151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In IEEE Conference on Machine Learning, 807--814.Google ScholarGoogle Scholar
  45. Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P. Vetrov. 2015. Tensorizing neural networks. In Advances in Neural Information Processing Systems. 442--450.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jonathan Pedoeem and Rachel Huang. 2018. Yolo-lite: A real-time object detection algorithm optimized for non-GPU computers. Arxiv Preprint Arxiv:1811.05588 (2018).Google ScholarGoogle Scholar
  47. Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, faster, stronger. Arxiv Preprint Arxiv:1612.08242 (2017).Google ScholarGoogle Scholar
  48. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.Google ScholarGoogle Scholar
  49. Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. Arxiv Preprint Arxiv:1511.04119 (2015).Google ScholarGoogle Scholar
  50. Runbin Shi, Junjie Liu, K.-H. Hayden So, Shuo Wang, and Yun Liang. 2019. E-LSTM: Efficient inference of sparse LSTM on embedded heterogeneous system. In ACM/IEEE Design Automation Conference. 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.Google ScholarGoogle Scholar
  52. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929--1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In IEEE International Conference on Computer Vision. 4489--4497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Amin Ullah, Jamil Ahmad, Khan Muhammad, Muhammad Sajjad, and Sung Wook Baik. 2018. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6 (2018), 1155--1166.Google ScholarGoogle Scholar
  55. Botao Wang, Junni Zou, Yong Li, Kuanyu Ju, Hongkai Xiong, and Yuan F. Zheng. 2017. Sparse-to-dense depth estimation in videos via high-dimensional tensor voting. IEEE Transactions on Circuits and Systems for Video Technology (2017).Google ScholarGoogle Scholar
  56. Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.Google ScholarGoogle ScholarCross RefCross Ref
  57. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048--2057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Yinchong Yang, Denis Krompass, and Volker Tresp. 2017. Tensor-train recurrent neural networks for video classification. Arxiv Preprint Arxiv:1707.01786 (2017).Google ScholarGoogle Scholar
  59. Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. 2017. Long-term forecasting using tensor-train RNNs. Arxiv Preprint Arxiv:1711.00073 (2017).Google ScholarGoogle Scholar
  60. Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.Google ScholarGoogle ScholarCross RefCross Ref
  61. Chen Yunpeng, Jin Xiaojie, Kang Bingyi, Feng Jiashi, and Yan Shuicheng. 2017. Sharing residual units through collective tensor factorization in deep neural networks. Arxiv Preprint Arxiv:1703.02180 (2017).Google ScholarGoogle Scholar
  62. Jun Zhang, Yong Yan, and Martin Lades. 1997. Face recognition: Eigenface, elastic matching, and neural nets. Proc. IEEE 85, 9 (1997), 1423--1435.Google ScholarGoogle ScholarCross RefCross Ref
  63. Qibin Zhao, Masashi Sugiyama, and Andrzej Cichocki. 2017. Learning efficient tensor representations with ring structure networks. Arxiv Preprint Arxiv:1705.08286 (2017).Google ScholarGoogle Scholar
  64. Shandian Zhe, Kai Zhang, Pengyuan Wang, Kuang-chih Lee, Zenglin Xu, Yuan Qi, and Zoubin Ghahramani. 2016. Distributed flexible nonlinear tensor factorization. In Advances in Neural Information Processing Systems. 928--936.Google ScholarGoogle Scholar
  65. Peining Zhen, Bin Liu, Yuan Cheng, Hai-Bao Chen, and Hao Yu. 2019. Fast video facial expression recognition by deeply tensor-compressed LSTM neural network on mobile device. In ACM/IEEE Symposium on Edge Computing. 298--300.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Guanwen Zhong, Akshat Dubey, Cheng Tan, and Tulika Mitra. 2019. Synergy: An HW/SW framework for high throughput CNNs on embedded heterogeneous SoC. ACM Transactions on Embedded Computing Systems 18, 2 (2019), 13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Bingyin Zhou, Fan Zhang, and Lizhong Peng. 2013. Compact representation for dynamic texture video coding using tensor method. IEEE Transactions on Circuits and Systems for Video Technology 23, 2 (2013), 280--288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. Arxiv Preprint Arxiv:1606.06160 (2016).Google ScholarGoogle Scholar

Index Terms

  1. DEEPEYE: A Deeply Tensor-Compressed Neural Network for Video Comprehension on Terminal Devices

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!