Abstract
Video object detection and action recognition typically require deep neural networks (DNNs) with huge number of parameters. It is thereby challenging to develop a DNN video comprehension unit in resource-constrained terminal devices. In this article, we introduce a deeply tensor-compressed video comprehension neural network, called DEEPEYE, for inference on terminal devices. Instead of building a Long Short-Term Memory (LSTM) network directly from high-dimensional raw video data input, we construct an LSTM-based spatio-temporal model from structured, tensorized time-series features for object detection and action recognition. A deep compression is achieved by tensor decomposition and trained quantization of the time-series feature-based LSTM network. We have implemented DEEPEYE on an ARM-core-based IOT board with 31 FPS consuming only 2.4W power. Using the video datasets MOMENTS, UCF11 and HMDB51 as benchmarks, DEEPEYE achieves a 228.1× model compression with only 0.47% mAP reduction; as well as 15k× parameter reduction with up to 8.01% accuracy improvement over other competing approaches.
- Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In International Conference on Machine Learning. 2285--229Google Scholar
- Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. 2018. Multi-fiber networks for video recognition. In European Conference on Computer Vision. 352--367.Google Scholar
Cross Ref
- Yuan Cheng, Guangya Li, Ngai Wong, Hai-Bao Chen, and Hao Yu. 2019. DEEPEYE: A deeply tensor-compressed neural network hardware accelerator. In IEEE/ACM International Conference on Computer-Aided Design. 1--8.Google Scholar
- Yuan Cheng, Chao Wang, Hai-Bao Chen, and Hao Yu. 2019. A large-scale in-memory computing for deep neural network with trained quantization. Integration (2019).Google Scholar
- Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando De Freitas. 2013. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems. 2148--2156.Google Scholar
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.Google Scholar
Cross Ref
- Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. 2015. Recurrent neural networks for emotion recognition in video. In ACM Conference on Multimodal Interaction. 467--474.Google Scholar
- Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303--338.Google Scholar
Digital Library
- Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon, Boqing Gong, and Junzhou Huang. 2018. End-to-end learning of motion representation for video understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 6016--6025.Google Scholar
Cross Ref
- Basura Fernando and Stephen Gould. 2016. Learning end-to-end video classification with rank-pooling. In International Conference on Machine Learning. 1187--1196.Google Scholar
- Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G. Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2568--2577.Google Scholar
- Ross Girshick. 2015. Fast r-CNN. In IEEE Conference on Computer Vision. 1440--1448.Google Scholar
Digital Library
- Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Arxiv Preprint Arxiv:1510.00149 (2015).Google Scholar
Digital Library
- Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems. 1135--1143.Google Scholar
- Mahmudul Hasan and Amit K Roy-Chowdhury. 2014. Incremental activity modeling and recognition in streaming videos. In IEEE Conference on Computer Vision and Pattern Recognition. 796--803.Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision. 770--778.Google Scholar
Cross Ref
- Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, and David Barber. 2017. Wider and deeper, cheaper and faster: Tensorized LSTMs for sequence learning. In Advances in Neural Information Processing Systems. 1--11.Google Scholar
- Haifeng Hu. 2014. Multiview gait recognition based on patch distribution features and uncorrelated multilinear sparse local discriminant canonical correlation analysis. IEEE Transactions on Circuits and Systems for Video Technology 24, 4 (2014), 617--630.Google Scholar
Cross Ref
- Hantao Huang and Hao Yu. 2018. LTNN: A layerwise tensorized compression of multilayer neural network. IEEE Transactions on Neural Networks and Learning Systems (2018).Google Scholar
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. Arxiv Preprint Arxiv:1609.07061 (2016).Google Scholar
- Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations.Journal of Machine Learning Research 18 (2017), 187--1.Google Scholar
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Arxiv Preprint Arxiv:1502.03167 (2015).Google Scholar
Digital Library
- Nandakishore Kambhatla and Todd K. Leen. 1997. Dimension reduction by local principal component analysis. Neural Computation 9, 7 (1997), 1493--1516.Google Scholar
Digital Library
- Shubham Kamdar and Neha Kamdar. 2015. big. LITTLE architecture: Heterogeneous multicore processing. International Journal of Computer Applications 119, 1 (2015).Google Scholar
Cross Ref
- Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, and Wanli Ouyang. 2017. T-CNN: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology (2017).Google Scholar
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.Google Scholar
Cross Ref
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732.Google Scholar
Digital Library
- Henk A. L. Kiers. 2000. Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics 14, 3 (2000), 105--122.Google Scholar
Cross Ref
- D. Kinga and J. Ba Adam. 2015. A method for stochastic optimization. In International Conference on Learning Representations, Vol. 5.Google Scholar
- Tamara G. Kolda and Brett W. Bader. 2009. Tensor decompositions and applications. SIAM Review 51, 3 (2009), 455--500.Google Scholar
Digital Library
- Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In International Conference on Computer Vision. 2556--2563.Google Scholar
Digital Library
- Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. 2014. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. Arxiv Preprint Arxiv:1412.6553 (2014).Google Scholar
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google Scholar
Cross Ref
- Yixing Li, Zichuan Liu, Wenye Liu, Yu Jiang, Yongliang Wang, Wang Ling Goh, Hao Yu, and Fengbo Ren. 2018. A 34-FPS 698-GOP/s/W binarized deep neural network-based natural scene text interpretation accelerator for mobile edge computing. IEEE Transactions on Industrial Electronics (2018).Google Scholar
- Dianting Liu, Mei-Ling Shyu, and Guiru Zhao. 2013. Spatial-temporal motion information integration for action detection and recognition in non-static background. In IEEE Conference on Information Reuse 8 Integration. 626--633.Google Scholar
Cross Ref
- Jiaying Liu, Yanghao Li, Sijie Song, Junliang Xing, Cuiling Lan, and Wenjun Zeng. 2018. Multi-modality multi-task recurrent neural network for online action detection. IEEE Transactions on Circuits and Systems for Video Technology (2018).Google Scholar
- Jingen Liu, Jiebo Luo, and Mubarak Shah. 2009. Recognizing realistic actions from videos “in the wild”. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1996--2003.Google Scholar
Cross Ref
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. Ssd: Single shot multibox detector. In European Conference on Computer Vision. 21--37.Google Scholar
- Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.Google Scholar
Cross Ref
- Weixin Luo, Wen Liu, and Shenghua Gao. 2017. Remembering history with convolutional LSTM for anomaly detection. In IEEE International Conference on Multimedia and Expo. 439--444.Google Scholar
Cross Ref
- Jefferson Ryan Medel and Andreas Savakis. 2016. Anomaly detection in video using predictive convolutional long short-term memory networks. Arxiv Preprint Arxiv:1612.00390 (2016).Google Scholar
- Mathew Monfort, Bolei Zhou, Sarah Adel Bargal, Alex Andonian, Tom Yan, Kandan Ramakrishnan, Lisa Brown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, and Aude Oliva. 2018. Moments in time dataset: One million videos for event understanding. Arxiv Preprint Arxiv:1801.03150 (2018).Google Scholar
- Mohammad Motamedi, Daniel Fong, and Soheil Ghiasi. 2017. Machine intelligence on resource-constrained IoT devices: The case of thread granularity optimization for CNN inference. ACM Transactions on Embedded Computing Systems 16, 5s (2017), 151.Google Scholar
Digital Library
- Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In IEEE Conference on Machine Learning, 807--814.Google Scholar
- Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P. Vetrov. 2015. Tensorizing neural networks. In Advances in Neural Information Processing Systems. 442--450.Google Scholar
Digital Library
- Jonathan Pedoeem and Rachel Huang. 2018. Yolo-lite: A real-time object detection algorithm optimized for non-GPU computers. Arxiv Preprint Arxiv:1811.05588 (2018).Google Scholar
- Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, faster, stronger. Arxiv Preprint Arxiv:1612.08242 (2017).Google Scholar
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.Google Scholar
- Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. Arxiv Preprint Arxiv:1511.04119 (2015).Google Scholar
- Runbin Shi, Junjie Liu, K.-H. Hayden So, Shuo Wang, and Yun Liang. 2019. E-LSTM: Efficient inference of sparse LSTM on embedded heterogeneous system. In ACM/IEEE Design Automation Conference. 1--6.Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.Google Scholar
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929--1958.Google Scholar
Digital Library
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In IEEE International Conference on Computer Vision. 4489--4497.Google Scholar
Digital Library
- Amin Ullah, Jamil Ahmad, Khan Muhammad, Muhammad Sajjad, and Sung Wook Baik. 2018. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6 (2018), 1155--1166.Google Scholar
- Botao Wang, Junni Zou, Yong Li, Kuanyu Ju, Hongkai Xiong, and Yuan F. Zheng. 2017. Sparse-to-dense depth estimation in videos via high-dimensional tensor voting. IEEE Transactions on Circuits and Systems for Video Technology (2017).Google Scholar
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.Google Scholar
Cross Ref
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048--2057.Google Scholar
Digital Library
- Yinchong Yang, Denis Krompass, and Volker Tresp. 2017. Tensor-train recurrent neural networks for video classification. Arxiv Preprint Arxiv:1707.01786 (2017).Google Scholar
- Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. 2017. Long-term forecasting using tensor-train RNNs. Arxiv Preprint Arxiv:1711.00073 (2017).Google Scholar
- Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.Google Scholar
Cross Ref
- Chen Yunpeng, Jin Xiaojie, Kang Bingyi, Feng Jiashi, and Yan Shuicheng. 2017. Sharing residual units through collective tensor factorization in deep neural networks. Arxiv Preprint Arxiv:1703.02180 (2017).Google Scholar
- Jun Zhang, Yong Yan, and Martin Lades. 1997. Face recognition: Eigenface, elastic matching, and neural nets. Proc. IEEE 85, 9 (1997), 1423--1435.Google Scholar
Cross Ref
- Qibin Zhao, Masashi Sugiyama, and Andrzej Cichocki. 2017. Learning efficient tensor representations with ring structure networks. Arxiv Preprint Arxiv:1705.08286 (2017).Google Scholar
- Shandian Zhe, Kai Zhang, Pengyuan Wang, Kuang-chih Lee, Zenglin Xu, Yuan Qi, and Zoubin Ghahramani. 2016. Distributed flexible nonlinear tensor factorization. In Advances in Neural Information Processing Systems. 928--936.Google Scholar
- Peining Zhen, Bin Liu, Yuan Cheng, Hai-Bao Chen, and Hao Yu. 2019. Fast video facial expression recognition by deeply tensor-compressed LSTM neural network on mobile device. In ACM/IEEE Symposium on Edge Computing. 298--300.Google Scholar
Digital Library
- Guanwen Zhong, Akshat Dubey, Cheng Tan, and Tulika Mitra. 2019. Synergy: An HW/SW framework for high throughput CNNs on embedded heterogeneous SoC. ACM Transactions on Embedded Computing Systems 18, 2 (2019), 13.Google Scholar
Digital Library
- Bingyin Zhou, Fan Zhang, and Lizhong Peng. 2013. Compact representation for dynamic texture video coding using tensor method. IEEE Transactions on Circuits and Systems for Video Technology 23, 2 (2013), 280--288.Google Scholar
Digital Library
- Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. Arxiv Preprint Arxiv:1606.06160 (2016).Google Scholar
Index Terms
DEEPEYE: A Deeply Tensor-Compressed Neural Network for Video Comprehension on Terminal Devices
Recommendations
A large-scale in-memory computing for deep neural network with trained quantization
AbstractThere is a grand challenge to develop energy-efficient yet high throughput accelerator for deep learning. This paper shows an in-memory deep learning accelerator with trained low-bitwidth quantization method. Firstly, we show that a ...
HighlightsIn this paper, we have developed a quantized large-scale ResNet-50 network using ImageNet benchmark with high accuracy. We further show that the quantized ResNet-50 network can be realized on ReRAM-crossbar with significantly ...
Deep learning: an overview and main paradigms
In the present paper, we examine and analyze main paradigms of learning of multilayer neural networks starting with a single layer perceptron and ending with deep neural networks, which are considered regarded as a breakthrough in the field of the ...
Deep Kronecker neural networks: A general framework for neural networks with adaptive activation functions
AbstractWe propose a new type of neural networks, Kronecker neural networks (KNNs), that form a general framework for neural networks with adaptive activation functions. KNNs employ the Kronecker product, which provides an efficient way of ...






Comments