Abstract
This paper addresses the design of systolic array (SA) based convolutional neural network (CNN) accelerators for mobile and embedded domains. On- and off-chip memory accesses to the large activation inputs (sometimes called feature maps) of CNN layers contribute significantly to total energy consumption for such accelerators; while prior has proposed off-chip compression, activations are still stored on-chip in uncompressed form, requiring either large on-chip activation buffers or slow and energy-hungry off-chip accesses. In this paper, we propose CompAct, a new architecture that enables on-chip compression of activations for SA based CNN accelerators. CompAct is built around several key ideas. First, CompAct identifies an SA schedule that has nearly regular access patterns, enabling the use of a modified run-length coding scheme (RLC). Second, CompAct improves compression ratio of the RLC scheme using Sparse-RLC in later CNN layers and Lossy-RLC in earlier layers. Finally, CompAct proposes look-ahead snoozing that operates synergistically with RLC to reduce the leakage energy of activation buffers. Based on detailed synthesis results, we show that CompAct enables up to 62% reduction in activation buffer energy, and 34% reduction in total chip energy.
- Jorge Albericio et al. 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In Processdings of ACM/IEEE ISCA. 1--13.Google Scholar
Digital Library
- Manoj Alwani et al. 2016. Fused-layer CNN accelerators. In IEEE/ACM MICRO. 1--12.Google Scholar
- ARM. 2018. PROJECT [email protected]. https://www.arm.com/products/silicon-ip-cpu/machine-learning/project-trilliumGoogle Scholar
- Srimat Chakradhar et al. 2010. A dynamically configurable coprocessor for convolutional neural networks. In ACM Computer Architecture News, Vol. 38. 247--257.Google Scholar
Digital Library
- Yu-Hsin Chen et al. 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE JSSC 52, 1 (2017), 127--138.Google Scholar
Cross Ref
- Reetuparna Das and Tushar Krishna. [n.d.]. DNN Accelerator Architecture -- SIMD or Systolic? https://www.sigarch.org/dnn-accelerator-architecture-simd-or-systolic/. Accessed: 2019-04-27.Google Scholar
- Jia Deng et al. 2009. Imagenet: A large-scale hierarchical image database. In IEEE CVPR. 248--255.Google Scholar
- Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, et al. 2017. CirCNN: Accelerating and compressing deep neural networks using Block-CirculantWeight matrices. arXiv preprint arXiv:1708.08917 (2017).Google Scholar
- Zidong Du et al. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, Vol. 43. 92--104.Google Scholar
Digital Library
- Li et al. 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In 2018 IEEE DATE. 343--348.Google Scholar
- Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. 2019. Tangram: Optimized coarse-grained dataflow for scalable NN accelerators.Google Scholar
Digital Library
- Yijin Guan et al. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE FCCM. 152--159.Google Scholar
- Song Han et al. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).Google Scholar
- Song Han et al. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of IEEE ISCA. 243--254.Google Scholar
- Muhammad Abdullah Hanif, Alberto Marchisio, Tabasher Arif, Rehan Hafiz, Semeen Rehman, and Muhammad Shafique. 2018. X-DNNs: Systematic cross-layer approximations for energy-efficient deep neural networks. Journal of Low Power Electronics 14, 4 (2018), 520--534.Google Scholar
Cross Ref
- Weiwen Jiang, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Lei Yang, Xianzhang Chen, and Jingtong Hu. 2018. Heterogeneous fpga-based cost-optimal design for timing-constrained cnns. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2542--2554.Google Scholar
Cross Ref
- Norman Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. arXiv preprint arXiv:1704.04760 (2017).Google Scholar
- Andrej Karpathy et al. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE CVPR. 1725--1732.Google Scholar
- Dongyoung Kim et al. 2017. ZeNA: Zero-aware neural network accelerator. IEEE Design Test (2017).Google Scholar
- Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014).Google Scholar
- Alex Krizhevsky et al. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097--1105.Google Scholar
- HT Kung et al. 2019. Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In 24th ASPLOS. ACM, 821--834.Google Scholar
- Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).Google Scholar
- Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. 2018. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 (2018).Google Scholar
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Laboratories (2009), 22--31.Google Scholar
- Angshuman Parashar et al. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. SIGARCH Comput. Archit. News 45, 2 (June 2017).Google Scholar
- Seongwook Park et al. 2015. 93TOPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications. In IEEE ISSCC.Google Scholar
- Atul Rahman et al. 2016. Efficient FPGA acceleration of convolutional neural networks using logical-3D compute array. In IEEE DATE. 1393--1398.Google Scholar
- Jonathan Ross and Gregory Michael Thorson. 2017. Rotating data for neural network computations. US Patent 9,805,303.Google Scholar
- Murugan Sankaradas et al. 2009. A massively parallel coprocessor for convolutional neural networks. In IEEE ASAP 2009. 53--60.Google Scholar
- Syed Shakib Sarwar, Gopalakrishnan Srinivasan, Bing Han, Parami Wijesinghe, Akhilesh Jaiswal, Priyadarshini Panda, Anand Raghunathan, and Kaushik Roy. 2018. Energy efficient neural computing: A study of cross-layer approximations. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 8, 4 (2018), 796--809.Google Scholar
Cross Ref
- Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural networks 61 (2015), 85--117.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Aravind Vasudevan et al., Andrew Anderson, and David Gregg. 2017. Parallel multi channel convolution using general matrix multiplication. In IEEE 28th ASAP. 19--24.Google Scholar
- Elena I Vatajelu and Joan Figueras. 2011. Statistical analysis of 6T SRAM data retention voltage under process variation. In 14th IEEE International Symposium on Design and Diagnostics of Electronic Circuits and Systems. IEEE, 365--370.Google Scholar
Cross Ref
- Wei Wen et al. 2016. Learning structured sparsity in deep neural networks. In NIPS. 2074--2082.Google Scholar
- Keiji Yanai, Ryosuke Tanno, and Koichi Okamoto. 2016. Efficient mobile implementation of a cnn-based object recognition system. In Proceedings of the 24th ACM international conference on Multimedia. ACM, 362--366.Google Scholar
Digital Library
- Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2017. Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of the IEEE CVPR. 5687--5695.Google Scholar
Cross Ref
- Jiecao Yu et al. 2017. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. In Proceedings of ACM ISCA. 548--560.Google Scholar
- Jeff Zhang, Kartheek Rangineni, Zahra Ghodsi, and Siddharth Garg. 2018. Thundervolt: Enabling aggressive voltage underscaling and timing error resilience for energy efficient deep learning accelerators. In ACM 55th DAC. 19.Google Scholar
Digital Library
- Jeff Jun Zhang, Tianyu Gu, Kanad Basu, and Siddharth Garg. 2018. Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator. In 2018 IEEE 36th VLSI Test Symposium (VTS). IEEE, 1--6.Google Scholar
Cross Ref
- Chen Zhang et al. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM FPGA. ACM, 161--170.Google Scholar
Index Terms
CompAct: On-chip <underline>Com</underline>pression of <underline>Act</underline>ivations for Low Power Systolic Array Based CNN Acceleration
Recommendations
Lightening the Load with Highly Accurate Storage- and Energy-Efficient LightNNs
Special Issue on Deep learning on FPGAsHardware implementations of deep neural networks (DNNs) have been adopted in many systems because of their higher classification speed. However, while they may be characterized by better accuracy, larger DNNs require significant energy and area, thereby ...
A-WiNoC: Adaptive Wireless Network-on-Chip Architecture for Chip Multiprocessors
With the rise of chip multiprocessors, an energy-efficient communication fabric is required to satisfy the data rate requirements of future multi-core systems. The Network-on-Chip (NoC) paradigm is fast becoming the standard communication infrastructure ...
MoDe-X: Microarchitecture of a Layout-Aware Modular Decoupled Crossbar for On-Chip Interconnects
The number of cores in a single chip keeps increasing with process technology scaling, requiring a scalable interconnection network topology. Buffered wormhole-switched interconnect architectures are attractive for such multicore architectures. The 2D ...






Comments