Abstract
With the prospering of mobile devices, the distributed learning approach, enabling model training with decentralized data, has attracted great interest from researchers. However, the lack of training capability for edge devices significantly limits the energy efficiency of distributed learning in real life. This article describes Efficient-Grad, an algorithm-hardware co-design approach for training deep convolutional neural networks, which improves both throughput and energy saving during model training, with negligible validation accuracy loss.
The key to Efficient-Grad is its exploitation of two observations. Firstly, the sparsity has potential for not only activation and weight, but gradients and the asymmetry residing in the gradients for the conventional back propagation (BP). Secondly, a dedicated hardware architecture for sparsity utilization and efficient data movement can be optimized to support the Efficient-Grad algorithm in a scalable manner. To the best of our knowledge, Efficient-Grad is the first approach that successfully adopts a feedback-alignment (FA)-based gradient optimization scheme for deep convolutional neural network training, which leads to its superiority in terms of energy efficiency. We present case studies to demonstrate that the Efficient-Grad design outperforms the prior arts by 3.72x in terms of energy efficiency.
Supplemental Material
Available for Download
Version of Record for “Efficient-Grad: Training Deep Convolutional Neural Networks on Edge Devices with ient Optimizations” by Hong et al., ACM Transactions on Embedded Computing Systems, Volume 21, No. 2 (TECS 21:2).
- [1] . 2020. Chipyard: Integrated design, simulation, and implementation framework for custom SoCs. IEEE Micro 40, 4 (2020), 10–21.Google Scholar
Digital Library
- [2] . 2016. The Rocket Chip Generator.
Technical Report UCB/EECS-2016-17. EECS Department, University of California, Berkeley.Google Scholar - [3] . 2012. Chisel: Constructing hardware in a scala embedded language. In Proceedings of the 49th Annual Design Automation Conference. Association for Computing Machinery, New York, NY, 1216–1225.Google Scholar
Digital Library
- [4] . 2019. Learning in the machine: Random backpropagation and the deep learning channel. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 6348–6352.Google Scholar
- [5] . 2019. Towards federated learning at scale: System design. In Proceedings of Machine Learning and Systems, , , and (Eds.), Vol. 1. 374–388.Google Scholar
- [6] . 2018. A low-power convolutional neural network face recognition processor and a CIS integrated with always-on face detector. IEEE Journal of Solid-State Circuits 53, 1 (2018), 115–123.Google Scholar
Cross Ref
- [7] . 2013. Towards variation-aware system-level power estimation of DRAMs: An empirical approach. In Proceedings of the Design Automation Conference.Google Scholar
Digital Library
- [8] . 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. Association for Computing Machinery, New York, NY, 269–284.Google Scholar
Digital Library
- [9] . 2017. FxpNet: Training a deep convolutional neural network in fixed-point representation. In Proceedings of the 2017 International Joint Conference on Neural Networks. 2494–2501.Google Scholar
Cross Ref
- [10] . 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 609–622.Google Scholar
Digital Library
- [11] . 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9, 2 (2019), 292–308.Google Scholar
Cross Ref
- [12] . 2020. DEEPEYE: A deeply tensor-compressed neural network for video comprehension on terminal devices. ACM Transactions on Embedded Computing System 19, 3, Article
18 (May 2020), 25 pages.Google Scholar - [13] . 2019. An optimized design technique of low-bit neural network training for personalization on IoT devices. In Proceedings of the 2019 56th ACM/IEEE Design Automation Conference. 1–6.Google Scholar
Digital Library
- [14] . 2020. An energy-efficient deep convolutional neural network training accelerator for in situ personalization on smart devices. IEEE Journal of Solid-State Circuits 55, 10 (2020), 2691–2702.Google Scholar
Cross Ref
- [15] . 2009. An introduction to high-level synthesis. IEEE Design Test of Computers 26, 4 (2009), 8–17.Google Scholar
Digital Library
- [16] . 2020. SparseTrain: Exploiting dataflow sparsity for efficient convolutional neural networks training. In Proceedings of the 2020 57th ACM/IEEE Design Automation Conference. 1–6.Google Scholar
Cross Ref
- [17] . 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255.Google Scholar
Cross Ref
- [18] . 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. Association for Computing Machinery, New York, NY, 92–104.Google Scholar
Digital Library
- [19] . 2018. A scalable multi- teraops deep learning processor core for AI trainina and inference. In Proceedings of the 2018 IEEE Symposium on VLSI Circuits. 35–36.Google Scholar
Cross Ref
- [20] . 2021. Learning without feedback: Fixed random learning signals allow for feedforward training of deep neural networks. Frontiers in Neuroscience 15, 1 (2021), 20.Google Scholar
Cross Ref
- [21] . 2020. A 28-nm convolutional neuromorphic processor enabling online learning with spike-based retinas. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems. 1–5.Google Scholar
Cross Ref
- [22] . 2021. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration. In Proceedings of the 58th Annual Design Automation Conference. 1–6.Google Scholar
Digital Library
- [23] . 2019. A low-power deep neural network online learning processor for real-time object tracking application. IEEE Transactions on Circuits and Systems I: Regular Papers 66, 5 (2019), 1794–1804.Google Scholar
Cross Ref
- [24] . 2021. DF-LNPU: A pipelined direct feedback alignment-based deep neural network learning processor for fast online learning. IEEE Journal of Solid-State Circuits 56, 5 (2021), 1630–1640.Google Scholar
Cross Ref
- [25] . 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google Scholar
Cross Ref
- [26] . 2019. Electromagnetic pattern extraction and grouping for near-field scanning of integrated circuits by PCA and k-means approaches. IEEE Transactions on Electromagnetic Compatibility 61, 6 (2019), 1811–1822.Google Scholar
Cross Ref
- [27] . 2021. Efficient training convolutional neural networks on edge devices with gradient-pruned sign-symmetric feedback alignment. (2021). arXiv:cs.LG/2103.02889. Retrieved from https://arxiv.org/abs/2103.02889.Google Scholar
- [28] . 2014. 1.1 computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers. 10–14.Google Scholar
Cross Ref
- [29] . 2018. Training neural networks using features replay. In Proceedings of the Advances in Neural Information Processing Systems, , , , , , and (Eds.), Vol. 31. Curran Associates, Inc.Google Scholar
- [30] . 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. JMLR.org, 448–456.Google Scholar
- [31] . 2017. Decoupled neural interfaces using synthetic gradients. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), and (Eds.), Vol. 70. PMLR, 1627–1635.Google Scholar
- [32] . 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture. 1–12.Google Scholar
Digital Library
- [33] . 2009. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009).Google Scholar
- [34] . 2019. Training on the edge: The why and the how. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium Workshops. 899–903.Google Scholar
Cross Ref
- [35] . 1998. Efficient backprop. In Proceedings of the Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop. Springer-Verlag, Berlin, 9–50.Google Scholar
Digital Library
- [36] . 2019. Acceleration of DNN backward propagation by selective computation of gradients. In Proceedings of the 2019 56th ACM/IEEE Design Automation Conference (DAC). 1–6.Google Scholar
Digital Library
- [37] . 2019. 7.7 LNPU: A 25.3TFLOPS/W sparse deep-neural-network learning processor with fine-grained mixed precision of FP8-FP16. In Proceedings of the 2019 IEEE International Solid- State Circuits Conference. 142–144.Google Scholar
Cross Ref
- [38] . 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the ICCAD: International Conference on Computer-Aided Design. 694–701.Google Scholar
Digital Library
- [39] . 2020. DRAMsim3: A cycle-accurate, thermal-capable DRAM simulator. IEEE Computer Architecture Letters 19, 2 (2020), 106–109.Google Scholar
Digital Library
- [40] . 2016. Random feedback weights support learning in deep neural networks. Nature Communications 7, 1 (2016), 1–10.Google Scholar
Cross Ref
- [41] . 2019. A 2.25 TOPS/W fully-integrated deep CNN learning processor with on-chip training. In Proceedings of the 2019 IEEE Asian Solid-State Circuits Conference. 65–68.Google Scholar
Cross Ref
- [42] . 2017. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research), Vol. 54. PMLR, 1273–1282.Google Scholar
- [43] . 2016. Direct feedback alignment provides learning in deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 1037–1045.Google Scholar
- [44] . 2020. Scala based HDL v1.4.0. (
mar 2020). Retrieved May 2, 2021 from https://github.com/SpinalHDL/SpinalHDL.Google Scholar - [45] . 2021. 9.3 A 40nm 4.81TFLOPS/W 8b floating-point training processor for non-sparse neural networks using shared exponent bias and 24-way fused multiply-add tree. In Proceedings of the 2021 IEEE International Solid- State Circuits Conference Vol. 64. 1–3.Google Scholar
Cross Ref
- [46] . 2020. The dynamics of learning with feedback alignment. (2020). arXiv:stat.ML/2011.12428. Retrieved from https://arxiv.org/abs/2011.12428.Google Scholar
- [47] . 2011. DRAMSim2: A cycle accurate memory system simulator. IEEE Computer Architecture Letters 10, 1 (2011), 16–19.Google Scholar
Digital Library
- [48] . 1986. Learning representations by back-propagating errors. Nature 323, 6088 (1986), 533–536.Google Scholar
Cross Ref
- [49] . 2020. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim. In Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software. 58–68.Google Scholar
Cross Ref
- [50] . 2020. Towards high performance low bitwidth training for deep neural networks. Journal of Semiconductors 41, 2 (
Feb. 2020), 022404.Google ScholarCross Ref
- [51] . 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research), and (Eds.), Vol. 97. PMLR, 6105–6114.Google Scholar
- [52] . 2017. Deep convolutional neural network architecture with reconfigurable computation patterns. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 8 (2017), 2220–2233.Google Scholar
Digital Library
- [53] . 2018. ChiselTest, a test harness for Chisel-based RTL designs. (2018). Retrieved November 1, 2020 from https://github.com/ucb-bar/chisel-testers2.Google Scholar
- [54] . 2020. NITI: Training Integer Neural Networks Using Integer-only Arithmetic. (2020). arXiv:cs.CV/2009.13108. Retrieved from https://arxiv.org/abs/2009.13108.Google Scholar
- [55] . 2019. E2-Train: Training state-of-the-art CNNs with over 80% energy savings. In Proceedings of the Advances in Neural Information Processing Systems, , , , , , and (Eds.), Vol. 32. Curran Associates, Inc.Google Scholar
- [56] . 2020. Enabling on-device CNN training by self-supervised instance filtering and error map pruning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 3445–3457.Google Scholar
Cross Ref
- [57] . 2019. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology 10, 2, Article
12 (Jan. 2019), 19 pages.Google ScholarDigital Library
- [58] . 2018. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE Journal of Solid-State Circuits 53, 4 (2018), 968–982.Google Scholar
Cross Ref
Index Terms
Efficient-Grad: Efficient Training Deep Convolutional Neural Networks on Edge Devices with Gradient Optimizations
Recommendations
Weightless Neural Networks for Efficient Edge Inference
PACT '22: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesWeightless neural networks (WNNs) are a class of machine learning model which use table lookups to perform inference, rather than the multiply-accumulate operations typical of deep neural networks (DNNs). Individual weightless neurons are capable of ...
A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only)
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFPGA-based hardware accelerator for convolutional neural networks (CNNs) has obtained great attentions due to its higher energy efficiency than GPUs. However, it has been a challenge for FPGA-based solutions to achieve a higher throughput than GPU ...
Deep Kronecker neural networks: A general framework for neural networks with adaptive activation functions
AbstractWe propose a new type of neural networks, Kronecker neural networks (KNNs), that form a general framework for neural networks with adaptive activation functions. KNNs employ the Kronecker product, which provides an efficient way of ...






Comments