Abstract
Implementing embedded neural network processing at the edge requires efficient hardware acceleration that combines high computational throughput with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly being adapted to support the improved functionalities. Hardware designers can refer to a myriad of accelerator implementations in the literature to evaluate and compare hardware design choices. However, the sheer number of publications and their diverse optimization directions hinder an effective assessment. Existing surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effects of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress.
In contrast to previous surveys, this work provides a quantitative overview of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. The list of optimizations and their quantitative effects are presented as a construction kit, allowing to assess the design choices for each building block individually. Reported optimizations range from up to 10,000× memory savings to 33× energy reductions, providing chip designers with an overview of design choices for implementing efficient low power neural network accelerators.
- [1] . 2019. Object detection with deep learning: A review. IEEE Trans. Neural Networks and Learning Systems 30 (2019), 3212–3232.Google Scholar
Cross Ref
- [2] . 2017. Hello Edge: Keyword spotting on microcontrollers. CoRR, vol. abs/1711.07128, 2017.Google Scholar
- [3] . 2012. ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems - Volume 1, USA, 2012.Google Scholar
- [4] . 2016. Deep residual learning for image recognition. In IEEE Computer Vision and Pattern Recognition.Google Scholar
Cross Ref
- [5] . 2016. Densely connected convolutional networks. CoRR, vol. abs/1608.06993, 2016.Google Scholar
- [6] . 2011. Implications of historical trends in the electrical efficiency of computing. IEEE Annals of the History of Computing 33 (2011), 46–54.Google Scholar
Digital Library
- [7] . 2013. Internet of Things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems 29 (2013), 1645–1660.Google Scholar
Digital Library
- [8] . 2016. Edge computing: Vision and challenges. IEEE Internet of Things J. 3 (2016), 637–646.Google Scholar
Cross Ref
- [9] . 2019. Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proceedings of the IEEE 107 (2019), 1738–1762.Google Scholar
- [10] 2016. A 2.71 nJ/Pixel Gaze-Activated object recognition system for low-power mobile smart glasses. IEEE Journal of Solid-State Circuits 51 (2016), 45–55.Google Scholar
Cross Ref
- [11] . 2014. Draining our glass: An energy and heat characterization of Google glass. In 5th Asia-Pacific Workshop on Systems, New York, NY, USA, 2014.Google Scholar
Digital Library
- [12] . 2016. 14.1 A 126.1mW real-time natural UI/UX processor with embedded deep-learning core for low-power smart glasses. In Int. Solid-State Circuits Conference, 2016.Google Scholar
Cross Ref
- [13] Google, Google Clips specifications, 2017. [Online]. Available: https://support.google.com/googleclips/answer/7545447?hl=en. [Accessed 21 05 2019].Google Scholar
- [14] Xiaomi, Xiaomo AI door bell overview, 2019. [Online]. Available: https://www.xiaomitoday.com/2019/10/29/xiaomi-xiaomo-mijia-ai-face-identifcation-1080p-door-bell/. [Accessed 09 02 2021].Google Scholar
- [15] Orcam, Orcam MyEye 2 specifications, 2020. [Online]. Available: https://www.orcam.com/en/myeye2/specification [Accessed 09 02 2021].Google Scholar
- [16] 2019. Low-power computer vision: Status, challenges, and opportunities. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 (2019), 411–421.Google Scholar
Cross Ref
- [17] 2019. IoT2 — The Internet of Tiny Things: Realizing mm-scale sensors through 3D die stacking. In Design, Automation Test in Europe Conference Exhibition, 2019.Google Scholar
- [18] . 2020. Towards real-time and real-life image classification and detection using CNN: A review of practical applications requirements, algorithms, hardware and current trends. In Int. Symp. for Design and Technology in Electronic Packaging, 2020.Google Scholar
Cross Ref
- [19] . 2019. 18μW SoC for near-microphone keyword spotting and speaker verification. In Symposium on VLSI Circuits, 2019.Google Scholar
- [20] 2021. The design process for Google's training chips: TPUv2 and TPUv3. IEEE Micro 41 (2021), 56–63, 2021.Google Scholar
Cross Ref
- [21] . 2019. Dissecting the graphcore IPU architecture via microbenchmarking. CoRR, vol. abs/1912.03413, 2019.Google Scholar
- [22] 2020. Neural network accelerator comparison. 2020. [Online]. Available: http://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/. [Accessed 22 03 2021].Google Scholar
- [23] 2019. Survey and benchmarking of machine learning accelerators. In IEEE High Perf. Extreme Computing Conf., 2019.Google Scholar
Cross Ref
- [24] 2020. Survey of machine learning accelerators. In IEEE High Perf. Extreme Computing Conf., 2020.Google Scholar
Cross Ref
- [25] 2017. A survey of neuromorphic computing and neural networks in hardware. CoRR, vol. abs/1705.06963, 2017.Google Scholar
- [26] . 2020. Efficient processing of deep neural networks. Synthesis Lectures on Comp. Architecture, 2020.Google Scholar
Cross Ref
- [27] . 2017. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE 105 (2017), 2295–2329.Google Scholar
Cross Ref
- [28] . 2017. Hardware for machine learning: Challenges and opportunities. In IEEE Custom Integrated Circuits Conference, 2017.Google Scholar
- [29] . 2020. Efficient hardware implementations of deep neural networks: A survey. In International Conference on Inventive Systems and Control, 2020.Google Scholar
- [30] . 2018. A survey of FPGA-based accelerators for convolutional neural networks. Neural Computing and Applications, 2018.Google Scholar
- [31] 2020. Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead. IEEE Access 8 (2020), 225134–225180.Google Scholar
Cross Ref
- [32] 2019. AI enabling technologies: A survey. CoRR, vol. abs/1905.03592, 2019.Google Scholar
- [33] 2019. Bridging biological and artificial neural networks with emerging neuromorphic devices: Fundamentals, progress, and challenges. Advanced Materials 31 (2019), 1902761.Google Scholar
Cross Ref
- [34] . 2020. The development of silicon for AI: Different design approaches. IEEE Trans. Circuits and Systems I: Regular Papers 67 (2020), 4719–4732.Google Scholar
Cross Ref
- [35] . 2020. A survey of accelerator architectures for deep neural networks. Engineering 6 (2020), 264–274.Google Scholar
Cross Ref
- [36] 2003. Edge intelligence: Architectures, challenges, and applications. CoRR, vol. abs/2003.12172, 2020.Google Scholar
- [37] 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Int. Symposium on Computer Architecture, 2015.Google Scholar
- [38] 2016. EIE: Efficient inference engine on compressed deep neural network. In Int. Symposium on Computer Architecture, 2016.Google Scholar
Digital Library
- [39] . 2017. 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm FDSOI. In IEEE Int. Solid-State Circuits Conference, 2017.Google Scholar
Cross Ref
- [40] . 2017. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. of Solid-State Circuits 52, 1 (2017), 127–138.Google Scholar
Cross Ref
- [41] . 2018. YodaNN: An architecture for ultralow power binary-weight CNN acceleration. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems 37 (2018), 48–60.Google Scholar
Cross Ref
- [42] 2019. UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision. IEEE J. Solid-State Circ. 54 (2019), 173–185.Google Scholar
Cross Ref
- [43] . 2019. Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 (2019), 292–308.Google Scholar
- [44] 2019. Power consumption analysis, measurement, management, and issues: A state-of-the-art review of smartphone battery and energy usage. IEEE Access 7 (2019), 182113–182172.Google Scholar
Cross Ref
- [45] 2020. SamurAI: A 1.7MOPS-36GOPS adaptive versatile IoT node with 15,000× peak-to-idle power reduction, 207ns wake-up time and 1.3TOPS/W ML efficiency. In IEEE Symposium on VLSI Circuits, 2020.Google Scholar
- [46] . 2019. An ultra-low-power analog-digital hybrid CNN face recognition processor integrated with a CIS for always-on mobile devices. In IEEE Int. Symposium on Circuits and Systems, 2019.Google Scholar
- [47] . 2016. Hardware-oriented approximation of convolutional neural networks. CoRR, vol. abs/1604.03168, 2016.Google Scholar
- [48] 2021. A survey of quantization methods for efficient neural network inference. CoRR, 2021.Google Scholar
- [49] 2019. AI benchmark: All about deep learning on smartphones in 2019. CoRR, vol. abs/1910.06663, 2019.Google Scholar
- [50] 2019. MLPerf inference benchmark. CoRR, vol. abs/1911.02549, 2019.Google Scholar
- [51] 2020. Benchmarking TinyML systems: Challenges and direction. CoRR, vol. abs/2003.04821, 3 2020.Google Scholar
- [52] EEMBC, Exploring CoreMark – A Benchmark Maximizing Simplicity and Efficacy, 2009. [Online] Available: https://www.eembc.org/techlit/articles/coremark-whitepaper.pdf. [Accessed 03 02 2021].Google Scholar
- [53] . 2019. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2019), 65–76.Google Scholar
Digital Library
- [54] . 1993. First draft of a report on the EDVAC. IEEE Annals of the History of Computing 15 (1993), 27–75.Google Scholar
Digital Library
- [55] . 1999. Multiple-issue processors. In Processor Architecture: From Dataflow to Superscalar and Beyond, Berlin: Springer Berlin, 1999, 123–219.Google Scholar
- [56] 2017. In-datacenter performance analysis of a tensor processing unit. In Int. Symposium on Computer Architecture, 2017.Google Scholar
Digital Library
- [57] . 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In International Symposium on Computer Architecture, 2016.Google Scholar
Digital Library
- [58] . 2020. Systolic tensor array: An efficient structured-sparse GEMM accelerator for mobile CNN inference. IEEE Comp. Architecture Letters 19 (2020), 34–37.Google Scholar
Cross Ref
- [59] . 2018. SCALE-Sim: Systolic CNN Accelerator. CoRR, vol. abs/1811.02883, 2018.Google Scholar
- [60] . 2019. Design considerations for efficient deep neural networks on processing-in-memory accelerators. In IEEE International Electron Devices Meeting, 2019.Google Scholar
- [61] . 2004. Reflections on the Memory Wall. In Proceedings of the 1st Conference on Computing Frontiers, New York, NY, USA, 2004.Google Scholar
- [62] . 2020. Compute-in-Memory with emerging nonvolatile-memories: Challenges and prospects. In IEEE Custom Integrated Circuits Conference, 2020.Google Scholar
- [63] 2019. 24.5 A Twin-8T SRAM computation-in-memory macro for multiple-bit CNN-based machine learning. In IEEE International Solid- State Circuits Conference, 2019.Google Scholar
- [64] . 2017. In-memory computation of a machine-learning classifier in a standard 6T SRAM array. IEEE Journal of Solid-State Circuits 52 (2017), 915–924.Google Scholar
Cross Ref
- [65] . 2018. An in-memory VLSI architecture for convolutional neural networks. IEEE J. Emerging and Sel. Topics in Circ. and Sys. 8 (2018), 494–505.Google Scholar
Cross Ref
- [66] 2021. 16.3 A 28nm 384kb 6T-SRAM computation-in-memory macro with 8b precision for AI edge chips. In 2021 IEEE International Solid- State Circuits Conference, 2021.Google Scholar
Cross Ref
- [67] . 2018. Neuro-inspired computing with emerging nonvolatile memorys. Proceedings of the IEEE 106 (2018), 260–285.Google Scholar
Cross Ref
- [68] 2019. A fully integrated reprogrammable memristor–CMOS system for efficient multiply–accumulate operations. Nature Electronics 2 (2019) 1, 7 2019.Google Scholar
Cross Ref
- [69] . 2021. Robust processing-in-memory with multi-bit ReRAM using Hessian-driven mixed-precision computation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1–1, 2021.Google Scholar
- [70] 2019. 24.1 A 1Mb multibit ReRAM computing-in-memory macro with 14.6ns parallel MAC computing time for CNN based AI edge processors. In IEEE Int. Solid- State Circuits Conference, 2019.Google Scholar
- [71] 2019. Monolithically integrated RRAM- and CMOS-based in-memory computing optimizations for efficient deep learning. IEEE Micro 39 (2019), 54–63.Google Scholar
Cross Ref
- [72] . 2021. Heterogeneous mixed-signal monolithic 3-D in-memory computing using resistive RAM. IEEE Trans. Very Large Scale Integration Systems 29 (2021), 386–396.Google Scholar
Digital Library
- [73] . 2019. DNN+NeuroSim: An end-to-end benchmarking framework for compute-in-memory accelerators with versatile device technologies. In IEEE International Electron Devices Meeting, 2019.Google Scholar
Cross Ref
- [74] . 2020. McDRAM v2: In-dynamic random access memory systolic array accelerator to address the large model problem in deep neural networks on the edge. IEEE Access 8 (2020), 135223–135243.Google Scholar
Cross Ref
- [75] G. E. Moore. Cramming more components onto integrated circuits. Reprinted from Electronics 38, 8, (1965), 114. IEEE Solid-State Circuits Society Newsletter, vol. 11, pp. 33--35, 2006.Google Scholar
- [76] . 2017. What's next? [The end of Moore's law]. Computing in Science Engineering 19 (2017), 7–13.Google Scholar
Digital Library
- [77] . 2007. A 30 year retrospective on Dennard's MOSFET scaling paper. IEEE Solid-State Circ. Society Newsl. 12 (2007), 11–13.Google Scholar
Cross Ref
- [78] . 2011. Power Wall. In Encyclopedia of Parallel Computing, D. Padua, Hrsg., Boston, MA: Springer US, 2011, 1593–1608.Google Scholar
- [79] . 2014. 1.1 Computing's energy problem (and what we can do about it). In IEEE International Solid-State Circuits Conference, 2014.Google Scholar
- [80] 2013. A modular 1 mm³die-stacked sensing platform with low power I²C inter-die communication and multi-modal energy harvesting. IEEE J. Solid-State Circuits 48, 1 (2013), 229–243.Google Scholar
Cross Ref
- [81] 2018. QUEST: A 7.49TOPS multi-purpose log-quantized DNN inference engine stacked on 96MB 3D SRAM using inductive-coupling technology in 40nm CMOS. In IEEE Int. Solid-State Circuits Conference, 2018.Google Scholar
- [82] IRDS. 2021. International Roadmap for Devices and Systems: 2020 Update, 2020. [Online] Available: https://irds.ieee.org/editions/2020. [Accessed 11 05 2021].Google Scholar
- [83] 2012. A 22nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density MIM capacitors. In Symp. on VLSI Technology, 2012.Google Scholar
Cross Ref
- [84] 2016. A 55nm ultra low leakage deeply depleted Channel technology optimized for energy minimization in subthreshold SRAM and logic. In European Solid-State Circuits Conference, 2016.Google Scholar
- [85] 2014. A 14nm logic technology featuring 2nd-generation FinFET, air-gapped interconnects, self-aligned double patterning and a 0.0588 μm2 SRAM cell size. In IEEE Int. Electron Dev. Meeting, 2014.Google Scholar
- [86] . 2017. FDSOI vs FinFET: Differentiating device features for ultra low power IoT applications. In IEEE Int. Conf. on IC Design and Technology, 2017.Google Scholar
- [87] 2016. 22nm FDSOI technology for emerging mobile, Internet-of-Things, and RF applications. In IEEE Int. Electron Dev. Meet., 2016.Google Scholar
- [88] . 2009. Comparison of SOI FinFETs and Bulk FinFETs. ECS Trans. 19 (2009), 101–112.Google Scholar
Cross Ref
- [89] 2013. Ultra low-power standard cell design using planar bulk CMOS in subthreshold operation. In International Workshop on Power and Timing Modeling, Optimization and Simulation, 2013.Google Scholar
- [90] 2015. A 1kb single-side read 6T sub-threshold SRAM in 180 nm with 530 Hz frequency 3.1 nA total current and 2.4 nA leakage at 0.27 V. In IEEE SOI-3D-Subthr. Microel. Tech. Unified Conf., 2015.Google Scholar
- [91] 2016. Sub-threshold latch-based icyflex2 32-bit processor with wide supply range operation. In Europ. Solid-State Circuits Conf., 2016.Google Scholar
- [92] 2019. A 0.5 V 2.5 μW/MHz microcontroller with analog-assisted adaptive body bias PVT compensation with 3.13nW/kB SRAM retention in 55nm deeply-depleted Channel CMOS. In IEEE Custom Integrated Circuits Conference, 2019.Google Scholar
- [93] 2017. PVT compensation in Mie Fujitsu 55 nm DDC: A standard-cell library based comparison. In IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference, 2017.Google Scholar
Cross Ref
- [94] . 1994. Scheduling for reduced CPU ewnergy. In USENIX Conference on Operating Systems Design and Implementation, USA, 1994.Google Scholar
Digital Library
- [95] . 1996. Processor design for portable systems. In Technologies for Wireless Computing, USA, Kluwer Academic Publishers, 1996, 203–221.Google Scholar
- [96] . 2015. DVAS: Dynamic voltage accuracy scaling for increased energy-efficiency in approximate computing. In Int. Symposium on Low Power Electronics and Design, 2015.Google Scholar
Cross Ref
- [97] 2016. A systematic approach to blocking convolutional neural networks. CoRR, vol. abs/1606.04209, 2016.Google Scholar
- [98] . 2016. A review of emerging non-volatile memory (NVM) technologies and applications. Solid-State Electronics 125 (2016), 25–38.Google Scholar
- [99] . 2020. Neuromorphic computing using emerging synaptic devices: A retrospective summary and an outlook. Electronics 9, 2020.Google Scholar
Cross Ref
- [100] 2018. A 40nm low-power logic compatible phase change memory technology. In IEEE Int. Electron Devices Meeting, 2018.Google Scholar
- [101] 2019. 1Gbit high density embedded STT-MRAM in 28nm FDSOI technology. In IEEE Int. Electron Devices Meeting, 2019.Google Scholar
- [102] 2019. 13.3 A 7Mb STT-MRAM in 22FFL FinFET technology with 4ns read sensing time at 0.9V using write-verify-write scheme and offset-cancellation sensing technique. In IEEE International Solid- State Circuits Conference, 2019.Google Scholar
Cross Ref
- [103] 2018. Demonstration of highly manufacturable STT-MRAM embedded in 28nm logic. In IEEE Int. Electron Devices Meeting, 2018.Google Scholar
- [104] 2013. A 130.7mm2 2-layer 32Gb ReRAM memory device in 24nm technology. In IEEE Int. Solid-State Circuits Conference, 2013.Google Scholar
- [105] 2018. An N40 256K×44 embedded RRAM macro with SL-precharge SA and low-voltage current limiter to improve read and write performance. In IEEE Int. Solid - State Circuits Conference, 2018.Google Scholar
Cross Ref
- [106] 2019. 13.2 A 3.6Mb 10.1Mb/mm2 embedded non-volatile ReRAM macro in 22nm FinFET technology with adaptive forming/set/reset schemes yielding down to 0.5V with sensing time of 5ns at 0.7V. In IEEE Int. Solid- State Circuits Conference, 2019.Google Scholar
Cross Ref
- [107] 2021. 24.2 A 14nm-FinFET 1Mb embedded 1T1R RRAM with a 0.022μ m2 cell size using self-adaptive delayed termination and multi-cell reference. In IEEE Int. Solid- State Circuits Conf., 2021.Google Scholar
- [108] 2020. FeFET: A versatile CMOS compatible device with game-changing potential. In IEEE Int. Memory Workshop, 2020.Google Scholar
- [109] 2019. Design and analysis of an ultra-dense, low-leakage, and fast FeFET-based random access memory array. IEEE J. Exploratory Solid-State Comp. Dev. and Circ. 5 (2019), 103–112.Google Scholar
Cross Ref
- [110] 2017. 14.7 A 288μW programmable deep-learning processor with 270KB on-chip weight storage using non-uniform memory hierarchy for mobile intelligence. In IEEE Int. Solid-State Circuits Conf., 2017.Google Scholar
- [111] 2020. AμProcessor layer for mm-scale die-stacked sensing platforms featuring ultra-low power sleep mode at 125°C. In 2020 IEEE Asian Solid-State Circuits Conference, 2020.Google Scholar
Cross Ref
- [112] . 2018. A low-power convolutional neural network face recognition processor and a CIS integrated with always-on face detector. IEEE Journal of Solid-State Circuits 53, 1 2018, 115–123.Google Scholar
Cross Ref
- [113] . 2016. Power, area, and performance optimization of standard cell memory arrays through controlled placement. ACM Trans. Des. Autom. Electron. Syst. 21, 5 (2016).Google Scholar
- [114] . 2018. XNOR Neural Engine: A hardware accelerator IP for 21.6 fJ/op binary neural network inference. ArXiv e-prints, 7 2018.Google Scholar
- [115] . 2012. Hybrid memory cube new DRAM architecture increases density and performance. In Symposium on VLSI Technology, 2012.Google Scholar
Cross Ref
- [116] . 2017. TETRIS: Scalable and efficient neural network acceleration with 3D memory. SIGARCH Comput. Archit. News 45, 4 (2017), 751–764.Google Scholar
Digital Library
- [117] . 2018. A 4-transistor nMOS-only logic-compatible gain-cell embedded DRAM with over 1.6-ms retention time at 700 mV in 28-nm FD-SOI. IEEE Transactions on Circuits and Systems I: Regular Papers 65, 4 (2018), 1245–1256.Google Scholar
Cross Ref
- [118] . 2018. A 14.3pW sub-threshold 2T gain-cell eDRAM for ultra-low power IoT applications in 28nm FD-SOI. In IEEE SOI-3D-Subthreshold Microel. Tech. Unified Conf., 2018.Google Scholar
- [119] . 1997. Flash memory cells-an overview. Proceedings of the IEEE 85 (1997), 1248–1271.Google Scholar
Cross Ref
- [120] Cypress, SONOS flash technology, 2019. [Online]. Available: https://www.cypress.com/file/123341/download. [Accessed 21 03 2021].Google Scholar
- [121] . 2019. Breaking high-resolution CNN bandwidth barriers with enhanced depth-first execution. IEEE J. on Emerging and Sel. Topics in Circ. and Sys. 9 (2019), 323–331.Google Scholar
Cross Ref
- [122] . 2021. High-utilization, high-flexibility depth-first CNN coprocessor for image pixel processing on FPGA. IEEE Trans. Very Large Scale Integration Systems 29 (2021), 461–471.Google Scholar
Cross Ref
- [123] . 2016. Fused-layer CNN accelerators. In IEEE/ACM Int. Symp. on Microarchitecture, 2016.Google Scholar
- [124] . 2021. CUTIE: Beyond PetaOp/s/W Ternary DNN inference acceleration with better-than-binary energy efficiency. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1–1, 2021.Google Scholar
- [125] . 2019. Optimally scheduling CNN convolutions for efficient memory access. CoRR, vol. abs/1902.01492, 2019.Google Scholar
- [126] . 2018. Memory requirements for convolutional neural network hardware accelerators. In IEEE Int. Symp. on Workload Characterization, 2018.Google Scholar
Cross Ref
- [127] 2020. Efficient scheduling of irregular network structures on CNN accelerators. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39 (2020), 3408–3419.Google Scholar
Cross Ref
- [128] . 2018. CMSIS-NN: Efficient neural network kernels for Arm Cortex-M CPUs. CoRR, vol. abs/1801.06601, 2018.Google Scholar
- [129] 2020. Modular design and optimization of biomedical applications for ultralow power heterogeneous platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39 (2020), 3821–3832.Google Scholar
Cross Ref
- [130] 2016. ASP Vision: Optically computing the first layer of convolutional neural networks using angle sensitive pixels. CoRR, vol. abs/1605.03621, 2016.Google Scholar
- [131] 2020. Efficient neural vision systems based on convolutional image acquisition. In Conf. on Comp. Vision and Pattern Recog., 2020.Google Scholar
- [132] . 2015. Realizing low-energy classification systems by implementing matrix multiplication directly within an ADC. IEEE Trans. Biomed. Circuits Syst. 9 (2015), 825–837.Google Scholar
- [133] . 2016. RedEye: Analog ConvNet image sensor architecture for continuous mobile vision. In ACM/IEEE Int. Symp. on Computer Architecture, 2016.Google Scholar
Digital Library
- [134] . 2017. Distributed deep neural networks over the cloud, the edge and end devices. In IEEE Int. Conference on Distributed Computing Systems, 2017.Google Scholar
- [135] . 2018. DeepThings: Distributed adaptive deep learning inference on resource-constrained IoT edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37 (2018), 2348–2359.Google Scholar
- [136] . 2017. BranchyNet: Fast inference via early exiting from deep neural networks. CoRR, vol. abs/1709.01686, 2017.Google Scholar
- [137] . 2016. Conditional deep learning for energy-efficient and enhanced pattern recognition. In 2016 Design, Automation Test in Europe Conference Exhibition, 2016.Google Scholar
Digital Library
- [138] . 2020. UltraTrail: A configurable ultralow-power TC-ResNet AI accelerator for efficient keyword spotting. IEEE Trans. on Computer-Aided Design of Integrated Circ. and Sys. 39 (2020), 4240–4251.Google Scholar
Cross Ref
- [139] . 2017. Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offs by selective execution. CoRR, vol. abs/1701.00299, 2017.Google Scholar
- [140] . 2015. Scalable-effort classifiers for energy-efficient machine learning. In Design Automation Conference, New York, NY, USA, 2015.Google Scholar
Digital Library
- [141] . 2020. Improving memory utilization in convolutional neural network accelerators. IEEE Embedded Systems Letters, 1–1, 2020.Google Scholar
- [142] . 2016. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In Int. Conf. on Learning Representations, 2016.Google Scholar
- [143] . 2020. Evaluating fast algorithms for convolutional neural networks on FPGAs. IEEE Trans. on Computer-Aided Design of Integrated Circ. and Sys. 39 (2020), 857–870.Google Scholar
Cross Ref
- [144] . 2016. Fast algorithms for convolutional neural networks. In CVPR, 2016.Google Scholar
- [145] 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR, vol. abs/1704.04861, 2017.Google Scholar
- [146] . 2014. Fast training of convolutional networks through FFTs. In Int. Conference on Learning Representations, Banff, AB, Canada, 2014.Google Scholar
- [147] . 2020. Fast convolutional neural networks with fine-grained FFTs. In International Conference on Parallel Architectures and Compilation Techniques, New York, NY, USA, 2020.Google Scholar
Digital Library
- [148] . 2014. Minimizing computation in convolutional neural networks. In Artificial Neural Networks and Machine Learning – ICANN 2014, Cham, 2014.Google Scholar
Cross Ref
- [149] . 1989. Optimal brain damage. In NIPS, 1989.Google Scholar
- [150] . 2021. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. CoRR, vol. abs/2102.00554, 2021.Google Scholar
- [151] . 2017. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, 2017.Google Scholar
- [152] . 2019. Neural architecture search: A survey. J. Mach. Learn. Res. 20, 1 (2019), 1997–2017.Google Scholar
Digital Library
- [153] . 2013. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.Google Scholar
Cross Ref
- [154] . 2020. Shapeshifter networks: Decoupling layers from parameters for scalable and effective deep learning. CoRR, vol. abs/2006.10598, 2020.Google Scholar
- [155] . 2020. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE 108 (2020), 485–532.Google Scholar
Cross Ref
- [156] . 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, 2015.Google Scholar
- [157] . 2017. Designing energy-efficient convolutional neural networks using energy-aware pruning. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.Google Scholar
- [158] 2019. NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps. IEEE Trans. on Neural Netw. and Learning Sys. 30 (2019), 644–656.Google Scholar
Cross Ref
- [159] 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In Int. Symp. on Computer Architecture, 2016.Google Scholar
Digital Library
- [160] . 2017. CBinfer: Change-based inference for convolutional neural networks on video data. CoRR, vol. abs/1704.04313, 2017.Google Scholar
- [161] . 2014. A 240 G-ops/s mobile coprocessor for deep neural networks. In IEEE Conf. on Computer Vision and Pattern Recognition Workshops, 2014.Google Scholar
Digital Library
- [162] 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. SIGARCH Comput. Archit. News. 42, 2 (2014), 269–284.Google Scholar
Digital Library
- [163] . 2017. CORAL: Coarse-grained reconfigurable architecture for convolutional neural networks. In IEEE ISLPED, 2017.Google Scholar
- [164] . 2019. Hyperdrive: A multi-chip systolically scalable binary-weight CNN inference engine. IEEE J. on Emerg. and Sel. Topics in Circ. and Sys. 9 (2019), 309–322.Google Scholar
Cross Ref
- [165] . 2019. Engineering a \less artificial intelligence. Neuron 103 (2019), 967–979.Google Scholar
Cross Ref
- [166] 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. CoRR, vol. abs/1602.07360, 2016.Google Scholar
- [167] 2018. On-demand deep model compression for mobile devices: A usage-driven model selection framework. In Int. Conf. on Mobile Systems, Applications, and Services, New York, NY, USA, 2018.Google Scholar
- [168] . 2020. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020.Google Scholar
- [169] . 2018. Energy-quality scalable integrated circuits and systems: continuing energy scaling in the twilight of Moore's law. IEEE J. on Emerging and Selected Topics in Circuits and Systems 8 (2018), 653–678.Google Scholar
Cross Ref
- [170] . 2020. Low-energy voice activity detection via energy-quality scaling from data conversion to machine learning. IEEE Trans. on Circ. and Sys. 67 (2020), 1378–1388.Google Scholar
- [171] . 2020. Energy-quality scalable memory-frugal feature extraction for always-on deep sub-mW distributed vision. IEEE Access 8 (2020), 18951–18961.Google Scholar
Cross Ref
- [172] 2019. Understanding straight-through estimator in training activation quantized neural nets. CoRR, vol. abs/1903.05662, 2019.Google Scholar
- [173] . 2015. Resiliency of deep neural networks under quantization. CoRR, vol. abs/1511.06488, 2015.Google Scholar
- [174] . 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or −1. CoRR, vol. abs/1602.02830, 2016.Google Scholar
- [175] . 2017. A novel data format for approximate arithmetic computing. In Asia and South Pacific Design Automation Conference, 2017.Google Scholar
Digital Library
- [176] . 2018. DNPU: An energy-efficient deep-learning processor with heterogeneous multi-core architecture. IEEE Micro 38 (2018), 85–93.Google Scholar
Digital Library
- [177] . 2019. Review and benchmarking of precision-scalable multiply-accumulate unit architectures for embedded neural-network processing. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 (2019), 697–711.Google Scholar
Cross Ref
- [178] . 2016. Fixed point quantization of deep convolutional networks. In Int. Conf. on Machine Learning - Volume 48, New York, NY, USA, 2016.Google Scholar
- [179] . 2017. Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. CoRR, vol. abs/1711.05852, 2017.Google Scholar
- [180] . 2015. Approximate computing and the quest for computing efficiency. In Design Automation Conference, 2015.Google Scholar
Digital Library
- [181] . 2017. Approximate computing for low power and security in the Internet of Things. Computer 50 (2017), 27–34.Google Scholar
Digital Library
- [182] . 2020. Approximate arithmetic circuits: A survey, characterization, and recent applications. Proceedings of the IEEE 108 (2020), 2108–2135.Google Scholar
Cross Ref
- [183] . 2018. Designing efficient imprecise adders using multi-bit approximate building blocks. In Int. Symp. on Low Power Electronics and Design, New York, NY, USA, 2018.Google Scholar
Digital Library
- [184] . 2018. Bit error tolerance of a CIFAR-10 binarized convolutional neural network processor. In IEEE Int. Symp. on Circuits and Systems, 2018.Google Scholar
- [185] . 2021. Nonconventional computer arithmetic circuits, systems and applications. IEEE Circ. and Sys. Magazine, 21 (2021), 6–40.Google Scholar
Cross Ref
- [186] 2016. High-efficiency logarithmic number unit design based on an improved cotransformation scheme. In Design, Automation Test in Europe Conference Exhibition, 2016.Google Scholar
Cross Ref
- [187] . 2019. Towards spike-based machine intelligence with neuromorphic computing. Nature 575 (2019), 607—617.Google Scholar
Cross Ref
- [188] 2015. TrueNorth: Design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip. Trans. on Comp.-Aided Design of Integr. Circ. and Sys. 34 (2015), 1537–1557.Google Scholar
Digital Library
- [189] 2018. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 38 (2018), 82–99.Google Scholar
Cross Ref
- [190] . 2012. Computing with spiking neuron networks. In Handbook of Natural Computing, G. Rozenberg, T. Back and J. Kok, Hrsg., (Eds.). Springer-Verlag, 335–376.Google Scholar
Cross Ref
- [191] . 2019. An introduction to probabilistic spiking neural networks: Probabilistic models, learning rules, and applications. Sig. Proc. Mag. 36 (2019), 64–77.Google Scholar
Cross Ref
- [192] Kanerva and Pentti. 2009. Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation 1, 6 (2009).Google Scholar
- [193] . 2021. Mixed-signal computing for deep neural network inference. IEEE Trans. on Very Large Scale Integr. Sys. 29 (2021), 3–13.Google Scholar
Cross Ref
- [194] . 2019. The next generation of deep learning hardware: Analog computing. Proceedings of the IEEE 107 (2019), 108–122.Google Scholar
Cross Ref
- [195] . 2018. An always-on 3.8 uJ/86% CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS. In IEEE International Solid - State Circuits Conference, 2018.Google Scholar
- [196] . 2018. BinarEye: An always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28nm CMOS. CoRR, vol. abs/1804.05554, 2018.Google Scholar
- [197] . 2016. An 8-bit, 16 input, 3.2 pJ/op switched-capacitor dot product circuit in 28-nm FDSOI CMOS. In IEEE Asian Solid-State Circuits Conference, 2016.Google Scholar
Cross Ref
- [198] 2017. 14.6 A 0.62mW ultra-low-power convolutional-neural-network face-recognition processor and a CIS integrated with always-on Haar-like face detector. In IEEE Int. Solid-State Circuits Conf., 2017.Google Scholar
Cross Ref
- [199] . 2011. A 57mW embedded mixed-mode neuro-fuzzy accelerator for intelligent multi-core processor. In IEEE International Solid-State Circuits Conference, 2011.Google Scholar
- [200] 2018. Multiscale co-design analysis of energy, latency, area, and accuracy of a ReRAM analog neural training accelerator. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 8 (2018), 86–101.Google Scholar
Cross Ref
- [201] . 2015. Fixed point optimization of deep convolutional neural networks for object recognition. In IEEE Int. Conference on Acoustics, Speech and Signal Processing, 2015.Google Scholar
Cross Ref
- [202] . 2017. Stripes: Bit-serial deep neural network computing. IEEE Comp. Arch. Letters 16 (2017), 80–83.Google Scholar
Digital Library
Index Terms
A Construction Kit for Efficient Low Power Neural Network Accelerator Designs
Recommendations
A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art ...
A hardware-efficient computing engine for FPGA-based deep convolutional neural network accelerator
AbstractDeep convolutional neural networks (DCNNs) have recently emerged as a promising approach for computer vision tasks with many new DCNN architectures proposed to further improve their performance. However, the significant computation ...
A power-efficient and high performance FPGA accelerator for convolutional neural networks: work-in-progress
CODES '17: Proceedings of the Twelfth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis CompanionRecently, FPGAs have been widely used in the implementation of hardware accelerators for Convolutional Neural Networks (CNN), especially on mobile and embedded devices. However, most of these existing accelerators are designed with the same concept as ...






Comments