Abstract
Exploiting computational and data reuse in CNNs is crucial for the successful design of resource-constrained platforms. In image recognition applications, high levels of input locality and redundancy present in CNNs have become the golden goose for skipping costly arithmetic operations. One promising technique for this consists in storing function responses of some input patterns into offline lookup tables and replacing online computation with search operations, which are highly efficient when implemented by emerging non-volatile memory technologies. In this work, we rethink both algorithm and architecture for exploiting locality and reuse opportunities by replacing entire convolutions with searches on Content-addressable Memories. By previously calculating convolution results and building compact lookup tables with our novel clustering algorithm, one can evaluate activations at constant time complexity, also requiring a single read operation of the current input tensor. Then, we devise a reconfigurable array of processing elements based on memristive Ternary Content-addressable Memories to efficiently implement the algorithmic solution and meet the flexibility requirements of several CNN architectures. Results show that our design reduces the number of multiplications and memory accesses proportionally to the number of convolutional layer channels. The average performance is 1,172 and 82 FPS for AlexNet and VGG-16 models, thus outperforming state-of-the-art works by 13×.
- [1] . 2016. Memristors for energy-efficient new computing paradigms. Adv. Electron. Mater. 2, 9 (2016), 1600090.Google Scholar
Cross Ref
- [2] . 2021. Machine learning at the network edge: A survey. ACM Comput. Surveys 54, 8 (2021), 1–37.Google Scholar
Digital Library
- [3] . 2020. Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13693–13696.Google Scholar
Cross Ref
- [4] . 2019. A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfig. Technol. Syst. 12, 1, Article
2 (Mar. 2019), 26 pages.DOI: DOI: http://dx.doi.org/10.1145/3289185Google ScholarCross Ref
- [5] . 2020. XNOR-SRAM: In-memory computing SRAM macro for binary/ternary deep neural networks. IEEE J. Solid-State Circ. (2020).Google Scholar
Cross Ref
- [6] . 2018. Face recognition with hybrid efficient convolution algorithms on FPGAs. In Proceedings of the on Great Lakes Symposium on VLSI. 123–128.Google Scholar
Digital Library
- [7] . 2019. A uniform architecture design for accelerating 2D and 3D CNNS on FPGAs. Electronics 8, 1 (2019), 65.Google Scholar
Cross Ref
- [8] . 2018. Optimizing the convolution operation to accelerate deep neural networks on FPGA. IEEE Trans. Very Large Scale Integr. Syst. 26, 7 (2018), 1354–1367.Google Scholar
Digital Library
- [9] . 2017. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans. Computer-aided Design Integr. Circ. Syst. 37, 1 (2017), 35–47.Google Scholar
Cross Ref
- [10] . 2016. Fused-layer CNN accelerators. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1–12.Google Scholar
Digital Library
- [11] . 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (2016), 105–112.Google Scholar
Digital Library
- [12] . 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.Google Scholar
Cross Ref
- [13] . 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Comput. Architect. News 44, 3 (2016), 14–26.Google Scholar
Digital Library
- [14] . 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. ACM SIGARCH Comput. Architect. News 44, 3 (2016), 27–39.Google Scholar
Digital Library
- [15] . 2020. MPSoC ZCU102 evaluation kit. Retrieved from https://www.xilinx.com/products/boards-andkits/ek-u1-zcu102-g.html.Google Scholar
- [16] . 2019. LAcc: Exploiting lookup table-based fast and accurate vector multiplication in DRAM-based CNN accelerator. In Proceedings of the 56th Annual Design Automation Conference. 1–6.Google Scholar
Digital Library
- [17] . 2019. Energy-efficient convolutional neural networks via recurrent data reuse. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’19). IEEE, 848–853.Google Scholar
Cross Ref
- [18] . 2017. LCNN: Lookup-based convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7120–7129.Google Scholar
Cross Ref
- [19] . 2019. Aggressive energy reduction for video inference with software-only strategies. ACM Trans. Embed. Comput. Syst. 18, 5s (2019), 1–20.Google Scholar
Digital Library
- [20] . 2019. Skipping CNN convolutions through efficient memoization. In Proceedings of the International Conference on Embedded Computer Systems. Springer, 65–76.Google Scholar
Digital Library
- [21] . 2020. Endurance-aware RRAM-based reconfigurable architecture using TCAM arrays. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 40–46.Google Scholar
- [22] . 2018. Energy-efficient neural networks using approximate computation reuse. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’18). IEEE, 1223–1228.Google Scholar
Cross Ref
- [23] . 2018. DrAcc: A DRAM-based accelerator for accurate CNN inference. In Proceedings of the 55th Annual Design Automation Conference. 1–6.Google Scholar
Digital Library
- [24] . 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 383–396.Google Scholar
Digital Library
- [25] . 2018. A multilevel cell STT-MRAM-based computing in-memory accelerator for binary convolutional neural network. IEEE Trans. Magn. 54, 11 (2018), 1–5.Google Scholar
Cross Ref
- [26] . 2018. A GPU-outperforming FPGA accelerator architecture for binary convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst. 14, 2 (2018), 1–16.Google Scholar
Digital Library
- [27] . 2018. Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC’18). IEEE, 488–490.Google Scholar
Cross Ref
- [28] . 2018. XNOR-RRAM: A scalable and parallel resistive synaptic architecture for binary neural networks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’18). IEEE, 1423–1428.Google Scholar
Cross Ref
- [29] . 2019. FPSA: A full system stack solution for reconfigurable ReRAM-based NN accelerator architecture. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 733–747.Google Scholar
Digital Library
- [30] . 2018. Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. Retrieved from httsp://arXiv:1803.11232.Google Scholar
- [31] . 2017. Looknn: Neural network with no multiplication. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’17). IEEE, 1775–1780.Google Scholar
Cross Ref
- [32] . 2020. LogicNets: Co-designed neural networks and circuits for extreme-throughput applications. In Proceedings of the 30th International Conference on Field-Programmable Logic and Applications (FPL’20). IEEE, 291–297.Google Scholar
Cross Ref
- [33] . 2019. AxMemo: Hardware-compiler co-design for approximate code memoization. In Proceedings of the 46th International Symposium on Computer Architecture. 685–697.Google Scholar
Digital Library
- [34] . 2019. ReRAM-based in-memory computing for search engine and neural network applications. IEEE J. Emerg. Select. Top. Circ. Syst. (2019).Google Scholar
Cross Ref
- [35] . 2018. Post-P&R performance and power analysis for RRAM-based FPGAs. IEEE J. Emerg. Select. Top. Circ. Syst. 8, 3 (2018), 639–650.Google Scholar
Cross Ref
- [36] . 2019. Resistive RAM endurance: Array-level characterization and correction techniques targeting deep learning applications. IEEE Trans. Electron Devices 66, 3 (2019), 1281–1288.Google Scholar
Cross Ref
- [37] . 2019. Memristor TCAMs accelerate regular expression matching for network intrusion detection. IEEE Trans. Nanotechnol. 18 (2019), 963–970.Google Scholar
Cross Ref
- [38] . 2015. Resistive ternary content addressable memory systems for data-intensive computing. IEEE Micro 35, 5 (2015), 62–71.Google Scholar
Digital Library
- [39] . 2016. Resistive configurable associative memory for approximate computing. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE’16). IEEE, 1327–1332.Google Scholar
Cross Ref
- [40] . 2018. Experimental investigation of 4-kb RRAM arrays programming conditions suitable for TCAM. IEEE Trans. Very Large Scale Integr. Syst. 26, 12 (2018), 2599–2607.Google Scholar
Cross Ref
- [41] . 2008. Reconfigurable computing using content addressable memory for improved performance and resource usage. In Proceedings of the 45th ACM/IEEE Design Automation Conference. IEEE, 786–791.Google Scholar
Digital Library
- [42] . 2018. PRINS: Processing-in-storage acceleration of machine learning. IEEE Trans. Nanotechnol. 17, 5 (2018), 889–896.Google Scholar
Cross Ref
- [43] . 2020. Hyper-AP: Enhancing associative processing through a full-stack optimization. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). IEEE, 846–859.Google Scholar
Digital Library
- [44] . 2017. A resistive cam processing-in-storage architecture for dna sequence alignment. IEEE Micro 37, 4 (2017), 20–28.Google Scholar
Digital Library
- [45] . 2015. Emerging trends in design and applications of memory-based computing and content-addressable memories. Proc. IEEE 103, 8 (2015), 1311–1330.Google Scholar
Cross Ref
- [46] . 2018. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the 45th Annual International Symposium on Computer Architecture. IEEE Press, 674–687.Google Scholar
Digital Library
- [47] . 2009. CACTI 6.0: A tool to model large caches. HP Lab. 27 (2009), 28.Google Scholar
- [48] . 2011. The risc-v instruction set manual, volume i: Base user-level isa. EECS Department, UC Berkeley, Tech. Rep. UCB/EECS-2011-62, 116 (2011).Google Scholar
- [49] . 2004. The effect of LUT and cluster size on deep-submicron FPGA performance and density. IEEE Trans. Very Large Scale Integr. Syst. 12, 3 (2004), 288–298.Google Scholar
Digital Library
- [50] . 2020. Fast exact NPN classification by co-designing canonical form and its computation algorithm. IEEE Trans. Comput. (2020).Google Scholar
Cross Ref
- [51] . 2021. STAP: An architecture and design tool for automata processing on memristor TCAMs. ACM J. Emerg. Technol. Comput. Syst. 18, 2 (2021), 1–22.Google Scholar
Digital Library
- [52] . 2013. FPGA-RPI: A novel FPGA architecture with RRAM-based programmable interconnects. IEEE Trans. Very Large Scale Integr. Syst. 22, 4 (2013), 864–877.Google Scholar
Digital Library
- [53] . 2014. Compressing deep convolutional networks using vector quantization. Retrieved from https://arXiv:1412.6115.Google Scholar
- [54] . 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. Retrieved from https://arXiv:1510.00149.Google Scholar
- [55] . 2018. Yolov3: An incremental improvement. Retrieved from https://arXiv:1804.02767.Google Scholar
- [56] . 2020. VTR 8: High performance CAD and customizable FPGA architecture modelling. ACM Trans. Reconfig. Technol. Syst. (2020).Google Scholar
Digital Library
- [57] . 2007. Combinational and sequential mapping with priority cuts. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design. IEEE, 354–361.Google Scholar
Cross Ref
- [58] . 2007. ABC: A system for sequential synthesis and verification. Retrieved from http://www.eecs.berkeley.edu/alanmi/abc.Google Scholar
- [59] . 2012. Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 31, 7 (2012), 994–1007.Google Scholar
Digital Library
- [60] . 2016. The rocket chip generator. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17.Google Scholar
- [61] . 2011. The gem5 simulator. ACM SIGARCH Comput. Architect. News 39, 2 (2011), 1–7.Google Scholar
Digital Library
- [62] . 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to \(+1\) or \(-1\). Retrieved from https://arXiv:1602.02830.Google Scholar
- [63] . 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. 1–6.Google Scholar
Digital Library
- [64] . 2021. Unary coding and variation-aware optimal mapping scheme for reliable ReRAM-based neuromorphic computing. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 40, 12 (2021), 2495–2507.Google Scholar
Cross Ref
- [65] . 2021. Memristive crossbar arrays for storage and computing applications. Adv. Intell. Syst. 3, 9 (2021), 2100017.Google Scholar
Cross Ref
Index Terms
Data and Computation Reuse in CNNs Using Memristor TCAMs
Recommendations
STAP: An Architecture and Design Tool for Automata Processing on Memristor TCAMs
Accelerating finite-state automata benefits several emerging application domains that are built on pattern matching. In-memory architectures, such as the Automata Processor (AP), are efficient to speed them up, at least for outperforming traditional von-...
Reconfigurable content-addressable memory (CAM) on FPGAs: A tutorial and survey
AbstractContent-addressable memory (CAM) is a massively parallel searching device that returns the address of a given search input in one clock cycle. Field-programmable gate array (FPGA)-based CAMs are becoming popular due to their ...
Highlights- It describes all the works proposed in the past 10-15 years on FPGA-based CAM/TCAM.
An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive Applications
GLSVLSI '19: Proceedings of the 2019 on Great Lakes Symposium on VLSIThe conventional von Neumann architecture has been revealed as a major performance and energy bottleneck for rising data-intensive applications. The decade-old idea of leveraging in-memory processing to eliminate substantial data movements has returned ...






Comments