Abstract
The high accuracy of deep neural networks (NNs) has led to the development of NN accelerators that improve performance by two orders of magnitude. However, scaling these accelerators for higher performance with increasingly larger NNs exacerbates the cost and energy overheads of their memory systems, including the on-chip SRAM buffers and the off-chip DRAM channels.
This paper presents the hardware architecture and software scheduling and partitioning techniques for TETRIS, a scalable NN accelerator using 3D memory. First, we show that the high throughput and low energy characteristics of 3D memory allow us to rebalance the NN accelerator design, using more area for processing elements and less area for SRAM buffers. Second, we move portions of the NN computations close to the DRAM banks to decrease bandwidth pressure and increase performance and energy efficiency. Third, we show that despite the use of small SRAM buffers, the presence of 3D memory simplifies dataflow scheduling for NN computations. We present an analytical scheduling scheme that matches the efficiency of schedules derived through exhaustive search. Finally, we develop a hybrid partitioning scheme that parallelizes the NN computations over multiple accelerators. Overall, we show that TETRIS improves mthe performance by 4.1x and reduces the energy by 1.5x over NN accelerators with conventional, low-power DRAM memory systems.
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 265--283, 2016.Google Scholar
- J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing. In 42nd International Symposium on Computer Architecture (ISCA), pages 105--117, 2015. Google Scholar
Digital Library
- J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 1--13, 2016. Google Scholar
Digital Library
- R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy, R. Nair, and S. Swanson. Near-Data Processing: Insights from a MICRO-46 Workshop. IEEE Micro, 34(4): 36--42, 2014. Google Scholar
Cross Ref
- T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 269--284, 2014. Google Scholar
Digital Library
- Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. DaDianNao: A Machine-Learning Supercomputer. In 47th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 609--622, 2014. Google Scholar
Digital Library
- Y.-H. Chen, J. Emer, and V. Sze. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 367--379, 2016. Google Scholar
Digital Library
- Y.-H. Chen, T. Krishna, J. Emer, and V. Sze. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In IEEE International Solid-State Circuits Conference (ISSCC), pages 262--263, 2016. Google Scholar
Cross Ref
- P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In 43rd International Symposium on Computer Architecture (ISCA), pages 27--39, 2016. Google Scholar
Digital Library
- K. Choi. Coarse-Grained Reconfigurable Array: Architecture and Application Mapping. IPSJ Transactions on System LSI Design Methodology, 4:31--46, 2011. Google Scholar
Cross Ref
- J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large Scale Distributed Deep Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), pages 1223--1231, 2012.Google Scholar
- Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In 42nd Annual International Symposium on Computer Architecture (ISCA), pages 92--104, 2015. Google Scholar
Digital Library
- A. Dundar, J. Jin, V. Gokhale, B. Martini, and E. Culurciello. Memory Access Optimized Routing Scheme for Deep Networks on a Mobile Coprocessor. In 2014 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--6, 2014. Google Scholar
Cross Ref
- Y. Eckert, N. Jayasena, and G. H. Loh. Thermal Feasibility of Die-Stacked Processing in Memory. In 2nd Workshop on Near-Data Processing (WoNDP), 2014.Google Scholar
- C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A Runtime Reconfigurable Dataflow Processor for Vision. In 2011 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 109--116, 2011. Google Scholar
Cross Ref
- M. Gao and C. Kozyrakis. HRL: Efficient and Flexible Re- configurable Logic for Near-Data Processing. In 22nd IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 126--137, 2016. Google Scholar
Cross Ref
- M. Gao, G. Ayers, and C. Kozyrakis. Practical Near-Data Processing for In-Memory Analytics Frameworks. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 113--124, 2015. Google Scholar
Digital Library
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dall. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 243--254, 2016. Google Scholar
Digital Library
- K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015.Google Scholar
- Hybrid Memory Cube Consortium. Hybrid Memory Cube Specification 2.1, 2014.Google Scholar
- J. Jeddeloh and B. Keeth. Hybrid Memory Cube New DRAM Architecture Increases Density and Performance. In 2012 Symposium on VLSI Technology (VLSIT), pages 87--88, 2012. Google Scholar
Cross Ref
- JEDEC Standard. High Bandwidth Memory (HBM) DRAM. JESD235A, 2015.Google Scholar
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093, 2014.Google Scholar
- A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early- stage Design Space Exploration. In Conference on Design, Automation and Test in Europe (DATE), pages 423--428, 2009. Google Scholar
Cross Ref
- D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 380--392, 2016. Google Scholar
Digital Library
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), pages 1097--1105, 2012.Google Scholar
- Y. LeCun, Y. Bengio, and G. Hinton. Deep Learning. Nature, 521(7553):436--444, 2015. Google Scholar
- D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y. Kim, Y. J. Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shin, J. H. Cho, K. H. Kwon, M. J. Kim, J. Lee, K. W. Park, B. Chung, and S. Hong. 25.2 A 1.2V 8Gb 8-channel 128GB/s High-Bandwidth Memory (HBM) Stacked DRAM with Effective Microbump I/O Test Methods Using 29nm Process and TSV. In IEEE International Solid-State Circuits Conference (ISSCC), pages 432--433, 2014. Google Scholar
Cross Ref
- S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi. CACTI-P: Architecture-Level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques. In 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 694--701, 2011. Google Scholar
Digital Library
- B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix. In 13th International Conference on Field Programmable Logic and Application (FPL), pages 61--70, 2003.Google Scholar
- Micron Technology Inc. TN-41-01: Calculating Memory System Power for DDR3 . https://www.micron.com/support/tools-and-utilities/power-calc, 2007.Google Scholar
- Micron Technology Inc. Mobile LPDDR3 SDRAM: 178-Ball, Single-Channel Mobile LPDDR3 SDRAM Features. https://www.micron.com/products/dram/lpdram/16Gb, 2014.Google Scholar
- S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo. A 1.93TOPS/W Scalable Deep Learning/Inference Processor with Tetra-Parallel MIMD Architecture for Big-Data Applications. In IEEE International Solid-State Circuits Conference (ISSCC), pages 1--3, 2015.Google Scholar
- M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal. Memory-Centric Accelerator Design for Convolutional Neural Networks. In 31st International Conference on Computer Design (ICCD), pages 13--19, 2013. Google Scholar
Cross Ref
- S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li. NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 190--200, 2014. Google Scholar
Cross Ref
- B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 267--278,2016. Google Scholar
Digital Library
- D. Sanchez and C. Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-core Systems. In 40th International Symposium on Computer Architecture(ISCA), pages 475--486, 2013. Google Scholar
Digital Library
- V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization. In 46th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 185--197, 2013. Google Scholar
Digital Library
- V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry. Fast Bulk Bitwise AND and OR in DRAM. Computer Architecture Letters, 14 :127--131, 2015. Google Scholar
Digital Library
- A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. ISAAC: A Convolutional Neural Network Accelerator with In-situ Analog Arithmetic in Crossbars. In 43rd International Symposium on Computer Architecture (ISCA), pages 14--26, 2016. Google Scholar
Digital Library
- K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.Google Scholar
- H. Singh, M.-H. Lee, G. Lu, N. Bagherzadeh, F. J. Kurdahi, and E. M. C. Filho. MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications. IEEE Transactions Computers, 49(5):465--481, 2000. Google Scholar
Digital Library
- T. Vogelsang. Understanding the Energy Consumption of Dynamic Random Access Memories. In 43rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 363--374, 2010. Google Scholar
Digital Library
- C. Weis, N. Wehn, L. Igor, and L. Benini. Design Space Exploration for 3D-stacked DRAMs. In Design, Automation Test in Europe Conference Exhibition (DATE), pages 1--6, 2011.Google Scholar
Cross Ref
- X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatinsky, J. Ragan-Kelley, A. Pedram, and M. Horowitz. A Systematic Approach to Blocking Convolutional Neural Networks. arXiv preprint arXiv:1606.04209, 2016.Google Scholar
- M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. In 13th European Conference on Computer Vision (ECCV), pages 818--833, 2014. Google Scholar
Cross Ref
- C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pages 161--170, 2015.Google Scholar
Index Terms
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
Recommendations
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsThe high accuracy of deep neural networks (NNs) has led to the development of NN accelerators that improve performance by two orders of magnitude. However, scaling these accelerators for higher performance with increasingly larger NNs exacerbates the ...
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
Asplos'17The high accuracy of deep neural networks (NNs) has led to the development of NN accelerators that improve performance by two orders of magnitude. However, scaling these accelerators for higher performance with increasingly larger NNs exacerbates the ...
Software-defined address mapping: a case on 3D memory
ASPLOS '22: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems3D-stacking memory such as High-Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) provides orders of magnitude more bandwidth and significantly increased channel-level parallelism (CLP) due to its new parallel memory architecture. However, it is ...







Comments