skip to main content
research-article
Public Access

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Published:04 April 2017Publication History
Skip Abstract Section

Abstract

The high accuracy of deep neural networks (NNs) has led to the development of NN accelerators that improve performance by two orders of magnitude. However, scaling these accelerators for higher performance with increasingly larger NNs exacerbates the cost and energy overheads of their memory systems, including the on-chip SRAM buffers and the off-chip DRAM channels.

This paper presents the hardware architecture and software scheduling and partitioning techniques for TETRIS, a scalable NN accelerator using 3D memory. First, we show that the high throughput and low energy characteristics of 3D memory allow us to rebalance the NN accelerator design, using more area for processing elements and less area for SRAM buffers. Second, we move portions of the NN computations close to the DRAM banks to decrease bandwidth pressure and increase performance and energy efficiency. Third, we show that despite the use of small SRAM buffers, the presence of 3D memory simplifies dataflow scheduling for NN computations. We present an analytical scheduling scheme that matches the efficiency of schedules derived through exhaustive search. Finally, we develop a hybrid partitioning scheme that parallelizes the NN computations over multiple accelerators. Overall, we show that TETRIS improves mthe performance by 4.1x and reduces the energy by 1.5x over NN accelerators with conventional, low-power DRAM memory systems.

References

  1. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 265--283, 2016.Google ScholarGoogle Scholar
  2. J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing. In 42nd International Symposium on Computer Architecture (ISCA), pages 105--117, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 1--13, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy, R. Nair, and S. Swanson. Near-Data Processing: Insights from a MICRO-46 Workshop. IEEE Micro, 34(4): 36--42, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  5. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 269--284, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. DaDianNao: A Machine-Learning Supercomputer. In 47th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 609--622, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y.-H. Chen, J. Emer, and V. Sze. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 367--379, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y.-H. Chen, T. Krishna, J. Emer, and V. Sze. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In IEEE International Solid-State Circuits Conference (ISSCC), pages 262--263, 2016. Google ScholarGoogle ScholarCross RefCross Ref
  9. P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In 43rd International Symposium on Computer Architecture (ISCA), pages 27--39, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Choi. Coarse-Grained Reconfigurable Array: Architecture and Application Mapping. IPSJ Transactions on System LSI Design Methodology, 4:31--46, 2011. Google ScholarGoogle ScholarCross RefCross Ref
  11. J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng. Large Scale Distributed Deep Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), pages 1223--1231, 2012.Google ScholarGoogle Scholar
  12. Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In 42nd Annual International Symposium on Computer Architecture (ISCA), pages 92--104, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Dundar, J. Jin, V. Gokhale, B. Martini, and E. Culurciello. Memory Access Optimized Routing Scheme for Deep Networks on a Mobile Coprocessor. In 2014 IEEE High Performance Extreme Computing Conference (HPEC), pages 1--6, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  14. Y. Eckert, N. Jayasena, and G. H. Loh. Thermal Feasibility of Die-Stacked Processing in Memory. In 2nd Workshop on Near-Data Processing (WoNDP), 2014.Google ScholarGoogle Scholar
  15. C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A Runtime Reconfigurable Dataflow Processor for Vision. In 2011 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 109--116, 2011. Google ScholarGoogle ScholarCross RefCross Ref
  16. M. Gao and C. Kozyrakis. HRL: Efficient and Flexible Re- configurable Logic for Near-Data Processing. In 22nd IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 126--137, 2016. Google ScholarGoogle ScholarCross RefCross Ref
  17. M. Gao, G. Ayers, and C. Kozyrakis. Practical Near-Data Processing for In-Memory Analytics Frameworks. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 113--124, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dall. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 243--254, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015.Google ScholarGoogle Scholar
  20. Hybrid Memory Cube Consortium. Hybrid Memory Cube Specification 2.1, 2014.Google ScholarGoogle Scholar
  21. J. Jeddeloh and B. Keeth. Hybrid Memory Cube New DRAM Architecture Increases Density and Performance. In 2012 Symposium on VLSI Technology (VLSIT), pages 87--88, 2012. Google ScholarGoogle ScholarCross RefCross Ref
  22. JEDEC Standard. High Bandwidth Memory (HBM) DRAM. JESD235A, 2015.Google ScholarGoogle Scholar
  23. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093, 2014.Google ScholarGoogle Scholar
  24. A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early- stage Design Space Exploration. In Conference on Design, Automation and Test in Europe (DATE), pages 423--428, 2009. Google ScholarGoogle ScholarCross RefCross Ref
  25. D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 380--392, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In 25th International Conference on Neural Information Processing Systems (NIPS), pages 1097--1105, 2012.Google ScholarGoogle Scholar
  27. Y. LeCun, Y. Bengio, and G. Hinton. Deep Learning. Nature, 521(7553):436--444, 2015. Google ScholarGoogle Scholar
  28. D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y. Kim, Y. J. Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shin, J. H. Cho, K. H. Kwon, M. J. Kim, J. Lee, K. W. Park, B. Chung, and S. Hong. 25.2 A 1.2V 8Gb 8-channel 128GB/s High-Bandwidth Memory (HBM) Stacked DRAM with Effective Microbump I/O Test Methods Using 29nm Process and TSV. In IEEE International Solid-State Circuits Conference (ISSCC), pages 432--433, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  29. S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi. CACTI-P: Architecture-Level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques. In 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 694--701, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix. In 13th International Conference on Field Programmable Logic and Application (FPL), pages 61--70, 2003.Google ScholarGoogle Scholar
  31. Micron Technology Inc. TN-41-01: Calculating Memory System Power for DDR3 . https://www.micron.com/support/tools-and-utilities/power-calc, 2007.Google ScholarGoogle Scholar
  32. Micron Technology Inc. Mobile LPDDR3 SDRAM: 178-Ball, Single-Channel Mobile LPDDR3 SDRAM Features. https://www.micron.com/products/dram/lpdram/16Gb, 2014.Google ScholarGoogle Scholar
  33. S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo. A 1.93TOPS/W Scalable Deep Learning/Inference Processor with Tetra-Parallel MIMD Architecture for Big-Data Applications. In IEEE International Solid-State Circuits Conference (ISSCC), pages 1--3, 2015.Google ScholarGoogle Scholar
  34. M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal. Memory-Centric Accelerator Design for Convolutional Neural Networks. In 31st International Conference on Computer Design (ICCD), pages 13--19, 2013. Google ScholarGoogle ScholarCross RefCross Ref
  35. S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li. NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 190--200, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  36. B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In 43rd Annual International Symposium on Computer Architecture (ISCA), pages 267--278,2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Sanchez and C. Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-core Systems. In 40th International Symposium on Computer Architecture(ISCA), pages 475--486, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization. In 46th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 185--197, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry. Fast Bulk Bitwise AND and OR in DRAM. Computer Architecture Letters, 14 :127--131, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. ISAAC: A Convolutional Neural Network Accelerator with In-situ Analog Arithmetic in Crossbars. In 43rd International Symposium on Computer Architecture (ISCA), pages 14--26, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  42. H. Singh, M.-H. Lee, G. Lu, N. Bagherzadeh, F. J. Kurdahi, and E. M. C. Filho. MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications. IEEE Transactions Computers, 49(5):465--481, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. T. Vogelsang. Understanding the Energy Consumption of Dynamic Random Access Memories. In 43rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), pages 363--374, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. C. Weis, N. Wehn, L. Igor, and L. Benini. Design Space Exploration for 3D-stacked DRAMs. In Design, Automation Test in Europe Conference Exhibition (DATE), pages 1--6, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  45. X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatinsky, J. Ragan-Kelley, A. Pedram, and M. Horowitz. A Systematic Approach to Blocking Convolutional Neural Networks. arXiv preprint arXiv:1606.04209, 2016.Google ScholarGoogle Scholar
  46. M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. In 13th European Conference on Computer Vision (ECCV), pages 818--833, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  47. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), pages 161--170, 2015.Google ScholarGoogle Scholar

Index Terms

  1. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 52, Issue 4
      ASPLOS '17
      April 2017
      811 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3093336
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
        April 2017
        856 pages
        ISBN:9781450344654
        DOI:10.1145/3037697

      Copyright © 2017 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 April 2017

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!