skip to main content
article
Best Paper

Towards memory-efficient processing-in-memory architecture for convolutional neural networks

Authors Info & Claims
Published:21 June 2017Publication History
Skip Abstract Section

Abstract

Convolutional neural networks (CNNs) are widely adopted in artificial intelligent systems. In contrast to conventional computing centric applications, the computational and memory resources of CNN applications are mixed together in the network weights. This incurs a significant amount of data movement, especially for highdimensional convolutions. Although recent embedded 3D-stacked Processing-in-Memory (PIM) architecture alleviates this memory bottleneck to provide fast near-data processing, memory is still a limiting factor of the entire system. An unsolved key challenge is how to efficiently allocate convolutions to 3D-stacked PIM to combine the advantages of both neural and computational processing.

This paper presents Memolution, a compiler-based memory efficient data allocation strategy for convolutional neural networks on PIM architecture. Memolution offers thread-level parallelism that can fully exploit the computational power of PIM architecture. The objective is to capture the characteristics of neural network applications and present a hardware-independent design to transparently allocate CNN applications onto the underlining hardware resources provided by PIM. We demonstrate the viability of the proposed technique using a variety of realistic convolutional neural network applications. Our extensive evaluations show that, Memolution significantly improves performance and the cache utilization compared to the baseline scheme.

References

  1. A. Graves et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):1–6, Oct. 2016.Google ScholarGoogle ScholarCross RefCross Ref
  2. B. Akin, F. Franchetti, and J. C. Hoe. Data reorganization in memory using 3D-stacked DRAM. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA ’15), pages 131–143, June 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Alwani, H. Chen, M. Ferdman, and P. Milder. Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’16), pages 1–12, Oct 2016.Google ScholarGoogle ScholarCross RefCross Ref
  4. K. M. Barijough, M. Hashemi, V. Khibin, and S. Ghiasi. Implementation-aware model analysis: The case of buffer-throughput tradeoff in streaming applications. In Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES ’15), pages 11:1–11:10, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14), pages 269–284, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. H. Chen, J. Emer, and V. Sze. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA ’16), pages 367–379, June 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA ’16), pages 27–39, June 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Conti and L. Benini. A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters. In 2015 Design, Automation Test in Europe Conference Exhibition (DATE ’15), pages 683– 688, March 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Donyanavard, T. M ück, S. Sarma, and N. Dutt. Sparta: Runtime task allocation for energy efficient heterogeneous many-cores. In Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS ’16), pages 27:1–27:10, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. H. Foroozannejad, M. Hashemi, T. L. Hodges, and S. Ghiasi. Look into details: The benefits of fine-grain streaming buffer analysis. In Proceedings of the ACM SIGPLAN/SIGBED 2010 Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES ’10), pages 27–36, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Gao and C. Kozyrakis. HRL: Efficient and flexible reconfigurable logic for near-data processing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA ’16), pages 126– 137, March 2016.Google ScholarGoogle ScholarCross RefCross Ref
  12. M. Gao, G. Ayers, and C. Kozyrakis. Practical near-data processing for in-memory analytics frameworks. In 2015 International Conference on Parallel Architecture and Compilation (PACT ’15), pages 113–124, Oct 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Hsieh, E. Ebrahim, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA ’16), pages 204–216, June 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.Google ScholarGoogle Scholar
  15. D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA ’16), pages 380– 392, June 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS ’10), pages 253–256, May 2010.Google ScholarGoogle ScholarCross RefCross Ref
  17. Lenovo Group Ltd.’s eXFlash. http://www.lenovo.com/images/produ cts/system-x/pdfs/datasheets/exflash memory channel storage.pdf. 2017.Google ScholarGoogle Scholar
  18. P.-J. Micolet, A. Smith, and C. Dubach. A machine learning approach to mapping streaming workloads to dynamic multicore processors. In Proceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory for Embedded Systems (LCTES ’16), pages 113–122, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. L. Passos and E. H. M. Sha. Synchronous circuit optimization via multidimensional retiming. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 43(7):507–519, Jul 1996.Google ScholarGoogle ScholarCross RefCross Ref
  20. J. T. Pawlowski. Hybrid memory cube (HMC). In 2011 IEEE Hot Chips 23 Symposium (HCS ’11), pages 1–24, Aug 2011.Google ScholarGoogle ScholarCross RefCross Ref
  21. A. Rahman, J. Lee, and K. Choi. Efficient FPGA acceleration of convolutional neural networks using logical-3D compute array. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE ’16), pages 1393–1398, March 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li. C-brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC ’16), pages 123:1–123:6, June 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Viking Technology’s ArxCis-NV. http://www.vikingtechnology.com/ uploads/arxcis productbrief.pdf. 2017.Google ScholarGoogle Scholar
  24. Y. Wang, Z. Shao, H. C. B. Chan, D. Liu, and Y. Guan. Memoryaware task scheduling with communication overhead minimization for streaming applications on bus-based multiprocessor system-on-chips. IEEE Transactions on Parallel and Distributed Systems, 25(7):1797– 1807, July 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Wang, J. Xu, Y. Han, H. Li, and X. Li. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC ’16), pages 110:1–110:6, June 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’15), pages 161– 170, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards memory-efficient processing-in-memory architecture for convolutional neural networks

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!