Abstract
Convolutional neural networks (CNNs) are widely adopted in artificial intelligent systems. In contrast to conventional computing centric applications, the computational and memory resources of CNN applications are mixed together in the network weights. This incurs a significant amount of data movement, especially for highdimensional convolutions. Although recent embedded 3D-stacked Processing-in-Memory (PIM) architecture alleviates this memory bottleneck to provide fast near-data processing, memory is still a limiting factor of the entire system. An unsolved key challenge is how to efficiently allocate convolutions to 3D-stacked PIM to combine the advantages of both neural and computational processing.
This paper presents Memolution, a compiler-based memory efficient data allocation strategy for convolutional neural networks on PIM architecture. Memolution offers thread-level parallelism that can fully exploit the computational power of PIM architecture. The objective is to capture the characteristics of neural network applications and present a hardware-independent design to transparently allocate CNN applications onto the underlining hardware resources provided by PIM. We demonstrate the viability of the proposed technique using a variety of realistic convolutional neural network applications. Our extensive evaluations show that, Memolution significantly improves performance and the cache utilization compared to the baseline scheme.
- A. Graves et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):1–6, Oct. 2016.Google Scholar
Cross Ref
- B. Akin, F. Franchetti, and J. C. Hoe. Data reorganization in memory using 3D-stacked DRAM. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA ’15), pages 131–143, June 2015. Google Scholar
Digital Library
- M. Alwani, H. Chen, M. Ferdman, and P. Milder. Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’16), pages 1–12, Oct 2016.Google Scholar
Cross Ref
- K. M. Barijough, M. Hashemi, V. Khibin, and S. Ghiasi. Implementation-aware model analysis: The case of buffer-throughput tradeoff in streaming applications. In Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES ’15), pages 11:1–11:10, 2015. Google Scholar
Digital Library
- T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14), pages 269–284, 2014. Google Scholar
Digital Library
- Y. H. Chen, J. Emer, and V. Sze. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA ’16), pages 367–379, June 2016. Google Scholar
Digital Library
- P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. PRIME: A novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA ’16), pages 27–39, June 2016. Google Scholar
Digital Library
- F. Conti and L. Benini. A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters. In 2015 Design, Automation Test in Europe Conference Exhibition (DATE ’15), pages 683– 688, March 2015. Google Scholar
Digital Library
- B. Donyanavard, T. M ück, S. Sarma, and N. Dutt. Sparta: Runtime task allocation for energy efficient heterogeneous many-cores. In Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS ’16), pages 27:1–27:10, 2016. Google Scholar
Digital Library
- M. H. Foroozannejad, M. Hashemi, T. L. Hodges, and S. Ghiasi. Look into details: The benefits of fine-grain streaming buffer analysis. In Proceedings of the ACM SIGPLAN/SIGBED 2010 Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES ’10), pages 27–36, 2010. Google Scholar
Digital Library
- M. Gao and C. Kozyrakis. HRL: Efficient and flexible reconfigurable logic for near-data processing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA ’16), pages 126– 137, March 2016.Google Scholar
Cross Ref
- M. Gao, G. Ayers, and C. Kozyrakis. Practical near-data processing for in-memory analytics frameworks. In 2015 International Conference on Parallel Architecture and Compilation (PACT ’15), pages 113–124, Oct 2015. Google Scholar
Digital Library
- K. Hsieh, E. Ebrahim, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA ’16), pages 204–216, June 2016. Google Scholar
Digital Library
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.Google Scholar
- D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA ’16), pages 380– 392, June 2016. Google Scholar
Digital Library
- Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS ’10), pages 253–256, May 2010.Google Scholar
Cross Ref
- Lenovo Group Ltd.’s eXFlash. http://www.lenovo.com/images/produ cts/system-x/pdfs/datasheets/exflash memory channel storage.pdf. 2017.Google Scholar
- P.-J. Micolet, A. Smith, and C. Dubach. A machine learning approach to mapping streaming workloads to dynamic multicore processors. In Proceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory for Embedded Systems (LCTES ’16), pages 113–122, 2016. Google Scholar
Digital Library
- N. L. Passos and E. H. M. Sha. Synchronous circuit optimization via multidimensional retiming. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 43(7):507–519, Jul 1996.Google Scholar
Cross Ref
- J. T. Pawlowski. Hybrid memory cube (HMC). In 2011 IEEE Hot Chips 23 Symposium (HCS ’11), pages 1–24, Aug 2011.Google Scholar
Cross Ref
- A. Rahman, J. Lee, and K. Choi. Efficient FPGA acceleration of convolutional neural networks using logical-3D compute array. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE ’16), pages 1393–1398, March 2016. Google Scholar
Digital Library
- L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li. C-brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC ’16), pages 123:1–123:6, June 2016. Google Scholar
Digital Library
- Viking Technology’s ArxCis-NV. http://www.vikingtechnology.com/ uploads/arxcis productbrief.pdf. 2017.Google Scholar
- Y. Wang, Z. Shao, H. C. B. Chan, D. Liu, and Y. Guan. Memoryaware task scheduling with communication overhead minimization for streaming applications on bus-based multiprocessor system-on-chips. IEEE Transactions on Parallel and Distributed Systems, 25(7):1797– 1807, July 2014. Google Scholar
Digital Library
- Y. Wang, J. Xu, Y. Han, H. Li, and X. Li. DeepBurning: Automatic generation of FPGA-based learning accelerators for the neural network family. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC ’16), pages 110:1–110:6, June 2016. Google Scholar
Digital Library
- C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’15), pages 161– 170, 2015. Google Scholar
Digital Library
Index Terms
Towards memory-efficient processing-in-memory architecture for convolutional neural networks
Recommendations
Towards memory-efficient processing-in-memory architecture for convolutional neural networks
LCTES 2017: Proceedings of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded SystemsConvolutional neural networks (CNNs) are widely adopted in artificial intelligent systems. In contrast to conventional computing centric applications, the computational and memory resources of CNN applications are mixed together in the network weights. ...
Redesign the Memory Allocator for Non-Volatile Main Memory
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing SystemsThe non-volatile memory (NVM) has the merits of byte-addressability, fast speed, persistency and low power consumption, which make it attractive to be used as main memory. Commonly, user process dynamically acquires memory through memory allocators. ...
Large-Scale BSP Graph Processing in Distributed Non-Volatile Memory
GRADES'15: Proceedings of the GRADES'15Processing large graphs is becoming increasingly important for many domains. Large-scale graph processing requires a large-scale cluster system, which is very expensive. Thus, for high-performance large-scale graph processing in small clusters, we have ...






Comments