skip to main content
research-article

MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects

Published:19 March 2018Publication History
Skip Abstract Section

Abstract

Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, and are becoming foundational for ubiquitous AI. The computational complexity of these algorithms and a need for high energy-efficiency has led to a surge in research on hardware accelerators. % for this paradigm. To reduce the latency and energy costs of accessing DRAM, most DNN accelerators are spatial in nature, with hundreds of processing elements (PE) operating in parallel and communicating with each other directly. DNNs are evolving at a rapid rate, and it is common to have convolution, recurrent, pooling, and fully-connected layers with varying input and filter sizes in the most recent topologies.They may be dense or sparse. They can also be partitioned in myriad ways (within and across layers) to exploit data reuse (weights and intermediate outputs). All of the above can lead to different dataflow patterns within the accelerator substrate. Unfortunately, most DNN accelerators support only fixed dataflow patterns internally as they perform a careful co-design of the PEs and the network-on-chip (NoC). In fact, the majority of them are only optimized for traffic within a convolutional layer. This makes it challenging to map arbitrary dataflows on the fabric efficiently, and can lead to underutilization of the available compute resources. DNN accelerators need to be programmable to enable mass deployment. For them to be programmable, they need to be configurable internally to support the various dataflow patterns that could be mapped over them. To address this need, we present MAERI, which is a DNN accelerator built with a set of modular and configurable building blocks that can easily support myriad DNN partitions and mappings by appropriately configuring tiny switches. MAERI provides 8-459% better utilization across multiple dataflow mappings over baselines with rigid NoC fabrics.

References

  1. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  2. K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner, L. Gao, C. Gong, A. Hannun, T. Han, Vaino, L. Johannes, B. Jiang, C. Ju, B. Jun, P. LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y. Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S. Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y. Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B. Yuan, J. Zhan, and Z. Zhu, "Deep speech 2: End-to-end speech recognition in english and mandarin," arXiv preprint arXiv:1512.02595, 2015.Google ScholarGoogle Scholar
  3. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," in ASPLOS, pp. 269--284, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in MICRO, pp. 609--622, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: Shifting vision processing closer to the sensor," in ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, andW. J. Dally, "Eie: efficient inference engine on compressed deep neural network," in ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "Scnn: An accelerator for compressed-sparse convolutional neural networks," in Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 27--40, ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in NIPS, pp. 1097--1105, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.Google ScholarGoogle Scholar
  11. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770--778, 2016.Google ScholarGoogle Scholar
  12. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  13. S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, A. Ng, and M. Shoeybi, "Deep voice: Real-time neural text-to-speech," arXiv preprint arXiv:1702.07825, 2017.Google ScholarGoogle Scholar
  14. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," in ACM Sigplan Notices, vol. 49, pp. 269--284, ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "Dadiannao: A machine-learning supercomputer," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609--622, IEEE Computer Society, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer CNN accelerators," in 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, "C-brain: A deep learning accelerator that tames the diversity of cnns through adaptive data-level parallelization," in DAC, pp. 1--6, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S.Wei, "Deep convolutional neural network architecture with reconfigurable computation patterns," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: Ineffectual-neuron-free deep neural network computing," in ISCA, pp. 1--13, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, "Cambricon-x: An accelerator for sparse neural networks," in MICRO, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in FPGA, pp. 161--170, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks," in HPCA, 2017.Google ScholarGoogle Scholar
  23. N. P. Jouppi,, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati,W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, "In-datacenter performance analysis of a tensor processing unit," in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, "Imagenet large scale visual recognition challenge," International Journal of Computer Vision, vol. 115, no. 3, pp. 211--252, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in ISCA, pp. 367--379, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Cong and B. Xiao, "Minimizing computation in convolutional neural networks," in International Conference on Artificial Neural Networks, pp. 281--290, Springer, 2014.Google ScholarGoogle Scholar
  27. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735--1780, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv preprint arXiv:1408.5093, 2014.Google ScholarGoogle Scholar
  29. Theano Development Team, "Theano: A Python framework for fast computation of mathematical expressions," arXiv e-prints, vol. abs/1605.02688, May 2016.Google ScholarGoogle Scholar
  30. M. Abadi, A. Agarwal, and P. Barham, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015. Software available from tensorflow.org.Google ScholarGoogle Scholar
  31. R. Collobert, K. Kavukcuoglu, and C. Farabet, "Torch7: A matlab-like environment for machine learning," in BigLearn, NIPS Workshop, 2011.Google ScholarGoogle Scholar
  32. Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, "Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 45--54, ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. H. Kwon, A. Samajdar, and T. Krishna, "Rethinking nocs for spatial neural network accelerators," in NOCS, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks," in High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pp. 553--564, IEEE, 2017.Google ScholarGoogle Scholar
  35. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161--170, ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer cnn accelerators," in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pp. 1--12, IEEE, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Sharma, J. Park, E. Amaro, B. Thwaites, P. Kotha, A. Gupta, J. K. Kim, A. Mishra, and H. Esmaeilzadeh, "Dnnweaver: From high-level deep network models to fpga acceleration," in the Workshop on Cognitive Architectures, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Bannon, S. Boving, G. Desai, B. Felderman, P. Germano, A. Kanagala, J. Provost, J. Simmons, J. Wanderer, U. Holzle, S. Stuart, and A. Vahdat, "Jupiter rising: A decade of clos topologies and centralized control in google's datacenter network," ACM SIGCOMM Computer Communication Review, vol. 45, no. 4, pp. 183--197, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Nikhil, "Bluespec system verilog: efficient, correct rtl from high level specifications," in MEMOCODE, pp. 69--70, IEEE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Synopsys, "DesignWare IP Embedded Memory for TSMC 28-nm." https: //www.synopsys.com/dw/doc.php/ds/es/DW-28-nm-DS.pdf.Google ScholarGoogle Scholar
  41. Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks," IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127--138, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  42. D. Vainbrand et al., "Network-on-chip architectures for neural networks," in NOCS, pp. 135--144, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Harkin et al., "Reconfigurable platforms and the challenges for large-scale implementations of spiking neural networks," in Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, pp. 483--486, IEEE, 2008.Google ScholarGoogle Scholar
  44. T. Theocharides et al., "A generic reconfigurable neural network architecture implemented as a network on chip," in SOC, 2004.Google ScholarGoogle Scholar
  45. R. Emery et al., "Connection-centric network for spiking neural networks," in NOCS, pp. 144--152, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: Shifting vision processing closer to the sensor," in ACM SIGARCH Computer Architecture News, vol. 43, pp. 92-- 104, ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. Horowitz, "Convolution engine: balancing efficiency&flexibility in specialized computing," in ISCA, pp. 24--35, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Y. Ji, Y. Zhang, S. Li, P. Chi, C. Jiang, P. Qu, Y. Xie, and W. Chen, "Neutrams: Neural network transformation and co-design under neuromorphic hardware constraints," in MICRO, pp. 1--13, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks, "Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures," in ISCA, pp. 97--108, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Zhu, L. Liu, C.Wang, and Y. Xie, "Cnnlab: a novel parallel framework for neural networks using gpu and fpga-a practical study with trade-off analysis," arXiv preprint arXiv:1606.06234, 2016.Google ScholarGoogle Scholar
  51. Y. Shen, M. Ferdman, and P. Milder, "Maximizing CNN accelerator efficiency through resource partitioning," in 44th International Symposium on Computer Architecture (ISCA), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. J. L. Elman, "Finding structure in time," Cognitive science, vol. 14, no. 2, pp. 179--211, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  53. M. I. Jordan, "Serial order: A parallel distributed processing approach," Advances in psychology, vol. 121, pp. 471--495, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  54. C. Goller and A. Kuchler, "Learning task-dependent distributed representations by backpropagation through structure," in IEEE Neural Networks, vol. 1, pp. 347--352, 1996.Google ScholarGoogle Scholar
  55. A. X. M. Chang, B. Martini, and E. Culurciello, "Recurrent neural networks hardware implementation on fpga," arXiv preprint arXiv:1511.05552, 2015.Google ScholarGoogle Scholar
  56. Y. Guan, Z. Yuan, G. Sun, and J. Cong, "Fpga-based accelerator for long short-term memory recurrent neural networks," in ASP-DAC, pp. 629--634, 2017.Google ScholarGoogle Scholar
  57. S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, "Fpga acceleration of recurrent neural network based language model," in FCCM, pp. 111-- 118, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung, "Fpga-based low-power speech recognition with recurrent neural networks," in SiPS, pp. 230--235, 2016.Google ScholarGoogle Scholar
  59. S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally, "Ese: Efficient speech recognition engine with sparse lstm on fpga," in FPGA, pp. 75--84, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities," Proceedings of the national academy of sciences, vol. 79, no. 8, pp. 2554--2558, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  61. Y. Maeda and M. Wakamura, "Simultaneous perturbation learning rule for recurrent neural networks and its fpga implementation," IEEE Transactions on Neural Networks, vol. 16, no. 6, pp. 1664--1672, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. R. Tavcar, J. Dedic, D. Bokal, and A. Zemva, "Transforming the lstm training algorithm for efficient fpga-based adaptive control of nonlinear dynamic systems," Informacije Midem-Journal of Microelectronics Electronic Components and Materials, vol. 43, no. 2, pp. 131--138, 2013.Google ScholarGoogle Scholar
  63. J. Kung, D. Kim, and S. Mukhopadhyay, "Dynamic approximation with feedback control for energy-efficient recurrent neural network hardware," in ISLPED, pp. 168--173, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. D. Shin, J. Lee, J. Lee, and H.-J. Yoo, "14.2 dnpu: An 8.1 tops/w reconfigurable cnn-rnn processor for general-purpose deep neural networks," in ISSCC, pp. 240--241, 2017.Google ScholarGoogle Scholar

Index Terms

  1. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!