skip to main content
research-article

A Small-Footprint Accelerator for Large-Scale Neural Networks

Published:22 May 2015Publication History
Skip Abstract Section

Abstract

Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve toward heterogeneous multicores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope.

Until now, most machine-learning accelerator designs have been focusing on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance, and energy.

We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02mm<sup>2</sup> and 485mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87 × faster, and it can reduce the total energy by 21.08 ×. The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

References

  1. Renee St. Amant, Daniel A. Jimenez, and Doug Burger. 2008. Low-power, high-performance analog neural branch prediction. In International Symposium on Microarchitecture. Como. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques. ACM Press, New York, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In International Symposium on Computer Architecture. ACM Press, New York, NY, 247. DOI:http://dx.doi.org/10.1145/1815961.1815993 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Tianshi Chen, Yunji Chen, Marc Duranton, Qi Guo, Atif Hashmi, Mikko Lipasti, Andrew Nere, Shi Qiu, Michele Sebag, and Olivier Temam. 2012. BenchNN: On the broad potential application scope of hardware neural network accelerators. In International Symposium on Workload Characterization. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yunji Chen, Tao Luo, Shijin Zhang, Shaoli Liu, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Adam Coates, Brody Huval, Tao Wang, David J. Wu, and Andrew Y. Ng. 2013. Deep learning with cots hpc systems. In International Conference on Machine Learning. http://jmlr.org/proceedings/papers/v28/coates13.html.Google ScholarGoogle Scholar
  7. Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. In Machine Learning. 273 --297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. George E. Dahl, Tara N. Sainath, and Geoffrey E. Hinton. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In International Conference on Acoustics, Speech and Signal Processing. http://www.cs.toronto.edu/&sim;gdahl/papers/reluDropoutBN&lowbar;icassp2013.pdf.Google ScholarGoogle Scholar
  9. Sorin Draghici. 2002. On the capabilities of neural networks using limited precision weights. Neural Netw. 15, 3 (2002), 395--414. DOI:http://dx.doi.org/10.1016/S0893-6080(02)00032-1 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna V. Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In Asia and South Pacific Design Automation Conference.Google ScholarGoogle Scholar
  11. Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In International Symposium on Microarchitecture. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kevin Fan, Manjunath Kudlur, Ganesh S. Dasika, and Scott A. Mahlke. 2009. Bridging the computation gap between programmable processors and hardwired accelerators. In HPCA. IEEE Computer Society, 313--322.Google ScholarGoogle Scholar
  14. Clement Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello, and Yann LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshop. IEEE, 109--116. DOI:http://dx.doi.org/10.1109/CVPRW.2011.5981829Google ScholarGoogle ScholarCross RefCross Ref
  15. Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In International Symposium on Computer Architecture. ACM Press, New York, New York, 37. DOI:http://dx.doi.org/10.1145/1815961.1815968 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Atif Hashmi, Andrew Nere, James Jamal Thomas, and Mikko Lipasti. 2011. A case for neuromorphic ISAs. In International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY. DOI:http://dx.doi.org/10.1145/1950365.1950385 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Geoffrey E. Hinton and N. Srivastava. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv (2012), 1--18. http://arxiv.org/abs/1207.0580Google ScholarGoogle Scholar
  18. Jordan L. Holi and Jenq-Neng Hwang. 1993. Finite precision error analysis of neural network hardware implementations. IEEE Trans. Comput. 42, 3 (1993), 281--290. DOI:http://dx.doi.org/10.1109/12.210171 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mark Holler, Simon Tam, Hernan Castro, and Ronald Benson. 1990. An electrically trainable artificial neural network (ETANN) with 10240 “floating gate” synapses. In Artificial Neural Networks. IEEE Press, Piscataway, NJ, 50--55. DOI:http://dx.doi.org/10.1109/IJCNN.1989.118698 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Po-Sen Huang, Xiaodong He, Jianfeng Gao, and Li Deng. 2013. Learning deep structured semantic models for web search using clickthrough data. In International Conference on Information and Knowledge Management. http://dl.acm.org/citation.cfm&quest;id&equals;2505665 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Muhammad Mukaram Khan, David R. Lester, Luis A. Plana, Alexander D. Rast, Xin Jin, Eustace Painkras, and Stephen B. Furber. 2008. SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor. In IEEE International Joint Conference on Neural Networks (IJCNN). IEEE, 2849--2856. DOI:http://dx.doi.org/10.1109/IJCNN.2008.4634199Google ScholarGoogle Scholar
  22. Joo-young Kim, Minsu Kim, Seungjin Lee, Jinwook Oh, Kwanho Kim, and Hoi-jun Yoo. 2010. A 201.4 GOPS 496 mW real-time multi-object recognition processor with bio-inspired neural perception engine. IEEE Journal of Solid-State Circuits 45, 1 (Jan. 2010), 32--45. DOI:http://dx.doi.org/10.1109/JSSC.2009.2031768Google ScholarGoogle ScholarCross RefCross Ref
  23. Eric J. King and Earl E. Swartzlander Jr. 1997. Data-dependent truncation scheme for parallel multipliers. In Conference Record of the 31st Asilomar Conference on Signals, Systems &amp; Computers, Vol. 2. IEEE, 1178--1182.Google ScholarGoogle Scholar
  24. Daniel Larkin, Andrew Kinane, Valentin Muresan, and Noel E O’Connor. 2006b. An efficient hardware architecture for a neural network activation function generator. In ISNN (2), Jun Wang, Zhang Yi, Jacek M. Zurada, Bao-Liang Lu, and Hujun Yin (Eds.), Lecture Notes in Computer Science, Vol. 3973. Springer, 1319--1327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Daniel Larkin, Andrew Kinane, and Noel E. O’Connor. 2006a. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices. In ICONIP (3). 1178--1188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. 2007. An empirical evaluation of deep architectures on problems with many factors of variation. In International Conference on Machine Learning. ACM Press, New York, New York, 473--480. DOI:http://dx.doi.org/10.1145/1273496.1273556 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Quoc V. Le, MarcAurelio Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng. 2012. Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning.Google ScholarGoogle Scholar
  28. Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86 (1998). DOI:http://dx.doi.org/10.1109/5.726791Google ScholarGoogle ScholarCross RefCross Ref
  29. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, 469--480. DOI:http://dx.doi.org/10.1145/1669112.1669172 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ahmed Al Maashri, Michael Debole, Matthew Cotter, Nandhini Chandramoorthy, Yang Xiao, Vijaykrishnan Narayanan, and Chaitali Chakrabarti. 2012. Accelerating neuromorphic vision algorithms for recognition. In Proceedings of the 49th Annual Design Automation Conference (DAC’12) (2012), 579. DOI:http://dx.doi.org/10.1145/2228360.2228465 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Paul Merolla, John Arthur, Filipp Akopyan, Nabil Imam, Rajit Manohar, and D. S. Modha. 2011. A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm. In IEEE Custom Integrated Circuits Conference. IEEE, 1--4.Google ScholarGoogle Scholar
  32. Volodymyr Mnih and Geoffrey Hinton. 2012. Learning to label aerial images from noisy data. In Proceedings of the 29th International Conference on Machine Learning (ICML’12). 567--574.Google ScholarGoogle Scholar
  33. Mike Muller. 2010. Dark silicon and the internet. In EE Times “Designing with ARM” Virtual Conference.Google ScholarGoogle Scholar
  34. Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A. Horowitz. 2013. Convolution engine: Balancing efficiency and flexibility in specialized computing. In International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Johannes Schemmel, Johannes Fieres, and Karlheinz Meier. 2008. Wafer-scale integration of analog neural networks. In International Joint Conference on Neural Networks. IEEE, 431--438. DOI:http://dx.doi.org/10.1109/IJCNN.2008.4633828Google ScholarGoogle ScholarCross RefCross Ref
  36. Pierre Sermanet, Soumith Chintala, and Y. LeCun. 2012. Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition. http://ieeexplore.ieee.org/xpls/abs_all.jsp&quest;arnumber&equals;6460867.Google ScholarGoogle Scholar
  37. Pierre Sermanet and Yann LeCun. 2011. Traffic sign recognition with multi-scale convolutional networks. In International Joint Conference on Neural Networks. IEEE, 2809--2813. DOI:http://dx.doi.org/10.1109/IJCNN.2011.6033589Google ScholarGoogle ScholarCross RefCross Ref
  38. Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, and Tomaso Poggio. 2007. Robust object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 3 (March 2007), 411--26. DOI:http://dx.doi.org/10.1109/TPAMI.2007.56 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Olivier Temam. 2012. A defect-tolerant accelerator for emerging high-performance applications. In International Symposium on Computer Architecture. Portland, Oregon. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Olivier Temam and Nathalie Drach. 1995. Software assistance for data caches. Future Generation Computer Systems 11, 6 (1995), 519--536. DOI:http://dx.doi.org/10.1016/0167-739X(95)00022-K Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Shyamkumar Thoziyoor, Naveen Muralimanohar, and JH Ahn. 2008. CACTI 5.1. HP Labs, Palo Alto, Tech (2008). http://www.hpl.hp.com/techreports/2008/HPL-2008-20.pdf&quest;q&equals;cacti.Google ScholarGoogle Scholar
  42. Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.Google ScholarGoogle Scholar
  43. Ganesh Venkatesh, Jack Sampson, Nathan Goulding-hotta, Sravanthi Kota Venkata, Michael Bedford Taylor, and Steven Swanson. 2011. QsCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores categories and subject descriptors. In International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. R. Jacob Vogelstein, Udayan Mallik, Joshua T. Vogelstein, and Gert Cauwenberghs. 2007. Dynamically reconfigurable silicon array of spiking neurons with conductance-based synapses. IEEE Transactions on Neural Networks 18, 1 (2007), 253--265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Sami Yehia, Sylvain Girbal, Hugues Berry, and Olivier Temam. 2009. Reconciling specialization and flexibility through compound circuits. In International Symposium on High Performance Computer Architecture. IEEE, 277--288. DOI:http://dx.doi.org/10.1109/HPCA.2009.4798263Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Small-Footprint Accelerator for Large-Scale Neural Networks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!