Abstract
Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve toward heterogeneous multicores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope.
Until now, most machine-learning accelerator designs have been focusing on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance, and energy.
We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02mm<sup>2</sup> and 485mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87 × faster, and it can reduce the total energy by 21.08 ×. The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.
- Renee St. Amant, Daniel A. Jimenez, and Doug Burger. 2008. Low-power, high-performance analog neural branch prediction. In International Symposium on Microarchitecture. Como. Google Scholar
Digital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques. ACM Press, New York, New York. Google Scholar
Digital Library
- Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In International Symposium on Computer Architecture. ACM Press, New York, NY, 247. DOI:http://dx.doi.org/10.1145/1815961.1815993 Google Scholar
Digital Library
- Tianshi Chen, Yunji Chen, Marc Duranton, Qi Guo, Atif Hashmi, Mikko Lipasti, Andrew Nere, Shi Qiu, Michele Sebag, and Olivier Temam. 2012. BenchNN: On the broad potential application scope of hardware neural network accelerators. In International Symposium on Workload Characterization. Google Scholar
Digital Library
- Yunji Chen, Tao Luo, Shijin Zhang, Shaoli Liu, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In International Symposium on Microarchitecture. Google Scholar
Digital Library
- Adam Coates, Brody Huval, Tao Wang, David J. Wu, and Andrew Y. Ng. 2013. Deep learning with cots hpc systems. In International Conference on Machine Learning. http://jmlr.org/proceedings/papers/v28/coates13.html.Google Scholar
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. In Machine Learning. 273 --297. Google Scholar
Digital Library
- George E. Dahl, Tara N. Sainath, and Geoffrey E. Hinton. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In International Conference on Acoustics, Speech and Signal Processing. http://www.cs.toronto.edu/∼gdahl/papers/reluDropoutBN_icassp2013.pdf.Google Scholar
- Sorin Draghici. 2002. On the capabilities of neural networks using limited precision weights. Neural Netw. 15, 3 (2002), 395--414. DOI:http://dx.doi.org/10.1016/S0893-6080(02)00032-1 Google Scholar
Digital Library
- Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna V. Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In Asia and South Pacific Design Automation Conference.Google Scholar
- Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural acceleration for general-purpose approximate programs. In International Symposium on Microarchitecture. 1--6. Google Scholar
Digital Library
- Kevin Fan, Manjunath Kudlur, Ganesh S. Dasika, and Scott A. Mahlke. 2009. Bridging the computation gap between programmable processors and hardwired accelerators. In HPCA. IEEE Computer Society, 313--322.Google Scholar
- Clement Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello, and Yann LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshop. IEEE, 109--116. DOI:http://dx.doi.org/10.1109/CVPRW.2011.5981829Google Scholar
Cross Ref
- Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In International Symposium on Computer Architecture. ACM Press, New York, New York, 37. DOI:http://dx.doi.org/10.1145/1815961.1815968 Google Scholar
Digital Library
- Atif Hashmi, Andrew Nere, James Jamal Thomas, and Mikko Lipasti. 2011. A case for neuromorphic ISAs. In International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY. DOI:http://dx.doi.org/10.1145/1950365.1950385 Google Scholar
Digital Library
- Geoffrey E. Hinton and N. Srivastava. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv (2012), 1--18. http://arxiv.org/abs/1207.0580Google Scholar
- Jordan L. Holi and Jenq-Neng Hwang. 1993. Finite precision error analysis of neural network hardware implementations. IEEE Trans. Comput. 42, 3 (1993), 281--290. DOI:http://dx.doi.org/10.1109/12.210171 Google Scholar
Digital Library
- Mark Holler, Simon Tam, Hernan Castro, and Ronald Benson. 1990. An electrically trainable artificial neural network (ETANN) with 10240 “floating gate” synapses. In Artificial Neural Networks. IEEE Press, Piscataway, NJ, 50--55. DOI:http://dx.doi.org/10.1109/IJCNN.1989.118698 Google Scholar
Digital Library
- Po-Sen Huang, Xiaodong He, Jianfeng Gao, and Li Deng. 2013. Learning deep structured semantic models for web search using clickthrough data. In International Conference on Information and Knowledge Management. http://dl.acm.org/citation.cfm?id=2505665 Google Scholar
Digital Library
- Muhammad Mukaram Khan, David R. Lester, Luis A. Plana, Alexander D. Rast, Xin Jin, Eustace Painkras, and Stephen B. Furber. 2008. SpiNNaker: Mapping neural networks onto a massively-parallel chip multiprocessor. In IEEE International Joint Conference on Neural Networks (IJCNN). IEEE, 2849--2856. DOI:http://dx.doi.org/10.1109/IJCNN.2008.4634199Google Scholar
- Joo-young Kim, Minsu Kim, Seungjin Lee, Jinwook Oh, Kwanho Kim, and Hoi-jun Yoo. 2010. A 201.4 GOPS 496 mW real-time multi-object recognition processor with bio-inspired neural perception engine. IEEE Journal of Solid-State Circuits 45, 1 (Jan. 2010), 32--45. DOI:http://dx.doi.org/10.1109/JSSC.2009.2031768Google Scholar
Cross Ref
- Eric J. King and Earl E. Swartzlander Jr. 1997. Data-dependent truncation scheme for parallel multipliers. In Conference Record of the 31st Asilomar Conference on Signals, Systems & Computers, Vol. 2. IEEE, 1178--1182.Google Scholar
- Daniel Larkin, Andrew Kinane, Valentin Muresan, and Noel E O’Connor. 2006b. An efficient hardware architecture for a neural network activation function generator. In ISNN (2), Jun Wang, Zhang Yi, Jacek M. Zurada, Bao-Liang Lu, and Hujun Yin (Eds.), Lecture Notes in Computer Science, Vol. 3973. Springer, 1319--1327. Google Scholar
Digital Library
- Daniel Larkin, Andrew Kinane, and Noel E. O’Connor. 2006a. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices. In ICONIP (3). 1178--1188. Google Scholar
Digital Library
- Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. 2007. An empirical evaluation of deep architectures on problems with many factors of variation. In International Conference on Machine Learning. ACM Press, New York, New York, 473--480. DOI:http://dx.doi.org/10.1145/1273496.1273556 Google Scholar
Digital Library
- Quoc V. Le, MarcAurelio Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng. 2012. Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning.Google Scholar
- Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86 (1998). DOI:http://dx.doi.org/10.1109/5.726791Google Scholar
Cross Ref
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, 469--480. DOI:http://dx.doi.org/10.1145/1669112.1669172 Google Scholar
Digital Library
- Ahmed Al Maashri, Michael Debole, Matthew Cotter, Nandhini Chandramoorthy, Yang Xiao, Vijaykrishnan Narayanan, and Chaitali Chakrabarti. 2012. Accelerating neuromorphic vision algorithms for recognition. In Proceedings of the 49th Annual Design Automation Conference (DAC’12) (2012), 579. DOI:http://dx.doi.org/10.1145/2228360.2228465 Google Scholar
Digital Library
- Paul Merolla, John Arthur, Filipp Akopyan, Nabil Imam, Rajit Manohar, and D. S. Modha. 2011. A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm. In IEEE Custom Integrated Circuits Conference. IEEE, 1--4.Google Scholar
- Volodymyr Mnih and Geoffrey Hinton. 2012. Learning to label aerial images from noisy data. In Proceedings of the 29th International Conference on Machine Learning (ICML’12). 567--574.Google Scholar
- Mike Muller. 2010. Dark silicon and the internet. In EE Times “Designing with ARM” Virtual Conference.Google Scholar
- Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A. Horowitz. 2013. Convolution engine: Balancing efficiency and flexibility in specialized computing. In International Symposium on Computer Architecture. Google Scholar
Digital Library
- Johannes Schemmel, Johannes Fieres, and Karlheinz Meier. 2008. Wafer-scale integration of analog neural networks. In International Joint Conference on Neural Networks. IEEE, 431--438. DOI:http://dx.doi.org/10.1109/IJCNN.2008.4633828Google Scholar
Cross Ref
- Pierre Sermanet, Soumith Chintala, and Y. LeCun. 2012. Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6460867.Google Scholar
- Pierre Sermanet and Yann LeCun. 2011. Traffic sign recognition with multi-scale convolutional networks. In International Joint Conference on Neural Networks. IEEE, 2809--2813. DOI:http://dx.doi.org/10.1109/IJCNN.2011.6033589Google Scholar
Cross Ref
- Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, and Tomaso Poggio. 2007. Robust object recognition with cortex-like mechanisms. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 3 (March 2007), 411--26. DOI:http://dx.doi.org/10.1109/TPAMI.2007.56 Google Scholar
Digital Library
- Olivier Temam. 2012. A defect-tolerant accelerator for emerging high-performance applications. In International Symposium on Computer Architecture. Portland, Oregon. Google Scholar
Digital Library
- Olivier Temam and Nathalie Drach. 1995. Software assistance for data caches. Future Generation Computer Systems 11, 6 (1995), 519--536. DOI:http://dx.doi.org/10.1016/0167-739X(95)00022-K Google Scholar
Digital Library
- Shyamkumar Thoziyoor, Naveen Muralimanohar, and JH Ahn. 2008. CACTI 5.1. HP Labs, Palo Alto, Tech (2008). http://www.hpl.hp.com/techreports/2008/HPL-2008-20.pdf?q=cacti.Google Scholar
- Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.Google Scholar
- Ganesh Venkatesh, Jack Sampson, Nathan Goulding-hotta, Sravanthi Kota Venkata, Michael Bedford Taylor, and Steven Swanson. 2011. QsCORES: Trading dark silicon for scalable energy efficiency with quasi-specific cores categories and subject descriptors. In International Symposium on Microarchitecture. Google Scholar
Digital Library
- R. Jacob Vogelstein, Udayan Mallik, Joshua T. Vogelstein, and Gert Cauwenberghs. 2007. Dynamically reconfigurable silicon array of spiking neurons with conductance-based synapses. IEEE Transactions on Neural Networks 18, 1 (2007), 253--265. Google Scholar
Digital Library
- Sami Yehia, Sylvain Girbal, Hugues Berry, and Olivier Temam. 2009. Reconciling specialization and flexibility through compound circuits. In International Symposium on High Performance Computer Architecture. IEEE, 277--288. DOI:http://dx.doi.org/10.1109/HPCA.2009.4798263Google Scholar
Cross Ref
Index Terms
A Small-Footprint Accelerator for Large-Scale Neural Networks
Recommendations
Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer
The Sunway TaihuLight supercomputer is powered by SW26010, a new 260-core processor designed with on-chip fusion of heterogeneous cores. In this article, we present our work on optimizing the training process of convolutional neural networks (CNNs) on ...
A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks
Special Issue on Frontiers of Hardware and Algorithms for On-chip Learning, Special Issue on Silicon Photonics and Regular PapersFPGA-based hardware accelerators for convolutional neural networks (CNNs) have received attention due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. ...
Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor
IPDPSW '14: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium WorkshopsAs a new area of machine learning research, the deep learning algorithm has attracted a lot of attention from the research community. It may bring human beings to a higher cognitive level of data. Its unsupervised pre-training step allows us to find ...






Comments