Abstract
In recent years, Deep Neural Networks (DNNs) have been deployed into a diverse set of applications from voice recognition to scene generation mostly due to their high-accuracy. DNNs are known to be computationally intensive applications, requiring a significant power budget. There have been a large number of investigations into energy-efficiency of DNNs. However, most of them primarily focused on inference while training of DNNs has received little attention.
This work proposes an adaptive technique to identify and avoid redundant computations during the training of DNNs. Elements of activations exhibit a high degree of similarity, causing inputs and outputs of layers of neural networks to perform redundant computations. Based on this observation, we propose Adaptive Computation Reuse for Tensor Cores (ACRTC) where results of previous arithmetic operations are used to avoid redundant computations. ACRTC is an architectural technique, which enables accelerators to take advantage of similarity in input operands and speedup the training process while also increasing energy-efficiency. ACRTC dynamically adjusts the strength of computation reuse based on the tolerance of precision relaxation in different training phases. Over a wide range of neural network topologies, ACRTC accelerates training by 33% and saves energy by 32% with negligible impact on accuracy.
- [1] . 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 1097–1105. Google Scholar
Digital Library
- [2] . 2016. Cambricon-x: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. Google Scholar
Digital Library
- [3] . 2018. Imagenet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing 2018. Google Scholar
Digital Library
- [4] . 1996. Statistical theory of quantization. IEEE Transactions on Instrumentation and Measurement 45, 2 (1996), 353–361.Google Scholar
Cross Ref
- [5] . 2018. Gist: Efficient data encoding for deep neural network training. In Proceedings of the 45th Annual International Symposium on Computer Architecture,. Piscataway, NJ, IEEE Press, 776–789. Google Scholar
Digital Library
- [6] . 2020. JPEG-ACT: A frequency domain lossy DMA engine for training convolutional neural networks. In Proceedings of the 47th Annual International Symposium on Computer Architecture. ACM, 860–873. Google Scholar
Digital Library
- [7] . 2012. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - 1. Curran Associates Inc., 1223–1231. Google Scholar
Digital Library
- [8] . 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In 2016 Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture.Google Scholar
- [9] . 2019. Eager pruning: Algorithm and architecture support for fast training of deep neural networks. In Proceedings of the 46th International Symposium on Computer Architecture,. New York, NY. ACM, 292–303. Google Scholar
Digital Library
- [10] . 2017. Scalable and sustainable deep learning via randomized hashing. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 445–454. Google Scholar
Digital Library
- [11] . 2017. ScaleDeep: A scalable computer architecture for learning and evaluating deep networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture, Ser. ISCA’17. New York, NY. ACM, 13–26. Google Scholar
Digital Library
- [12] . 2019. Energy and policy considerations for deep learning in NLP. Retrieved from CoRR abs/1906.02243, 2019.Google Scholar
- [13] at al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, Ser. ISCA’17. New York, NY. ACM, 1–12. Google Scholar
Digital Library
- [14] “” 2019. Retrieved from https://www.cerebras.net/product/.Google Scholar
- [15] . 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. Google Scholar
Digital Library
- [16] . 2016. Minerva: Enabling lowpower, highly-accurate deep neural network accelerators. In Proceedings of the 43rd Annual ACM/IEEE International Symposium on Computer Architecture. Google Scholar
Digital Library
- [17] “NVIDIA Tesla V100 GPU Achitecture, ” 2017. Retrieved August 2017 from https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.Google Scholar
- [18] . 2020. Tensordash: Exploiting sparsity to accelerate deep neural network training. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture. 781–795.Google Scholar
Cross Ref
- [19] . 2019. Adaptive deep reuse: Accelerating CNN training on the fly. In Proceedings of the 35th IEEE International Conference on Data Engineering 1538–1549.Google Scholar
Cross Ref
- [20] . 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning. 448–456. Google Scholar
Digital Library
- [21] FreePDKTM process design kit. Retrieved from http://www.eda.ncsu.edu/wiki/FreePDK.Google Scholar
- [22] . 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Micro-architecture. 3–14. Google Scholar
Digital Library
- [23] Micron lpddr4 system power calculator. Retrieved from https://media-www.micron.com/-/media/client/global/documents/products/power-calcula-tor/ddr4_power_calc.xlsm?la=en&rev=5e97be39078d4a1b8619cb85c96bbe63.Google Scholar
- [24] . 1998. The mnist database of handwritten digits.. Retrieved from http://yann.lecun.com/exdb/mnist/.Google Scholar
- [25] UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml/datasets/ISOLET.Google Scholar
- [26] UCI machine learning repository Retrieved from https://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities.Google Scholar
- [27] The CIFAR dataset. Retrieved from https://www.cs.toronto.edu/∼kriz/cifar.html.Google Scholar
- [28] . 2018. Not all samples are created equal: Deep learning with importance sampling. In Proceedings of the 35th International Conference on Machine Learning. 2530–2539.Google Scholar
- [29] . 2014. ImageNet large scale visual recognition challenge. Retrieved September, 2014 from CoRR abs/1409.0575 Google Scholar
Digital Library
- [30] . 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 32nd International Conference on Machine Learning. 1–14.Google Scholar
- [31] Caffe model zoo. Retrieved from https://github.com/BVLC/caffe/wiki/Model-Zoo, 2020.Google Scholar
- [32] . 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning. 1139–1147. Google Scholar
Digital Library
- [33] . 2015. Shidiannao: Shifting vision processing closer to the sensor. In Proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture. Google Scholar
Digital Library
- [34] . 2016. Achieving human parity in conversational speech recognition. Retrieved from CoRR 2016.Google Scholar
- [35] . 2018. Computation reuse in DNNs by exploiting input similarity. In Proceedings of the 45th IEEE Annual International Symposium on Computer Architecture. Google Scholar
Digital Library
- [36] . EIE: Efficient inference engine on compressed deep neuralnetwork. In Proceedings of the 43rd International Symposium on Computer Architecture. Google Scholar
Digital Library
- [37] . 2017. Stripes: Bit-serial deep neural network computing. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. Google Scholar
Digital Library
- [38] . 2018. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture. 674–687. Google Scholar
Digital Library
- [39] . 2017. CATERPILLAR: Coarse grain reconfigurable architecture for accelerating the training of deep neural networks. In Proceedings of the 28th IEEE International Conference on Application-specific Systems, Architectures and Processors.Google Scholar
Cross Ref
- [40] . 2019. A scalable near-memory architecture for training deep neural networks on large in-memory datasets. IEEE Transactions on Computers 68, 4 (2019), 484–497. Google Scholar
Digital Library
- [41] . 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. Google Scholar
Digital Library
- [42] . 2005. FPGA implementation of a pipelined on-line backpropagation. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 40, 2 (2005), 189–213. Google Scholar
Digital Library
- [43] . 2016. F-cnn: An FPGA-based framework for training convolutional neural networks. In Proceedings of the 27th IEEE International Conference on Application-specific Systems, Architectures and Processors. 107–114.Google Scholar
- [44] . 2020. FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters. IEEE Transactions on Computers 69, 8 (2020), 1143–1158Google Scholar
- [45] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google Scholar
Digital Library
- [46] . 2014. A clockwork RNN. In Proceedings of the International Conference on Machine Learning. 1863–1871. Google Scholar
Digital Library
Index Terms
Adaptive Computation Reuse for Energy-Efficient Training of Deep Neural Networks
Recommendations
AxR-NN: Approximate Computation Reuse for Energy-Efficient Convolutional Neural Networks
GLSVLSI '20: Proceedings of the 2020 on Great Lakes Symposium on VLSIThe recent success of convolutional neural networks (CNN) has led its implementation in specialized accelerators such as graphics processing unit (GPUs). However, the intensive computing workloads of CNNs remain a challenge to existing accelerators. By ...
Energy-efficient acceleration of convolutional neural networks using computation reuse
Highlights- Most convolutional neural networks can undergo low-precision quantization.
- A ...
AbstractThe large and ever-increasing size of state-of-the-art convolutional neural networks (CNNs) poses both throughput and energy challenges to the underlying hardware, especially in embedded and mobile computing platforms. Weight ...
Adaptive dropout for training deep neural networks
NIPS'13: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called 'standout' in which a binary ...






Comments