skip to main content
research-article

Adaptive Computation Reuse for Energy-Efficient Training of Deep Neural Networks

Published:18 October 2021Publication History
Skip Abstract Section

Abstract

In recent years, Deep Neural Networks (DNNs) have been deployed into a diverse set of applications from voice recognition to scene generation mostly due to their high-accuracy. DNNs are known to be computationally intensive applications, requiring a significant power budget. There have been a large number of investigations into energy-efficiency of DNNs. However, most of them primarily focused on inference while training of DNNs has received little attention.

This work proposes an adaptive technique to identify and avoid redundant computations during the training of DNNs. Elements of activations exhibit a high degree of similarity, causing inputs and outputs of layers of neural networks to perform redundant computations. Based on this observation, we propose Adaptive Computation Reuse for Tensor Cores (ACRTC) where results of previous arithmetic operations are used to avoid redundant computations. ACRTC is an architectural technique, which enables accelerators to take advantage of similarity in input operands and speedup the training process while also increasing energy-efficiency. ACRTC dynamically adjusts the strength of computation reuse based on the tolerance of precision relaxation in different training phases. Over a wide range of neural network topologies, ACRTC accelerates training by 33% and saves energy by 32% with negligible impact on accuracy.

References

  1. [1] Krizhevsky A., Sutskever I., and Hinton G. E.. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 10971105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Zhang S., Du Z., Zhang L., Lan H., Liu S., Li L., Guo Q., Chen T., and Chen Y.. 2016. Cambricon-x: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] You Y., Zhang Z., Hsieh C.-J., Demmel J., and Keutzer K.. 2018. Imagenet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Widrow B., Kollar I., and Liu M. C.. 1996. Statistical theory of quantization. IEEE Transactions on Instrumentation and Measurement 45, 2 (1996), 353361.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Jain A., Phanishayee A., Mars J., Tang L., and Pekhimenko G.. 2018. Gist: Efficient data encoding for deep neural network training. In Proceedings of the 45th Annual International Symposium on Computer Architecture,. Piscataway, NJ, IEEE Press, 776789. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Evans R. D., Liu L. F., and Aamodt T.. 2020. JPEG-ACT: A frequency domain lossy DMA engine for training convolutional neural networks. In Proceedings of the 47th Annual International Symposium on Computer Architecture. ACM, 860873. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Dean J., Corrado G. S., Monga R., Chen K., Devin M., Le Q. V., Mao M. Z., Ranzato M., Senior A., Tucker P., Yang K., and Ng A. Y.. 2012. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - 1. Curran Associates Inc., 12231231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Albericio J., Judd P., Hetherington T., Aamodt T., Jerger N. E., and Moshovos A.. 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In 2016 Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  9. [9] Zhang J., Chen X., Song M., and Li T.. 2019. Eager pruning: Algorithm and architecture support for fast training of deep neural networks. In Proceedings of the 46th International Symposium on Computer Architecture,. New York, NY. ACM, 292303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Spring R. and Shrivastava A.. 2017. Scalable and sustainable deep learning via randomized hashing. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 445454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Venkataramani S., Ranjan A., Banerjee S., Das D., Avancha S., Jagannathan A., Durg A., Nagaraj D., Kaul B., Dubey P., and Raghunathan A.. 2017. ScaleDeep: A scalable computer architecture for learning and evaluating deep networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture, Ser. ISCA’17. New York, NY. ACM, 1326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Strubell E., Ganesh A., and McCallum A.. 2019. Energy and policy considerations for deep learning in NLP. Retrieved from CoRR abs/1906.02243, 2019.Google ScholarGoogle Scholar
  13. [13] Jouppi Norman P. at al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, Ser. ISCA’17. New York, NY. ACM, 112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Cerebras CS12019. Retrieved from https://www.cerebras.net/product/.Google ScholarGoogle Scholar
  15. [15] Chen Yunji, Luo Tao, Liu Shaoli, Zhang Shijin, He Liqiang, Wang Jia, Li Ling, Chen Tianshi, Xu Zhiwei, Sun Ninghui, and Temam Olivier. 2014. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Reagen B., Whatmough P., Adolf R., Rama S., Lee H., Lee S. K., Hern´andez-Lobato J. M., Wei G. Y., and Brooks D.. 2016. Minerva: Enabling lowpower, highly-accurate deep neural network accelerators. In Proceedings of the 43rd Annual ACM/IEEE International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] “NVIDIA Tesla V100 GPU Achitecture, ” 2017. Retrieved August 2017 from https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.Google ScholarGoogle Scholar
  18. [18] Mahmoud Mostafa, Edo Isak, Zadeh Ali Hadi, Awad Omar Mohamed, Pekhimenko Gennady, Albericio Jorge, and Moshovos Andreas. 2020. Tensordash: Exploiting sparsity to accelerate deep neural network training. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture. 781795.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Ning L., Guan H., and Shen X.. 2019. Adaptive deep reuse: Accelerating CNN training on the fly. In Proceedings of the 35th IEEE International Conference on Data Engineering 15381549.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Ioffe Sergey and Szegedy Christian. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning. 448–456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] FreePDKTM process design kit. Retrieved from http://www.eda.ncsu.edu/wiki/FreePDK.Google ScholarGoogle Scholar
  22. [22] Muralimanohar N., Balasubramonian Rajeev, and Jouppi Norm. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Micro-architecture. 314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Micron lpddr4 system power calculator. Retrieved from https://media-www.micron.com/-/media/client/global/documents/products/power-calcula-tor/ddr4_power_calc.xlsm?la=en&rev=5e97be39078d4a1b8619cb85c96bbe63.Google ScholarGoogle Scholar
  24. [24] LeCun Y., Cortes C., and Burges C. J.. 1998. The mnist database of handwritten digits.. Retrieved from http://yann.lecun.com/exdb/mnist/.Google ScholarGoogle Scholar
  25. [25] UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml/datasets/ISOLET.Google ScholarGoogle Scholar
  26. [26] UCI machine learning repository Retrieved from https://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities.Google ScholarGoogle Scholar
  27. [27] The CIFAR dataset. Retrieved from https://www.cs.toronto.edu/∼kriz/cifar.html.Google ScholarGoogle Scholar
  28. [28] Katharopoulos Angelos and Fleuret François. 2018. Not all samples are created equal: Deep learning with importance sampling. In Proceedings of the 35th International Conference on Machine Learning. 2530–2539.Google ScholarGoogle Scholar
  29. [29] Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z., Karpathy A., Khosla A., Bernstein M., Berg A. C., and Fei-Fei L.. 2014. ImageNet large scale visual recognition challenge. Retrieved September, 2014 from CoRR abs/1409.0575 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Simonyan Karen and Zisserman Andrew. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 32nd International Conference on Machine Learning. 1–14.Google ScholarGoogle Scholar
  31. [31] Caffe model zoo. Retrieved from https://github.com/BVLC/caffe/wiki/Model-Zoo, 2020.Google ScholarGoogle Scholar
  32. [32] Sutskever I., Martens J., Dahl G. E., and Hinton G. E.. 2013. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning. 11391147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Du Z., Fasthuber R., Chen T., Ienne P., Li L., Luo T., Feng X., Chen Y., and Temam O.. 2015. Shidiannao: Shifting vision processing closer to the sensor. In Proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Xiong W., Droppo J., Huang X., Seide F., Seltzer M., Stolcke A., Yu D., and Zweig G.. 2016. Achieving human parity in conversational speech recognition. Retrieved from CoRR 2016.Google ScholarGoogle Scholar
  35. [35] Riera M., Arnau J.-M., and Gonzalez A.. 2018. Computation reuse in DNNs by exploiting input similarity. In Proceedings of the 45th IEEE Annual International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Han S., Liu X., Mao H., Pu J., Pedram A., Horowitz M. A., and Dally W. J.. EIE: Efficient inference engine on compressed deep neuralnetwork. In Proceedings of the 43rd International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Judd P., Albericio J., and Moshovos A.. 2017. Stripes: Bit-serial deep neural network computing. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Hegde K., Yu J., Agrawal R., Yan M., Pellauer M., and Fletcher C.. 2018. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture. 674687. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Li Y. and Pedram A.. 2017. CATERPILLAR: Coarse grain reconfigurable architecture for accelerating the training of deep neural networks. In Proceedings of the 28th IEEE International Conference on Application-specific Systems, Architectures and Processors.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Schuiki F., Schaffner M., Gürkaynak F. K., and Benini L.. 2019. A scalable near-memory architecture for training deep neural networks on large in-memory datasets. IEEE Transactions on Computers 68, 4 (2019), 484497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Chen T., Du Z., Sun N., Wang J., Wu C., Chen Y., and Temam O.. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Girones R. G., Gironés Rafael Gadea, Palero Ricardo Colom, Boluda Joaquín Cerdá, Boluda Joaquín Cerdá, and Cortés Angel Sebastia. 2005. FPGA implementation of a pipelined on-line backpropagation. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 40, 2 (2005), 189213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Zhao Wenlai, Fu Haohuan, Luk Wayne, Yu Teng, Wang Shaojun, Feng Bo, Ma Yuchun, and Yang Guangwen. 2016. F-cnn: An FPGA-based framework for training convolutional neural networks. In Proceedings of the 27th IEEE International Conference on Application-specific Systems, Architectures and Processors. 107114.Google ScholarGoogle Scholar
  44. [44] Geng T., Wang T., Li A., Jin X., and Herbordt M.. 2020. FPDeep: Scalable acceleration of CNN training on deeply-pipelined FPGA clusters. IEEE Transactions on Computers 69, 8 (2020), 11431158Google ScholarGoogle Scholar
  45. [45] Hochreiter S. and Schmidhuber J.. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Koutnik Jan, Greff Klaus, Gomez Faustino, and Schmidhuber Juergen. 2014. A clockwork RNN. In Proceedings of the International Conference on Machine Learning. 18631871. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adaptive Computation Reuse for Energy-Efficient Training of Deep Neural Networks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Embedded Computing Systems
      ACM Transactions on Embedded Computing Systems  Volume 20, Issue 6
      November 2021
      256 pages
      ISSN:1539-9087
      EISSN:1558-3465
      DOI:10.1145/3485150
      • Editor:
      • Tulika Mitra
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 October 2021
      • Revised: 1 August 2021
      • Accepted: 1 August 2021
      • Received: 1 June 2021
      Published in tecs Volume 20, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!