skip to main content
research-article

OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators

Published:18 October 2022Publication History
Skip Abstract Section

Abstract

Hardware acceleration of Artificial Intelligence (AI) workloads has gained widespread popularity with its potential to deliver unprecedented performance and efficiency. An important challenge remains in how AI accelerators are programmed to sustain high utilization without impacting end-user productivity. Prior software optimizations start with an input graph and focus on node-level optimizations, viz. dataflows and hierarchical tiling, and graph-level optimizations such as operation fusion. However, little effort has been devoted to inter-node on-chip scratchpad memory (SPM) management in Deep Learning (DL) accelerators, whose significance is bolstered by the recent trends in complex network topologies and the emergence of eager execution in DL frameworks.

We characterize and show that there exists up to a 5.2× performance gap in DL inference to be bridged using SPM management and propose OnSRAM, a novel SPM management framework integrated with the compiler runtime of a DL accelerator. We develop two variants, viz.  OnSRAM-Static, which works on static graphs to identify data structures that can be lucratively held on-chip based on their size, liveness and significance, and OnSRAM-Eager, which targets an eager execution model (no graph) and uses a history-based speculative scheme to hold/discard data structures. We integrate OnSRAM  with TensorFlow and analyze it on multiple accelerator configurations. Across a suite of 12 images, objects, and language networks, on a 3 TFLOP system with a 2 MB SPM and 32 GBps external memory bandwidth, OnSRAM-Static  and OnSRAM-Eager  achieve 1.02–4.8× and 1.02–3.1× reduction in inference latency (batch size of 1), over a baseline with no SPM management. In terms of energy savings, we observe average reductions of 1.51× (up to 4.1×) and 1.23× (up to 2.9×) for the static and eager execution scenarios, respectively.

REFERENCES

  1. [1] 2016. Google supercharges machine learning tasks with TPU custom chip. Google Research blog (2016).Google ScholarGoogle Scholar
  2. [2] 2017. Nvidia tensor cores: Retrieved 1 January 2020 from https://www.nvidia.com/en-us/data-center/tensorcore/. NVIDIA blog.Google ScholarGoogle Scholar
  3. [3] Abadi Martin, Barham Paul, Chen Jianmin, Chen Zhifeng, Davis Andy, Dean Jeffrey, Devin Matthieu, Ghemawat Sanjay, Irving Geoffrey, Isard Michael, Kudlur Manjunath, Levenberg Josh, Monga Rajat, Moore Sherry, Murray Derek G., Steiner Benoit, Tucker Paul, Vasudevan Vijay, Warden Pete, Wicke Martin, Yu Yuan, and Zheng Xiaoqiang. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the OSDI. 265283.Google ScholarGoogle Scholar
  4. [4] Agrawal Ankur, Lee Sae Kyu, Silberman Joel, Ziegler Matthew, Kang Mingu, Venkataramani Swagath, Cao Nianzheng, Fleischer Bruce, Guillorn Michael, Cohen Matthew, Mueller Silvia, Oh Jinwook, Lutz Martin, Jung Jinwook, Koswatta Siyu, Zhou Ching, Zalani Vidhi, Bonanno James, Casatuta Robert, Chen Chia-Yu, Choi Jungwook, Haynie Howard, Herbert Alyssa, Jain Radhika, Kar Monodeep, Kim Kyu-Hyoun, Li Yulong, Ren Zhibin, Rider Scot, Schaal Marcel, Schelm Kerstin, Scheuermann Michael, Sun Xiao, Tran Hung, Wang Naigang, Wang Wei, Zhang Xin, Shah Vinay, Curran Brian, Srinivasan Vijayalakshmi, Lu Pong-Fei, Shukla Sunil, Chang Leland, and Gopalakrishnan Kailash. 2021. 9.1 A 7nm 4-Core AI chip with 25.6TFLOPS hybrid FP8 training, 102.4TOPS INT4 inference and workload-aware throttling. In Proceedings of the 2021 IEEE International Solid- State Circuits Conference. 144146. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Byung Hoon Ahn, Jinwon Lee, Jamie Menjay Lin, Hsin-Pai Cheng, Jilei Hou, and Hadi Esmaeilzadeh. 2020. Ordering chaos: Memory-aware scheduling of irregularly wired neural networks for edge devices. In Proceedings of Machine Learning and Systems 2 (2020), 44–57.Google ScholarGoogle Scholar
  6. [6] Albericio Jorge, Judd Patrick, Hetherington Tayler, Aamodt Tor, Jerger Natalie Enright, and Moshovos Andreas. 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In Proceedings of the 43rd International Symposium on Computer Architecture.IEEE, Piscataway, NJ, 113. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Alwani M., Chen H., Ferdman M., and Milder P.. 2016. Fused-layer CNN accelerators. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture. 112. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Ardakani Arash, Condo Carlo, Ahmadi Mehdi, and Gross Warren J.. 2017. An architecture to accelerate convolution in deep neural networks. IEEE Transactions on Circuits and Systems I: Regular Papers 65, 4 (2017), 13491362.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Ardakani Arash, Condo Carlo, and Gross Warren J.. 2019. Fast and efficient convolutional accelerator for edge computing. IEEE Transactions on Computers 69, 1 (2019), 138152.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Gopalakrishnan B. Fleischer, S. Shukla, M. Ziegler, J. Silberman, J. Oh, V. Srinivasan, J. Choi, S. Mueller, A. Agrawal, T. Babinsky, N. Cao, C. Y. Chen, P. Chuang, T. Fox, G. Gristede, M. Guillorn, H. Haynie, M. Klaiber, D. Lee, S.Lo, G. Maier, M. Scheuermann, S. Venkataramani, C. Vezyrtzis, N. Wang, F. Yee, C. Zhou, P. F. Lu, B. Curran, L. Chang, and K.. 2018. A scalable multi-TeraOPS deep learning processor core for AI training and inference. In Proceedings of the VLSI Symposium.Google ScholarGoogle Scholar
  11. [11] Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Comput. Surv. 52, 4, Article 65 (July 2020), 43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Briggs Preston, Cooper Keith D., and Torczon Linda. 1994. Improvements to graph coloring register allocation. ACM Transactions on Programming Languages and Systems 16, 3 (1994), 428455.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Calder Brad, Reinman Glenn, and Tullsen Dean M.. 1999. Selective value prediction. In Proceedings of the 26th Annual International Symposium on Computer Architecture.IEEE Computer Society, Washington, DC, 6474. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Chakradhar Srimat, Sankaradas Murugan, Jakkula Venkata, and Cadambi Srihari. 2010. A dynamically configurable coprocessor for convolutional neural networks. ACM SIGARCH Computer Architecture News 38, 3 (2010), 247257. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan. 2018. AdaComp: adaptive residual gradient compression for data-parallel distributed training. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 345, 2827–2835.Google ScholarGoogle Scholar
  16. [16] Chen Tianshi, Du Zidong, Sun Ninghui, Wang Jia, Wu Chengyong, Chen Yunji, and Temam Olivier. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems.ACM, New York, NY, 269284. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Chen Tianqi, Moreau Thierry, Jiang Ziheng, Shen Haichen, Yan Eddie Q., Wang Leyuan, Hu Yuwei, Ceze Luis, Guestrin Carlos, and Krishnamurthy Arvind. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the OSDI. 578594.Google ScholarGoogle Scholar
  18. [18] Chen T.-F. and Baer J.-L.. 1994. A performance study of software and hardware data prefetching schemes. In Proceedings of the 21st Annual International Symposium on Computer Architecture.IEEE Computer Society Press, Los Alamitos, CA, 223232. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Chen Yu-Hsin, Emer Joel, and Sze Vivienne. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture.IEEE, Piscataway, NJ, 367379. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Choi Jungwook, Wang Zhuo, Venkataramani Swagath, Chuang Pierce I-Jen, Srinivasan Vijayalakshmi, and Gopalakrishnan Kailash. 2018. PACT: Parameterized clipping activation for quantized neural networks. arXiv:1805.06085. Retrieved from https://arxiv.org/abs/1805.06085.Google ScholarGoogle Scholar
  21. [21] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in Neural Information Processing Systems 28 (2015).Google ScholarGoogle Scholar
  22. [22] Cyphers Scott, Bansal Arjun K., Bhiwandiwalla Anahita, Bobba Jayaram, Brookhart Matthew, Chakraborty Avijit, Constable William, Convey Christian, Cook Leona, Kanawi Omar, Kimball Robert, Knight Jason, Korovaiko Nikolay, Kumar Varun, Lao Yixing, Lishka Christopher R., Menon Jaikrishnan, Myers Jennifer, Narayana Sandeep Aswath, Procter Adam, and Webb Tristan J.. 2018. Intel nGraph: An intermediate representation, compiler, and executor for deep learning. arXiv:1801.08058. Retrieved from https://arxiv.org/abs/1801.08058.Google ScholarGoogle Scholar
  23. [23] Scott Cyphers, Arjun K. Bansal, Anahita Bhiwandiwalla, Jayaram Bobba, Matthew Brookhart, Avijit Chakraborty, Will Constable, Christian Convey, Leona Cook, Omar Kanawi, Robert Kimball, Jason Knight, Nikolay Korovaiko, Varun Kumar, Yixing Lao, Christopher R. Lishka, Jaikrishnan Menon, Jennifer Myers, Sandeep Aswath Narayana, Adam Procter, and Tristan J. Webb. 2018. Intel ngraph: An intermediate representation, compiler, and executor for deep learning. arXiv:1801.08058. Retrieved from https://arxiv.org/abs/1801.08058.Google ScholarGoogle Scholar
  24. [24] Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. 2016. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 (2016).Google ScholarGoogle Scholar
  25. [25] Dean Jeffrey, Corrado Greg, Monga Rajat, Chen Kai, Devin Matthieu, Mao Mark, Ranzato Marc’aurelio, Senior Andrew, Tucker Paul, Yang Ke, Le Quoc V., and Ng Andrew Y.. 2012. Large scale distributed deep networks. In Proceedings of the Advances in Neural Information Processing Systems 25. Bartlett P., Pereira F.c.n., Burges C.j.c., Bottou L., and Weinberger K.q. (Eds.), 12321240. Retrieved from http://books.nips.cc/papers/files/nips25/NIPS2012_0598.pdf.Google ScholarGoogle Scholar
  26. [26] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248255.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Eldridge S., Waterland A., Seltzer M., Appavoo J., and Joshi A.. 2015. Towards general-purpose neural network computing. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation. 99112. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the CVPR.Google ScholarGoogle Scholar
  29. [29] Farabet C., Martini B., Corda B., Akselrod P., Culurciello E., and LeCun Y.. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In Proceedings of the CVPR 2011 WORKSHOPS. 109116. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Fowers J., Ovtcharov K., Papamichael M., Massengill T., Liu M., Lo D., Alkalay S., Haselman M., Adams L., Ghandi M., Heil S., Patel P., Sapek A., Weisz G., Woods L., Lanka S., Reinhardt S. K., Caulfield A. M., Chung E. S., and Burger D.. 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture. 114. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Gao Chang, Rios-Navarro Antonio, Chen Xi, Liu Shih-Chii, and Delbruck Tobi. 2020. EdgeDRNN: Recurrent neural network accelerator for edge inference. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10, 4 (2020), 419432.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Gao Yuanxiang, Chen Li, and Li Baochun. 2018. Spotlight: Optimizing device placement for training deep neural networks. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80). Dy Jennifer and Krause Andreas (Eds.), PMLR, Stockholmsmassan, Stockholm Sweden, 16621670. Retrieved from http://proceedings.mlr.press/v80/gao18a.html.Google ScholarGoogle Scholar
  33. [33] Google. 2017. Google AI Blog: Eager Execution: An imperative, define-by-run interface to TensorFlow. Retrieved 1 January 2020 from http://ai.googleblog.com/2017/10/eager-execution-imperative-define-by.html.Google ScholarGoogle Scholar
  34. [34] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning, PMLR, 1737–1746.Google ScholarGoogle Scholar
  35. [35] Hadidi Ramyad, Cao Jiashen, Xie Yilun, Asgari Bahar, Krishna Tushar, and Kim Hyesoon. 2019. Characterizing the deployment of deep neural networks on commercial edge devices. In Proceedings of the 2019 IEEE International Symposium on Workload Characterization. IEEE, 3548.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Han S., Liu X., Mao H., Pu J., Pedram A., Horowitz M. A., and Dally W. J.. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture. 243254. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Han Song, Mao Huizi, and Dally William J.. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. arXiv:1510.00149. Retrieved from https://arxiv.org/abs/1510.00149.Google ScholarGoogle Scholar
  38. [38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.Google ScholarGoogle Scholar
  39. [39] He Xin, Pal Subhankar, Amarnath Aporva, Feng Siying, Park Dong-Hyeon, Rovinski Austin, Ye Haojie, Chen Yuhan, Dreslinski Ronald, and Mudge Trevor. 2020. Sparse-TPU: Adapting systolic arrays for sparse matrices. In Proceedings of the 34th ACM International Conference on Supercomputing. 112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Hennessy John L. and Patterson David A.. 2019. A new golden age for computer architecture. Communications of the ACM 62, 2 (2019), 4860. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Howard Andrew G., Zhu Menglong, Chen Bo, Kalenichenko Dmitry, Wang Weijun, Weyand Tobias, Andreetto Marco, and Adam Hartwig. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Retrieved from https://arxiv.org/abs/1704.04861.Google ScholarGoogle Scholar
  42. [42] Huang Chao-Tsung, Ding Yu-Chun, Wang Huan-Ching, Weng Chi-Wen, Lin Kai-Ping, Wang Li-Wei, and Chen Li-De. 2019. ecnn: A block-based and highly-parallel cnn accelerator for edge inference. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 182195.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4700–4708.Google ScholarGoogle Scholar
  44. [44] Forrest Iandola N., Matthew W. Moskewicz, Khalid Ashraf, and Kurt Keutzer. 2016. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2592–2600.Google ScholarGoogle Scholar
  45. [45] Iandola Forrest N., Han Song, Moskewicz Matthew W., Ashraf Khalid, Dally William J., and Keutzer Kurt. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv:1602.07360. Retrieved from https://arxiv.org/abs/1602.07360.Google ScholarGoogle Scholar
  46. [46] Jaderberg Max, Vedaldi Andrea, and Zisserman Andrew. 2014. Speeding up convolutional neural networks with low rank expansions. arXiv:1405.3866. Retrieved from https://arxiv.org/abs/1405.3866.Google ScholarGoogle Scholar
  47. [47] Jain Shubham, Venkataramani Swagath, Srinivasan Vijayalakshmi, Choi Jungwook, Chuang Pierce, and Chang Leland. 2018. Compensated-DNN: Energy efficient low-precision deep neural networks by compensating quantization errors. In Proceedings of the 55th Annual Design Automation Conference.ACM, New York, NY, Article 38, 6 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Jiang Yimin, Zhu Yibo, Lan Chang, Yi Bairen, Cui Yong, and Guo Chuanxiong. 2020. A unified architecture for accelerating distributed \( \lbrace \)DNN\( \rbrace \) training in heterogeneous GPU/CPU clusters. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation.463479.Google ScholarGoogle Scholar
  49. [49] Kim D., Kung J., Chai S., Yalamanchili S., and Mukhopadhyay S.. 2016. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture. 380392. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Krizhevsky Alex. 2014. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997. Retrieved from https://arxiv.org/abs/1404.5997.Google ScholarGoogle Scholar
  51. [51] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Kwon Hyoukjun, Chatarasi Prasanth, Pellauer Michael, Parashar Angshuman, Sarkar Vivek, and Krishna Tushar. 2019. Understanding reuse, performance, and hardware cost of DNN dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 754768.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Kwon Hyoukjun, Samajdar Ananda, and Krishna Tushar. 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems.ACM, New York, NY, 461475. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Jiansong Li, Wei Cao, Xiao Dong, Guangli Li, Xueying Wang, Peng Zhao, Lei Liu, and Xiaobing Feng. 2021. Compiler-assisted Operator Template Library for DNN Accelerators. International Journal of Parallel Programming 49, 5 (2021), 628–645.Google ScholarGoogle Scholar
  55. [55] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C Lawrence. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740755.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Liu S., Du Z., Tao J., Han D., Luo T., Xie Y., Chen Y., and Chen T.. 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.393405. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. Ssd: Single shot multibox detector. In European Conference on Computer Vision, Springer, Cham, 21–37.Google ScholarGoogle Scholar
  58. [58] Sangkug Lym, Armand Behroozi, Wei Wen, Ge Li, Yongkee Kwon, and Mattan Erez. 2019. Mini-batch serialization: Cnn training with inter-layer data reuse. Proceedings of Machine Learning and Systems 1 (2019), 264–275.Google ScholarGoogle Scholar
  59. [59] Majumdar Abhinandan, Cadambi Srihari, Becchi Michela, Chakradhar Srimat T., and Graf Hans Peter. 2012. A massively parallel, energy efficient programmable accelerator for learning and classification. ACM Transactions on Architecture and Code Optimization 9, 1 (2012), 30 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Marcuello P., Tubella J., and Gonzalez A.. 1999. Value prediction for speculative multithreaded architectures. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. 230236. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Mirhoseini Azalia, Goldie Anna, Pham Hieu, Steiner Benoit, Le Quoc V., and Dean Jeff. 2018. Hierarchical planning for device placement. Retrieved from https://openreview.net/pdf?id=Hkc-TeZ0W.Google ScholarGoogle Scholar
  62. [62] Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. In International Conference on Machine Learning, PMLR, 2430–2439.Google ScholarGoogle Scholar
  63. [63] Yoon Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th International Symposium on Computer Architecture.Google ScholarGoogle Scholar
  64. [64] O’Connor Mike, Chatterjee Niladrish, Lee Donghyuk, Wilson John, Agrawal Aditya, Keckler Stephen W., and Dally William J.. 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. In Proceedings of the 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 4154.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Pal Subhankar, Amarnath Aporva, Feng Siying, O’Boyle Michael, Dreslinski Ronald, and Dubach Christophe. 2021. SparseAdapt: Runtime control for sparse linear algebra on a reconfigurable accelerator. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 10051021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Pal Subhankar, Beaumont Jonathan, Park Dong-Hyeon, Amarnath Aporva, Feng Siying, Chakrabarti Chaitali, Kim Hun-Seok, Blaauw David, Mudge Trevor, and Dreslinski Ronald. 2018. Outerspace: An outer product based sparse matrix multiplication accelerator. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture. IEEE, 724736.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Subhankar Pal, Siying Feng, Dong-hyeon Park, Sung Kim, Aporva Amarnath, Chi-Sheng Yang, Xin He, Jonathan Beaumont, Kyle May, Yan Xiong, Kuba Kaszyk, John Magnus Morton, Jiawen Sun, Michael O’Boyle, Murray Cole, Chaitali Chakrabarti, David Blaauw, Hun-Seok Kim, Trevor Mudge, and Ronald Dreslinski. 2020. Transmuter: Bridging the efficiency gap using memory and dataflow reconfiguration. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 175190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Subhankar Pal, Dong-hyeon Park, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, Jonathan Beaumont, Kuan-Yu Chen, Chaitali Chakrabarti, Michael Taylor, Trevor Mudge, David Blaauw, Hun-Seok Kim, and Ronald Dreslinski. 2019. A 7.3 m output non-zeros/j sparse matrix-matrix multiplication accelerator using memory reconfiguration in 40 nm. In Proceedings of the 2019 Symposium on VLSI Technology. IEEE, C150–C151.Google ScholarGoogle Scholar
  69. [69] Pal Subhankar, Venkataramani Swagath, Srinivasan Viji, and Gopalakrishnan Kailash. 2021. Efficient management of scratch-pad memories in deep learning accelerators. In Proceedings of the 2021 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 240242.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Parashar Angshuman, Raina Priyanka, Shao Yakun Sophia, Chen Yu-Hsin, Ying Victor A., Mukkara Anurag, Venkatesan Rangharajan, Khailany Brucek, Keckler Stephen W., and Emer Joel. 2019. Timeloop: A systematic approach to dnn accelerator evaluation. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 304315.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Computer Architecture News 45, 2 (2017), 27–40.Google ScholarGoogle Scholar
  72. [72] Dong-hyeon Park, Subhankar Pal, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, Jonathan Beaumont, Kuan-Yu Chen, Chaitali Chakrabarti, Michael Taylor, Trevor Mudge, David Blaauw, Hun-Seok Kim, and Ronald Dreslinski. 2020. A 7.3 m output non-zeros/j, 11.7 m output non-zeros/gb reconfigurable sparse matrix–matrix multiplication accelerator. IEEE Journal of Solid-State Circuits 55, 4 (2020), 933944.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems. 80268037.Google ScholarGoogle Scholar
  74. [74] Pellauer Michael, Shao Yakun Sophia, Clemons Jason, Crago Neal, Hegde Kartik, Venkatesan Rangharajan, Keckler Stephen W., Fletcher Christopher W., and Emer Joel. 2019. Buffets: An efficient and composable storage idiom for explicit decoupled data orchestration. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 137151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, Springer, Cham, 525–542.Google ScholarGoogle Scholar
  76. [76] Reagen Brandon, Whatmough Paul, Adolf Robert, Rama Saketh, Lee Hyunkwang, Lee Sae Kyu, Hernández-Lobato José Miguel, Wei Gu-Yeon, and Brooks David. 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In Proceedings of the 43rd International Symposium on Computer Architecture.IEEE, Piscataway, NJ, 267278. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. [77] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. VDNN: virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). IEEE Press, Article 18, 1–13.Google ScholarGoogle Scholar
  78. [78] Rotem Nadav, Fix Jordan, Abdulrasool Saleem, Deng Summer, Dzhabarov Roman, Hegeman James, Levenstein Roman, Maher Bert, Satish Nadathur, Olesen Jakob, Park Jongsoo, Rakhov Artem, and Smelyanskiy Misha. 2018. Glow: Graph lowering compiler techniques for neural networks. arXiv:1805.00907. Retrieved from https://arxiv.org/abs/1805.00907.Google ScholarGoogle Scholar
  79. [79] Seide Frank, Fu Hao, Droppo Jasha, Li Gang, and Yu Dong. 2014. 1-Bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.Google ScholarGoogle Scholar
  80. [80] Sen Sanchari, Jain Shubham, Venkataramani Swagath, and Raghunathan Anand. 2019. SparCE: Sparsity aware general-purpose core extensions to accelerate deep neural networks. IEEE Transactions on Computers 68, 6 (2019), 912925. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. [81] Simon William Andrew, Qureshi Yasir Mahmood, Levisse Alexandre, Zapater Marina, and Atienza David. 2019. BLADE: A bitline accelerator for devices on the edge. In Proceedings of the 2019 on Great Lakes Symposium on VLSI. 207212.Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. [82] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.Google ScholarGoogle Scholar
  83. [83] Steiner Benoit, Devito Zachary, Chintala Soumith, Gross Sam, Paszke Adam, Massa Francisco, Lerer Adam, Chanan Gregory, Lin Zeming, Yang Edward, Desmaison Alban, Tejani Alykhan, Kopf Andreas Dipl.-Ing., Bradbury James, Antiga Luca, Raison Martin, Gimelshein Natalia, Chilamkurthy Sasank, Killeen Trevor, Fang Lu, and Bai Junjie. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the NeurIPS.Google ScholarGoogle Scholar
  84. [84] Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 31043112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. [85] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  86. [86] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.Google ScholarGoogle Scholar
  87. [87] Xiaohan Tao, Jianmin Pang, Jinlong Xu, and Yu Zhu. 2021. Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture. The Journal of Supercomputing 77, 12 (2021), 14502–14524.Google ScholarGoogle Scholar
  88. [88] Taylor Ann, Marcus Mitchell, and Santorini Beatrice. 2003. The Penn treebank: An overview. In Proceedings of the Treebanks. Springer, 522.Google ScholarGoogle ScholarCross RefCross Ref
  89. [89] Chen Tien-Fu and Baer Jean-Loup. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers 44, 5 (1995), 609623. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. [90] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems.Curran Associates Inc., 60006010. Retrieved from http://dl.acm.org/citation.cfm?id=3295222.3295349.Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. [91] Venkataramani Swagath, Chippa Vinay K., Chakradhar Srimat T., Roy Kaushik, and Raghunathan Anand. 2013. Quality programmable vector processors for approximate computing. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture.ACM, New York, NY, 112. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. [92] Venkataramani S., Choi J., Srinivasan V., Gopalakrishnan K., and Chang L.. 2017. POSTER: Design space exploration for performance optimization of deep neural networks on shared memory accelerators. In Proceedings of the 2017 26th International Conference on Parallel Architectures and Compilation Techniques.146147. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  93. [93] Venkataramani S., Choi J., Srinivasan V., Wang W., Zhang J., Schaal M., Serrano M. J., Ishizaki K., Inoue H., Ogawa E., Ohara M., Chang L., and Gopalakrishnan K.. 2019. DeepTools: Compiler and execution runtime extensions for RaPiD AI accelerator. IEEE Micro 39, 5 (2019), 102–111. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  94. [94] Venkataramani S., Ranjan A., Banerjee S., Das D., Avancha S., Jagannathan A., Durg A., Nagaraj D., Kaul B., Dubey P., and Raghunathan A.. 2017. SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks. In Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture.1326. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. [95] Venkataramani Swagath, Ranjan Ashish, Roy Kaushik, and Raghunathan Anand. 2014. AxNN: Energy-efficient neuromorphic systems using approximate computing. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design.ACM, New York, NY, 2732. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. [96] Venkataramani Swagath, Srinivasan Vijayalakshmi, Choi Jungwook, Heidelberger Philip, Chang Leland, and Gopalakrishnan Kailash. 2019. Memory and interconnect optimizations for peta-scale deep learning systems. In Proceedings of the 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics. 225234. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  97. [97] Swagath Venkataramani, Vijayalakshmi Srinivasan, Wei Wang, Sanchari Sen, Jintao Zhang, Ankur Agrawal, Monodeep Kar, Shubham Jain, Alberto Mannari, Hoang Tran, Yulong Li, Eri Ogawa, Kazuaki Ishizaki, Hiroshi Inoue, Marcel Schaal, Mauricio Serrano, Jungwook Choi, Xiao Sun, Naigang Wang, Chia-Yu Chen, Allison Allain, James Bonano, Nianzheng Cao, Robert Casatuta, Matthew Cohen, Bruce Fleischer, Michael Guillorn, Howard Haynie, Jinwook Jung, Mingu Kang, Kyu-hyoun Kim, Siyu Koswatta, Saekyu Lee, Martin Lutz, Silvia Mueller, Jinwook Oh, Ashish Ranjan, Zhibin Ren, Scot Rider, Kerstin Schelm, Michael Scheuermann, Joel Silberman, Jie Yang, Vidhi Zalani, Xin Zhang, Ching Zhou, Matt Ziegler, Vinay Shah, Moriyoshi Ohara, Pong-Fei Lu, Brian Curran, Sunil Shukla, Leland Chang, and Kailash Gopalakrishnan. 2021. RaPiD: AI accelerator for ultra-low precision training and inference. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture. IEEE, 153166.Google ScholarGoogle Scholar
  98. [98] Wang Kai and Franklin Manoj. 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 281290.Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. [99] Wang Kai and Franklin Manoj. 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture.IEEE Computer Society, Washington, DC, 281290. Retrieved from http://dl.acm.org/citation.cfm?id=266800.266827.Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. [100] Wechsler O., Behar M., and Daga B.. 2019. Spring Hill (NNP-I 1000) Intel’s data center inference chip. In Proceedings of the 2019 IEEE Hot Chips 31 Symposium. 112. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  101. [101] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500.Google ScholarGoogle Scholar
  102. [102] Xiong Yan, Zhou Jian, Pal Subhankar, Blaauw David, Kim Hun-Seok, Mudge Trevor, Dreslinski Ronald, and Chakrabarti Chaitali. 2020. Accelerating deep neural network computation on a low power reconfigurable architecture. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems. IEEE, 15.Google ScholarGoogle ScholarCross RefCross Ref
  103. [103] Zhang Minjia, Hu Zehua, and Li Mingqin. 2021. DUET: A compiler-runtime subgraph scheduling approach for tensor programs on a coupled CPU-GPU architecture. In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium. IEEE, 151161.Google ScholarGoogle ScholarCross RefCross Ref
  104. [104] Zoph Barret and Le Quoc V.. 2017. Neural architecture search with reinforcement learning. arXiv:1611.01578. Retrieved from https://arxiv.org/abs/1611.01578Google ScholarGoogle Scholar

Index Terms

  1. OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Embedded Computing Systems
          ACM Transactions on Embedded Computing Systems  Volume 21, Issue 6
          November 2022
          498 pages
          ISSN:1539-9087
          EISSN:1558-3465
          DOI:10.1145/3561948
          • Editor:
          • Tulika Mitra
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 October 2022
          • Online AM: 27 April 2022
          • Accepted: 8 April 2022
          • Revised: 6 April 2022
          • Received: 15 July 2021
          Published in tecs Volume 21, Issue 6

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)266
          • Downloads (Last 6 weeks)29

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!