Abstract
Hardware acceleration of Artificial Intelligence (AI) workloads has gained widespread popularity with its potential to deliver unprecedented performance and efficiency. An important challenge remains in how AI accelerators are programmed to sustain high utilization without impacting end-user productivity. Prior software optimizations start with an input graph and focus on node-level optimizations, viz. dataflows and hierarchical tiling, and graph-level optimizations such as operation fusion. However, little effort has been devoted to inter-node on-chip scratchpad memory (SPM) management in Deep Learning (DL) accelerators, whose significance is bolstered by the recent trends in complex network topologies and the emergence of eager execution in DL frameworks.
We characterize and show that there exists up to a 5.2× performance gap in DL inference to be bridged using SPM management and propose OnSRAM, a novel SPM management framework integrated with the compiler runtime of a DL accelerator. We develop two variants, viz. OnSRAM-Static, which works on static graphs to identify data structures that can be lucratively held on-chip based on their size, liveness and significance, and OnSRAM-Eager, which targets an eager execution model (no graph) and uses a history-based speculative scheme to hold/discard data structures. We integrate OnSRAM with TensorFlow and analyze it on multiple accelerator configurations. Across a suite of 12 images, objects, and language networks, on a 3 TFLOP system with a 2 MB SPM and 32 GBps external memory bandwidth, OnSRAM-Static and OnSRAM-Eager achieve 1.02–4.8× and 1.02–3.1× reduction in inference latency (batch size of 1), over a baseline with no SPM management. In terms of energy savings, we observe average reductions of 1.51× (up to 4.1×) and 1.23× (up to 2.9×) for the static and eager execution scenarios, respectively.
- [1] 2016. Google supercharges machine learning tasks with TPU custom chip. Google Research blog (2016).Google Scholar
- [2] 2017. Nvidia tensor cores: Retrieved 1 January 2020 from https://www.nvidia.com/en-us/data-center/tensorcore/. NVIDIA blog.Google Scholar
- [3] . 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the OSDI. 265–283.Google Scholar
- [4] . 2021. 9.1 A 7nm 4-Core AI chip with 25.6TFLOPS hybrid FP8 training, 102.4TOPS INT4 inference and workload-aware throttling. In Proceedings of the 2021 IEEE International Solid- State Circuits Conference. 144–146.
DOI: Google ScholarCross Ref
- [5] Byung Hoon Ahn, Jinwon Lee, Jamie Menjay Lin, Hsin-Pai Cheng, Jilei Hou, and Hadi Esmaeilzadeh. 2020. Ordering chaos: Memory-aware scheduling of irregularly wired neural networks for edge devices. In Proceedings of Machine Learning and Systems 2 (2020), 44–57.Google Scholar
- [6] . 2016. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In Proceedings of the 43rd International Symposium on Computer Architecture.IEEE, Piscataway, NJ, 1–13.
DOI: Google ScholarDigital Library
- [7] . 2016. Fused-layer CNN accelerators. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture. 1–12.
DOI: Google ScholarCross Ref
- [8] . 2017. An architecture to accelerate convolution in deep neural networks. IEEE Transactions on Circuits and Systems I: Regular Papers 65, 4 (2017), 1349–1362.Google Scholar
Cross Ref
- [9] . 2019. Fast and efficient convolutional accelerator for edge computing. IEEE Transactions on Computers 69, 1 (2019), 138–152.Google Scholar
Digital Library
- [10] . 2018. A scalable multi-TeraOPS deep learning processor core for AI training and inference. In Proceedings of the VLSI Symposium.Google Scholar
- [11] Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Comput. Surv. 52, 4, Article 65 (July 2020), 43. Google Scholar
Digital Library
- [12] . 1994. Improvements to graph coloring register allocation. ACM Transactions on Programming Languages and Systems 16, 3 (1994), 428–455.Google Scholar
Digital Library
- [13] . 1999. Selective value prediction. In Proceedings of the 26th Annual International Symposium on Computer Architecture.IEEE Computer Society, Washington, DC, 64–74.
DOI: Google ScholarDigital Library
- [14] . 2010. A dynamically configurable coprocessor for convolutional neural networks. ACM SIGARCH Computer Architecture News 38, 3 (2010), 247–257.
DOI: Google ScholarDigital Library
- [15] Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan. 2018. AdaComp: adaptive residual gradient compression for data-parallel distributed training. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 345, 2827–2835.Google Scholar
- [16] . 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems.ACM, New York, NY, 269–284.
DOI: Google ScholarDigital Library
- [17] . 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the OSDI. 578–594.Google Scholar
- [18] . 1994. A performance study of software and hardware data prefetching schemes. In Proceedings of the 21st Annual International Symposium on Computer Architecture.IEEE Computer Society Press, Los Alamitos, CA, 223–232.
DOI: Google ScholarDigital Library
- [19] . 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture.IEEE, Piscataway, NJ, 367–379.
DOI: Google ScholarDigital Library
- [20] . 2018. PACT: Parameterized clipping activation for quantized neural networks. arXiv:1805.06085. Retrieved from https://arxiv.org/abs/1805.06085.Google Scholar
- [21] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in Neural Information Processing Systems 28 (2015).Google Scholar
- [22] . 2018. Intel nGraph: An intermediate representation, compiler, and executor for deep learning. arXiv:1801.08058. Retrieved from https://arxiv.org/abs/1801.08058.Google Scholar
- [23] Scott Cyphers, Arjun K. Bansal, Anahita Bhiwandiwalla, Jayaram Bobba, Matthew Brookhart, Avijit Chakraborty, Will Constable, Christian Convey, Leona Cook, Omar Kanawi, Robert Kimball, Jason Knight, Nikolay Korovaiko, Varun Kumar, Yixing Lao, Christopher R. Lishka, Jaikrishnan Menon, Jennifer Myers, Sandeep Aswath Narayana, Adam Procter, and Tristan J. Webb. 2018. Intel ngraph: An intermediate representation, compiler, and executor for deep learning. arXiv:1801.08058. Retrieved from https://arxiv.org/abs/1801.08058.Google Scholar
- [24] Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. 2016. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 (2016).Google Scholar
- [25] . 2012. Large scale distributed deep networks. In Proceedings of the Advances in Neural Information Processing Systems 25. , , , , and (Eds.), 1232–1240. Retrieved from http://books.nips.cc/papers/files/nips25/NIPS2012_0598.pdf.Google Scholar
- [26] . 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.Google Scholar
Cross Ref
- [27] . 2015. Towards general-purpose neural network computing. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation. 99–112.
DOI: Google ScholarDigital Library
- [28] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the CVPR.Google Scholar
- [29] . 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In Proceedings of the CVPR 2011 WORKSHOPS. 109–116.
DOI: Google ScholarCross Ref
- [30] . 2018. A configurable cloud-scale DNN processor for real-time AI. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture. 1–14.
DOI: Google ScholarDigital Library
- [31] . 2020. EdgeDRNN: Recurrent neural network accelerator for edge inference. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10, 4 (2020), 419–432.Google Scholar
Cross Ref
- [32] . 2018. Spotlight: Optimizing device placement for training deep neural networks. In Proceedings of the 35th International Conference on Machine Learning(
Proceedings of Machine Learning Research , Vol. 80). and (Eds.), PMLR, Stockholmsmassan, Stockholm Sweden, 1662–1670. Retrieved from http://proceedings.mlr.press/v80/gao18a.html.Google Scholar - [33] . 2017. Google AI Blog: Eager Execution: An imperative, define-by-run interface to TensorFlow. Retrieved 1 January 2020 from http://ai.googleblog.com/2017/10/eager-execution-imperative-define-by.html.Google Scholar
- [34] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning, PMLR, 1737–1746.Google Scholar
- [35] . 2019. Characterizing the deployment of deep neural networks on commercial edge devices. In Proceedings of the 2019 IEEE International Symposium on Workload Characterization. IEEE, 35–48.Google Scholar
Cross Ref
- [36] . 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture. 243–254.
DOI: Google ScholarDigital Library
- [37] . 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. arXiv:1510.00149. Retrieved from https://arxiv.org/abs/1510.00149.Google Scholar
- [38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.Google Scholar
- [39] . 2020. Sparse-TPU: Adapting systolic arrays for sparse matrices. In Proceedings of the 34th ACM International Conference on Supercomputing. 1–12.Google Scholar
Digital Library
- [40] . 2019. A new golden age for computer architecture. Communications of the ACM 62, 2 (2019), 48–60.
DOI: Google ScholarDigital Library
- [41] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Retrieved from https://arxiv.org/abs/1704.04861.Google Scholar
- [42] . 2019. ecnn: A block-based and highly-parallel cnn accelerator for edge inference. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 182–195.Google Scholar
Digital Library
- [43] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4700–4708.Google Scholar
- [44] Forrest Iandola N., Matthew W. Moskewicz, Khalid Ashraf, and Kurt Keutzer. 2016. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2592–2600.Google Scholar
- [45] . 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv:1602.07360. Retrieved from https://arxiv.org/abs/1602.07360.Google Scholar
- [46] . 2014. Speeding up convolutional neural networks with low rank expansions. arXiv:1405.3866. Retrieved from https://arxiv.org/abs/1405.3866.Google Scholar
- [47] . 2018. Compensated-DNN: Energy efficient low-precision deep neural networks by compensating quantization errors. In Proceedings of the 55th Annual Design Automation Conference.ACM, New York, NY, Article
38 , 6 pages.DOI: Google ScholarDigital Library
- [48] . 2020. A unified architecture for accelerating distributed \( \lbrace \)DNN\( \rbrace \) training in heterogeneous GPU/CPU clusters. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation.463–479.Google Scholar
- [49] . 2016. Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture. 380–392.
DOI: Google ScholarDigital Library
- [50] . 2014. One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997. Retrieved from https://arxiv.org/abs/1404.5997.Google Scholar
- [51] . 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems. Google Scholar
Digital Library
- [52] . 2019. Understanding reuse, performance, and hardware cost of DNN dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 754–768.Google Scholar
Digital Library
- [53] . 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems.ACM, New York, NY, 461–475.
DOI: Google ScholarDigital Library
- [54] Jiansong Li, Wei Cao, Xiao Dong, Guangli Li, Xueying Wang, Peng Zhao, Lei Liu, and Xiaobing Feng. 2021. Compiler-assisted Operator Template Library for DNN Accelerators. International Journal of Parallel Programming 49, 5 (2021), 628–645.Google Scholar
- [55] . 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.Google Scholar
Cross Ref
- [56] . 2016. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture.393–405.
DOI: Google ScholarDigital Library
- [57] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. Ssd: Single shot multibox detector. In European Conference on Computer Vision, Springer, Cham, 21–37.Google Scholar
- [58] Sangkug Lym, Armand Behroozi, Wei Wen, Ge Li, Yongkee Kwon, and Mattan Erez. 2019. Mini-batch serialization: Cnn training with inter-layer data reuse. Proceedings of Machine Learning and Systems 1 (2019), 264–275.Google Scholar
- [59] . 2012. A massively parallel, energy efficient programmable accelerator for learning and classification. ACM Transactions on Architecture and Code Optimization 9, 1 (2012), 30 pages.
DOI: Google ScholarDigital Library
- [60] . 1999. Value prediction for speculative multithreaded architectures. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. 230–236.
DOI: Google ScholarCross Ref
- [61] . 2018. Hierarchical planning for device placement. Retrieved from https://openreview.net/pdf?id=Hkc-TeZ0W.Google Scholar
- [62] Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. In International Conference on Machine Learning, PMLR, 2430–2439.Google Scholar
- [63] . 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th International Symposium on Computer Architecture.Google Scholar
- [64] . 2017. Fine-grained DRAM: Energy-efficient DRAM for extreme bandwidth systems. In Proceedings of the 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 41–54.Google Scholar
Digital Library
- [65] . 2021. SparseAdapt: Runtime control for sparse linear algebra on a reconfigurable accelerator. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 1005–1021.Google Scholar
Digital Library
- [66] . 2018. Outerspace: An outer product based sparse matrix multiplication accelerator. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture. IEEE, 724–736.Google Scholar
Cross Ref
- [67] Subhankar Pal, Siying Feng, Dong-hyeon Park, Sung Kim, Aporva Amarnath, Chi-Sheng Yang, Xin He, Jonathan Beaumont, Kyle May, Yan Xiong, Kuba Kaszyk, John Magnus Morton, Jiawen Sun, Michael O’Boyle, Murray Cole, Chaitali Chakrabarti, David Blaauw, Hun-Seok Kim, Trevor Mudge, and Ronald Dreslinski. 2020. Transmuter: Bridging the efficiency gap using memory and dataflow reconfiguration. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 175–190.Google Scholar
Digital Library
- [68] Subhankar Pal, Dong-hyeon Park, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, Jonathan Beaumont, Kuan-Yu Chen, Chaitali Chakrabarti, Michael Taylor, Trevor Mudge, David Blaauw, Hun-Seok Kim, and Ronald Dreslinski. 2019. A 7.3 m output non-zeros/j sparse matrix-matrix multiplication accelerator using memory reconfiguration in 40 nm. In Proceedings of the 2019 Symposium on VLSI Technology. IEEE, C150–C151.Google Scholar
- [69] . 2021. Efficient management of scratch-pad memories in deep learning accelerators. In Proceedings of the 2021 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 240–242.Google Scholar
Cross Ref
- [70] . 2019. Timeloop: A systematic approach to dnn accelerator evaluation. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 304–315.Google Scholar
Cross Ref
- [71] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Computer Architecture News 45, 2 (2017), 27–40.Google Scholar
- [72] Dong-hyeon Park, Subhankar Pal, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, Jonathan Beaumont, Kuan-Yu Chen, Chaitali Chakrabarti, Michael Taylor, Trevor Mudge, David Blaauw, Hun-Seok Kim, and Ronald Dreslinski. 2020. A 7.3 m output non-zeros/j, 11.7 m output non-zeros/gb reconfigurable sparse matrix–matrix multiplication accelerator. IEEE Journal of Solid-State Circuits 55, 4 (2020), 933–944.Google Scholar
Cross Ref
- [73] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems. 8026–8037.Google Scholar
- [74] . 2019. Buffets: An efficient and composable storage idiom for explicit decoupled data orchestration. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 137–151.Google Scholar
Digital Library
- [75] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, Springer, Cham, 525–542.Google Scholar
- [76] . 2016. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In Proceedings of the 43rd International Symposium on Computer Architecture.IEEE, Piscataway, NJ, 267–278.
DOI: Google ScholarDigital Library
- [77] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. VDNN: virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-49). IEEE Press, Article 18, 1–13.Google Scholar
- [78] . 2018. Glow: Graph lowering compiler techniques for neural networks. arXiv:1805.00907. Retrieved from https://arxiv.org/abs/1805.00907.Google Scholar
- [79] . 2014. 1-Bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.Google Scholar
- [80] . 2019. SparCE: Sparsity aware general-purpose core extensions to accelerate deep neural networks. IEEE Transactions on Computers 68, 6 (2019), 912–925.
DOI: Google ScholarDigital Library
- [81] . 2019. BLADE: A bitline accelerator for devices on the edge. In Proceedings of the 2019 on Great Lakes Symposium on VLSI. 207–212.Google Scholar
Digital Library
- [82] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556.Google Scholar
- [83] . 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the NeurIPS.Google Scholar
- [84] . 2014. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 3104–3112.Google Scholar
Digital Library
- [85] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI Conference on Artificial Intelligence.Google Scholar
- [86] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.Google Scholar
- [87] Xiaohan Tao, Jianmin Pang, Jinlong Xu, and Yu Zhu. 2021. Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture. The Journal of Supercomputing 77, 12 (2021), 14502–14524.Google Scholar
- [88] . 2003. The Penn treebank: An overview. In Proceedings of the Treebanks. Springer, 5–22.Google Scholar
Cross Ref
- [89] . 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers 44, 5 (1995), 609–623.
DOI: Google ScholarDigital Library
- [90] . 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems.Curran Associates Inc., 6000–6010. Retrieved from http://dl.acm.org/citation.cfm?id=3295222.3295349.Google Scholar
Digital Library
- [91] . 2013. Quality programmable vector processors for approximate computing. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture.ACM, New York, NY, 1–12.
DOI: Google ScholarDigital Library
- [92] . 2017. POSTER: Design space exploration for performance optimization of deep neural networks on shared memory accelerators. In Proceedings of the 2017 26th International Conference on Parallel Architectures and Compilation Techniques.146–147.
DOI: Google ScholarCross Ref
- [93] . 2019. DeepTools: Compiler and execution runtime extensions for RaPiD AI accelerator. IEEE Micro 39, 5 (2019), 102–111.
DOI: Google ScholarCross Ref
- [94] . 2017. SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks. In Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture.13–26.
DOI: Google ScholarDigital Library
- [95] . 2014. AxNN: Energy-efficient neuromorphic systems using approximate computing. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design.ACM, New York, NY, 27–32.
DOI: Google ScholarDigital Library
- [96] . 2019. Memory and interconnect optimizations for peta-scale deep learning systems. In Proceedings of the 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics. 225–234.
DOI: Google ScholarCross Ref
- [97] Swagath Venkataramani, Vijayalakshmi Srinivasan, Wei Wang, Sanchari Sen, Jintao Zhang, Ankur Agrawal, Monodeep Kar, Shubham Jain, Alberto Mannari, Hoang Tran, Yulong Li, Eri Ogawa, Kazuaki Ishizaki, Hiroshi Inoue, Marcel Schaal, Mauricio Serrano, Jungwook Choi, Xiao Sun, Naigang Wang, Chia-Yu Chen, Allison Allain, James Bonano, Nianzheng Cao, Robert Casatuta, Matthew Cohen, Bruce Fleischer, Michael Guillorn, Howard Haynie, Jinwook Jung, Mingu Kang, Kyu-hyoun Kim, Siyu Koswatta, Saekyu Lee, Martin Lutz, Silvia Mueller, Jinwook Oh, Ashish Ranjan, Zhibin Ren, Scot Rider, Kerstin Schelm, Michael Scheuermann, Joel Silberman, Jie Yang, Vidhi Zalani, Xin Zhang, Ching Zhou, Matt Ziegler, Vinay Shah, Moriyoshi Ohara, Pong-Fei Lu, Brian Curran, Sunil Shukla, Leland Chang, and Kailash Gopalakrishnan. 2021. RaPiD: AI accelerator for ultra-low precision training and inference. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture. IEEE, 153–166.Google Scholar
- [98] . 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 281–290.Google Scholar
Digital Library
- [99] . 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture.IEEE Computer Society, Washington, DC, 281–290. Retrieved from http://dl.acm.org/citation.cfm?id=266800.266827.Google Scholar
Digital Library
- [100] . 2019. Spring Hill (NNP-I 1000) Intel’s data center inference chip. In Proceedings of the 2019 IEEE Hot Chips 31 Symposium. 1–12.
DOI: Google ScholarCross Ref
- [101] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1492–1500.Google Scholar
- [102] . 2020. Accelerating deep neural network computation on a low power reconfigurable architecture. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems. IEEE, 1–5.Google Scholar
Cross Ref
- [103] . 2021. DUET: A compiler-runtime subgraph scheduling approach for tensor programs on a coupled CPU-GPU architecture. In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium. IEEE, 151–161.Google Scholar
Cross Ref
- [104] . 2017. Neural architecture search with reinforcement learning. arXiv:1611.01578. Retrieved from https://arxiv.org/abs/1611.01578Google Scholar
Index Terms
OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators
Recommendations
Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver
The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block ...
Efficient memory management of a hierarchical and a hybrid main memory for MN-MATE platform
PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and ManycoresThe advent of manycore in computing architecture causes severe energy consumption and memory wall problem. Thus, emerging technologies such as on-chip memory and nonvolatile memory (NVRAM) have led to a paradigm shift in computing architecture era. For ...
Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management
Hybrid main memories composed of DRAM as a cache to scalable non-volatile memories such as phase-change memory (PCM) can provide much larger storage capacity than traditional main memories. A key challenge for enabling high-performance and scalable ...






Comments