Abstract
Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner -- to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.
- Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1492--1500.Google Scholar
Cross Ref
- Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1--12.Google Scholar
Digital Library
- Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2016), 127--138.Google Scholar
Cross Ref
- HT Kung, Bradley McDanel, and Sai Qian Zhang. 2019. Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 821--834.Google Scholar
Digital Library
- Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 553--564.Google Scholar
Cross Ref
- Hongbo Rong. 2017. Programmatic control of a compiler for generating high-performance spatial hardware. arXiv preprint arXiv:1711.07606 (2017).Google Scholar
- Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. Scale-sim: Systolic cnn accelerator. arXiv preprint arXiv:1811.02883 (2018).Google Scholar
- Michael Pellauer, Angshuman Parashar, Michael Adler, Bushra Ahsan, Randy Allmon, Neal Crago, Kermin Fleming, Mohit Gambhir, Aamer Jaleel, Tushar Krishna, et al. 2015. Efficient control and communication paradigms for coarse-grained spatial architectures. ACM Transactions on Computer Systems (TOCS) 33, 3 (2015), 10.Google Scholar
Digital Library
- Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2013. A general constraint-centric scheduling framework for spatial architectures. In ACM SIGPLAN Notices, Vol. 48. ACM, 495--506.Google Scholar
- Yang You, Zhao Zhang, Cho-Jui Hsieh, Jim Demmel, and Kurt Keutzer. 2019. Fast deep neural network training on distributed systems and cloud TPUs. IEEE Transactions on Parallel and Distributed Systems (2019).Google Scholar
Digital Library
- Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 367--379.Google Scholar
Digital Library
- Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, Vol. 43. ACM, 92--104.Google Scholar
Digital Library
- Shouyi Yin, Peng Ouyang, Shibin Tang, Fengbin Tu, Xiudong Li, Shixuan Zheng, Tianyi Lu, Jiangyuan Gu, Leibo Liu, and Shaojun Wei. 2017. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE Journal of Solid-State Circuits 53, 4 (2017), 968--982.Google Scholar
Cross Ref
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-end optimization stack for deep learning. arXiv preprint arXiv:1802.04799 (2018), 1--15.Google Scholar
- Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 16--25.Google Scholar
Digital Library
- SCALE-Sim. https://github.com/ARM-software/SCALE-Sim. ([n. d.]). Accessed: November 5, 2018.Google Scholar
- Ye Yu, Yingmin Li, Shuai Che, Niraj K Jha, and Weifeng Zhang. 2019. Software-defined design space exploration for an efficient AI accelerator architecture. arXiv preprint arXiv:1903.07676 (2019).Google Scholar
- Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Emberton Bell, Jeff Ou Setter, Kaidi Cao, Heonjae Ha, Christos Kozyrakis, et al. 2018. DNN dataflow choice is overrated. arXiv preprint arXiv:1809.04070 (2018).Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google Scholar
Cross Ref
- Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, et al. 2018. Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886 (2018).Google Scholar
- Yann LeCun. 2019. 1.1 deep learning hardware: Past, Present, and Future. In 2019 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 12--19.Google Scholar
Cross Ref
- Kartik Hegde, Rohit Agrawal, Yulun Yao, and Christopher W. Fletcher. 2018. Morph: Flexible acceleration for 3D CNN-based video understanding. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 933--946.Google Scholar
- Xuan Yang et al. DNN Energy Model and Optimizer. https://github.com/xuanyoya/CNN-blocking/tree/dev. Accessed: November 5, 2018.Google Scholar
- Bruce Fleischer, Sunil Shukla, Matthew Ziegler, Joel Silberman, Jinwook Oh, Vijavalakshmi Srinivasan, Jungwook Choi, Silvia Mueller, Ankur Agrawal, Tina Babinsky, et al. 2018. A scalable multi-TeraOPS deep learning processor core for AI trainina and inference. In 2018 IEEE Symposium on VLSI Circuits. IEEE, 35--36.Google Scholar
Cross Ref
- Jason Cong and Jie Wang. 2018. PolySA: Polyhedral-based systolic array auto-compilation. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.Google Scholar
Digital Library
- Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2019. HyPar: Towards hybrid parallelism for deep learning accelerator array. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 56--68.Google Scholar
Cross Ref
- Alfred V. Aho et al. 2007. Compilers: Principles, techniques and tools. (2007).Google Scholar
- Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. Vol. 29. ACM.Google Scholar
- Ilya Issenin, Erik Brockmeyer, Miguel Miranda, and Nikil Dutt. 2007. DRDU: A data reuse analysis technique for efficient scratch-pad memory management. ACM Transactions on Design Automation of Electronic Systems (TODAES) 12, 2 (2007), 15.Google Scholar
Digital Library
- Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Acm Sigplan Notices, Vol. 43. ACM, 101--113.Google Scholar
Digital Library
- Florin Balasa, Per Gunnar Kjeldsberg, Arnout Vandecappelle, Martin Palkovic, Qubo Hu, Hongwei Zhu, and Francky Catthoor. 2008. Storage estimation and design space exploration methodologies for the memory management of signal processing applications. Journal of Signal Processing Systems 53, 1--2 (2008), 51.Google Scholar
Digital Library
- Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. 2017. Hycube: A cgra with reconfigurable single-cycle multi-hop interconnect. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google Scholar
Digital Library
- Bernhard Egger, Hochan Lee, Duseok Kang, Mansureh S. Moghaddam, Youngchul Cho, Yeonbok Lee, Sukjin Kim, Soonhoi Ha, and Kiyoung Choi. 2017. A space-and energy-efficient code compression/decompression technique for coarse-grained reconfigurable architectures. In Proceedings of the 2017 International Symposium on Code Generation and Optimization. IEEE Press, 197--209.Google Scholar
Digital Library
- Shail Dave, Mahesh Balasubramanian, and Aviral Shrivastava. 2018. Ramp: Resource-aware mapping for cgras. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 1--6.Google Scholar
Digital Library
- Shail Dave, Mahesh Balasubramanian, and Aviral Shrivastava. 2018. URECA: A compiler solution to manage unified register file for CGRAs. In 2018 Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE). IEEE, 1081--1086.Google Scholar
- Arthur Stoutchinin, Francesco Conti, and Luca Benini. 2019. Optimally scheduling CNN convolutions for efficient memory access. arXiv preprint arXiv:1902.01492 (2019).Google Scholar
- Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning. 1737--1746.Google Scholar
Digital Library
- Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, and Tushar Krishna. 2019. mRNA: Enabling efficient mapping space exploration for a reconfiguration neural accelerator. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 282--292.Google Scholar
Cross Ref
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.Google Scholar
Digital Library
- Hyoukjun Kwon, Michael Pellauer, and Tushar Krishna. 2018. MAESTRO: An open-source infrastructure for modeling dataflows within deep learning accelerators. CoRR abs/1805.02566 (2018).Google Scholar
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization. IEEE Computer Society, 75.Google Scholar
Cross Ref
- fmincon. https://www.mathworks.com/help/optim/ug/fmincon.html. Accessed: November 5, 2018.Google Scholar
- Michael Kistler, Michael Perrone, and Fabrizio Petrini. 2006. Cell multiprocessor communication network: Built for speed. IEEE Micro 26, 3 (2006), 10--23.Google Scholar
Digital Library
- Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, Jonghee W. Yoon, Doosan Cho, and Yunheung Paek. 2011. High throughput data mapping for coarse-grained reconfigurable architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 11 (2011), 1599--1609.Google Scholar
Digital Library
Index Terms
dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators
Recommendations
Exploitation of parallelism to nested loops with dependence cycles
In this paper, we analyze the recurrences from the breakability of the dependence links formed in general multi-statements in a nested loop. The major findings include: (1) A sink variable renaming technique, which can reposition an undesired anti-...
An improved algorithm for loop dead optimization
Loop dead variables are the variables, which are defined in a loop, but not used in that loop. On successive execution of loop, these get different value, however all values (except last value) are not used. Hence in optimized program, the definition of ...
Control Mechanism for Software Pipelining on Nested Loop
APDC '97: Proceedings of the 1997 Advances in Parallel and Distributed Computing Conference (APDC '97)ILSP (Interlaced inner and outer Loop Software Pipelining) is an efficient algorithm of optimizing operations in the nested loops. To ensure the ILSP has a good time efficiency and a good space efficiency, there must be an efficient nested control ...






Comments