Abstract
Next-generation high-performance computing platforms will handle extreme data- and compute-intensive problems that are intractable with today’s technology. A promising path in achieving the next leap in high-performance computing is to embrace heterogeneity and specialised computing in the form of reconfigurable accelerators such as FPGAs, which have been shown to speed up compute-intensive tasks with reduced power consumption. However, assessing the feasibility of large-scale heterogeneous systems requires fast and accurate performance prediction. This article proposes Performance Estimation for Reconfigurable Kernels and Systems (PERKS), a novel performance estimation framework for reconfigurable dataflow platforms. PERKS makes use of an analytical model with machine and application parameters for predicting the performance of multi-accelerator systems and detecting their bottlenecks. Model calibration is automatic, making the model flexible and usable for different machine configurations and applications, including hypothetical ones. Our experimental results show that PERKS can predict the performance of current workloads on reconfigurable dataflow platforms with an accuracy above 91%. The results also illustrate how the modelling scales to large workloads, and how performance impact of architectural features can be estimated in seconds.
- Amazon. 2020. Amazon EC2 F1 Instances. Retrieved May 22, 2021 from https://aws.amazon.com/ec2/instance-types/f1/.Google Scholar
- Maxeler. 2020. Maxeler AppGallery. Retrieved May 22, 2021 from http://appgallery.maxeler.com/.Google Scholar
- Maxeler. 2020. Maxeler Technologies Home Page. Retrieved May 22, 2021 from http://maxeler.com/.Google Scholar
- TOP500. 2020. TOP500 Supercomputer Sites. Retrieved May 22, 2021 from https://www.top500.org/lists/2020/11/.Google Scholar
- M. S. B. Altaf and D. A. Wood. 2017. LogCA: A high-level performance model for hardware accelerators. In Proceedings of the 44th Annual International Symposium on Computer Architecture.375–388. Google Scholar
Digital Library
- A.-S. Anghel. 2017. On Large-Scale System Performance Analysis and Software Characterization. Ph.D. Dissertation. ETH Zurich.Google Scholar
- J. Arram, T. Kaplan, W. Luk, and P. Jiang. 2017. Leveraging FPGAs for accelerating short read alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14 (May-June 2017), 668–677. Google Scholar
Digital Library
- J. Arram, W. Luk, and P. Jiang. 2015. Ramethy: Reconfigurable acceleration of bisulfite sequence alignment. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 250–259. Google Scholar
Digital Library
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software.163–174.Google Scholar
- P. Balaprakash, D. Buntinas, A. Chan, A. Guha, R. Gupta, S. H. K. Narayanan, A. A. Chien, P. Hovland, and B. Norris. 2013. Exascale workload characterization and architecture implications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software.Google Scholar
- J. Bang-Jensen and G. Gutin. 2008. Digraphs: Theory, Algorithms and Applications (2nd ed.). Springer-Verlag. Google Scholar
Digital Library
- T. Becker, P. Burovskiy, A. M. Nestorov, H. Palikareva, E. Reggiani, and G. Gaydadjiev. 2017. From exaflop to exaflow. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition.404–409. Google Scholar
Digital Library
- B. Bhattacharya and S. S. Bhattacharyya. 2001. Parameterized dataflow modeling for DSP systems. IEEE Transactions on Signal Processing 49, 10 (2001), 2408–2421. Google Scholar
Digital Library
- J. Blieberger. 2002. Data-flow frameworks for worst-case execution time analysis. Real-Time Systems 22, 3 (2002), 183–227. Google Scholar
Digital Library
- A. Bouakaz, P. Fradet, and A. Girault. 2017. A survey of parametric dataflow models of computation. ACM Transactions on Design Automation of Electronic Systems. 22, 2 (2017), 38. Google Scholar
Digital Library
- S. Collange, M. Daumas, D. Defour, and D. Parello. 2010. Barra: A parallel functional simulator for GPGPU. In Proceedings of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.351–360. Google Scholar
Digital Library
- A.-I. Cross, L. Guo, W. Luk, and M. Salmon. 2018. CJS: Custom Jacobi solver. In Proceedings of the International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies. 1–6. Google Scholar
Digital Library
- A.-I. Cross, L. Guo, W. Luk, and M. Salmon. 2018. CRRS: Custom regression and regularisation solver for large-scale linear systems. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18).Google Scholar
- J. Curreri, S. Koehler, A. D. George, B. Holland, and R. Garcia. 2010. Performance analysis framework for high-level language applications in reconfigurable computing. ACM Transactions on Reconfigurable Technology and Systems 3 (Jan. 2010), Article 5. Google Scholar
Digital Library
- B. da Silva, A. Braeken, E. H. D’Hollander, and A. Touhafi. 2013. Performance modeling for FPGAs: Extending the roofline model with high-level synthesis tools. International Journal of Reconfigurable Computing 2013 (Nov. 2013), Article 7. Google Scholar
Digital Library
- J. B. Dennis. 1980. Data flow supercomputers. Computer 13, 11 (Nov. 1980), 48–56. Google Scholar
Digital Library
- H. Fu, L. Gan, R. G. Clapp, H. Ruan, O. Pell, O. Mencer, M. Flynn, X. Huang, and G. Yang. 2014. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro 34, 1 (2014), 30–40.Google Scholar
Cross Ref
- L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, and G. Yang. 2017. Solving mesoscale atmospheric dynamics using a reconfigurable dataflow architecture. IEEE Micro 37, 4 (2017), 40–50.Google Scholar
Digital Library
- M. R. Garey and D. S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Company. Google Scholar
Digital Library
- A. H. Ghamarian, M. C. W. Geilen, S. Stuijk, T. Basten, B. D. Theelen, M. R. Mousavi, A. J. M. Moonen, and M. J. G. Bekooij. 2006. Throughput analysis of synchronous data flow graphs. In Proceedings of the International Conference on Application of Concurrency to System Design. 25–36. Google Scholar
Digital Library
- T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. 2010. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s bing search engine. In Proceedings of the 27th International Conference on Machine Learning.13–20. Google Scholar
Digital Library
- P. Grigoras, M. Tottenham, X. Niu, J. G. F. Coutinho, and W. Luk. 2014. Elastic management of reconfigurable accelerators. In Proceedings of the International Symposium on Parallel and Distributed Processing with Applications. 174–181. Google Scholar
Digital Library
- J. Hennessy and D. Patterson. 2018. A New Golden Age for Computer Architecture. Turing Award Lecture. Retrieved May 22, 2021 from http://iscaconf.org/isca2018/docs/HennessyPattersonTuringLectureISCA4June2018.pdf.Google Scholar
- S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the International Symposium on Computer Architecture. 152–163. Google Scholar
Digital Library
- H. Jia, Y. Zhang, G. Long, J. Xu, S. Yan, and Y. Li. 2012. GPURoofline: A model for guiding performance optimizations on GPUs. In Proceedings of the European Conference on Parallel Processing.920–932. Google Scholar
Digital Library
- A. Kerr, G. Diamos, and S. Yalamanchili. 2010. Modeling GPU-CPU workloads and systems. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. 31–42. Google Scholar
Digital Library
- Al. C. J. Kienhuis. 1999. Design space exploration of stream-based dataflow architectures. Nederlands Elektronica en Radiogenootschap 64, 5 (1999), 191.Google Scholar
- L. Gan, H. Fu, C. Yang, W. Luk, W. Xue, O. Mencer, X. Huang, and G. Yang.2014. A highly-efficient and green data flow engine for solving Euler atmospheric equations. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1–6.Google Scholar
- E. A. Lee and D. G. Messerschmitt. 1987. Synchronous data flow. Proceedings of the IEEE 75, 9 (1987), 1235–1245.Google Scholar
Cross Ref
- S. Lee, J. S. Meredith, and J. S. Vetter. 2015. COMPASS: A framework for automated performance modeling and prediction. In Proceedings of the International Conference on Supercomputing. Google Scholar
Digital Library
- A. M. Nestorov, E. Reggiani, H. Palikareva, P. Burovskiy, T. Becker, and M. D. Santambrogio. 2017. A scalable dataflow implementation of Curran’s approximation algorithm. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops.Google Scholar
- H. Rihani, M. Moy, C. Maiza, R. I. Davis, and S. Altmeyer. 2016. Response time analysis of synchronous data flow programs on a many-core processor. In Proceedings of the 24th International Conference on Real-Time Networks and Systems. 67–76. Google Scholar
Digital Library
- K. Sato, K. Komatsu, H. Takizawa, and H. Kobayashi. 2011. A history-based performance prediction model with profile data classification for automatic task allocation in heterogeneous computing systems. In Proceedings of the 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications. 135–142. Google Scholar
Digital Library
- R. F. Service. 2012. What it’ll take to go exascale. Science 27 (2012), 394–396.Google Scholar
Cross Ref
- J. Shalf, S. Dosanjh, and J. Morrison. 2010. Exascale computing technology challenges. In Proceedings of the International Conference on High Performance Computing for Computational Science. 1–25. Google Scholar
Digital Library
- R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 2012 21st International Conference on Parallel Architectures and Compilation Techniques.335–344. Google Scholar
Digital Library
- D. Unat, C. Chan, W. Zhang, S. Williams, J. Bachan, J. Bell, and J. Shalf. 2015. ExaSAT: An exascale co-design tool for performance modeling. International Journal of High Performance Computing Application 29 (June 2015), 209–232. Google Scholar
Digital Library
- S. Williams, A. Waterman, and D. Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commications of the ACM 52 (April 2009), 65–76. Google Scholar
Digital Library
- R. Yasudo, J. Coutinho, A. Varbanescu, W. Luk, H. Amano, and T. Becker. 2018. Performance estimation for exascale reconfigurable dataflow platforms. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). 314–317.Google Scholar
Index Terms
Analytical Performance Estimation for Large-Scale Reconfigurable Dataflow Platforms
Recommendations
Designing Run-Time Reconfigurable Systems with JHDL
Run-time reconfigurable (RTR) systems are FPGA-based systems that reconfigure FPGAs during execution to alter hardware organization and composition to meet the varying needs of applications as they execute. These systems are difficult to describe with ...
Solving the Global Atmospheric Equations through Heterogeneous Reconfigurable Platforms
Special Section on FPL 2013One of the most essential and challenging components in climate modeling is the atmospheric model. To solve multiphysical atmospheric equations, developers have to face extremely complex stencil kernels that are costly in terms of both computing and ...
A 13.75 ns Holographic Reconfiguration of an Optically Differential Reconfigurable Gate Array
IIH-MSP '09: Proceedings of the 2009 Fifth International Conference on Intelligent Information Hiding and Multimedia Signal ProcessingReconfiguration applications based on reconfigurable devices present new computational paradigms because increasing the reconfiguration frequency of such devices can enhance their activity and performance dramatically. Recently, optically reconfigurable ...






Comments