skip to main content
research-article

Analytical Performance Estimation for Large-Scale Reconfigurable Dataflow Platforms

Published:12 August 2021Publication History
Skip Abstract Section

Abstract

Next-generation high-performance computing platforms will handle extreme data- and compute-intensive problems that are intractable with today’s technology. A promising path in achieving the next leap in high-performance computing is to embrace heterogeneity and specialised computing in the form of reconfigurable accelerators such as FPGAs, which have been shown to speed up compute-intensive tasks with reduced power consumption. However, assessing the feasibility of large-scale heterogeneous systems requires fast and accurate performance prediction. This article proposes Performance Estimation for Reconfigurable Kernels and Systems (PERKS), a novel performance estimation framework for reconfigurable dataflow platforms. PERKS makes use of an analytical model with machine and application parameters for predicting the performance of multi-accelerator systems and detecting their bottlenecks. Model calibration is automatic, making the model flexible and usable for different machine configurations and applications, including hypothetical ones. Our experimental results show that PERKS can predict the performance of current workloads on reconfigurable dataflow platforms with an accuracy above 91%. The results also illustrate how the modelling scales to large workloads, and how performance impact of architectural features can be estimated in seconds.

References

  1. Amazon. 2020. Amazon EC2 F1 Instances. Retrieved May 22, 2021 from https://aws.amazon.com/ec2/instance-types/f1/.Google ScholarGoogle Scholar
  2. Maxeler. 2020. Maxeler AppGallery. Retrieved May 22, 2021 from http://appgallery.maxeler.com/.Google ScholarGoogle Scholar
  3. Maxeler. 2020. Maxeler Technologies Home Page. Retrieved May 22, 2021 from http://maxeler.com/.Google ScholarGoogle Scholar
  4. TOP500. 2020. TOP500 Supercomputer Sites. Retrieved May 22, 2021 from https://www.top500.org/lists/2020/11/.Google ScholarGoogle Scholar
  5. M. S. B. Altaf and D. A. Wood. 2017. LogCA: A high-level performance model for hardware accelerators. In Proceedings of the 44th Annual International Symposium on Computer Architecture.375–388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A.-S. Anghel. 2017. On Large-Scale System Performance Analysis and Software Characterization. Ph.D. Dissertation. ETH Zurich.Google ScholarGoogle Scholar
  7. J. Arram, T. Kaplan, W. Luk, and P. Jiang. 2017. Leveraging FPGAs for accelerating short read alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14 (May-June 2017), 668–677. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Arram, W. Luk, and P. Jiang. 2015. Ramethy: Reconfigurable acceleration of bisulfite sequence alignment. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 250–259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software.163–174.Google ScholarGoogle Scholar
  10. P. Balaprakash, D. Buntinas, A. Chan, A. Guha, R. Gupta, S. H. K. Narayanan, A. A. Chien, P. Hovland, and B. Norris. 2013. Exascale workload characterization and architecture implications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software.Google ScholarGoogle Scholar
  11. J. Bang-Jensen and G. Gutin. 2008. Digraphs: Theory, Algorithms and Applications (2nd ed.). Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Becker, P. Burovskiy, A. M. Nestorov, H. Palikareva, E. Reggiani, and G. Gaydadjiev. 2017. From exaflop to exaflow. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition.404–409. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Bhattacharya and S. S. Bhattacharyya. 2001. Parameterized dataflow modeling for DSP systems. IEEE Transactions on Signal Processing 49, 10 (2001), 2408–2421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Blieberger. 2002. Data-flow frameworks for worst-case execution time analysis. Real-Time Systems 22, 3 (2002), 183–227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Bouakaz, P. Fradet, and A. Girault. 2017. A survey of parametric dataflow models of computation. ACM Transactions on Design Automation of Electronic Systems. 22, 2 (2017), 38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Collange, M. Daumas, D. Defour, and D. Parello. 2010. Barra: A parallel functional simulator for GPGPU. In Proceedings of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.351–360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A.-I. Cross, L. Guo, W. Luk, and M. Salmon. 2018. CJS: Custom Jacobi solver. In Proceedings of the International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies. 1–6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A.-I. Cross, L. Guo, W. Luk, and M. Salmon. 2018. CRRS: Custom regression and regularisation solver for large-scale linear systems. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18).Google ScholarGoogle Scholar
  19. J. Curreri, S. Koehler, A. D. George, B. Holland, and R. Garcia. 2010. Performance analysis framework for high-level language applications in reconfigurable computing. ACM Transactions on Reconfigurable Technology and Systems 3 (Jan. 2010), Article 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. da Silva, A. Braeken, E. H. D’Hollander, and A. Touhafi. 2013. Performance modeling for FPGAs: Extending the roofline model with high-level synthesis tools. International Journal of Reconfigurable Computing 2013 (Nov. 2013), Article 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. B. Dennis. 1980. Data flow supercomputers. Computer 13, 11 (Nov. 1980), 48–56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Fu, L. Gan, R. G. Clapp, H. Ruan, O. Pell, O. Mencer, M. Flynn, X. Huang, and G. Yang. 2014. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro 34, 1 (2014), 30–40.Google ScholarGoogle ScholarCross RefCross Ref
  23. L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, and G. Yang. 2017. Solving mesoscale atmospheric dynamics using a reconfigurable dataflow architecture. IEEE Micro 37, 4 (2017), 40–50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. R. Garey and D. S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Company. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. H. Ghamarian, M. C. W. Geilen, S. Stuijk, T. Basten, B. D. Theelen, M. R. Mousavi, A. J. M. Moonen, and M. J. G. Bekooij. 2006. Throughput analysis of synchronous data flow graphs. In Proceedings of the International Conference on Application of Concurrency to System Design. 25–36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. 2010. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s bing search engine. In Proceedings of the 27th International Conference on Machine Learning.13–20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Grigoras, M. Tottenham, X. Niu, J. G. F. Coutinho, and W. Luk. 2014. Elastic management of reconfigurable accelerators. In Proceedings of the International Symposium on Parallel and Distributed Processing with Applications. 174–181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Hennessy and D. Patterson. 2018. A New Golden Age for Computer Architecture. Turing Award Lecture. Retrieved May 22, 2021 from http://iscaconf.org/isca2018/docs/HennessyPattersonTuringLectureISCA4June2018.pdf.Google ScholarGoogle Scholar
  29. S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the International Symposium on Computer Architecture. 152–163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Jia, Y. Zhang, G. Long, J. Xu, S. Yan, and Y. Li. 2012. GPURoofline: A model for guiding performance optimizations on GPUs. In Proceedings of the European Conference on Parallel Processing.920–932. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Kerr, G. Diamos, and S. Yalamanchili. 2010. Modeling GPU-CPU workloads and systems. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. 31–42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Al. C. J. Kienhuis. 1999. Design space exploration of stream-based dataflow architectures. Nederlands Elektronica en Radiogenootschap 64, 5 (1999), 191.Google ScholarGoogle Scholar
  33. L. Gan, H. Fu, C. Yang, W. Luk, W. Xue, O. Mencer, X. Huang, and G. Yang.2014. A highly-efficient and green data flow engine for solving Euler atmospheric equations. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1–6.Google ScholarGoogle Scholar
  34. E. A. Lee and D. G. Messerschmitt. 1987. Synchronous data flow. Proceedings of the IEEE 75, 9 (1987), 1235–1245.Google ScholarGoogle ScholarCross RefCross Ref
  35. S. Lee, J. S. Meredith, and J. S. Vetter. 2015. COMPASS: A framework for automated performance modeling and prediction. In Proceedings of the International Conference on Supercomputing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. M. Nestorov, E. Reggiani, H. Palikareva, P. Burovskiy, T. Becker, and M. D. Santambrogio. 2017. A scalable dataflow implementation of Curran’s approximation algorithm. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops.Google ScholarGoogle Scholar
  37. H. Rihani, M. Moy, C. Maiza, R. I. Davis, and S. Altmeyer. 2016. Response time analysis of synchronous data flow programs on a many-core processor. In Proceedings of the 24th International Conference on Real-Time Networks and Systems. 67–76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. K. Sato, K. Komatsu, H. Takizawa, and H. Kobayashi. 2011. A history-based performance prediction model with profile data classification for automatic task allocation in heterogeneous computing systems. In Proceedings of the 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications. 135–142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. F. Service. 2012. What it’ll take to go exascale. Science 27 (2012), 394–396.Google ScholarGoogle ScholarCross RefCross Ref
  40. J. Shalf, S. Dosanjh, and J. Morrison. 2010. Exascale computing technology challenges. In Proceedings of the International Conference on High Performance Computing for Computational Science. 1–25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 2012 21st International Conference on Parallel Architectures and Compilation Techniques.335–344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. Unat, C. Chan, W. Zhang, S. Williams, J. Bachan, J. Bell, and J. Shalf. 2015. ExaSAT: An exascale co-design tool for performance modeling. International Journal of High Performance Computing Application 29 (June 2015), 209–232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Williams, A. Waterman, and D. Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commications of the ACM 52 (April 2009), 65–76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. R. Yasudo, J. Coutinho, A. Varbanescu, W. Luk, H. Amano, and T. Becker. 2018. Performance estimation for exascale reconfigurable dataflow platforms. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). 314–317.Google ScholarGoogle Scholar

Index Terms

  1. Analytical Performance Estimation for Large-Scale Reconfigurable Dataflow Platforms

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 14, Issue 3
        September 2021
        137 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/3472296
        • Editor:
        • Deming Chen
        Issue’s Table of Contents

        Copyright © 2021 Association for Computing Machinery.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 August 2021
        • Accepted: 1 February 2021
        • Revised: 1 December 2020
        • Received: 1 June 2020
        Published in trets Volume 14, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)34
        • Downloads (Last 6 weeks)3

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!