skip to main content
research-article

Platform-aware bottleneck detection for reconfigurable computing applications

Authors Info & Claims
Published:22 August 2011Publication History
Skip Abstract Section

Abstract

Reconfigurable Computing (RC) has the potential to provide substantial performance benefits and yet simultaneously consume less power than traditional microprocessors or GPUs. While experimental performance analysis of RC applications has previously been shown crucial for achieving this potential, existing methods still require application designers to manually locate bottlenecks and determine appropriate optimizations, typically requiring significant designer expertise and effort. Worse, the diversity of platforms employed by RC applications further complicates the process of detecting bottlenecks and formulating optimizations. To address these shortcomings, we first discuss our platform-template system, which enables a performance analysis tool to perform more accurate bottleneck detection and achieve a higher degree of portability across diverse FPGA systems. We then provide details for our implementation of these concepts and techniques in the Reconfigurable Computing Application Performance (ReCAP) tool. Next, we present a taxonomy of common RC bottlenecks, providing associated detection and optimization strategies for each bottleneck, which we use to populate ReCAP's knowledge base for bottleneck detection. Finally, we demonstrate the utility of our approach via two application case studies across a total of three platforms.

References

  1. Aggarwal, V., Garcia, R., Stitt, G., George, A., and Lam, H. 2009. SCF: A device- and language-independent task coordination framework for reconfigurable, heterogeneous systems. In Proceedings of the 3rd International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA'09). ACM, New York, 19--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexandrov, A., Ionescu, M. F., Schauser, K. E., and Scheiman, C. 1995. LogGP: Incorporating long messages into the logp model—One step closer towards a realistic model for parallel computation. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'95). ACM, New York, 95--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Barroso, L. A. 2005. The price of performance. Queue 3, 7, 48--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bodenner, R. 2010. Creating platform support packages. http://www.impulseaccelerated.com/AppNotes/APP109_PSP/IATAPP109_PSP.pdf.Google ScholarGoogle Scholar
  5. Chamberlain, R., Franklin, M., Tyson, E., Buckley, J., Buhler, J., Galloway, G., Gayen, S., Hall, M., Shands, E., and Singla, N. 2010. Auto-Pipe: Streaming applications on architecturally diverse systems. Comput. 43, 3, 42--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Che, S., Li, J., Sheaffer, J. W., Skadron, K., and Lach, J. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the Symposium on Application Specific Processors (SASP'08). IEEE Computer Society, Los Alamitos, CA, 101--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chung, I.-H., Cong, G., Klepacki, D., Sbaraglia, S., Seelam, S., and Wen, H.-F. 2008. A framework for automated performance bottleneck detection. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS'08). 1 --7.Google ScholarGoogle Scholar
  8. Crawford, C. H., Henning, P., Kistler, M., and Wright, C. 2008. Accelerating computing with the cell broadband engine processor. In Proceedings of the Conference on Computing Frontiers. ACM, New York, 3--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cray. 2010. Cray XD1 datasheet. http://www.hpc.unm.edu/%7Etlthomas/buildout/Cray_XD1_Datasheet.pdf.Google ScholarGoogle Scholar
  10. Curreri, J., Koehler, S., George, A. D., Holland, B., and Garcia, R. 2010. Performance analysis framework for high-level language applications in reconfigurable computing. ACM Trans. Reconfig. Technol. Syst. 3, 1, 1--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. DeHon, A., Adams, J., Delorimier, M., Kapre, N., Matsuda, Y., Naeimi, H., Vanier, M., and Wrighton, M. 2004. Design patterns for reconfigurable computing. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. 13--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Garcia, P., Compton, K., Schulte, M., Blem, E., and Fu, W. 2006. An overview of reconfigurable hardware in embedded systems. EURASIP J. Embed. Syst. 1, 13--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. GiDEL. 2010. GiDEL PROCStar III PCIe x8\texttrademark\ computation accelerator. http://www.gidel.com/pdf/PROCStarIII%20Product%20Brief.pdf.Google ScholarGoogle Scholar
  14. Haney, R., Meuse, T., Kepner, J., and Lebak, J. 2005. The HPEC challenge benchmark suite. In Proceedings of the 9th Annual High-Performance Embedded Computing Workshop (HPEC'05).Google ScholarGoogle Scholar
  15. Jorba, J., Margalef, T., and Luque, E. 2008. Applied Parallel Computing. State of the Art in Scientific Computing. Springer (Chapter Search of Performance Inefficiencies in Message Passing Applications with KappaPI 2 Tool), 409--419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Koehler, S., Curreri, J., and George, A. D. 2008. Performance analysis challenges and framework for high-performance reconfigurable computing. Parall. Comput. 34, 4-5, 217--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Koehler, S. and George, A. D. 2010. Performance visualization and exploration for reconfigurable computing applications. In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA).Google ScholarGoogle Scholar
  18. Laudon, J. 2005. Performance/watt: the new server focus. SIGARCH Comput. Archit. News 33, 4, 5--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. McGraw-Herdeg, M. P., Enright, D. P., and Michel, B. S. 2007. Benchmarking the NVIDIA 8800GTX with the CUDA development platform. In Proceedings of the 11th Annual High-Performance Embedded Computing Workshop (HPEC'07).Google ScholarGoogle Scholar
  20. Mohr, B. and Wolf, F. 2003. Euro-Par 2003 Parallel Processing. Springer (Chapter KOJAK A Tool Set for Automatic Performance Analysis of Parallel Programs.) 1301--1304.Google ScholarGoogle Scholar
  21. Nagarajan, K., Holland, B., Slatton, C., and George, A. D. 2008. Scalable and portable architecture for probability density function estimation on FPGAs. In Proceedings of the 16th International Symposium on Field-Programmable Custom Computing Machines (FCCM'08). IEEE Computer Society, Los Alamitos, CA, 302--303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Nallatech. 2010. H101-PCIXM PCI-X FPGA accelerator card. http://www.nallatech.com/PCI-Express-Cards/h101-pcixm.html.Google ScholarGoogle Scholar
  23. OpenFPGA. 2010. OpenFPGA GenAPI version 0.4 draft for comment. http://www.openfpga.org/Standards%20Documents/OpenFPGA-GenAPIv0.4.pdf.Google ScholarGoogle Scholar
  24. Su, H.-H., Billingsley III, M., and George, A. D. 2011. Parallel performance wizard: A performance system for the analysis of partitioned global-address-space applications. Int. J. High-Perform. Comput. Appl. in press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Su, H.-H., Billingsley III, M., and George, A. D. 2009. A distributed, programming model-independent automatic analysis system for parallel applications. In Proceedings of the 14th IEEE International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) of IPDPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tessier, R. and Burleson, W. 2001. Reconfigurable computing for digital signal processing: A survey. The J. VLSI Signal Process. 28, 7--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tripp, J. L., Mortveit, H. S., Hansson, A. A., and Gokhale, M. 2005. Metropolitan road traffic simulation on FPGAs. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05). IEEE Computer Society, Washington, DC, 117--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Truong, H.-L. and Fahringer, T. 2002. SCALEA: A performance analysis tool for distributed and parallel programs. In Proceedings of the 8th International Europar Conference(EuroPar02). Springer, 41--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. University of California at Riverside. 2010. ROCCC 2.0 user's manual—Revision 0.5.1. http://roccc.cs.ucr.edu/documentation/files/UserManual-0.5.1.pdf.Google ScholarGoogle Scholar
  30. Williams, J., George, A. D., Richardson, J., Gosrani, K., Massie, C., and Lam, H. 2011. Characterization of fixed and reconfigurable multi-core devices for application acceleration. ACM Trans. Reconfig. Technol. Syst. 3, 4, to appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Williams, J., George, A. D., Richardson, J., Gosrani, K., and Suresh, S. 2008. Computational density of fixed and reconfigurable multi-core devices for application acceleration. In Proceedings of the Reconfigurable Systems Summer Institute (RSSI).Google ScholarGoogle Scholar
  32. XtremeData Inc. 2010. XD1000#8482; development system. http://old.xtremedatainc.com/index.php?option= com_content&view=article& id=109&Itemid=170.Google ScholarGoogle Scholar

Index Terms

  1. Platform-aware bottleneck detection for reconfigurable computing applications

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Reconfigurable Technology and Systems
          ACM Transactions on Reconfigurable Technology and Systems  Volume 4, Issue 3
          August 2011
          204 pages
          ISSN:1936-7406
          EISSN:1936-7414
          DOI:10.1145/2000832
          Issue’s Table of Contents

          Copyright © 2011 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 August 2011
          • Accepted: 1 January 2011
          • Received: 1 August 2010
          Published in trets Volume 4, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!