Abstract
Reconfigurable Computing (RC) has the potential to provide substantial performance benefits and yet simultaneously consume less power than traditional microprocessors or GPUs. While experimental performance analysis of RC applications has previously been shown crucial for achieving this potential, existing methods still require application designers to manually locate bottlenecks and determine appropriate optimizations, typically requiring significant designer expertise and effort. Worse, the diversity of platforms employed by RC applications further complicates the process of detecting bottlenecks and formulating optimizations. To address these shortcomings, we first discuss our platform-template system, which enables a performance analysis tool to perform more accurate bottleneck detection and achieve a higher degree of portability across diverse FPGA systems. We then provide details for our implementation of these concepts and techniques in the Reconfigurable Computing Application Performance (ReCAP) tool. Next, we present a taxonomy of common RC bottlenecks, providing associated detection and optimization strategies for each bottleneck, which we use to populate ReCAP's knowledge base for bottleneck detection. Finally, we demonstrate the utility of our approach via two application case studies across a total of three platforms.
- Aggarwal, V., Garcia, R., Stitt, G., George, A., and Lam, H. 2009. SCF: A device- and language-independent task coordination framework for reconfigurable, heterogeneous systems. In Proceedings of the 3rd International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA'09). ACM, New York, 19--28. Google Scholar
Digital Library
- Alexandrov, A., Ionescu, M. F., Schauser, K. E., and Scheiman, C. 1995. LogGP: Incorporating long messages into the logp model—One step closer towards a realistic model for parallel computation. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'95). ACM, New York, 95--105. Google Scholar
Digital Library
- Barroso, L. A. 2005. The price of performance. Queue 3, 7, 48--53. Google Scholar
Digital Library
- Bodenner, R. 2010. Creating platform support packages. http://www.impulseaccelerated.com/AppNotes/APP109_PSP/IATAPP109_PSP.pdf.Google Scholar
- Chamberlain, R., Franklin, M., Tyson, E., Buckley, J., Buhler, J., Galloway, G., Gayen, S., Hall, M., Shands, E., and Singla, N. 2010. Auto-Pipe: Streaming applications on architecturally diverse systems. Comput. 43, 3, 42--49. Google Scholar
Digital Library
- Che, S., Li, J., Sheaffer, J. W., Skadron, K., and Lach, J. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the Symposium on Application Specific Processors (SASP'08). IEEE Computer Society, Los Alamitos, CA, 101--107. Google Scholar
Digital Library
- Chung, I.-H., Cong, G., Klepacki, D., Sbaraglia, S., Seelam, S., and Wen, H.-F. 2008. A framework for automated performance bottleneck detection. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS'08). 1 --7.Google Scholar
- Crawford, C. H., Henning, P., Kistler, M., and Wright, C. 2008. Accelerating computing with the cell broadband engine processor. In Proceedings of the Conference on Computing Frontiers. ACM, New York, 3--12. Google Scholar
Digital Library
- Cray. 2010. Cray XD1 datasheet. http://www.hpc.unm.edu/%7Etlthomas/buildout/Cray_XD1_Datasheet.pdf.Google Scholar
- Curreri, J., Koehler, S., George, A. D., Holland, B., and Garcia, R. 2010. Performance analysis framework for high-level language applications in reconfigurable computing. ACM Trans. Reconfig. Technol. Syst. 3, 1, 1--23. Google Scholar
Digital Library
- DeHon, A., Adams, J., Delorimier, M., Kapre, N., Matsuda, Y., Naeimi, H., Vanier, M., and Wrighton, M. 2004. Design patterns for reconfigurable computing. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. 13--23. Google Scholar
Digital Library
- Garcia, P., Compton, K., Schulte, M., Blem, E., and Fu, W. 2006. An overview of reconfigurable hardware in embedded systems. EURASIP J. Embed. Syst. 1, 13--13. Google Scholar
Digital Library
- GiDEL. 2010. GiDEL PROCStar III PCIe x8\texttrademark\ computation accelerator. http://www.gidel.com/pdf/PROCStarIII%20Product%20Brief.pdf.Google Scholar
- Haney, R., Meuse, T., Kepner, J., and Lebak, J. 2005. The HPEC challenge benchmark suite. In Proceedings of the 9th Annual High-Performance Embedded Computing Workshop (HPEC'05).Google Scholar
- Jorba, J., Margalef, T., and Luque, E. 2008. Applied Parallel Computing. State of the Art in Scientific Computing. Springer (Chapter Search of Performance Inefficiencies in Message Passing Applications with KappaPI 2 Tool), 409--419. Google Scholar
Digital Library
- Koehler, S., Curreri, J., and George, A. D. 2008. Performance analysis challenges and framework for high-performance reconfigurable computing. Parall. Comput. 34, 4-5, 217--230. Google Scholar
Digital Library
- Koehler, S. and George, A. D. 2010. Performance visualization and exploration for reconfigurable computing applications. In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA).Google Scholar
- Laudon, J. 2005. Performance/watt: the new server focus. SIGARCH Comput. Archit. News 33, 4, 5--13. Google Scholar
Digital Library
- McGraw-Herdeg, M. P., Enright, D. P., and Michel, B. S. 2007. Benchmarking the NVIDIA 8800GTX with the CUDA development platform. In Proceedings of the 11th Annual High-Performance Embedded Computing Workshop (HPEC'07).Google Scholar
- Mohr, B. and Wolf, F. 2003. Euro-Par 2003 Parallel Processing. Springer (Chapter KOJAK A Tool Set for Automatic Performance Analysis of Parallel Programs.) 1301--1304.Google Scholar
- Nagarajan, K., Holland, B., Slatton, C., and George, A. D. 2008. Scalable and portable architecture for probability density function estimation on FPGAs. In Proceedings of the 16th International Symposium on Field-Programmable Custom Computing Machines (FCCM'08). IEEE Computer Society, Los Alamitos, CA, 302--303. Google Scholar
Digital Library
- Nallatech. 2010. H101-PCIXM PCI-X FPGA accelerator card. http://www.nallatech.com/PCI-Express-Cards/h101-pcixm.html.Google Scholar
- OpenFPGA. 2010. OpenFPGA GenAPI version 0.4 draft for comment. http://www.openfpga.org/Standards%20Documents/OpenFPGA-GenAPIv0.4.pdf.Google Scholar
- Su, H.-H., Billingsley III, M., and George, A. D. 2011. Parallel performance wizard: A performance system for the analysis of partitioned global-address-space applications. Int. J. High-Perform. Comput. Appl. in press. Google Scholar
Digital Library
- Su, H.-H., Billingsley III, M., and George, A. D. 2009. A distributed, programming model-independent automatic analysis system for parallel applications. In Proceedings of the 14th IEEE International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) of IPDPS. Google Scholar
Digital Library
- Tessier, R. and Burleson, W. 2001. Reconfigurable computing for digital signal processing: A survey. The J. VLSI Signal Process. 28, 7--27. Google Scholar
Digital Library
- Tripp, J. L., Mortveit, H. S., Hansson, A. A., and Gokhale, M. 2005. Metropolitan road traffic simulation on FPGAs. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05). IEEE Computer Society, Washington, DC, 117--126. Google Scholar
Digital Library
- Truong, H.-L. and Fahringer, T. 2002. SCALEA: A performance analysis tool for distributed and parallel programs. In Proceedings of the 8th International Europar Conference(EuroPar02). Springer, 41--55. Google Scholar
Digital Library
- University of California at Riverside. 2010. ROCCC 2.0 user's manual—Revision 0.5.1. http://roccc.cs.ucr.edu/documentation/files/UserManual-0.5.1.pdf.Google Scholar
- Williams, J., George, A. D., Richardson, J., Gosrani, K., Massie, C., and Lam, H. 2011. Characterization of fixed and reconfigurable multi-core devices for application acceleration. ACM Trans. Reconfig. Technol. Syst. 3, 4, to appear. Google Scholar
Digital Library
- Williams, J., George, A. D., Richardson, J., Gosrani, K., and Suresh, S. 2008. Computational density of fixed and reconfigurable multi-core devices for application acceleration. In Proceedings of the Reconfigurable Systems Summer Institute (RSSI).Google Scholar
- XtremeData Inc. 2010. XD1000#8482; development system. http://old.xtremedatainc.com/index.php?option= com_content&view=article& id=109&Itemid=170.Google Scholar
Index Terms
Platform-aware bottleneck detection for reconfigurable computing applications
Recommendations
Automatic Loop-Based Pipeline Optimization on Reconfigurable Platform
TRUSTCOM '13: Proceedings of the 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and CommunicationsPipelining is an effective technique to improve the performance of a loop by overlapping the execution of several iterations. We consider the pipeline scheduling of loops on reconfigurable platform in this paper. A loop is abstracted as a weighted data ...
Performance Analysis with High-Level Languages for High-Performance Reconfigurable Computing
FCCM '08: Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing MachinesHigh-Level Languages (HLLs) for FPGAs (Field-Programmable Gate Arrays) facilitate the use of reconfigurable computing resources for application developers by using familiar, higher-level syntax, semantics, and abstractions, typically enabling faster ...
An energy-efficient system on a programmable chip platform for cloud applications
Performance analysis of master-slave based reconfigurable architecture and standalone SOPC based reconfigurable architecture using an analytical model.A massive-sessions optimized TCP/IP offload engine that supports up to 100K TCP sessions under 10Gbps ...






Comments