Abstract
The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.
- Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. 2006. Compilers, Principles, Techniques and Tools, 2nd ed. Addison-Wesley. Google Scholar
Digital Library
- Allen, R. and Kennedy, K. 2002. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, Academic Press. Google Scholar
Digital Library
- AMD. 2012. Accelerated processing units. http://www.amd.com/us/products/technologies/fusion/Pages/fusion.aspx.Google Scholar
- BDTI. 2010. An independent evaluation of: The autoesl autopilot high-level synthesis tool. http://www.bdti.com/MyBDTI/pubs/AutoPilot.pdf.Google Scholar
- Che, S., Li, J., Sheaffer, J. W., Skadron, K., and Lach, J. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the 6th Symposium on Application Specific Processors. IEEE, 101--107. Google Scholar
Digital Library
- Chen, D., Cong, J., Fan, Y., Han, G., Jiang, W., and Zhang Z. 2005. XPilot: A platform-based behavioral synthesis system. In Proceedings of the TechCon Conference.Google Scholar
- Cho, J., Mirzaei, S., Oberg, J., and Kastner, R. 2009. Fpga-based face detection system using haar classifiers. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 103--112. Google Scholar
Digital Library
- CHREC. 2012. NSF center for high performance reconfigurable computing. http://www.chrec.org/facilities.html.Google Scholar
- Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., and Zhang, Z. 2011. High-level synthesis for FPGA: From prototyping to deployment. Comput. Aid. Des. Integr. Circ. Syst. 30, 4, 473--491. Google Scholar
Digital Library
- Cong, J. and Zou, Y. 2008. Lithographic aerial image simulation with FPGA-based hardware acceleration. In Proceedings of the 16th International Symposium on Field Programmable Gate Arrays. ACM Press, New York. Google Scholar
Digital Library
- Convey Computer. 2011. http://www.conveycomputer.com.Google Scholar
- Diniz, P., Hall, M., Park, J., So, B., and Ziegler, H. 2005. Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system. Microprocess. Microsyst. 29, 2--3, 51--62.Google Scholar
Cross Ref
- Gajski, D. 2003. NISC: The ultimate reconfigurable component. Tech. rep. 03-28. Center for Embedded Computer Systems, UCI. http://www.cecs.uci.edu/technical_report/TR03-28.pdf.Google Scholar
- Gupta, S., Gupta, R. K., Dutt, N. D., and Nicolau, A. 2004. Coordinated parallelizing compiler optimizations and high-level synthesis. ACM Trans. Des. Autom. Electron. Syst. 9, 4, 441--470. Google Scholar
Digital Library
- He, C., Papakonstantinou, A., and Chen, D. 2009. A novel soc architecture on fpga for ultra fast face detection. In Proceedings of the 27th International Conference on Computer Design. IEEE, 412--418. Google Scholar
Digital Library
- Huang, S. S., Hormati, A., Bacon, D. F., and Rabbah, R. 2008. Liquid metal: Object-oriented programming across the hardware/software boundary. In Proceedings of the 22nd European Conference on Object-Oriented Programming. Springer, 76--103. Google Scholar
Digital Library
- IBM. 2006. The cell architecture. http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.html.Google Scholar
- Impact. 2012. Parboil benchmarks. http://impact.crhc.illinois.edu/parboil.aspx.Google Scholar
- Impulse. 2003. Impulse accelerated technologies inc. http://www.impulseaccelerated.com.Google Scholar
- Hormati, A., Kudlur, M., Mahlke, S., Bacon, D., and Rabbah, R. 2008. Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM Press, New York, 41--50. Google Scholar
Digital Library
- Khronos. 2011. OpenCL specification, version 1.1. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf.Google Scholar
- Lee, S., Johnson, T. A., and Eigenmann, R. 2003. Cetus - An extensible compiler infrastructure for source-to-source transformation. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing. Springer, 539--553.Google Scholar
- Lin, M., Lebedev, I., and Wawrzynek, J. 2010. OpenRCL: Low-power high-performance computing with reconfigurable devices. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 458--463. Google Scholar
Digital Library
- Ling, L., Oliver, N., Bhushan, C., Qigang, W., Chen, A., et al. 2009. High-performance, energy-efficient platforms using in-socket fpga accelerators. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 61--264. Google Scholar
Digital Library
- LLVM. 2007. The LLVM compiler infrastructure. http://www.llvm.org.Google Scholar
- Mentor Graphics. 2012. Catapult C synthesis overview. http://www.mentor.com/esl/catapult/overview/.Google Scholar
- Nallatech. 2012. DATA v5. http://www.nallatech.com/Modules/data-v5-xilinx-virtex-5-fpga-ddr2-sdramqdr-ii-sram-and-io-module.html.Google Scholar
- Nvidia. 2012a. CUDA developer zone. http://developer.nvidia.com/category/zone/cuda-zone.Google Scholar
- Nvidia. 2012b. GeForce 8 series. http://www.nvidia.com/page/geforce8.html.Google Scholar
- Owaida, M., Bellas, N., Daloukas, K., and Antonopoulos, C. 2011. Synthesis of platform architectures from opencl programs. In Proceedings of the 19th Symposium on Field-Programmable Custom Computing Machines. IEEE, 178--185. Google Scholar
Digital Library
- Parker, M. 2011. Hardware-based floating-point design flow. In Proceedings of the DesignCon Conference.Google Scholar
- Showerman, M., Enos, J., Kidratenko, C., Steffer, C., Pennington, R., and Hwu, W. W. 2009. QP: A heterogeneous multi-accelerator cluster. In Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing.Google Scholar
- Stratton, J. A., Stone, S. S., and Hwu, W. W. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 21st International Conference on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 5335, Springer, 16--30. Google Scholar
Digital Library
- Thomas, D. B., Howes, L., and Luk, W. 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 63--72. Google Scholar
Digital Library
- Tilera. 2012. Tilera corporation. http://www.tilera.com.Google Scholar
- Williams, J., Richardson, J., Gosrani, K., and Suresh, S. 2008. Computational density of fixed and reconfigurable multi-core devices for application acceleration. In Proceedings of the 4th Annual Reconfigurable Systems Summer Institute.Google Scholar
- Xilinx. 2012. Virtex-5 FXT ML510 embedded development platform. http://www.xilinx.com/products/boards-and-kits/HW-V5-ML510-G.htm.Google Scholar
- Zhang, Z., Fan, Y., Jiang, W., Han, G., Yang, C., and Cong, J. 2008. AutoPilot: A platform-based ESL synthesis system. In High-Level Synthesis: From Algorithm to Digital Circuit, P. Coussy and A. Morawiec, Eds., Springer, 99--112.Google Scholar
Index Terms
Efficient compilation of CUDA kernels for high-performance computing on FPGAs
Recommendations
High-performance CUDA kernel execution on FPGAs
ICS '09: Proceedings of the 23rd international conference on SupercomputingIn this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in CUDA kernels onto ...
Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisWe evaluate the power and performance of the Rodinia benchmark suite using the Altera SDK for OpenCL targeting a Stratix V FPGA against a modern CPU and GPU. We study multiple OpenCL kernels per benchmark, ranging from direct ports of the original GPU ...
Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs
ICS '18: Proceedings of the 2018 International Conference on SupercomputingReconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations from several domains because of their unique combination of flexibility, performance, and power efficiency. However, FPGAs have not ...






Comments