skip to main content
research-article

Efficient compilation of CUDA kernels for high-performance computing on FPGAs

Published:30 September 2013Publication History
Skip Abstract Section

Abstract

The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

References

  1. Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. 2006. Compilers, Principles, Techniques and Tools, 2nd ed. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Allen, R. and Kennedy, K. 2002. Optimizing Compilers for Modern Architectures. Morgan Kaufmann, Academic Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AMD. 2012. Accelerated processing units. http://www.amd.com/us/products/technologies/fusion/Pages/fusion.aspx.Google ScholarGoogle Scholar
  4. BDTI. 2010. An independent evaluation of: The autoesl autopilot high-level synthesis tool. http://www.bdti.com/MyBDTI/pubs/AutoPilot.pdf.Google ScholarGoogle Scholar
  5. Che, S., Li, J., Sheaffer, J. W., Skadron, K., and Lach, J. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the 6th Symposium on Application Specific Processors. IEEE, 101--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chen, D., Cong, J., Fan, Y., Han, G., Jiang, W., and Zhang Z. 2005. XPilot: A platform-based behavioral synthesis system. In Proceedings of the TechCon Conference.Google ScholarGoogle Scholar
  7. Cho, J., Mirzaei, S., Oberg, J., and Kastner, R. 2009. Fpga-based face detection system using haar classifiers. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 103--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. CHREC. 2012. NSF center for high performance reconfigurable computing. http://www.chrec.org/facilities.html.Google ScholarGoogle Scholar
  9. Cong, J., Liu, B., Neuendorffer, S., Noguera, J., Vissers, K., and Zhang, Z. 2011. High-level synthesis for FPGA: From prototyping to deployment. Comput. Aid. Des. Integr. Circ. Syst. 30, 4, 473--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cong, J. and Zou, Y. 2008. Lithographic aerial image simulation with FPGA-based hardware acceleration. In Proceedings of the 16th International Symposium on Field Programmable Gate Arrays. ACM Press, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Convey Computer. 2011. http://www.conveycomputer.com.Google ScholarGoogle Scholar
  12. Diniz, P., Hall, M., Park, J., So, B., and Ziegler, H. 2005. Automatic mapping of C to FPGAs with the DEFACTO compilation and synthesis system. Microprocess. Microsyst. 29, 2--3, 51--62.Google ScholarGoogle ScholarCross RefCross Ref
  13. Gajski, D. 2003. NISC: The ultimate reconfigurable component. Tech. rep. 03-28. Center for Embedded Computer Systems, UCI. http://www.cecs.uci.edu/technical_report/TR03-28.pdf.Google ScholarGoogle Scholar
  14. Gupta, S., Gupta, R. K., Dutt, N. D., and Nicolau, A. 2004. Coordinated parallelizing compiler optimizations and high-level synthesis. ACM Trans. Des. Autom. Electron. Syst. 9, 4, 441--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. He, C., Papakonstantinou, A., and Chen, D. 2009. A novel soc architecture on fpga for ultra fast face detection. In Proceedings of the 27th International Conference on Computer Design. IEEE, 412--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Huang, S. S., Hormati, A., Bacon, D. F., and Rabbah, R. 2008. Liquid metal: Object-oriented programming across the hardware/software boundary. In Proceedings of the 22nd European Conference on Object-Oriented Programming. Springer, 76--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. IBM. 2006. The cell architecture. http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.html.Google ScholarGoogle Scholar
  18. Impact. 2012. Parboil benchmarks. http://impact.crhc.illinois.edu/parboil.aspx.Google ScholarGoogle Scholar
  19. Impulse. 2003. Impulse accelerated technologies inc. http://www.impulseaccelerated.com.Google ScholarGoogle Scholar
  20. Hormati, A., Kudlur, M., Mahlke, S., Bacon, D., and Rabbah, R. 2008. Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM Press, New York, 41--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Khronos. 2011. OpenCL specification, version 1.1. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf.Google ScholarGoogle Scholar
  22. Lee, S., Johnson, T. A., and Eigenmann, R. 2003. Cetus - An extensible compiler infrastructure for source-to-source transformation. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing. Springer, 539--553.Google ScholarGoogle Scholar
  23. Lin, M., Lebedev, I., and Wawrzynek, J. 2010. OpenRCL: Low-power high-performance computing with reconfigurable devices. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 458--463. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ling, L., Oliver, N., Bhushan, C., Qigang, W., Chen, A., et al. 2009. High-performance, energy-efficient platforms using in-socket fpga accelerators. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 61--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. LLVM. 2007. The LLVM compiler infrastructure. http://www.llvm.org.Google ScholarGoogle Scholar
  26. Mentor Graphics. 2012. Catapult C synthesis overview. http://www.mentor.com/esl/catapult/overview/.Google ScholarGoogle Scholar
  27. Nallatech. 2012. DATA v5. http://www.nallatech.com/Modules/data-v5-xilinx-virtex-5-fpga-ddr2-sdramqdr-ii-sram-and-io-module.html.Google ScholarGoogle Scholar
  28. Nvidia. 2012a. CUDA developer zone. http://developer.nvidia.com/category/zone/cuda-zone.Google ScholarGoogle Scholar
  29. Nvidia. 2012b. GeForce 8 series. http://www.nvidia.com/page/geforce8.html.Google ScholarGoogle Scholar
  30. Owaida, M., Bellas, N., Daloukas, K., and Antonopoulos, C. 2011. Synthesis of platform architectures from opencl programs. In Proceedings of the 19th Symposium on Field-Programmable Custom Computing Machines. IEEE, 178--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Parker, M. 2011. Hardware-based floating-point design flow. In Proceedings of the DesignCon Conference.Google ScholarGoogle Scholar
  32. Showerman, M., Enos, J., Kidratenko, C., Steffer, C., Pennington, R., and Hwu, W. W. 2009. QP: A heterogeneous multi-accelerator cluster. In Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing.Google ScholarGoogle Scholar
  33. Stratton, J. A., Stone, S. S., and Hwu, W. W. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the 21st International Conference on Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, vol. 5335, Springer, 16--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Thomas, D. B., Howes, L., and Luk, W. 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the International Symposium on Field Programmable Gate Arrays. ACM Press, New York, 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Tilera. 2012. Tilera corporation. http://www.tilera.com.Google ScholarGoogle Scholar
  36. Williams, J., Richardson, J., Gosrani, K., and Suresh, S. 2008. Computational density of fixed and reconfigurable multi-core devices for application acceleration. In Proceedings of the 4th Annual Reconfigurable Systems Summer Institute.Google ScholarGoogle Scholar
  37. Xilinx. 2012. Virtex-5 FXT ML510 embedded development platform. http://www.xilinx.com/products/boards-and-kits/HW-V5-ML510-G.htm.Google ScholarGoogle Scholar
  38. Zhang, Z., Fan, Y., Jiang, W., Han, G., Yang, C., and Cong, J. 2008. AutoPilot: A platform-based ESL synthesis system. In High-Level Synthesis: From Algorithm to Digital Circuit, P. Coussy and A. Morawiec, Eds., Springer, 99--112.Google ScholarGoogle Scholar

Index Terms

  1. Efficient compilation of CUDA kernels for high-performance computing on FPGAs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!