skip to main content
research-article

Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs

Published:01 December 2013Publication History
Skip Abstract Section

Abstract

This article describes the design and implementation of a novel compilation flow that implements circuits in FPGAs from a streaming programming language. The streaming language supported is called FPGA Brook and is based on the existing Brook language. It allows system designers to express applications in a way that exposes parallelism, which can be exploited through hardware implementation. FPGA Brook supports replication, allowing parts of an application to be implemented as multiple hardware units operating in parallel. Hardware units are interconnected through FIFO buffers which use the small memory modules available in FPGAs. The FPGA Brook automated design flow uses a source-to-source compiler, developed as a part of this work, and combines it with a commercial behavioral synthesis tool to generate the hardware implementation. A suite of benchmark applications was developed in FPGA Brook and implemented using our design flow. Experimental results indicate that performance of many applications scales well with replication. Our benchmark applications also achieve significantly better results than corresponding implementations using a commercial behavioral synthesis tool. We conclude that using an automated design flow for implementation of streaming applications in FPGAs is a promising methodology.

References

  1. Altera. 2012a. Altera Corporation: Nios II C-to-Hardware Acceleration Compiler. http://www.altera.com/devices/processor/nios2/tools/c2h/ni2-c2h.html (Last accessed 8/12).Google ScholarGoogle Scholar
  2. Altera. 2012b. Altera Corporation: C2H Compiler Mandelbrot Design Example. http://www.altera.com/support/examples/nios2/exm-c2h-mandelbrot.html (Last accessed 8/12).Google ScholarGoogle Scholar
  3. Altera. 2012c. Altera Corporation: Cyclone II FPGA Family Overview. http://www.altera.com/devices/fpga/cyclone2/overview/cy2-overview.html (Last accessed 8/12).Google ScholarGoogle Scholar
  4. Altera. 2012d. Altera Corporation: DE2 Development and Education Board. http://www.altera.com/education/univ/materials/boards/de2/unv-de2-board.html (Last accessed 8/12).Google ScholarGoogle Scholar
  5. Altera. 2012e. Altera Corporation: Implementing FPGA Design with the OpenCL Standard. http://www.altera.com/literature/wp/wp-01173-opencl.pdf (Last accessed 8/12).Google ScholarGoogle Scholar
  6. Altera. 2012f. Altera Corporation: Optimizing Nios II C2H Compiler Results. http://www.altera.com/ literature/hb/nios2/edh_ed51005.pdf (Last accessed 8/12).Google ScholarGoogle Scholar
  7. Altera. 2012g. Altera Corporation: Stratix III FPGA Family Overview. http://www.altera.com/devices/fpga/stratix-fpgas/stratix-iii/overview/st3-overview.html (Last accessed 8/12).Google ScholarGoogle Scholar
  8. Altera. 2012h. Altera Corporation: Stratix V Device Overview. http://www.altera.com/literature/hb/stratix-v/stx5_51001.pdf (Last accessed 8/12).Google ScholarGoogle Scholar
  9. ATI. 2012. ATI Stream Software Development Kit. http://developer.amd.com/archive/gpu/ ATIStreamSDKv1.4Beta/pages/default.aspx (Last accessed 8/12).Google ScholarGoogle Scholar
  10. Nikolaos Bellas, Sek M. Chai, Malcolm Dwyer, and Dan Linzmeier. 2006. Template-based generation of streaming accelerators from a high level presentation. In Proceedings of the Symposium on Field-Programmable Custom Computing Machines. IEEE, 345--346. DOI:http://dx.doi.org/10.1109/FCCM.2006.69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ian Buck. 2003. Brook specification v0.2. Technical rep. CSTR 2003-04 10/31/03 12/5/03. Department of Computer Science, Stanford University, Palo Alto, CA.Google ScholarGoogle Scholar
  12. Ian Buck. 2006. Stream computing on graphics hardware. Ph.D. dissertation. Stanford University, Palo Alto, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. Brook for GPUs: Stream computing on graphics hardware. ACM Trans. Graph. 23, 3, 777--786. DOI:http://dx.doi.org/10.1145/1015706.1015800. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. CAST. 2012. CAST, Inc.: 2-D inverse discrete cosine transform megafunction. http://www.cast-inc.com/ip-cores/multimedia/idct/cast_idct-a.pdf (Last accessed 8/12).Google ScholarGoogle Scholar
  15. William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, Jung-Ho Ahn, Jayanth Gummaraju, Mattan Erez, Nuwan Jayasena, Ian Buck, Timothy J. Knight, and Ujval J. Kapasi. 2003. Merrimac: Supercomputing with streams. In Proceedings of the ACM/IEEE Conference on supercomputing. ACM, 35--42. DOI:http://dx.doi.org/10.1145/1048935.1050187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Giovanni De Micheli. 1994. Synthesis and Optimization of Digital Circuits. McGraw Hill, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ian Foster. 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. FPGA Brook. 2012. FPGA Brook Homepage. http://www.eecg.toronto.edu/~plavec/fpgabrook/ (Last accessed 8/12).Google ScholarGoogle Scholar
  19. Martin Charles Golumbic. 1976. Combinatorial merging. IEEE Trans. Comput. C-25, 11, 1164--1167. DOI:http://dx.doi.org/10.1109/TC.1976.1674574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe. 2002. A stream compiler for communication-exposed architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 291--303. DOI:http://dx.doi.org/10.1145/635508.605428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. GPGPU. 2012. Brook for GPUs Forum. http://www.gpgpu.org/forums/index.php?c=5 (Last accessed 8/12).Google ScholarGoogle Scholar
  22. GPU Brook. 2012a. GPU Brook Source Code. http://sourceforge.net/projects/brook/ (Last accessed 8/12).Google ScholarGoogle Scholar
  23. GPU Brook. 2012b. GPU Brook: Current Issues and Restrictions. http://graphics.stanford.edu/projects/brookgpu/issues.html (Last accessed 8/12).Google ScholarGoogle Scholar
  24. Jayanth Gummaraju and Mendel Rosenblum. 2005. Stream programming on general-purpose processors. In Proceedings of the 38th International Symposium on Microarchitecture. IEEE, 343--354. DOI:http://dx.doi.org/10.1109/MICRO.2005.32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. John L. Hennessy and David A. Patterson. 2003. Computer Architecture: A Quantitative Approach (3rd Ed.). Morgan Kaufmann Publishers, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Amir H. Hormati, Manjunath Kudlur, David Bacon, Scott Mahlke, and Rodric Rabbah. 2008. Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM, 41--50. DOI:http://dx.doi.org/10.1145/1450095.1450105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lee W. Howes, Paul Price, Oskar Mencer, Olav Beckmann, and Oliver Pell. 2006. Comparing FPGAs to graphics accelerators and the Playstation 2 using a unified source description. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 1--6. DOI:http://dx.doi.org/10.1109/FPL.2006.311203.Google ScholarGoogle ScholarCross RefCross Ref
  28. Impulse. 2012. Impulse Accelerated Technologies: Impulse CoDeveloper C-to-FPGA Tools. http://www.impulseaccelerated.com/products_universal.htm (Last accessed 8/12).Google ScholarGoogle Scholar
  29. Ju-Wook Jang, Seonil Choi, and Viktor K. Prasanna. 2005. Energy- and time-efficient matrix multiplication on FPGAs. IEEE Trans. VLSI Syst. 13, 11, 1305--1319. DOI:http://dx.doi.org/10.1109/TVLSI.2005.859562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Y. Leow, C. Y. Ng, and W. F. Wong. 2006. Generating hardware from OpenMP programs. In Proceedings of the IEEE International Conference on Field Programmable Technology. IEEE, 73--80. DOI:http://dx.doi.org/10.1109/FPT.2006.270297.Google ScholarGoogle Scholar
  31. Shih-Wei Liao, Zhaohui Du, Gansha Wu, and Guei-Yuan Lueh. 2006. Data and computation transformations for Brook streaming applications on multiprocessors. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE, 196--207. DOI:http://dx.doi.org/10.1109/CGO.2006.13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mingjie Lin, Ilia Lebedev, and John Wawrzynek. 2010. OpenRCL: Low-power high-performance computing with reconfigurable devices. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 458--463. DOI:http://dx.doi.org/10.1109/FPL.2010.93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Mentor Graphics. 2012. Mentor Graphics: Catapult C Synthesis. http://www.mentor.com/esl/catapult/overview (Last accessed 8/12).Google ScholarGoogle Scholar
  34. Joan L. Mitchell, William B. Pennebaker, Chad Fogg, and Didier J. Legall. 1997. MPEG Video Compression Standard. Chapman & Hall, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Stephen Neuendorffer and Kees Vissers. 2008. Streaming systems in FPGAs. In Proceedings of the 8th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation. Lecture Notes in Computer Science, vol. 5114, Springer-Verlag, Berlin Heidelberg, 147--156. DOI:http://dx.doi.org/10.1007/978-3-540-70550-5_17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. NVIDIA. 2012. NVIDIA Corporation: CUDA Zone. http://developer.nvidia.com/category/zone/cuda-zone (Last accessed 8/12).Google ScholarGoogle Scholar
  37. Muhsen Owaida, Nikolaos Bellas, Christos D. Antonopoulos, Konstantis Daloukas, and Charalambos Antoniadis. 2011. Massively parallel programming models used as hardware description languages: The OpenCL case. In Proceedings of the International Conference on Computer-Aided Design. IEEE, 326--333. DOI:http://dx.doi.org/10.1109/ICCAD.2011.6105349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, and Wen-Mei W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In Proceedings of the 7th Symposium on Application Specific Processors. IEEE, 35--42. DOI:http://dx.doi.org/10.1109/SASP.2009.5226333.Google ScholarGoogle Scholar
  39. David Pellerin and Scott Thibault. 2005. Practical FPGA Programming in C. Prentice Hall, Upper Saddle River, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Franjo Plavec. 2010. Stream computing on FPGAs. Ph.D. dissertation, University of Toronto, Toronto, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2008. Towards compilation of streaming programs into FPGA hardware. In Proceedings of the Forum on Specification and Design Languages. IEEE, 67--72. DOI:http://dx.doi.org/10.1109/FDL.2008.4641423.Google ScholarGoogle ScholarCross RefCross Ref
  42. Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2009a. Enhancements to FPGA design methodology using streaming. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 294--301. DOI:http://dx.doi.org/10.1109/FPL.2009.5272286.Google ScholarGoogle ScholarCross RefCross Ref
  43. Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2009b. Stream programming for FPGAs. In Languages for Embedded Systems and their Applications, Martin Radetzki Ed., Lecture Notes in Electrical Engineering, Vol. 36, Springer Netherlands, 241--253. DOI:http://dx.doi.org/10.1007/978-1-4020-9714-0_16.Google ScholarGoogle Scholar
  44. Michael J. Quinn. 2004. Parallel programming in C with MPI and OpenMP. McGraw-Hill, Dubuque, IA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Claus Schneider, Martin Kayss, Thomas Hollstein, and Jurgen Deicke. 1998. From algorithms to hardware architectures: A comparison of regular and irregular structured IDCT algorithms. In Proceedings of Design, Automation and Test in Europe. IEEE, 186--190. DOI:http://dx.doi.org/10.1109/DATE.1998.655855. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jeffrey Sheldon, Walter Lee, Ben Greenwald, and Saman Amarasinghe. 2003. Strength reduction of integer division and modulo operations. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 2624, Springer, Berlin, 254--273. DOI:http://dx.doi.org/10.1007/3-540-35767-X_17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Robert Stephens. 1997. A survey of stream processing. Acta Informatica 34, 7, 491--541. DOI:http://dx.doi.org/10.1007/s002360050095.Google ScholarGoogle ScholarCross RefCross Ref
  48. David Tarditi, Sidd Puri, and Jose Oglesby. 2006. Accelerator: Using data parallelism to program GPUs for general-purpose uses. Technical rep. MSR-TR-2005-184. Microsoft Research.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2004. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams In Proceedings of the 31st International Symposium on Computer Architecture. IEEE, 2--13. DOI:http://dx.doi.org/10.1109/ISCA.2004.1310759. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Matjaz Verderber, Andrej Zemva, and Andrej Trost. 2003. HW/SW codesign of the MPEG-2 video decoder. In Proceedings of the International Parallel and Distributed Processing Symposium. IEEE, 7 pp. DOI:http://dx.doi.org/10.1109/IPDPS.2003.1213330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Xilinx. 2012a. Xilinx Inc. Virtex-II Platform FPGAs: Complete Data Sheet. http://www.xilinx.com/support/documentation/data_sheets/ds031.pdf (Last accessed 8/12).Google ScholarGoogle Scholar
  52. Xilinx. 2012b. Xilinx Inc. AutoESL High-Level Synthesis Tool. http://www.xilinx.com/products/design-tools/autoesl/ (Last accessed 8/12).Google ScholarGoogle Scholar
  53. Peter Yiannacouras. 2009. FPGA-based soft vector processors. Ph.D. dissertation, University of Toronto, Toronto, Canada.Google ScholarGoogle Scholar
  54. Nikos D. Zervas. 2010. Alma Technologies S.A.: Private Communication.Google ScholarGoogle Scholar

Index Terms

  1. Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!