Abstract
This article describes the design and implementation of a novel compilation flow that implements circuits in FPGAs from a streaming programming language. The streaming language supported is called FPGA Brook and is based on the existing Brook language. It allows system designers to express applications in a way that exposes parallelism, which can be exploited through hardware implementation. FPGA Brook supports replication, allowing parts of an application to be implemented as multiple hardware units operating in parallel. Hardware units are interconnected through FIFO buffers which use the small memory modules available in FPGAs. The FPGA Brook automated design flow uses a source-to-source compiler, developed as a part of this work, and combines it with a commercial behavioral synthesis tool to generate the hardware implementation. A suite of benchmark applications was developed in FPGA Brook and implemented using our design flow. Experimental results indicate that performance of many applications scales well with replication. Our benchmark applications also achieve significantly better results than corresponding implementations using a commercial behavioral synthesis tool. We conclude that using an automated design flow for implementation of streaming applications in FPGAs is a promising methodology.
- Altera. 2012a. Altera Corporation: Nios II C-to-Hardware Acceleration Compiler. http://www.altera.com/devices/processor/nios2/tools/c2h/ni2-c2h.html (Last accessed 8/12).Google Scholar
- Altera. 2012b. Altera Corporation: C2H Compiler Mandelbrot Design Example. http://www.altera.com/support/examples/nios2/exm-c2h-mandelbrot.html (Last accessed 8/12).Google Scholar
- Altera. 2012c. Altera Corporation: Cyclone II FPGA Family Overview. http://www.altera.com/devices/fpga/cyclone2/overview/cy2-overview.html (Last accessed 8/12).Google Scholar
- Altera. 2012d. Altera Corporation: DE2 Development and Education Board. http://www.altera.com/education/univ/materials/boards/de2/unv-de2-board.html (Last accessed 8/12).Google Scholar
- Altera. 2012e. Altera Corporation: Implementing FPGA Design with the OpenCL Standard. http://www.altera.com/literature/wp/wp-01173-opencl.pdf (Last accessed 8/12).Google Scholar
- Altera. 2012f. Altera Corporation: Optimizing Nios II C2H Compiler Results. http://www.altera.com/ literature/hb/nios2/edh_ed51005.pdf (Last accessed 8/12).Google Scholar
- Altera. 2012g. Altera Corporation: Stratix III FPGA Family Overview. http://www.altera.com/devices/fpga/stratix-fpgas/stratix-iii/overview/st3-overview.html (Last accessed 8/12).Google Scholar
- Altera. 2012h. Altera Corporation: Stratix V Device Overview. http://www.altera.com/literature/hb/stratix-v/stx5_51001.pdf (Last accessed 8/12).Google Scholar
- ATI. 2012. ATI Stream Software Development Kit. http://developer.amd.com/archive/gpu/ ATIStreamSDKv1.4Beta/pages/default.aspx (Last accessed 8/12).Google Scholar
- Nikolaos Bellas, Sek M. Chai, Malcolm Dwyer, and Dan Linzmeier. 2006. Template-based generation of streaming accelerators from a high level presentation. In Proceedings of the Symposium on Field-Programmable Custom Computing Machines. IEEE, 345--346. DOI:http://dx.doi.org/10.1109/FCCM.2006.69. Google Scholar
Digital Library
- Ian Buck. 2003. Brook specification v0.2. Technical rep. CSTR 2003-04 10/31/03 12/5/03. Department of Computer Science, Stanford University, Palo Alto, CA.Google Scholar
- Ian Buck. 2006. Stream computing on graphics hardware. Ph.D. dissertation. Stanford University, Palo Alto, CA. Google Scholar
Digital Library
- Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. 2004. Brook for GPUs: Stream computing on graphics hardware. ACM Trans. Graph. 23, 3, 777--786. DOI:http://dx.doi.org/10.1145/1015706.1015800. Google Scholar
Digital Library
- CAST. 2012. CAST, Inc.: 2-D inverse discrete cosine transform megafunction. http://www.cast-inc.com/ip-cores/multimedia/idct/cast_idct-a.pdf (Last accessed 8/12).Google Scholar
- William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, Jung-Ho Ahn, Jayanth Gummaraju, Mattan Erez, Nuwan Jayasena, Ian Buck, Timothy J. Knight, and Ujval J. Kapasi. 2003. Merrimac: Supercomputing with streams. In Proceedings of the ACM/IEEE Conference on supercomputing. ACM, 35--42. DOI:http://dx.doi.org/10.1145/1048935.1050187. Google Scholar
Digital Library
- Giovanni De Micheli. 1994. Synthesis and Optimization of Digital Circuits. McGraw Hill, New York, NY. Google Scholar
Digital Library
- Ian Foster. 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley, Reading, MA. Google Scholar
Digital Library
- FPGA Brook. 2012. FPGA Brook Homepage. http://www.eecg.toronto.edu/~plavec/fpgabrook/ (Last accessed 8/12).Google Scholar
- Martin Charles Golumbic. 1976. Combinatorial merging. IEEE Trans. Comput. C-25, 11, 1164--1167. DOI:http://dx.doi.org/10.1109/TC.1976.1674574. Google Scholar
Digital Library
- Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe. 2002. A stream compiler for communication-exposed architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 291--303. DOI:http://dx.doi.org/10.1145/635508.605428. Google Scholar
Digital Library
- GPGPU. 2012. Brook for GPUs Forum. http://www.gpgpu.org/forums/index.php?c=5 (Last accessed 8/12).Google Scholar
- GPU Brook. 2012a. GPU Brook Source Code. http://sourceforge.net/projects/brook/ (Last accessed 8/12).Google Scholar
- GPU Brook. 2012b. GPU Brook: Current Issues and Restrictions. http://graphics.stanford.edu/projects/brookgpu/issues.html (Last accessed 8/12).Google Scholar
- Jayanth Gummaraju and Mendel Rosenblum. 2005. Stream programming on general-purpose processors. In Proceedings of the 38th International Symposium on Microarchitecture. IEEE, 343--354. DOI:http://dx.doi.org/10.1109/MICRO.2005.32. Google Scholar
Digital Library
- John L. Hennessy and David A. Patterson. 2003. Computer Architecture: A Quantitative Approach (3rd Ed.). Morgan Kaufmann Publishers, San Francisco, CA. Google Scholar
Digital Library
- Amir H. Hormati, Manjunath Kudlur, David Bacon, Scott Mahlke, and Rodric Rabbah. 2008. Optimus: Efficient realization of streaming applications on FPGAs. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM, 41--50. DOI:http://dx.doi.org/10.1145/1450095.1450105. Google Scholar
Digital Library
- Lee W. Howes, Paul Price, Oskar Mencer, Olav Beckmann, and Oliver Pell. 2006. Comparing FPGAs to graphics accelerators and the Playstation 2 using a unified source description. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 1--6. DOI:http://dx.doi.org/10.1109/FPL.2006.311203.Google Scholar
Cross Ref
- Impulse. 2012. Impulse Accelerated Technologies: Impulse CoDeveloper C-to-FPGA Tools. http://www.impulseaccelerated.com/products_universal.htm (Last accessed 8/12).Google Scholar
- Ju-Wook Jang, Seonil Choi, and Viktor K. Prasanna. 2005. Energy- and time-efficient matrix multiplication on FPGAs. IEEE Trans. VLSI Syst. 13, 11, 1305--1319. DOI:http://dx.doi.org/10.1109/TVLSI.2005.859562. Google Scholar
Digital Library
- Y. Y. Leow, C. Y. Ng, and W. F. Wong. 2006. Generating hardware from OpenMP programs. In Proceedings of the IEEE International Conference on Field Programmable Technology. IEEE, 73--80. DOI:http://dx.doi.org/10.1109/FPT.2006.270297.Google Scholar
- Shih-Wei Liao, Zhaohui Du, Gansha Wu, and Guei-Yuan Lueh. 2006. Data and computation transformations for Brook streaming applications on multiprocessors. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE, 196--207. DOI:http://dx.doi.org/10.1109/CGO.2006.13. Google Scholar
Digital Library
- Mingjie Lin, Ilia Lebedev, and John Wawrzynek. 2010. OpenRCL: Low-power high-performance computing with reconfigurable devices. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 458--463. DOI:http://dx.doi.org/10.1109/FPL.2010.93. Google Scholar
Digital Library
- Mentor Graphics. 2012. Mentor Graphics: Catapult C Synthesis. http://www.mentor.com/esl/catapult/overview (Last accessed 8/12).Google Scholar
- Joan L. Mitchell, William B. Pennebaker, Chad Fogg, and Didier J. Legall. 1997. MPEG Video Compression Standard. Chapman & Hall, New York, NY. Google Scholar
Digital Library
- Stephen Neuendorffer and Kees Vissers. 2008. Streaming systems in FPGAs. In Proceedings of the 8th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation. Lecture Notes in Computer Science, vol. 5114, Springer-Verlag, Berlin Heidelberg, 147--156. DOI:http://dx.doi.org/10.1007/978-3-540-70550-5_17. Google Scholar
Digital Library
- NVIDIA. 2012. NVIDIA Corporation: CUDA Zone. http://developer.nvidia.com/category/zone/cuda-zone (Last accessed 8/12).Google Scholar
- Muhsen Owaida, Nikolaos Bellas, Christos D. Antonopoulos, Konstantis Daloukas, and Charalambos Antoniadis. 2011. Massively parallel programming models used as hardware description languages: The OpenCL case. In Proceedings of the International Conference on Computer-Aided Design. IEEE, 326--333. DOI:http://dx.doi.org/10.1109/ICCAD.2011.6105349. Google Scholar
Digital Library
- Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, and Wen-Mei W. Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In Proceedings of the 7th Symposium on Application Specific Processors. IEEE, 35--42. DOI:http://dx.doi.org/10.1109/SASP.2009.5226333.Google Scholar
- David Pellerin and Scott Thibault. 2005. Practical FPGA Programming in C. Prentice Hall, Upper Saddle River, NJ. Google Scholar
Digital Library
- Franjo Plavec. 2010. Stream computing on FPGAs. Ph.D. dissertation, University of Toronto, Toronto, Canada. Google Scholar
Digital Library
- Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2008. Towards compilation of streaming programs into FPGA hardware. In Proceedings of the Forum on Specification and Design Languages. IEEE, 67--72. DOI:http://dx.doi.org/10.1109/FDL.2008.4641423.Google Scholar
Cross Ref
- Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2009a. Enhancements to FPGA design methodology using streaming. In Proceedings of the International Conference on Field Programmable Logic and Applications. IEEE, 294--301. DOI:http://dx.doi.org/10.1109/FPL.2009.5272286.Google Scholar
Cross Ref
- Franjo Plavec, Zvonko Vranesic, and Stephen Brown. 2009b. Stream programming for FPGAs. In Languages for Embedded Systems and their Applications, Martin Radetzki Ed., Lecture Notes in Electrical Engineering, Vol. 36, Springer Netherlands, 241--253. DOI:http://dx.doi.org/10.1007/978-1-4020-9714-0_16.Google Scholar
- Michael J. Quinn. 2004. Parallel programming in C with MPI and OpenMP. McGraw-Hill, Dubuque, IA. Google Scholar
Digital Library
- Claus Schneider, Martin Kayss, Thomas Hollstein, and Jurgen Deicke. 1998. From algorithms to hardware architectures: A comparison of regular and irregular structured IDCT algorithms. In Proceedings of Design, Automation and Test in Europe. IEEE, 186--190. DOI:http://dx.doi.org/10.1109/DATE.1998.655855. Google Scholar
Digital Library
- Jeffrey Sheldon, Walter Lee, Ben Greenwald, and Saman Amarasinghe. 2003. Strength reduction of integer division and modulo operations. In Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science, Vol. 2624, Springer, Berlin, 254--273. DOI:http://dx.doi.org/10.1007/3-540-35767-X_17. Google Scholar
Digital Library
- Robert Stephens. 1997. A survey of stream processing. Acta Informatica 34, 7, 491--541. DOI:http://dx.doi.org/10.1007/s002360050095.Google Scholar
Cross Ref
- David Tarditi, Sidd Puri, and Jose Oglesby. 2006. Accelerator: Using data parallelism to program GPUs for general-purpose uses. Technical rep. MSR-TR-2005-184. Microsoft Research.Google Scholar
Digital Library
- Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2004. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams In Proceedings of the 31st International Symposium on Computer Architecture. IEEE, 2--13. DOI:http://dx.doi.org/10.1109/ISCA.2004.1310759. Google Scholar
Digital Library
- Matjaz Verderber, Andrej Zemva, and Andrej Trost. 2003. HW/SW codesign of the MPEG-2 video decoder. In Proceedings of the International Parallel and Distributed Processing Symposium. IEEE, 7 pp. DOI:http://dx.doi.org/10.1109/IPDPS.2003.1213330. Google Scholar
Digital Library
- Xilinx. 2012a. Xilinx Inc. Virtex-II Platform FPGAs: Complete Data Sheet. http://www.xilinx.com/support/documentation/data_sheets/ds031.pdf (Last accessed 8/12).Google Scholar
- Xilinx. 2012b. Xilinx Inc. AutoESL High-Level Synthesis Tool. http://www.xilinx.com/products/design-tools/autoesl/ (Last accessed 8/12).Google Scholar
- Peter Yiannacouras. 2009. FPGA-based soft vector processors. Ph.D. dissertation, University of Toronto, Toronto, Canada.Google Scholar
- Nikos D. Zervas. 2010. Alma Technologies S.A.: Private Communication.Google Scholar
Index Terms
Exploiting Task- and Data-Level Parallelism in Streaming Applications Implemented in FPGAs
Recommendations
LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems
Special issue on application-specific processorsIt is generally accepted that a custom hardware implementation of a set of computations will provide superior speed and energy efficiency relative to a software implementation. However, the cost and difficulty of hardware design is often prohibitive, ...
Impact of FPGA architecture on resource sharing in high-level synthesis
FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate ArraysResource sharing is a key area-reduction approach in high-level synthesis (HLS) in which a single hardware functional unit is used to implement multiple operations in the high-level circuit specification. We show that the utility of sharing depends on ...
Exploiting Memory-Level Parallelism in Reconfigurable Accelerators
FCCM '12: Proceedings of the 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing MachinesAs memory accesses increasingly limit the overall performance of reconfigurable accelerators, it is important for high level synthesis (HLS) flows to discover and exploit memory-level parallelism. This paper develops 1) a framework where parallelism ...






Comments