Abstract
Specialized FPGA implementations can deliver higher performance and greater power efficiency than embedded CPU or GPU implementations for real-time image processing. Programming challenges limit their wider use, because the implementation of FPGA architectures at the register transfer level is time consuming and error prone. Existing software languages supported by high-level synthesis (HLS), although providing a productivity improvement, are too general purpose to generate efficient hardware without the use of hardware-specific code optimizations. Such optimizations leak hardware details into the abstractions that software languages are there to provide, and they require knowledge of FPGAs to generate efficient hardware, such as by using language pragmas to partition data structures across memory blocks.
This article presents a thorough account of the Rathlin image processing language (RIPL), a high-level image processing domain-specific language for FPGAs. We motivate its design, based on higher-order algorithmic skeletons, with requirements from the image processing domain. RIPL’s skeletons suffice to elegantly describe image processing stencils, as well as recursive algorithms with nonlocal random access patterns. At its core, RIPL employs a dataflow intermediate representation. We give a formal account of the compilation scheme from RIPL skeletons to static and cyclostatic dataflow models to describe their data rates and static scheduling on FPGAs.
RIPL compares favorably to the Vivado HLS OpenCV library and C++ compiled with Vivado HLS. RIPL achieves between 54 and 191 frames per second (FPS) at 100MHz for four synthetic benchmarks, faster than HLS OpenCV in three cases. Two real-world algorithms are implemented in RIPL: visual saliency and mean shift segmentation. For the visual saliency algorithm, RIPL achieves 71 FPS compared to optimized C++ at 28 FPS. RIPL is also concise, being 5x shorter than C++ and 111x shorter than an equivalent direct dataflow implementation. For mean shift segmentation, RIPL achieves 7 FPS compared to optimized C++ on 64 CPU cores at 1.1, and RIPL is 10x shorter than the direct dataflow FPGA implementation.
- S. Ahmad, V. Boppana, I. Ganusov, V. Kathail, V. Rajagopalan, and R. Wittig. 2016. A 16-nm multiprocessing system-on-chip field-programmable gate array platform. IEEE Micro 36, 2, 48--62. Google Scholar
Digital Library
- Altera. 2017. DSP Builder for Intel FPGAs. Retrieved February 4, 2018, from https://www.altera.com/products/design-software/model---simulation/dsp-builder/overview.html.Google Scholar
- David L. Andrews, Douglas Niehaus, Razali Jidin, Michael Finley, Wesley Peck, Michael Frisbie, Jorge L. Ortiz, Ed Komp, and Peter J. Ashenden. 2004. Programming models for hybrid FPGA-CPU computational components: A missing link. IEEE Micro 24, 4, 42--53. Google Scholar
Digital Library
- Endri Bezati. 2015. High-Level Synthesis of Dataflow Programs for Heterogeneous Platforms: Design Flow Tools and Design Space Exploration. Ph.D. Dissertation. School of Engineering, Ecole Polytechnique Federale de Lausanne, Switzerland.Google Scholar
- Endri Bezati, Simone Casale Brunet, Marco Mattavelli, and Jörn W. Janneck. 2016. High-level synthesis of dynamic dataflow programs on heterogeneous MPSoC platforms. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’16). IEEE, Los Alamitos, CA, 227--234.Google Scholar
- Deepayan Bhowmik, Paulo Garcia, Andrew M. Wallace, Robert J. Stewart, and Greg Michaelson. 2017. Power efficient dataflow design for a heterogeneous smart camera architecture. In Proceedings of the 2017 Conference on Design and Architectures for Signal and Image Processing (DASIP’17). IEEE, Los Alamitos, CA, 1--6.Google Scholar
Cross Ref
- Deepayan Bhowmik, Matthew Oakes, and Charith Abhayaratne. 2016. Visual attention-based image watermarking. IEEE Access 4, 8002--8018.Google Scholar
Cross Ref
- G. Bilsen, M. Engels, R. Lauwereins, and J. A. Peperstraete. 1996. Cycle-static dataflow. IEEE Transactions on Signal Processing 44, 2, 397--408. Google Scholar
Digital Library
- Ali Borji and Laurent Itti. 2013. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1, 185--207. Google Scholar
Digital Library
- André Rigland Brodtkorb, Christopher Dyken, Trond Runar Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli. 2010. State-of-the-art in heterogeneous computing. Scientific Programming 18, 1, 1--33. Google Scholar
Digital Library
- Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the POPL 2011 Workshop on Declarative Aspects of Multicore Programming (DAMP’11). ACM, New York, NY, 3--14. Google Scholar
Digital Library
- Murray Cole. 1991. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge, MA. Google Scholar
Digital Library
- Dorin Comaniciu and Peter Meer. 1999. Mean shift analysis and applications. In Proceedings of the 7th IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 1197--1203. Google Scholar
Digital Library
- Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. 2000. Real-time tracking of non-rigid objects using mean shift. In Proceedings of the 2000 Conference on Computer Vision and Pattern Recognition (CVPR’00). IEEE, Los Alamitos, CA, 2142.Google Scholar
Cross Ref
- Katherine Compton and Scott Hauck. 2002. Reconfigurable computing: A survey of systems and software. ACM Computing Surveys 34, 2, 171--210. Google Scholar
Digital Library
- I. Daubechies and W. Sweldens. 1998. Factoring wavelet transforms into lifting steps. Journal of Fourier Analysis and Applications 4, 3, 245--267.Google Scholar
Cross Ref
- Johan Eker and Jorn W. Janneck. 2003. CAL Language Report Specification of the CAL Actor Language. Technical Report UCB/ERL M03/48. EECS Department, University of California, Berkeley.Google Scholar
- Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA 20th International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 47--56. Google Scholar
Digital Library
- Keinosuke Fukunaga and Larry Hostetler. 1975. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21, 1, 32--40. Google Scholar
Digital Library
- Rafael C. González and Richard E. Woods. 1992. Digital Image Processing. Addison-Wesley, Reading, MA. Google Scholar
Digital Library
- James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. 2014. Darkroom: Compiling high-level image processing code into hardware pipelines. ACM Transactions on Graphics 33, 4, 144:1--144:11. Google Scholar
Digital Library
- James Hegarty, Ross Daly, Zachary DeVito, Mark Horowitz, Pat Hanrahan, and Jonathan Ragan-Kelley. 2016. Rigel: Flexible multi-rate image processing hardware. ACM Transactions on Graphics 35, 4, 85:1--85:11. Google Scholar
Digital Library
- Jörn W. Janneck. 2003. Actors and their composition. Formal Aspects of Computing 15, 4, 349--369.Google Scholar
Digital Library
- J. Jeddeloh and B. Keeth. 2012. Hybrid Memory Cube new DRAM architecture increases density and performance. In Proceedings of the 2012 Symposium on VLSI Technology (VLSIT’12). IEEE, Los Alamitos, CA, 87--88.Google Scholar
- S. Peyton Jones, A. Tolmach, and T. Hoare. 2001. Playing by the rules: Rewriting as a practical optimisation technique in GHC. In Proceedings of the ACM SIGPLAN Haskell Workshop. ACM, New York, NY, 203--233.Google Scholar
- Kwang In Kim, Keechul Jung, and Jin Hyung Kim. 2003. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 12, 1631--1639. Google Scholar
Digital Library
- Oleg Kiselyov. 2012. Iteratees. In Proceedings of the 11th International Symposium on Functional and Logic Programming (FLOPS’12). 166--181. Google Scholar
Digital Library
- Edward A. Lee and David G. Messerschmitt. 1987. Synchronous data flow: Describing signal processing algorithm for parallel computation. In Proceedings of the 32nd IEEE Computer Society International Conference (COMPCON’87). IEEE, Los Alamitos, CA, 310--315.Google Scholar
- Edward A. Lee and Thomas M. Parks. 2002. Dataflow process networks. In Readings in Hardware/Software Co-Design, G. De Micheli, R. Ernst, and W. Wolf (Eds.). Kluwer Academic Publishers, Norwell, MA, 59--85. Google Scholar
Digital Library
- Erik Jan Marinissen and Yervant Zorian. 2017. Guest editors introduction: Design and test of a high-volume 3-D stacked graphics processor with high-bandwidth memory. IEEE Design and Test 34, 1, 6--7.Google Scholar
Cross Ref
- David R. Martin, Charless C. Fowlkes, Doron Tal, and Jitendra Malik. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the 8th IEEE International Conference on Computer Vision (ICCV’01). IEEE, Los Alamitos, CA, 416--425.Google Scholar
Cross Ref
- MathWorks. 2017. FPGA Design and SoC Codesign. Retrieved February 4, 2018, from https://uk.mathworks.com/solutions/fpga-design.html.Google Scholar
- J. McGraw, S. Skedzielewski, S. Allan, Oldehoeft Oldehoeft, J. Glauert, C. Kirkham, B. Noyce, and R. Thomas. 1985. SISAL: Streams and Iteration in a Single Assignment Language, Language Reference Manual Version 1.2. Lawrence-Livermore-National-Laboratory, Livermore, CA.Google Scholar
- R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels. 2016. A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10, 1591--1604. Google Scholar
Digital Library
- Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, and Mark Horowitz. 2017. Programming heterogeneous systems from an image processing DSL. ACM Transactions on Architecture and Code Optimization 14, 3, 26:1--26:25. Google Scholar
Digital Library
- B. C. Schafer and A. Mahapatra. 2014. S2CBench: Synthesizable SystemC benchmark suite for high-level synthesis. IEEE Embedded Systems Letters 6, 3, 53--56.Google Scholar
Cross Ref
- Stephen Neuendorffer, Thomas Li, and Devin Wang. 2015. Accelerating OpenCV Applications With Zynq-7000 All Programmable SoC Using Vivado HLS Video Libraries (v3.0). Technical Report. Xilinx. https://www.xilinx.com/support/documentation/application_notes/xapp1167.pdf.Google Scholar
- Robert Stewart. 2018. Open dataset for “RIPL: A Parallel Image Processing Language for FPGAs.” ACM Transactions on Reconfigurable Technology and Systems. Forthcoming. Google Scholar
Digital Library
- Robert Stewart, Greg J. Michaelson, Deepayan Bhowmik, Paulo Garcia, and Andy Wallace. 2016. A dataflow IR for memory efficient RIPL compilation to FPGAs. In Algorithms and Architectures for Parallel Processing. Lecture Notes in Computer Science, Vol. 1194. Springer, 174--188.Google Scholar
- Robert J. Stewart, Deepayan Bhowmik, Andrew M. Wallace, and Greg Michaelson. 2017. Profile guided dataflow transformation for FPGAs and CPUs. Signal Processing Systems 87, 1, 3--20. Google Scholar
Digital Library
- David Taubman and Michael Marcellin. 2012. JPEG2000 Image Compression Fundamentals, Standards and Practice. Vol. 642. Springer Science 8 Business Media, Berlin, Germany. Google Scholar
Digital Library
- David B. Thomas, Lee W. Howes, and Wayne Luk. 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the ACM/SIGDA 17th International Symposium on Field Programmable Gate Arrays (FPGA’09). ACM, New York, NY, 63--72. Google Scholar
Digital Library
- Donald E. Thomas and Philip Moorby. 1996. The Verilog Hardware Description Language (3rd ed.). Kluwer, Boston, MA. Google Scholar
Digital Library
- William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News 23, 1, 20--24. Google Scholar
Digital Library
- Xilinx. 2015. 7 Series FPGAs Overview, DS180 (v1.17) Product Specification. Technical Report. Xilinx.Google Scholar
- Xilinx. 2017a. System Generator for DSP. Retrieved February 4, 2018, from https://www.xilinx.com/products/design-tools/vivado/integration/sysgen.html.Google Scholar
- Xilinx. 2017b. Vivado High-Level Synthesis. Retrieved February 4, 2018, from https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.Google Scholar
Index Terms
RIPL: A Parallel Image Processing Language for FPGAs
Recommendations
Programming Heterogeneous Systems from an Image Processing DSL
Specialized image processing accelerators are necessary to deliver the performance and energy efficiency required by important applications in computer vision, computational photography, and augmented reality. But creating, “programming,” and ...
Generating Efficient FPGA-based CNN Accelerators from High-Level Descriptions
AbstractThe wide landscape of memory-hungry and compute-intensive Convolutional Neural Networks (CNNs) is quickly changing. CNNs are continuously evolving by introducing new layers or optimization strategies to either improve accuracy, reduce memory and ...
From software to accelerators with LegUp high-level synthesis
CASES '13: Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded SystemsEmbedded system designers can achieve energy and performance benefits by using dedicated hardware accelerators. However, implementing custom hardware accelerators for an application can be difficult and time intensive. LegUp is an open-source high-level ...






Comments