skip to main content
research-article

A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications

Published:06 March 2015Publication History
Skip Abstract Section

Abstract

The increasing usage of hardware accelerators such as Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) has significantly increased application design complexity. Such complexity results from a larger design space created by numerous combinations of accelerators, algorithms, and hw/sw partitions. Exploration of this increased design space is critical due to widely varying performance and energy consumption for each accelerator when used for different application domains and different use cases. To address this problem, numerous studies have evaluated specific applications across different architectures. In this article, we analyze an important domain of applications, referred to as sliding-window applications, implemented on FPGAs, GPUs, and multicore CPUs. For each device, we present optimization strategies and analyze use cases where each device is most effective. The results show that, for large input sizes, FPGAs can achieve speedups of up to 5.6× and 58× compared to GPUs and multicore CPUs, respectively, while also using up to an order of magnitude less energy. For small input sizes and applications with frequency-domain algorithms, GPUs generally provide the best performance and energy.

References

  1. Altera. 2013. Altera’s User-Customizable ARM-Based SoC. (2013). Retrieved from http://www.altera.com/literature/br/br-soc-fpga.pdf.Google ScholarGoogle Scholar
  2. S. Asano, T. Maruyama, and Y. Yamaguchi. 2009. Performance comparison of FPGA, GPU and CPU in image processing. In Proceedings of the 2009 International Conference on Field Programmable Logic and Applications (FPL’09). 126--131. DOI: http://dx.doi.org/10.1109/FPL.2009.5272532Google ScholarGoogle ScholarCross RefCross Ref
  3. Z. K. Baker, M. B. Gokhale, and J. L. Tripp. 2007. Matched filter computation on FPGA, cell and GPU. In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). 207--218. DOI: http://dx.doi.org/10.1109/FCCM.2007.52 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Branover, D. Foley, and M. Steinman. 2012. AMD fusion APU: Llano. IEEE Micro 32, 2 (2012), 28--37. DOI: http://dx.doi.org/10.1109/MM.2012.2 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Chase, B. Nelson, J. Bodily, Zhaoyi Wei, and Dah-Jye Lee. 2008. Real-time optical flow calculations on FPGA and GPU architectures: A comparison study. In Proceedings of the 16th International Symposium on Field-Programmable Custom Computing Machines (FCCM’08). 173--182. DOI: http://dx.doi.org/10.1109/FCCM.2008.24 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shuai Che, Jie Li, J. W. Sheaffer, K. Skadron, and J. Lach. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the Symposium on Application Specific Processors (SASP’08). 101--107. DOI: http://dx.doi.org/10.1109/SASP.2008.4570793 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Cope, P. Y. K. Cheung, W. Luk, and S. Witt. 2005. Have GPUs made FPGAs redundant in the field of video processing? In Proceedings of the 2005 IEEE International Conference on Field-Programmable Technology. 111--118. DOI: http://dx.doi.org/10.1109/FPT.2005.1568533Google ScholarGoogle Scholar
  8. Yazhuo Dong, Yong Dou, and Jie Zhou. 2007. Optimized generation of memory structure in compiling window operations onto reconfigurable hardware. In Proceedings of the International Symposium on Applied Reconfigurable Computing. 110--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 47--56. DOI: http://dx.doi.org/10.1145/2145694.2145704 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. H. Friemel, L. N. Bohs, and G. E. Trahey. 1995. Relative performance of two-dimensional speckle-tracking techniques: Normalized correlation, non-normalized correlation and sum-absolute-difference. In Proceedings of the 1995 IEEE Ultrasonics Symposium, Vol. 2. 1481--1484. DOI: http://dx.doi.org/10.1109/ULTSYM.1995.495835Google ScholarGoogle ScholarCross RefCross Ref
  11. Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. Proceedings of the IEEE 93, 2 (2005), 216--231.Google ScholarGoogle ScholarCross RefCross Ref
  12. Zhi Guo, Betul Buyukkurt, and Walid Najjar. 2004a. Input data reuse in compiling window operations onto reconfigurable hardware. In Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’04). ACM, New York, NY, 249--256. DOI: http://dx.doi.org/10.1145/997163.997199 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Zhi Guo, Walid Najjar, Frank Vahid, and Kees Vissers. 2004b. A quantitative analysis of the speedup factors of FPGAs over processors. In Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA’04). ACM, New York, NY, 162--170. DOI: http://dx.doi.org/10.1145/968280.968304 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Hunt. 2009. Fault-aware machine vision in small unmanned systems. In Proceedings of the Florida Conference on Recent Advances in Robotics.Google ScholarGoogle Scholar
  15. Intel Corporation. 2013. Intel SDK for OpenCL Applications 2013 Optimization Guide. Retrieved from http://software.intel.com/sites/products/documentation/ioclsdk/2013/Intel_SDK_for_OpenCL_Applications_2013_Optimization_Guide.pdf.Google ScholarGoogle Scholar
  16. S. Kestur, J. D. Davis, and O. Williams. 2010. BLAS comparison on FPGA, CPU and GPU. In Proceedings of the 2010 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’10). 288--293. DOI: http://dx.doi.org/10.1109/ISVLSI.2010.84 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Weifeng Liu, P. P. Pokharel, and J. C. Principe. 2007. Correntropy: Properties and applications in non-gaussian signal processing. IEEE Transactions on Signal Processing 55, 11 (Nov. 2007), 5286--5298. DOI: http://dx.doi.org/10.1109/TSP.2007.896065 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sanyam Mehta, Arindam Misra, Ayush Singhal, Praveen Kumar, and Ankush Mittal. 2010. A high-performance parallel implementation of sum of absolute differences algorithm for motion estimation using CUDA. In Proceedings of the HiPC Conference.Google ScholarGoogle Scholar
  19. John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (March 2008), 40--53. DOI: http://dx.doi.org/10.1145/1365490.1365500 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. NVIDIA. 2013. Tegra 4 Processors, Smartphones, Tablets. Retrieved from http://www.nvidia.com/object/tegra.html.Google ScholarGoogle Scholar
  21. J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. 2008. GPU computing. Proceedings of the IEEE 96, 5 (2008), 879--899. DOI: http://dx.doi.org/10.1109/JPROC.2008.917757Google ScholarGoogle ScholarCross RefCross Ref
  22. K. Pauwels, M. Tomasi, J. Diaz Alonso, E. Ros, and M. M. Van Hulle. 2012. A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features. IEEE Transactions on Computers 61, 7 (July 2012), 999--1012. DOI: http://dx.doi.org/10.1109/TC.2011.120 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Victor Podlozhnyuk. 2007. FFT-based 2D Convolution. Retrieved from http://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/convolutionFFT2D/doc/convolutionFFT2D.pdf.Google ScholarGoogle Scholar
  24. R. B. Porter and N. W. Bergmann. 1997. A generic implementation framework for FPGA based stereo matching. In Proceedings of the IEEE Region 10 Annual Conference on Speech and Image Technologies for Computing and Telecommunications (TENCON’97), Vol. 2. 461--464. DOI: http://dx.doi.org/10.1109/TENCON.1997.648244Google ScholarGoogle Scholar
  25. Jose C. Principe, Dongxin Xu, and John Fisher. 2000. Information theoretic learning. Unsupervised Adaptive Filtering 1 (2000), 265--319.Google ScholarGoogle Scholar
  26. Sudipta N. Sinha, Jan-Michael Frahm, Marc Pollefeys, and Yakup Genc. 2011. Feature tracking and matching in video using programmable graphics hardware. Machine Vision Applications. 22, 1, Article 17 (Jan. 2011), 11 pages. DOI: http://dx.doi.org/10.1007/s00138-007-0105-z Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. D. Underwood and K. S. Hemmert. 2004. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’04). 219--228. DOI: http://dx.doi.org/10.1109/FCCM.2004.21 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Xilinx. 2013. All Programable SoC. Retrieved from http://www.xilinx.com/products/silicon-devices/soc/index.htm.Google ScholarGoogle Scholar
  30. Haiqian Yu and M. Leeser. 2006. Automatic sliding window operation optimization for FPGA-Based. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06). 76--88. DOI: http://dx.doi.org/10.1109/FCCM.2006.29 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jianning Zhang, Yuwen He, Shiqiang Yang, and Yuzhuo Zhong. 2003. Performance and complexity joint optimization for H.264 video coding. In Proceedings of the 2003 International Symposium on Circuits and Systems (ISCAS’03), Vol. 2. II--888--II--891. DOI: http://dx.doi.org/10.1109/ISCAS.2003.1206117Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Reconfigurable Technology and Systems
        ACM Transactions on Reconfigurable Technology and Systems  Volume 8, Issue 1
        February 2015
        127 pages
        ISSN:1936-7406
        EISSN:1936-7414
        DOI:10.1145/2744082
        • Editor:
        • Steve Wilton
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 March 2015
        • Accepted: 1 July 2014
        • Revised: 1 May 2014
        • Received: 1 December 2013
        Published in trets Volume 8, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!