Abstract
The increasing usage of hardware accelerators such as Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs) has significantly increased application design complexity. Such complexity results from a larger design space created by numerous combinations of accelerators, algorithms, and hw/sw partitions. Exploration of this increased design space is critical due to widely varying performance and energy consumption for each accelerator when used for different application domains and different use cases. To address this problem, numerous studies have evaluated specific applications across different architectures. In this article, we analyze an important domain of applications, referred to as sliding-window applications, implemented on FPGAs, GPUs, and multicore CPUs. For each device, we present optimization strategies and analyze use cases where each device is most effective. The results show that, for large input sizes, FPGAs can achieve speedups of up to 5.6× and 58× compared to GPUs and multicore CPUs, respectively, while also using up to an order of magnitude less energy. For small input sizes and applications with frequency-domain algorithms, GPUs generally provide the best performance and energy.
- Altera. 2013. Altera’s User-Customizable ARM-Based SoC. (2013). Retrieved from http://www.altera.com/literature/br/br-soc-fpga.pdf.Google Scholar
- S. Asano, T. Maruyama, and Y. Yamaguchi. 2009. Performance comparison of FPGA, GPU and CPU in image processing. In Proceedings of the 2009 International Conference on Field Programmable Logic and Applications (FPL’09). 126--131. DOI: http://dx.doi.org/10.1109/FPL.2009.5272532Google Scholar
Cross Ref
- Z. K. Baker, M. B. Gokhale, and J. L. Tripp. 2007. Matched filter computation on FPGA, cell and GPU. In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). 207--218. DOI: http://dx.doi.org/10.1109/FCCM.2007.52 Google Scholar
Digital Library
- A. Branover, D. Foley, and M. Steinman. 2012. AMD fusion APU: Llano. IEEE Micro 32, 2 (2012), 28--37. DOI: http://dx.doi.org/10.1109/MM.2012.2 Google Scholar
Digital Library
- J. Chase, B. Nelson, J. Bodily, Zhaoyi Wei, and Dah-Jye Lee. 2008. Real-time optical flow calculations on FPGA and GPU architectures: A comparison study. In Proceedings of the 16th International Symposium on Field-Programmable Custom Computing Machines (FCCM’08). 173--182. DOI: http://dx.doi.org/10.1109/FCCM.2008.24 Google Scholar
Digital Library
- Shuai Che, Jie Li, J. W. Sheaffer, K. Skadron, and J. Lach. 2008. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the Symposium on Application Specific Processors (SASP’08). 101--107. DOI: http://dx.doi.org/10.1109/SASP.2008.4570793 Google Scholar
Digital Library
- B. Cope, P. Y. K. Cheung, W. Luk, and S. Witt. 2005. Have GPUs made FPGAs redundant in the field of video processing? In Proceedings of the 2005 IEEE International Conference on Field-Programmable Technology. 111--118. DOI: http://dx.doi.org/10.1109/FPT.2005.1568533Google Scholar
- Yazhuo Dong, Yong Dou, and Jie Zhou. 2007. Optimized generation of memory structure in compiling window operations onto reconfigurable hardware. In Proceedings of the International Symposium on Applied Reconfigurable Computing. 110--121. Google Scholar
Digital Library
- Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 47--56. DOI: http://dx.doi.org/10.1145/2145694.2145704 Google Scholar
Digital Library
- B. H. Friemel, L. N. Bohs, and G. E. Trahey. 1995. Relative performance of two-dimensional speckle-tracking techniques: Normalized correlation, non-normalized correlation and sum-absolute-difference. In Proceedings of the 1995 IEEE Ultrasonics Symposium, Vol. 2. 1481--1484. DOI: http://dx.doi.org/10.1109/ULTSYM.1995.495835Google Scholar
Cross Ref
- Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. Proceedings of the IEEE 93, 2 (2005), 216--231.Google Scholar
Cross Ref
- Zhi Guo, Betul Buyukkurt, and Walid Najjar. 2004a. Input data reuse in compiling window operations onto reconfigurable hardware. In Proceedings of the 2004 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’04). ACM, New York, NY, 249--256. DOI: http://dx.doi.org/10.1145/997163.997199 Google Scholar
Digital Library
- Zhi Guo, Walid Najjar, Frank Vahid, and Kees Vissers. 2004b. A quantitative analysis of the speedup factors of FPGAs over processors. In Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA’04). ACM, New York, NY, 162--170. DOI: http://dx.doi.org/10.1145/968280.968304 Google Scholar
Digital Library
- L. Hunt. 2009. Fault-aware machine vision in small unmanned systems. In Proceedings of the Florida Conference on Recent Advances in Robotics.Google Scholar
- Intel Corporation. 2013. Intel SDK for OpenCL Applications 2013 Optimization Guide. Retrieved from http://software.intel.com/sites/products/documentation/ioclsdk/2013/Intel_SDK_for_OpenCL_Applications_2013_Optimization_Guide.pdf.Google Scholar
- S. Kestur, J. D. Davis, and O. Williams. 2010. BLAS comparison on FPGA, CPU and GPU. In Proceedings of the 2010 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’10). 288--293. DOI: http://dx.doi.org/10.1109/ISVLSI.2010.84 Google Scholar
Digital Library
- Weifeng Liu, P. P. Pokharel, and J. C. Principe. 2007. Correntropy: Properties and applications in non-gaussian signal processing. IEEE Transactions on Signal Processing 55, 11 (Nov. 2007), 5286--5298. DOI: http://dx.doi.org/10.1109/TSP.2007.896065 Google Scholar
Digital Library
- Sanyam Mehta, Arindam Misra, Ayush Singhal, Praveen Kumar, and Ankush Mittal. 2010. A high-performance parallel implementation of sum of absolute differences algorithm for motion estimation using CUDA. In Proceedings of the HiPC Conference.Google Scholar
- John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (March 2008), 40--53. DOI: http://dx.doi.org/10.1145/1365490.1365500 Google Scholar
Digital Library
- NVIDIA. 2013. Tegra 4 Processors, Smartphones, Tablets. Retrieved from http://www.nvidia.com/object/tegra.html.Google Scholar
- J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips. 2008. GPU computing. Proceedings of the IEEE 96, 5 (2008), 879--899. DOI: http://dx.doi.org/10.1109/JPROC.2008.917757Google Scholar
Cross Ref
- K. Pauwels, M. Tomasi, J. Diaz Alonso, E. Ros, and M. M. Van Hulle. 2012. A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features. IEEE Transactions on Computers 61, 7 (July 2012), 999--1012. DOI: http://dx.doi.org/10.1109/TC.2011.120 Google Scholar
Digital Library
- Victor Podlozhnyuk. 2007. FFT-based 2D Convolution. Retrieved from http://developer.download.nvidia.com/compute/cuda/2_2/sdk/website/projects/convolutionFFT2D/doc/convolutionFFT2D.pdf.Google Scholar
- R. B. Porter and N. W. Bergmann. 1997. A generic implementation framework for FPGA based stereo matching. In Proceedings of the IEEE Region 10 Annual Conference on Speech and Image Technologies for Computing and Telecommunications (TENCON’97), Vol. 2. 461--464. DOI: http://dx.doi.org/10.1109/TENCON.1997.648244Google Scholar
- Jose C. Principe, Dongxin Xu, and John Fisher. 2000. Information theoretic learning. Unsupervised Adaptive Filtering 1 (2000), 265--319.Google Scholar
- Sudipta N. Sinha, Jan-Michael Frahm, Marc Pollefeys, and Yakup Genc. 2011. Feature tracking and matching in video using programmable graphics hardware. Machine Vision Applications. 22, 1, Article 17 (Jan. 2011), 11 pages. DOI: http://dx.doi.org/10.1007/s00138-007-0105-z Google Scholar
Digital Library
- John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66. Google Scholar
Digital Library
- K. D. Underwood and K. S. Hemmert. 2004. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’04). 219--228. DOI: http://dx.doi.org/10.1109/FCCM.2004.21 Google Scholar
Digital Library
- Xilinx. 2013. All Programable SoC. Retrieved from http://www.xilinx.com/products/silicon-devices/soc/index.htm.Google Scholar
- Haiqian Yu and M. Leeser. 2006. Automatic sliding window operation optimization for FPGA-Based. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06). 76--88. DOI: http://dx.doi.org/10.1109/FCCM.2006.29 Google Scholar
Digital Library
- Jianning Zhang, Yuwen He, Shiqiang Yang, and Yuzhuo Zhong. 2003. Performance and complexity joint optimization for H.264 video coding. In Proceedings of the 2003 International Symposium on Circuits and Systems (ISCAS’03), Vol. 2. II--888--II--891. DOI: http://dx.doi.org/10.1109/ISCAS.2003.1206117Google Scholar
Cross Ref
Index Terms
A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications
Recommendations
Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysCurrent-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications
FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate ArraysWith the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmable gate arrays (FPGAs), application designers are confronted with the problem of searching a huge design space that has been shown to ...
Exploiting Parallelism on GPUs and FPGAs with OmpSs
ANDARE '17: Proceedings of the 1st Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC SystemsThis paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives ...






Comments