Abstract
Using field-programmable gate arrays (FPGAs) as a substrate to deploy soft graphics processing units (GPUs) would enable offering the FPGA compute power in a very flexible GPU-like tool flow. Application-specific adaptations like selective hardening of floating-point operations and instruction set subsetting would mitigate the high area and power demands of soft GPUs. This work explores the capabilities and limitations of soft General Purpose Computing on GPUs (GPGPU) for both fixed- and floating point arithmetic. For this purpose, we have developed FGPU: a configurable, scalable, and portable GPU architecture designed especially for FPGAs. FGPU is open-source and implemented entirely in RTL. It can be programmed in OpenCL and controlled through a Python API. This article introduces its hardware architecture as well as its tool flow. We evaluated the proposed GPGPU approach against multiple other solutions. In comparison to homogeneous Multi-Processor System-On-Chips (MPSoCs), we found that using a soft GPU is a Pareto-optimal solution regarding throughput per area and energy consumption. On average, FGPU has a 2.9× better compute density and 11.2× less energy consumption than a single MicroBlaze processor when computing in IEEE-754 floating-point format. An average speedup of about 4× over the ARM Cortex-A9 supported with the NEON vector co-processor has been measured for fixed- or floating-point benchmarks. In addition, the biggest FGPU cores we could implement on a Xilinx Zynq-7000 System-On-Chip (SoC) can deliver similar performance to equivalent implementations with High-Level Synthesis (HLS).
- A. Al-Dujaili et al. 2012. Guppy: A GPU-like soft-core processor. In Proceedings of the International Conference on Field-Programmable Technology (FPT’12). 57--60.Google Scholar
- Muhammed Al Kadi, Benedikt Janssen, and Michael Huebner. 2016. FGPU: An SIMT-architecture for FPGAs (FPGA’16). ACM, New York, NY, 254--263. Google Scholar
Digital Library
- Muhammed Al Kadi, Benedikt Janssen, and Michael Huebner. 2017. Floating-point arithmetic using GPGPU on FPGAs. In Proceedings of the 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI’17).Google Scholar
Cross Ref
- Altera Corp. Dec. 2015. Stratix 10 Device Overview. Initial Release.Google Scholar
- AMD, Inc. 2017. ADM Accelerated Parallel Processing SDK v3.0. Retrieved from http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/.Google Scholar
- K. Andryc, M. Merchant, and R. Tessier. 2013. FlexGrip: A soft GPGPU for FPGAs. In Proceedings of the 2013 International Conference on Field-Programmable Technology (FPT’13). 230--237.Google Scholar
- K. Andryc, T. Thomas, and R. Tessier. 2016. Soft GPGPUs for embedded FPGAs: An architectural evaluation. In Proceedings of the 2016 Second Workshop on Overlay Architectures for FPGAs (OLAF’16).Google Scholar
- Raghuraman Balasubramanian et al. 2015. Enabling GPGPU low-level hardware explorations with MIAOW: An open-source RTL implementation of a GPGPU. ACM Trans. Archit. Code Optim. 12, 2, Article 21 (June 2015). Google Scholar
Digital Library
- J. Bush, P. Dexter, and T. N. Miller. 2015. Nyami: A synthesizable GPU architectural model for general-purpose and graphics-specific workloads. In Proceedings of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’15). 173--182.Google Scholar
- D. W. Chang et al. 2010. ERCBench: An open-source benchmark suite for embedded and reconfigurable computing. In Proceedings of the 2010 International Conference on Field Programmable Logic and Applications. 408--413. 1946-147X Google Scholar
Digital Library
- Diego Valverde. 2011. Theia: Ray Graphic Processing Unit. Retrieved from opencores.com/project,theia_gpu.Google Scholar
- M. Al Kadi and M. Huebner. 2016. Integer computations with soft GPGPU on FPGAs. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT’16). 28--35.Google Scholar
- Nachiket Kapre. 2016. Optimizing soft vector processing in FPGA-based embedded systems. ACM Trans. Reconfigurable Technol. Syst. 9, 3, Article 17 (May 2016). Google Scholar
Digital Library
- Khronos Group. 2012. OpenCL 1.2 Specification. https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf.Google Scholar
- J. Kingyens and J. Gregory Steffan. 2010. A GPU-inspired soft processor for high-throughput acceleration. In Proceedings of the 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW’10). 1--8.Google Scholar
- C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’04). 75--86. Google Scholar
Digital Library
- T. Miller. 2016. OpenShader: Open Architecture GPU Simulator and Implementation. Retrieved from sourceforge.net/projects/openshader.Google Scholar
- Muhammed Al Kadi. 2017. FGPU Demo using PYNQ on the Xilinx ZC706. Retrieved from https://github.com/malkadi/FGPU_IPython.Google Scholar
- Muhammed Al Kadi. 2017. The FGPU Project. Retrieved from https://github.com/malkadi/FGPU.Google Scholar
- R. Rashid, J. G. Steffan, and V. Betz. 2014. Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS. In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT’14). 20--27.Google Scholar
- A. Severance and G. G. F. Lemieux. 2013. Embedded Supercomputing in FPGAs with the vectorblox MXP matrix processor. In Proceedings of the 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’13). 1--10. Google Scholar
Digital Library
- VectorBlox Computing, Inc. 2017. The MXP Vector Matrix Processor Repository. Retrieved from https://github.com/VectorBlox/mxp.Google Scholar
- Xilinx, Inc. 2015. AXI DMA, LogiCORE IP Product Guide (PG021, v7.1). https://www.xilinx.com/support/documentation/ipdocumentation/axidma/v71/pg021axidma.pdf.Google Scholar
- Xilinx, Inc. 2015. Floating-Point Operator v7.1, LogiCORE IP Product Guide (PG060). https://www.xilinx.com/support/documentation/ipdocumentation/floatingpoint/v71/pg060-floating-point.pdf.Google Scholar
- Xilinx, Inc. 2016. 7 Series FPGAs Configurable Logic Block v1.8, (UG474). https://www.xilinx.com/support/documentation/userguides/ug4747SeriesCLB.pdf.Google Scholar
- Xilinx, Inc. 2016. The PYNQ Project. http://www.pynq.io {Online; accessed 15-Jan-2017}.Google Scholar
- Xilinx, Inc. 2016. UltraScale Architecture and Product Overview (v3.1), DS890. https://www.xilinx.com/support/documentation/datasheets/ds890-ultrascale-overview.pdf.Google Scholar
- Xilinx, Inc. 2016. Zynq-7000 All Programmable SoC, Technical Reference Manual (UG585, v1.12.1). https://www.xilinx.com/support/documentation/userguides/ug585-Zynq-7000-TRM.pdf.Google Scholar
- Xilinx, Inc. 2016. SDAccel Development Environment Methodology Guide, Performance Optimization (UG1207, v2.0). https://www.xilinx.com/support/documentation/swmanuals/ug1207-sdaccel-performance-optimization.pdf. (August 2016). Ch. 7.Google Scholar
- Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. 2009. Fine-grain performance scaling of soft vector processors. In Proceedings of the 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’09). ACM, New York, NY, 97--106. Google Scholar
Digital Library
Index Terms
General-Purpose Computing with Soft GPUs on FPGAs
Recommendations
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
PDP '14: Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based ProcessingOpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Virtualizing General Purpose GPUs for High Performance Cloud Computing: An Application to a Fluid Simulator
ISPA '12: Proceedings of the 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with ApplicationsIn this work we present an hypervisor-independent GPU Virtualization Service named GVirtus. It instantiates virtual machines able to access to the GPU in a transparent way. GPUs allow to speed up calculations over CPUs. Therefore, virtualizing GPUs is a ...






Comments