Abstract
The design cycle for complex special-purpose computing systems is extremely costly and time-consuming. It involves a multiparametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation, through time-consuming Monte Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as the Low-Density Parity-Check (LDPC) codes adopted by modern communication standards, which involves thousands of Monte Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct programming paradigms. In this context, we evaluate the concept of retargeting a single OpenCL program to multiple platforms, thereby significantly reducing design time. A single OpenCL-based parallel kernel is used without modifications or code tuning on multicore CPUs, GPUs, and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL in order to introduce FPGAs as a potential platform to efficiently execute simulations coded in OpenCL. We use LDPC decoding simulations as a case study. Experimental results were obtained by testing a variety of regular and irregular LDPC codes that range from short/medium (e.g., 8,000 bit) to long length (e.g., 64,800 bit) DVB-S2 codes. We observe that, depending on the design parameters to be simulated, on the dimension and phase of the design, the GPU or FPGA may suit different purposes more conveniently, thus providing different acceleration factors over conventional multicore CPUs.
- A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson, S. Brown, and T. Czajkowski. 2011. Legup: High-level synthesis for fpga-based processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 33--36. Google Scholar
Digital Library
- B. Cope, P. Y. K. Cheung, W. Luk, and L. Howes. 2010. Performance comparison of graphics processors to reconfigurable logic: A case study. IEEE Transactions on Computing 59, 4 (2010), 433--448. Google Scholar
Digital Library
- EN 302 307 V1. 1.1, European Telecommunications Standards Institute (ETSI). 2005. Digital video broadcasting (DVB); second generation framing structure, channel coding and modulation systems for broadcasting, interactive services, news gathering and other broad-band satellite applications. (2005).Google Scholar
- M. Eroz, F. W. Sun, and L. N. Lee. 2004. Dvb-s2 low density parity check codes with near Shannon limit performance. International Journal of Satellite Communications and Networking 22 (2004), 269--279.Google Scholar
Cross Ref
- G. Falcao, J. Andrade, V. Silva, and L. Sousa. 2011. GPU-based DVB-S2 LDPC decoder with high throughput and fast error floor detection. Electronics Letters 47, 9 (April 2011), 542--543.Google Scholar
Cross Ref
- G. Falcao, V. Silva, L. Sousa, and J. Andrade. 2012. Portable LDPC decoding on multicores using OpenCL. IEEE Signal Processing Magazine 29, 4 (2012), 81--109.Google Scholar
Cross Ref
- R. G. Gallager. 1962. Low-density parity-check codes. IRE Transactions on Information Theory 8, 1 (1962), 21--28.Google Scholar
- A. Gill, T. Bull, D. DePardo, A. Farmer, E. Komp, and E. Perrins. 2011. Using functional programming to generate an LDPC forward error corrector. In Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines. 133--140. Google Scholar
Digital Library
- H. Jin, A. Khandekar, and R. McEliece. 2000. Irregular repeat-accumulate codes. In Proceedings of the 2nd International Symposium on Turbo Codes & Related Topics.Google Scholar
- V. Kathail, S. Aditya, R. Schreiber, B. R. Rau, D. Cronquist, and M. Sivaraman. 2002. Pico: Automatically designing custom computers. IEEE Computer Magazine 35, 9 (2002), 39--47. Google Scholar
Digital Library
- Group Khronos. 2010. OpenCL -- The Open Standard for Parallel Programming of Heterogeneous Systems. Retrieved from http://www.khronos.org/opencl.Google Scholar
- C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO'04). 75--86. Google Scholar
Digital Library
- M. Lin, I. Lebedev, and J. Wawrzynek. 2010. OpenrCL: Low-power high performance computing with reconfigurable devices. In Proceedings of the 2010 International Conference on Field Programmable Logic (FPL'10). 458--463. Google Scholar
Digital Library
- J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero. 1996. Swing modulo scheduling: A lifetime-sensitive approach. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'96). 80--90. Google Scholar
Digital Library
- NVIDIA. 2007. CUDA -- Compute Unified Device Architecture. Retrieved from http://www.nvidia.com/object/cuda_home_new.html.Google Scholar
- Muhsen Owaida, Christos D. Antonopoulos, and Nikolaos Bellas. 2013. A Grammar Induction Method for Reducing Routing Overhead in Complex FPGA Designs. Technical Report. Department of Computer and Communication Engineering, University of Thessaly, Greece.Google Scholar
- M. Owaida, N. Bellas, K. Daloukas, and C. D. Antonopoulos. 2011a. Massively parallel programming models used as hardware description language: The OpenCL case. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). Google Scholar
Digital Library
- M. Owaida, N. Bellas, K. Daloukas, and C. D. Antonopoulos. 2011b. Synthesis of platform architectures from opencl programs. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'11). Google Scholar
Digital Library
- A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong, and Wen-mei Hwu. 2009. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In Proceedings of the 7th IEEE Symposium on Application Specific Processors. 35--42.Google Scholar
Cross Ref
- Markus Rupp, Andreas Burg, and Eric Beck. 2003. Rapid prototyping for wireless designs: The five-ones approach. Signal Processing 83, 7 (2003), 1427--1444. Google Scholar
Digital Library
- B. Smith, A. Farhood, A. Hunt, F. Kschischang, and J. Lodge. 2011. Staircase codes: FEC for 100 Gb/s OTN. IEEE/OSA Lightwave Technology PP, 99 (2011), 1.Google Scholar
- M. Stephenson, J. Babb, and A. Amarasinghe. 2000. Bitwidth analysis with application to silicon compilation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'00). Google Scholar
Digital Library
- R. Weber, A Gothandaraman, R. J. Hinde, and G. D. Peterson. 2011. Comparing hardware accelerators in scientific applications: A case study. IEEE Transactions on Parallel and Distributed Systems 22, 1 (2011), 58--68. Google Scholar
Digital Library
- S. B. Wicker and S. Kim. 2003. Fundamentals of Codes, Graphs, and Iterative Decoding. Kluwer Academic Publishers. Google Scholar
Digital Library
- Chi-Li Yu and C. Chakrabarti. 2012. Transpose-free sar imaging on fpga platform. In Proceedings of the International Symposium on Circuits and Systems (ISCAS'12). 762--765. DOI:http://dx.doi.org/10.1109/ISCAS.2012.6272149Google Scholar
- Z. Zhang, Y. Fan, W. Jiang, G. Han, C. Yang, and J. Cong. 2008. High-Level Synthesis: From Algorithm to Digital Circuit. Springer Netherlands, Chapter AutoPilot: A Platform-Based ESL Synthesis System. Google Scholar
Digital Library
Index Terms
Enhancing Design Space Exploration by Extending CPU/GPU Specifications onto FPGAs
Recommendations
Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads
AbstractThe recent upsurge in the available amount of health data and the advances in next-generation sequencing are setting the ground for the long-awaited precision medicine. To process this deluge of data, bioinformatics workloads are ...
Highlights- Refactoring of OpenCL GPU code to efficiently run on multiple FPGAs.
- Multi-...
Shortening Design Time through Multiplatform Simulations with a Portable OpenCL Golden-model: The LDPC Decoder Case
FCCM '12: Proceedings of the 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing MachinesHardware designers and engineers typically need to explore a multi-parametric design space in order to find the best configuration for their designs using simulations that can take weeks to months to complete. For example, designers of special purpose ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...






Comments