Abstract
FPGAs have the potential to serve as a platform for accelerating many computations including scientific applications. However, the large development cost and short life span for FPGA designs have limited their adoption by the scientific computing community. FPGA-based scientific computing and many kinds of embedded computing could become more practical if there were hardware libraries that were portable to any FPGA-based system with performance that scaled with the size of the FPGA. To illustrate this idea we have implemented one common super-computing library function: the LU factorization method for solving systems of linear equations. This paper describes a method for making the design both portable and scalable that should be illustrative if such libraries are to be built in the future. The design is a software-based generator that leverages both the flexibility of a software programming language and the parameters inherent in an hardware description language. The generator accepts parameters that describe the FPGA capacity and external memory capabilities. We compare the performance of our engine executing on the largest FPGA available at the time of this work (an Altera Stratix III 3S340) to a single processor core fabricated in the same 65nm IC process running a highly optimized software implementation from the processor vendor. For single precision matrices on the order of 10,000 × 10,000 elements, the FPGA implementation is 2.2 times faster and the energy dissipated per useful GFLOP operation is a factor of 5 times less. For double precision, the FPGA implementation is 1.7 times faster and 3.5 times more energy efficient.
- Agility Design Solutions, Inc. 2008. Handel-c. http://www.agilityds.com/products/c_based_products/dk_design_suite/handel-c.aspx.Google Scholar
- Altera. 2008. Netlist optimizations and physical synthesis. Tech. rep., Altera Corporation. http://www.altera.com/literature/hb/qts/qts_qii52007.pdf.Google Scholar
- Altera Corporation. 2008. Intellectual property solutions. http://www.altera.com/products/ip/ipm- index.html.Google Scholar
- AutoESL. 2008. Auto pilot synthesis tool. http://www.autoesl.com/.Google Scholar
- Beauchamp, M. J., Hauck, S., Underwood, K. D., and Hemmert, K. S. 2006. Embedded floating-point units in FPGAs. In Proceedings of the ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays (FPGA'06). ACM, New York, NY, 12--20. Google Scholar
Digital Library
- Blackford, L. S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., and Whaley, R. C. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 2, 135--151. Google Scholar
Digital Library
- Cray Inc. 2008. http://www.cray.com.Google Scholar
- Daga, V., Govindu, G., Gangadharpalli, S., Sridhar, V., and Prasanna, V. K. 2004. Efficient floating-point based block LU decomposition on FPGAs. In Proceedings of the International Conference on Engineering of Reconfigureable Systems and Algorithms.Google Scholar
- deLorimier, M. and DeHon, A. 2005. Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field Programmable Gate Arrays. 75--85. Google Scholar
Digital Library
- Diersch, H. J. G. 2008. Error norm. http://www1.wasy.de/deutsch/produkte/feflow/hilfe/general/theory/whitepapers/error_norms/enornorm.html.Google Scholar
- Dongarra, J. J., Duff, I. S., Sorensen, D. C., and van der Vorst, H. A. 1998. Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia, PA. Google Scholar
Digital Library
- Hager, W. W. 1988. Applied Numerical Linear Algebra. Prentice Hall, Englewood Cliffs, NJ.Google Scholar
- Intel. 2008. Intel math kernel library. http://www.intel.com/cd/software/products/asmo-na/eng/307757.htm.Google Scholar
- Intel Corporation. 2008. Intel Xeon processor 5160. http://processorfinder.intel.com/Details.aspx?sSpec=SLABS.Google Scholar
- Liang, X. and Jean, J. S.-N. 2003. Mapping of generalized template matching onto reconfigurable computers. In IEEE Trans VLSI Syst. 167--174. Google Scholar
Digital Library
- Lopes, A. R. and Constantinides, G. A. 2008. A high throughput FPGA-based floating point conjugate gradient implementation. In Proceedings of the 4th International Workshop on Reconfigurable Computing: Architectures, Tools and Applications. Lecture Notes in Computer Science Vol. 4943, 75--86. Google Scholar
Digital Library
- Mencer, O., Morf, M., and Flynn, M. J. 1998. PAM-Blox: High performance FPGA design for adaptive computing. In Proceedings of the 6th Annual IEEE Symposium on FPGAs for Custom Computing Machines. 485--498. Google Scholar
Digital Library
- Mentor Graphics. 2008. Catapult synthesis. http://www.mentor.com/products/esl/high_level_synthesis/catapult_synthesis/index.cfm.Google Scholar
- Moore, N., Conti, A., Leeser, M., and King, L. S. 2007. Vforce: An extensible framework for reconfigurable supercomputing. Comput. 40, 39--49. Google Scholar
Digital Library
- Morris, G. R. and Prasanna, V. K. 2007. Sparse matrix computations on reconfigurable hardware. Comput. 40, 3, 58--64. Google Scholar
Digital Library
- NVIDIA Corporation. 2011. Geforce gtx 280. http://www.nvidia.com/object/product_geforce_gtx_280_us.html.Google Scholar
- SRC Computers. 2008. http://www.srccomp.com.Google Scholar
- Sun, J., Peterson, G. D., and Storaasli, O. O. 2008. High-performance mixed-precision linear solver for fpgas. IEEE Trans. Comput. 57, 1614--1623. Google Scholar
Digital Library
- Volkov, V. and Demmel, J. W. 2008. Benchmarking gpus to tune dense linear algebra. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC'08). IEEE Press, Los Alamitos, CA, 11. Google Scholar
Digital Library
- XtremeData, Inc. 2008. http://www.xtremedatainc.com.Google Scholar
- Zhang, W. 2008. Portable and scalable FPGA-based acceleration of a direct linear system solver. M.A.Sc. Thesis, University of Toronto.Google Scholar
- Zhang, W., Betz, V., and Rose, J. 2008. Portable and scalable FPGA-based acceleration of a direct linear system solver. In Proceedings of the International Conference on Field-Programmable Technology. 17--24.Google Scholar
- Zhuo, L. and Prasanna, V. 2008. High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Trans. Comput. 57, 8, 1057--1071. Google Scholar
Digital Library
- Zhuo, L. and Prasanna, V. K. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field Programmable Gate Arrays. 63--74. Google Scholar
Digital Library
- Zhuo, L. and Prasanna, V. K. 2006. High-performance and parameterized matrix factorization on FPGAs. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1--6.Google Scholar
Index Terms
Portable and scalable FPGA-based acceleration of a direct linear system solver
Recommendations
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning ...
FPGA-Based High-Performance and Scalable Block LU Decomposition Architecture
Decomposition of a matrix into lower and upper triangular matrices (LU decomposition) is a vital part of many scientific and engineering applications, and the block LU decomposition algorithm is an approach well suited to parallel hardware ...
Implementing high-performance, low-power FPGA-based optical flow accelerators in C
ASAP '13: Proceedings of the 2013 IEEE 24th International Conference on Application-specific Systems, Architectures and Processors (ASAP)Recent developments in High-Level Synthesis (HLS) for FPGAs are making it possible to “run” C code on FPGAs thereby making modern programming environments available to FPGA developers. In this paper, C code for a complex optical-flow algorithm is ...






Comments