Abstract
For scientific numerical simulation that requires a relatively high ratio of data access to computation, the scalability of memory bandwidth is the key to performance improvement, and therefore custom-computing machines (CCMs) are one of the promising approaches to provide bandwidth-aware structures tailored for individual applications. In this article, we propose a scalable FPGA-array with bandwidth-reduction mechanism (BRM) to implement high-performance and power-efficient CCMs for scientific simulations based on finite difference methods. With the FPGA-array, we construct a systolic computational-memory array (SCMA), which is given a minimum of programmability to provide flexibility and high productivity for various computing kernels and boundary computations. Since the systolic computational-memory architecture of SCMA provides scalability of both memory bandwidth and arithmetic performance according to the array size, we introduce a homogeneously partitioning approach to the SCMA so that it is extensible over a 1D or 2D array of FPGAs connected with a mesh network. To satisfy the bandwidth requirement of inter-FPGA communication, we propose BRM based on time-division multiplexing. BRM decreases the required number of communication channels between the adjacent FPGAs at the cost of delay cycles. We formulate the trade-off between bandwidth and delay of inter-FPGA data-transfer with BRM. To demonstrate feasibility and evaluate performance quantitatively, we design and implement the SCMA of 192 processing elements over two ALTERA Stratix II FPGAs. The implemented SCMA running at 106MHz has the peak performance of 40.7 GFlops in single precision. We demonstrate that the SCMA achieves the sustained performances of 32.8 to 35.7 GFlops for three benchmark computations with high utilization of computing units. The SCMA has complete scalability to the increasing number of FPGAs due to the highly localized computation and communication. In addition, we also demonstrate that the FPGA-based SCMA is power-efficient: it consumes 69% to 87% power and requires only 2.8% to 7.0% energy of those for the same computations performed by a 3.4-GHz Pentium4 processor. With software simulation, we show that BRM works effectively for benchmark computations, and therefore commercially available low-end FPGAs with relatively narrow I/O bandwidth can be utilized to construct a scalable FPGA-array.
- Altera Corporation. 2008. http://www.altera.com/literature/.Google Scholar
- Chen, W., Kosmas, P., Leeser, M., and Rappaport, C. 2004. An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm. In Proceedings of the ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA’04). 213--222. Google Scholar
Digital Library
- Chiu, M., Herbordt, M., and Langhammer, M. 2008. Performance potential of molecular dynamics simulations on high performance reconfigurable computing systems. In Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA’08). DOI: 10.1109/HPRCTA.2008.4745685.Google Scholar
- Compton, K. and Hauck, S. 2002. Reconfigurable computing: A survey of systems and software. ACM Comput. Surv. 34, 2, 171--210. Google Scholar
Digital Library
- deLorimier, M. and DeHon, A. 2005. Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 75--85. Google Scholar
Digital Library
- Dou, Y., Vassiliadis, S., Kuzmanov, G. K., and Gaydadjiev, G. N. 2005. 64-bit floating-point FPGA matrix multiplication. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 86--95. Google Scholar
Digital Library
- Durbano, J. P., Ortiz, F. E., Humphrey, J. R., Curt, P. F., and Prather, D. W. 2004. FPGA-based acceleration of the 3d finite-difference time-domain method. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 156--163. Google Scholar
Digital Library
- Elliott, D. G., Stumm, M., Snelgrove, W., Cojocaru, C., and Mckenzie, R. 1999. Computational ram: Implementing processors in memory. Des. Test Comput. 16, 1, 32--41. Google Scholar
Digital Library
- Fatahalian, K., Sugerman, J., and Hanrahan, P. M. 2004. Understanding the eciency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware. 133--137. Google Scholar
Digital Library
- Feng, X., Ge, R., and Cameron, K. W. 2005. Power and energy profiling of scieitific applications on distributed systems. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society Press, Los Alamitos, CA. Google Scholar
Digital Library
- Ferziger, J. H. and Perić, M. 1996. Computational Methods for Fluid Dynamics. Springer-Verlag, Berlin.Google Scholar
- Hageman, L. A. and Young, D. M. 1981. Applied Iterative Methods. Academic Press.Google Scholar
- Hauser, T. 2005. A flow solver for a reconfigurable FPGA-based hypercomputer. AIAA Aerospace Sciences Meeting and Exhibit AIAA-2005-1382.Google Scholar
Cross Ref
- He, C., Lu, M., and Sun, C. 2004. Accelerating seismic migration using FPGA-based coprocessor platform. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society Press, Los Alamitos, CA. 207--216. Google Scholar
Digital Library
- He, C., Zhao, W., and Lu, M. 2005. Time domain numerical simulation for transient waves on reconfigurable coprocessor platform. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society Press, Los Alamitos, CA. 127--136. Google Scholar
Digital Library
- Hemmert, K. S. and Underwood, K. D. 2005. An analysis of the double-precision floating-point FFT on FPGAs. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society Press, Los Alamitos, CA. 171--180. Google Scholar
Digital Library
- Hoshino, T., Kawai, T., Shirakawa, T., Higashino, J., Yamaoka, A., Ito, H., Sato, T., and Sawada, K. 1983. Pacs: A parallel microprocessor array for scientific calculations. ACM Trans. Comput. Syst. 1, 3, 195--221. Google Scholar
Digital Library
- Johnson, K. T., Hurson, A., and Shirazi, B. 1993. General-purpose systolic arrays. Computer 26, 11, 20--31. Google Scholar
Digital Library
- Kaganov, A., Chow, P., and Lakhany, A. 2008. FPGA acceleration of Monte-Carlo based credit derivative pricing. In Proceedings of the International Conference on Field Programmable Logic and Applications. 329--334.Google Scholar
- Kim, J. and Moin, P. 1985. Application of a fractional-step method to incompressible navier-stokes. J. Comput. Physics 59, 308--323.Google Scholar
Cross Ref
- Kung, H. T. 1982. Why systolic architecture? Computer 15, 1, 37--46. Google Scholar
Digital Library
- Morishita, H., Osana, Y., Fujita, N., and Amano, H. 2008. Exploiting memory hierarchy for a computational fluid dynamics accelerator on FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (FPT’08). 193--200.Google Scholar
- Morris, G. R., Prasanna, V. K., and Anderson, R. D. 2006. A hybrid approach for mapping conjugate gradient onto an FPGA-augmented reconfigurable supercomputer. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 30--12. Google Scholar
Digital Library
- Murtaza, S., Hoekstra, A., and Sloot, P. 2008. Floating point based cellular automata simulations using a dual FPGA-enabled system. In Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA’08). DOI: 10.1109/HPRCTA.2008.4745686.Google Scholar
- Patel, A., Madill, C. A., Saldana, M., Comis, C., Pomes, R., and Chow, P. 2006. A scalable FPGA-based multiprocessor. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 111--120. Google Scholar
Digital Library
- Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C., Thomas, R., and Yelick, K. 1997a. A case for intelligent ram: IRAM. IEEE Micro 17, 2, 34--44. Google Scholar
Digital Library
- Patterson, D., Asanovic, K., Brown, A., Fromm, R., Golbus, J., Gribstad, B., Keeton, K., Kozyrakis, C., Martin, D., Perissakis, S., Thomas, R., Treuhaft, N., and Yelick, K. 1997b. Intelligent ram (IRAM): The industrial setting, applications, and architectures. In Proceedings of the International Conference on Computer Design. 2--9. Google Scholar
Digital Library
- Sano, K., Takagi, C., Egawa, R., Suzuki, K., and Nakamura, T. 2004. A systolic memory architecture for fast codebook design based on MMPDCL algorithm. In Proceedings of the International Conference on Information Technology (ITCC’04). 572--578. Google Scholar
Digital Library
- Sano, K., Takagi, C., and Nakamura, T. 2005. Systolic computational memory approach to high-speed codebook design. In Proceedings of the 5th IEEE International Symposium on Signal Processing and Information Technology (ISSPIT’05). 334--339.Google Scholar
- Sano, K., Iizuka, T., and Yamamoto, S. 2006a. Massively parallel processor based on systolic architecture for high-performance computation of different schemes. In Proceedings of the International Conference on Parallel Computational Fluid Dynamics. 174--177.Google Scholar
- Sano, K., Iizuka, T., and Yamamoto, S. 2006b. Systolic computational-memory architecture for an FPGA-based flow solver. In Proceedings of the 49th IEEE Internationial Midwest Symposium on Circuits and Systems (MWSCAS’06). CDROM.Google Scholar
- Sano, K., Iizuka, T., and Yamamoto, S. 2007a. Systolic architecture for computational fluid dynamics on FPGAs. In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). 107--116. Google Scholar
Digital Library
- Sano, K., Pell, O., Luk, W., and Yamamoto, S. 2007b. FPGA-based streaming computation for lattice Boltzmann method. In Proceedings of the International Conference on Field-Programmable Technology (FPT’07). 233--236.Google Scholar
- Sano, K., Luzhou, W., Hatsuda, Y., and Yamamoto, S. 2008a. Scalable FPGA-array for high-performance and power-efficient computation based on difference schemes. In Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA’08). DOI: 10.1109/HPRCTA.2008.4745679.Google Scholar
- Sano, K., Nishikawa, T., Aoki, T., and Yamamoto, S. 2008b. Evaluating power and energy consumption of FPGA-based custom computing machines for scientific floating-point computation. In Proceedings of the International Conference on Field-Programmable Technology (FPT’08). 301--304.Google Scholar
- Schneider, R. N., Turner, L. E., and Okoniewski, M. M. 2002. Application of FPGA technology to accelerate the finite-difference time-domain (FDTD) method. In Proceedings of the ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA’02). 97--105. Google Scholar
Digital Library
- Scrofano, R., Gokhale, M. B., Trouw, F., and Prasanna, V. K. 2006. In A hardware/software approach to molecular dynamics on reconfigurable computers. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 23--34. Google Scholar
Digital Library
- Scrofano, R., Gokhale, M. B., Trouw, F., and Prasanna, V. K. 2008. Accelerating molecular dynamics simulations with reconfigurable computers. IEEE Trans. Paral. Distrib. Syst. 19, 6, 764--778. Google Scholar
Digital Library
- Shirazi, N., Walters, A., and Athanas, P. 1995. Quantitative analysis of floating point arithmetic on FPGA based custom computing machines. In Proceedings of the IEEE Symposium on FPGA’s for Custom Computing Machines. 155--162. Google Scholar
Digital Library
- Smith, W. D. and Schnore, A. R. 2003. Towards an rcc-based accelerator for computational fluid dynamics applications. J. Supercomput. 30, 3, 239--261. Google Scholar
Digital Library
- Strenski, D., Simkins, J., Walke, R., and Wittig, R. 2008. Evaluating FPGAs for floating point performance. In Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA’08). DOI: 10.1109/HPRCTA.2008.4745680.Google Scholar
- Strikwerda, J. C. and Lee, Y. S. 1999. The accuracy of the fractional step method. SIAM J. Numer. Anal. 37, 1, 37--47. Google Scholar
Digital Library
- Taflove, A. and Hagness, S. C. 1996. Computational Electrodynamics -- The Finite Difference Time-Domain Method. Aretch House, Inc., Norwood, MA.Google Scholar
- Underwood, K. 2004. FPGA vs. CPUS: Trends in peak floating-point performance. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 171--180. Google Scholar
Digital Library
- Underwood, K. D. and Hemmert, K. S. 2004. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 219--228. Google Scholar
Digital Library
- Vuillemin, J. E., Bertin, P., Roncin, D., Shand, M., Touati, H. H., and Boucard, P. 1996. Programmable active memories: reconfigurable systems come of age. IEEE Trans. VLSI Syst. 4, 1, 56--69. Google Scholar
Digital Library
- Walke, R. L., Smith, R. W. M., and Lightbody, G. 2000. 20-GFLOPS QR processor on a Xilinx virtex-e FPGA. In Proceedings of SPIE: Advanced Signal Processing Algorithms, Architectures and Implementations X. Vol. 4116, 300--310.Google Scholar
- Williams, S., Waterman, A., and Patterson, D. 2009. Roofline: an insightful visual performance model for multicore architectures. Comm. ACM 52, 4, 65--76. Google Scholar
Digital Library
- Woods, N. A. and VanCourt, T. 2008. FPGA acceleration of Quasi-Monte Carlo in finance. In Proceedings of the International Conference on Field Programmable Logic and Applications. 335--340.Google Scholar
Cross Ref
- Yee, K. S. 1966. Numerical solution of inital boundary value problems involving Maxwell’s equations in isotropic media. IEEE Trans. Antennas Prop. 14, 302--307.Google Scholar
Cross Ref
- Zhuo, L., Morris, G. R., and Prasanna, V. K. 2007. High-performance reduction circuits using deeply pipelined operators on FPGAs. IEEE Trans. Paral. Distrib. Syst. 18, 10, 1377--1392. Google Scholar
Digital Library
- Zhuo, L. and Prasanna, V. K. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 63--74. Google Scholar
Digital Library
- Zhuo, L. and Prasanna, V. K. 2007. Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems. IEEE Trans. Paral. Distrib. Syst. 18, 4, 433--448. Google Scholar
Digital Library
Index Terms
FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods
Recommendations
Prototype implementation of array-processor extensible over multiple FPGAs for scalable stencil computation
This paper demonstrates and evaluates the performance and the scalability of the systolic computational-memory array (SCMA) for stencil computation, which is a typical computing kernel of scientific simulation. We describe the basic architecture of th ...
Using FPGA Devices to Accelerate Biomolecular Simulations
A field-programmable gate array implementation of a molecular dynamics simulation method reduces the microprocessor time-to-solution by a factor of three while using only high-level languages. The application speedup on FPGA devices increases with the ...
Minimization of the reconfiguration latency for the mapping of applications on FPGA-based systems
CODES+ISSS '09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesisField-Programmable Gate Arrays (FPGAs) have become promising mapping fabric for the implementation of System-on-Chip (SoC) platforms, due to their large capacity and their enhanced support for dynamic and partial reconfigurability. Design automation ...






Comments