skip to main content
research-article

FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods

Published:01 November 2010Publication History
Skip Abstract Section

Abstract

For scientific numerical simulation that requires a relatively high ratio of data access to computation, the scalability of memory bandwidth is the key to performance improvement, and therefore custom-computing machines (CCMs) are one of the promising approaches to provide bandwidth-aware structures tailored for individual applications. In this article, we propose a scalable FPGA-array with bandwidth-reduction mechanism (BRM) to implement high-performance and power-efficient CCMs for scientific simulations based on finite difference methods. With the FPGA-array, we construct a systolic computational-memory array (SCMA), which is given a minimum of programmability to provide flexibility and high productivity for various computing kernels and boundary computations. Since the systolic computational-memory architecture of SCMA provides scalability of both memory bandwidth and arithmetic performance according to the array size, we introduce a homogeneously partitioning approach to the SCMA so that it is extensible over a 1D or 2D array of FPGAs connected with a mesh network. To satisfy the bandwidth requirement of inter-FPGA communication, we propose BRM based on time-division multiplexing. BRM decreases the required number of communication channels between the adjacent FPGAs at the cost of delay cycles. We formulate the trade-off between bandwidth and delay of inter-FPGA data-transfer with BRM. To demonstrate feasibility and evaluate performance quantitatively, we design and implement the SCMA of 192 processing elements over two ALTERA Stratix II FPGAs. The implemented SCMA running at 106MHz has the peak performance of 40.7 GFlops in single precision. We demonstrate that the SCMA achieves the sustained performances of 32.8 to 35.7 GFlops for three benchmark computations with high utilization of computing units. The SCMA has complete scalability to the increasing number of FPGAs due to the highly localized computation and communication. In addition, we also demonstrate that the FPGA-based SCMA is power-efficient: it consumes 69% to 87% power and requires only 2.8% to 7.0% energy of those for the same computations performed by a 3.4-GHz Pentium4 processor. With software simulation, we show that BRM works effectively for benchmark computations, and therefore commercially available low-end FPGAs with relatively narrow I/O bandwidth can be utilized to construct a scalable FPGA-array.

References

  1. Altera Corporation. 2008. http://www.altera.com/literature/.Google ScholarGoogle Scholar
  2. Chen, W., Kosmas, P., Leeser, M., and Rappaport, C. 2004. An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm. In Proceedings of the ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA’04). 213--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chiu, M., Herbordt, M., and Langhammer, M. 2008. Performance potential of molecular dynamics simulations on high performance reconfigurable computing systems. In Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA’08). DOI: 10.1109/HPRCTA.2008.4745685.Google ScholarGoogle Scholar
  4. Compton, K. and Hauck, S. 2002. Reconfigurable computing: A survey of systems and software. ACM Comput. Surv. 34, 2, 171--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. deLorimier, M. and DeHon, A. 2005. Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 75--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dou, Y., Vassiliadis, S., Kuzmanov, G. K., and Gaydadjiev, G. N. 2005. 64-bit floating-point FPGA matrix multiplication. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 86--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Durbano, J. P., Ortiz, F. E., Humphrey, J. R., Curt, P. F., and Prather, D. W. 2004. FPGA-based acceleration of the 3d finite-difference time-domain method. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 156--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Elliott, D. G., Stumm, M., Snelgrove, W., Cojocaru, C., and Mckenzie, R. 1999. Computational ram: Implementing processors in memory. Des. Test Comput. 16, 1, 32--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Fatahalian, K., Sugerman, J., and Hanrahan, P. M. 2004. Understanding the eciency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware. 133--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Feng, X., Ge, R., and Cameron, K. W. 2005. Power and energy profiling of scieitific applications on distributed systems. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society Press, Los Alamitos, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ferziger, J. H. and Perić, M. 1996. Computational Methods for Fluid Dynamics. Springer-Verlag, Berlin.Google ScholarGoogle Scholar
  12. Hageman, L. A. and Young, D. M. 1981. Applied Iterative Methods. Academic Press.Google ScholarGoogle Scholar
  13. Hauser, T. 2005. A flow solver for a reconfigurable FPGA-based hypercomputer. AIAA Aerospace Sciences Meeting and Exhibit AIAA-2005-1382.Google ScholarGoogle ScholarCross RefCross Ref
  14. He, C., Lu, M., and Sun, C. 2004. Accelerating seismic migration using FPGA-based coprocessor platform. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society Press, Los Alamitos, CA. 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. He, C., Zhao, W., and Lu, M. 2005. Time domain numerical simulation for transient waves on reconfigurable coprocessor platform. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society Press, Los Alamitos, CA. 127--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hemmert, K. S. and Underwood, K. D. 2005. An analysis of the double-precision floating-point FFT on FPGAs. In Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society Press, Los Alamitos, CA. 171--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hoshino, T., Kawai, T., Shirakawa, T., Higashino, J., Yamaoka, A., Ito, H., Sato, T., and Sawada, K. 1983. Pacs: A parallel microprocessor array for scientific calculations. ACM Trans. Comput. Syst. 1, 3, 195--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Johnson, K. T., Hurson, A., and Shirazi, B. 1993. General-purpose systolic arrays. Computer 26, 11, 20--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kaganov, A., Chow, P., and Lakhany, A. 2008. FPGA acceleration of Monte-Carlo based credit derivative pricing. In Proceedings of the International Conference on Field Programmable Logic and Applications. 329--334.Google ScholarGoogle Scholar
  20. Kim, J. and Moin, P. 1985. Application of a fractional-step method to incompressible navier-stokes. J. Comput. Physics 59, 308--323.Google ScholarGoogle ScholarCross RefCross Ref
  21. Kung, H. T. 1982. Why systolic architecture? Computer 15, 1, 37--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Morishita, H., Osana, Y., Fujita, N., and Amano, H. 2008. Exploiting memory hierarchy for a computational fluid dynamics accelerator on FPGAs. In Proceedings of the International Conference on Field-Programmable Technology (FPT’08). 193--200.Google ScholarGoogle Scholar
  23. Morris, G. R., Prasanna, V. K., and Anderson, R. D. 2006. A hybrid approach for mapping conjugate gradient onto an FPGA-augmented reconfigurable supercomputer. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 30--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Murtaza, S., Hoekstra, A., and Sloot, P. 2008. Floating point based cellular automata simulations using a dual FPGA-enabled system. In Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA’08). DOI: 10.1109/HPRCTA.2008.4745686.Google ScholarGoogle Scholar
  25. Patel, A., Madill, C. A., Saldana, M., Comis, C., Pomes, R., and Chow, P. 2006. A scalable FPGA-based multiprocessor. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 111--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Patterson, D., Anderson, T., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C., Thomas, R., and Yelick, K. 1997a. A case for intelligent ram: IRAM. IEEE Micro 17, 2, 34--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Patterson, D., Asanovic, K., Brown, A., Fromm, R., Golbus, J., Gribstad, B., Keeton, K., Kozyrakis, C., Martin, D., Perissakis, S., Thomas, R., Treuhaft, N., and Yelick, K. 1997b. Intelligent ram (IRAM): The industrial setting, applications, and architectures. In Proceedings of the International Conference on Computer Design. 2--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sano, K., Takagi, C., Egawa, R., Suzuki, K., and Nakamura, T. 2004. A systolic memory architecture for fast codebook design based on MMPDCL algorithm. In Proceedings of the International Conference on Information Technology (ITCC’04). 572--578. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sano, K., Takagi, C., and Nakamura, T. 2005. Systolic computational memory approach to high-speed codebook design. In Proceedings of the 5th IEEE International Symposium on Signal Processing and Information Technology (ISSPIT’05). 334--339.Google ScholarGoogle Scholar
  30. Sano, K., Iizuka, T., and Yamamoto, S. 2006a. Massively parallel processor based on systolic architecture for high-performance computation of different schemes. In Proceedings of the International Conference on Parallel Computational Fluid Dynamics. 174--177.Google ScholarGoogle Scholar
  31. Sano, K., Iizuka, T., and Yamamoto, S. 2006b. Systolic computational-memory architecture for an FPGA-based flow solver. In Proceedings of the 49th IEEE Internationial Midwest Symposium on Circuits and Systems (MWSCAS’06). CDROM.Google ScholarGoogle Scholar
  32. Sano, K., Iizuka, T., and Yamamoto, S. 2007a. Systolic architecture for computational fluid dynamics on FPGAs. In Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). 107--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Sano, K., Pell, O., Luk, W., and Yamamoto, S. 2007b. FPGA-based streaming computation for lattice Boltzmann method. In Proceedings of the International Conference on Field-Programmable Technology (FPT’07). 233--236.Google ScholarGoogle Scholar
  34. Sano, K., Luzhou, W., Hatsuda, Y., and Yamamoto, S. 2008a. Scalable FPGA-array for high-performance and power-efficient computation based on difference schemes. In Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA’08). DOI: 10.1109/HPRCTA.2008.4745679.Google ScholarGoogle Scholar
  35. Sano, K., Nishikawa, T., Aoki, T., and Yamamoto, S. 2008b. Evaluating power and energy consumption of FPGA-based custom computing machines for scientific floating-point computation. In Proceedings of the International Conference on Field-Programmable Technology (FPT’08). 301--304.Google ScholarGoogle Scholar
  36. Schneider, R. N., Turner, L. E., and Okoniewski, M. M. 2002. Application of FPGA technology to accelerate the finite-difference time-domain (FDTD) method. In Proceedings of the ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays (FPGA’02). 97--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Scrofano, R., Gokhale, M. B., Trouw, F., and Prasanna, V. K. 2006. In A hardware/software approach to molecular dynamics on reconfigurable computers. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 23--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Scrofano, R., Gokhale, M. B., Trouw, F., and Prasanna, V. K. 2008. Accelerating molecular dynamics simulations with reconfigurable computers. IEEE Trans. Paral. Distrib. Syst. 19, 6, 764--778. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Shirazi, N., Walters, A., and Athanas, P. 1995. Quantitative analysis of floating point arithmetic on FPGA based custom computing machines. In Proceedings of the IEEE Symposium on FPGA’s for Custom Computing Machines. 155--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Smith, W. D. and Schnore, A. R. 2003. Towards an rcc-based accelerator for computational fluid dynamics applications. J. Supercomput. 30, 3, 239--261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Strenski, D., Simkins, J., Walke, R., and Wittig, R. 2008. Evaluating FPGAs for floating point performance. In Proceedings of the International Workshop on High-Performance Reconfigurable Computing Technology and Applications (HPRCTA’08). DOI: 10.1109/HPRCTA.2008.4745680.Google ScholarGoogle Scholar
  42. Strikwerda, J. C. and Lee, Y. S. 1999. The accuracy of the fractional step method. SIAM J. Numer. Anal. 37, 1, 37--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Taflove, A. and Hagness, S. C. 1996. Computational Electrodynamics -- The Finite Difference Time-Domain Method. Aretch House, Inc., Norwood, MA.Google ScholarGoogle Scholar
  44. Underwood, K. 2004. FPGA vs. CPUS: Trends in peak floating-point performance. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 171--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Underwood, K. D. and Hemmert, K. S. 2004. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 219--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Vuillemin, J. E., Bertin, P., Roncin, D., Shand, M., Touati, H. H., and Boucard, P. 1996. Programmable active memories: reconfigurable systems come of age. IEEE Trans. VLSI Syst. 4, 1, 56--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Walke, R. L., Smith, R. W. M., and Lightbody, G. 2000. 20-GFLOPS QR processor on a Xilinx virtex-e FPGA. In Proceedings of SPIE: Advanced Signal Processing Algorithms, Architectures and Implementations X. Vol. 4116, 300--310.Google ScholarGoogle Scholar
  48. Williams, S., Waterman, A., and Patterson, D. 2009. Roofline: an insightful visual performance model for multicore architectures. Comm. ACM 52, 4, 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Woods, N. A. and VanCourt, T. 2008. FPGA acceleration of Quasi-Monte Carlo in finance. In Proceedings of the International Conference on Field Programmable Logic and Applications. 335--340.Google ScholarGoogle ScholarCross RefCross Ref
  50. Yee, K. S. 1966. Numerical solution of inital boundary value problems involving Maxwell’s equations in isotropic media. IEEE Trans. Antennas Prop. 14, 302--307.Google ScholarGoogle ScholarCross RefCross Ref
  51. Zhuo, L., Morris, G. R., and Prasanna, V. K. 2007. High-performance reduction circuits using deeply pipelined operators on FPGAs. IEEE Trans. Paral. Distrib. Syst. 18, 10, 1377--1392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Zhuo, L. and Prasanna, V. K. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Zhuo, L. and Prasanna, V. K. 2007. Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems. IEEE Trans. Paral. Distrib. Syst. 18, 4, 433--448. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Reconfigurable Technology and Systems
                  ACM Transactions on Reconfigurable Technology and Systems  Volume 3, Issue 4
                  November 2010
                  240 pages
                  ISSN:1936-7406
                  EISSN:1936-7414
                  DOI:10.1145/1862648
                  Issue’s Table of Contents

                  Copyright © 2010 ACM

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 1 November 2010
                  • Accepted: 1 August 2009
                  • Revised: 1 June 2009
                  • Received: 1 March 2009
                  Published in trets Volume 3, Issue 4

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!