Abstract
The energy in FPGA computations is dominated by data communication energy, either in the form of memory references or data movement on interconnect. In this article, we explore how to use data placement and parallelism to reduce communication energy. We show that parallelism can reduce energy and that the optimal level of parallelism increases with the problem size. We further explore how FPGA memory architecture (memory block size(s), memory banking, and spacing between memory banks) can impact communication energy, and determine how to organize the memory architecture to guarantee that the energy overhead compared to the optimally matched architecture for the design is never more than 60%. We specifically show that an architecture with 32 bit wide, 16Kb internally banked memories placed every 8 columns of 10 4-LUT logic blocks is within 61% of the optimally matched architecture across the VTR 7 benchmark set and a set of parallelism-tunable benchmarks. Without internal banking, the worst-case overhead is 98%, achieved with an architecture with 32 bit wide, 8Kb memories placed every 9 columns, roughly comparable to the memory organization on the Cyclone V (where memories are placed about every 10 columns). Monolithic 32 bit wide, 16Kb memories placed every 10 columns (comparable to 18Kb and 20Kb memories used in Virtex 4 and Stratix V FPGAs) have a 180% worst-case energy overhead. Furthermore, we show practical cases where designs mapped for optimal parallelism use 4.7 × less energy than designs using a single processing element.
- Altera Corporation. 2013. PowerPlay Early Power Estimator. Altera Corporation, San Jose, CA. http://www.altera.com/support/devices/estimator/pow-powerplay.jsp.Google Scholar
- Vaughn Betz, Jonathan Rose, and Alexander Marquardt. 1999. Architecture and CAD for Deep-Submicron FPGAs. Kluwer, Norwell, MA. Google Scholar
Digital Library
- Sandeep Bhatt and Frank Thomson Leighton. 1984. A framework for solving VLSI graph layout problems. Journal of Computer System Sciences 28, 300--343.Google Scholar
Cross Ref
- Bluespec. 2012. Bluespec SystemVerilog 2012.01.A. Available at http://www.bluespec.com.Google Scholar
- S. Y. I. Chin, C. S. P. Lee, and Steven J. E. Wilton. 2006. Power implications of implementing logic using FPGA embedded memory arrays. In Proceedings of the International Conference on Field-Programmable Logic and Applications. 1--8. DOI:http://dx.doi.org/10.1109/FPL.2006.311200Google Scholar
- André DeHon. 1999. Balancing interconnect and computation in a reconfigurable computing array (or, why you don’t really want 100% LUT utilization). In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 69--78. Google Scholar
Digital Library
- André DeHon. 2015. Fundamental underpinnings of reconfigurable computing architectures. Proceedings of the IEEE 103, 3, 355--378. DOI:http://dx.doi.org/10.1109/JPROC.2014.2387696Google Scholar
Cross Ref
- Michael Delorimier, Nachiket Kapre, Nikil Mehta, and André DeHon. 2011. Spatial hardware implementation for sparse graph algorithms in GraphStep. ACM Transactions on Autonomous and Adaptive Systems 6, 3, Article No. 17. DOI:http://dx.doi.org/10.1145/2019583.2019584 Google Scholar
Digital Library
- Wilm E. Donath. 1979. Placement and average interconnection lengths of computer logic. IEEE Transactions on Circuits and Systems 26, 4, 272--277.Google Scholar
Cross Ref
- M. Genovese and E. Napoli. 2014. ASIC and FPGA implementation of the Gaussian mixture model algorithm for real-time segmentation of high definition video. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 22, 3, 537--547. DOI:http://dx.doi.org/10.1109/TVLSI.2013.2249295 Google Scholar
Digital Library
- J. B. Goeders and Steven J. E. Wilton. 2012. VersaPower: Power estimation for diverse FPGA architectures. In Proceedings of the International Conference on Field-Programmable Technology. 229--234. DOI:http://dx.doi.org/10.1109/FPT.2012.6412139Google Scholar
- Thomas L. Heath and Euclid. 1956. The Thirteen Books of Euclid’s Elements, Books I and II (2nd ed.). Dover Publications. Google Scholar
Digital Library
- ITRS. 2012. International Technology Roadmap for Semiconductors. Available at http://www.itrs2.net/itrs-reports.html.Google Scholar
- Edin Kadric, David Lakata, and André DeHon. 2015. Impact of memory architecture on FPGA energy consumption. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 146--155. Google Scholar
Digital Library
- Edin Kadric, Kunal Mahajan, and André DeHon. 2014. Kung Fu data energy-minimizing communication energy in FPGA computations. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. Google Scholar
Digital Library
- Dirk Koch and Jim Torresen. 2011. FPGASort: A high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 45--54. Google Scholar
Digital Library
- Ian Kuon and Jonathan Rose. 2007. Measuring the gap between FPGAs and ASICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26, 2, 203--215. Google Scholar
Digital Library
- J. Lamoureux and Steven J. E. Wilton. 2006. Activity estimation for field-programmable gate arrays. In Proceedings of the International Conference on Field-Programmable Logic and Applications. 1--8. DOI:http://dx.doi.org/10.1109/FPL.2006.311199Google Scholar
- B. S. Landman and R. L. Russo. 1971. On pin versus block relationship for partitions of logic circuits. IEEE Transactions on Computers 20, 1469--1479. Google Scholar
Digital Library
- David Lewis, Elias Ahmed, David Cashman, Tim Vanderhoek, Chris Lane, Andy Lee, and Philip Pan. 2009. Architectural enhancements in Stratix-III and Stratix-IV. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 33--42. DOI:http://dx.doi.org/10.1145/1508128.1508135 Google Scholar
Digital Library
- David Lewis, David Cashman, Mark Chan, Jeffery Chromczak, Gary Lai, Andy Lee, Tim Vanderhoek, and Haiming Yu. 2013. Architectural enhancements in Stratix V. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 147--156. DOI:http://dx.doi.org/10.1145/2435264.2435292 Google Scholar
Digital Library
- Jason Luu, Jason Helge Anderson, and Jonathan Scott Rose. 2011. Architecture description and packing for logic blocks with hierarchy, modes and complex interconnect. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 227--236. DOI:http://dx.doi.org/10.1145/1950413.1950457 Google Scholar
Digital Library
- Jason Luu, Jeffrey Goeders, Michael Wainberg, Andrew Somerville, Thien Yu, Konstantin Nasartschuk, Miad Nasr, et al. 2014. VTR 7.0: Next generation architecture and CAD system for FPGAs. ACM Transactions on Reconfigurable Technology and Systems 7, 2, 6:1--6:30. DOI:http://dx.doi.org/10.1145/2617593 Google Scholar
Digital Library
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. HPL 2009-85. HP Labs, Palo Alto, CA. http://www.hpl.hp.com/techreports/2009/HPL-2009-85.html.Google Scholar
- Kara K. W. Poon, Steven J. E. Wilton, and Andy Yan. 2005. A detailed power model for field-programmable gate arrays. ACM Transactions on Design Automation of Electronic Systems 10, 2, 279--302. Google Scholar
Digital Library
- Jonathan Rose, Jason Luu, Chi Wai Yu, Opal Densmore, Jeffrey Goeders, Andrew Somerville, Kenneth B. Kent, Peter Jamieson, and Jason Anderson. 2012. The VTR Project: Architecture and CAD for FPGAs from Verilog to routing. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 77--86. Google Scholar
Digital Library
- R. Tessier, V. Betz, D. Neto, A. Egier, and T. Gopalsamy. 2007. Power-efficient RAM mapping algorithms for FPGA embedded memory blocks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26, 2, 278--290. DOI:http://dx.doi.org/10.1109/TCAD.2006.887924 Google Scholar
Digital Library
- C. Thompson. 1979. Area-time complexity for VLSI. In Proceedings of the ACM Symposium on Theory of Computing. 81--88. Google Scholar
Digital Library
- Henry Wong, Vaughn Betz, and Jonathan Rose. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 5--14. Google Scholar
Digital Library
Index Terms
Impact of Parallelism and Memory Architecture on FPGA Communication Energy
Recommendations
Impact of Memory Architecture on FPGA Energy Consumption
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFPGAs have the advantage that a single component can be configured post-fabrication to implement almost any computation. However, designing a one-size-fits-all memory architecture causes an inherent mismatch between the needs of the application and the ...
Architecting phase change memory as a scalable dram alternative
Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM'...
Architecting phase change memory as a scalable dram alternative
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureMemory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM'...






Comments