skip to main content
research-article
Open Access

Impact of Parallelism and Memory Architecture on FPGA Communication Energy

Published:22 August 2016Publication History
Skip Abstract Section

Abstract

The energy in FPGA computations is dominated by data communication energy, either in the form of memory references or data movement on interconnect. In this article, we explore how to use data placement and parallelism to reduce communication energy. We show that parallelism can reduce energy and that the optimal level of parallelism increases with the problem size. We further explore how FPGA memory architecture (memory block size(s), memory banking, and spacing between memory banks) can impact communication energy, and determine how to organize the memory architecture to guarantee that the energy overhead compared to the optimally matched architecture for the design is never more than 60%. We specifically show that an architecture with 32 bit wide, 16Kb internally banked memories placed every 8 columns of 10 4-LUT logic blocks is within 61% of the optimally matched architecture across the VTR 7 benchmark set and a set of parallelism-tunable benchmarks. Without internal banking, the worst-case overhead is 98%, achieved with an architecture with 32 bit wide, 8Kb memories placed every 9 columns, roughly comparable to the memory organization on the Cyclone V (where memories are placed about every 10 columns). Monolithic 32 bit wide, 16Kb memories placed every 10 columns (comparable to 18Kb and 20Kb memories used in Virtex 4 and Stratix V FPGAs) have a 180% worst-case energy overhead. Furthermore, we show practical cases where designs mapped for optimal parallelism use 4.7 × less energy than designs using a single processing element.

References

  1. Altera Corporation. 2013. PowerPlay Early Power Estimator. Altera Corporation, San Jose, CA. http://www.altera.com/support/devices/estimator/pow-powerplay.jsp.Google ScholarGoogle Scholar
  2. Vaughn Betz, Jonathan Rose, and Alexander Marquardt. 1999. Architecture and CAD for Deep-Submicron FPGAs. Kluwer, Norwell, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sandeep Bhatt and Frank Thomson Leighton. 1984. A framework for solving VLSI graph layout problems. Journal of Computer System Sciences 28, 300--343.Google ScholarGoogle ScholarCross RefCross Ref
  4. Bluespec. 2012. Bluespec SystemVerilog 2012.01.A. Available at http://www.bluespec.com.Google ScholarGoogle Scholar
  5. S. Y. I. Chin, C. S. P. Lee, and Steven J. E. Wilton. 2006. Power implications of implementing logic using FPGA embedded memory arrays. In Proceedings of the International Conference on Field-Programmable Logic and Applications. 1--8. DOI:http://dx.doi.org/10.1109/FPL.2006.311200Google ScholarGoogle Scholar
  6. André DeHon. 1999. Balancing interconnect and computation in a reconfigurable computing array (or, why you don’t really want 100% LUT utilization). In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 69--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. André DeHon. 2015. Fundamental underpinnings of reconfigurable computing architectures. Proceedings of the IEEE 103, 3, 355--378. DOI:http://dx.doi.org/10.1109/JPROC.2014.2387696Google ScholarGoogle ScholarCross RefCross Ref
  8. Michael Delorimier, Nachiket Kapre, Nikil Mehta, and André DeHon. 2011. Spatial hardware implementation for sparse graph algorithms in GraphStep. ACM Transactions on Autonomous and Adaptive Systems 6, 3, Article No. 17. DOI:http://dx.doi.org/10.1145/2019583.2019584 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Wilm E. Donath. 1979. Placement and average interconnection lengths of computer logic. IEEE Transactions on Circuits and Systems 26, 4, 272--277.Google ScholarGoogle ScholarCross RefCross Ref
  10. M. Genovese and E. Napoli. 2014. ASIC and FPGA implementation of the Gaussian mixture model algorithm for real-time segmentation of high definition video. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 22, 3, 537--547. DOI:http://dx.doi.org/10.1109/TVLSI.2013.2249295 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. B. Goeders and Steven J. E. Wilton. 2012. VersaPower: Power estimation for diverse FPGA architectures. In Proceedings of the International Conference on Field-Programmable Technology. 229--234. DOI:http://dx.doi.org/10.1109/FPT.2012.6412139Google ScholarGoogle Scholar
  12. Thomas L. Heath and Euclid. 1956. The Thirteen Books of Euclid’s Elements, Books I and II (2nd ed.). Dover Publications. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. ITRS. 2012. International Technology Roadmap for Semiconductors. Available at http://www.itrs2.net/itrs-reports.html.Google ScholarGoogle Scholar
  14. Edin Kadric, David Lakata, and André DeHon. 2015. Impact of memory architecture on FPGA energy consumption. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 146--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Edin Kadric, Kunal Mahajan, and André DeHon. 2014. Kung Fu data energy-minimizing communication energy in FPGA computations. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dirk Koch and Jim Torresen. 2011. FPGASort: A high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 45--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ian Kuon and Jonathan Rose. 2007. Measuring the gap between FPGAs and ASICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26, 2, 203--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Lamoureux and Steven J. E. Wilton. 2006. Activity estimation for field-programmable gate arrays. In Proceedings of the International Conference on Field-Programmable Logic and Applications. 1--8. DOI:http://dx.doi.org/10.1109/FPL.2006.311199Google ScholarGoogle Scholar
  19. B. S. Landman and R. L. Russo. 1971. On pin versus block relationship for partitions of logic circuits. IEEE Transactions on Computers 20, 1469--1479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. David Lewis, Elias Ahmed, David Cashman, Tim Vanderhoek, Chris Lane, Andy Lee, and Philip Pan. 2009. Architectural enhancements in Stratix-III and Stratix-IV. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 33--42. DOI:http://dx.doi.org/10.1145/1508128.1508135 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. David Lewis, David Cashman, Mark Chan, Jeffery Chromczak, Gary Lai, Andy Lee, Tim Vanderhoek, and Haiming Yu. 2013. Architectural enhancements in Stratix V. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 147--156. DOI:http://dx.doi.org/10.1145/2435264.2435292 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jason Luu, Jason Helge Anderson, and Jonathan Scott Rose. 2011. Architecture description and packing for logic blocks with hierarchy, modes and complex interconnect. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 227--236. DOI:http://dx.doi.org/10.1145/1950413.1950457 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jason Luu, Jeffrey Goeders, Michael Wainberg, Andrew Somerville, Thien Yu, Konstantin Nasartschuk, Miad Nasr, et al. 2014. VTR 7.0: Next generation architecture and CAD system for FPGAs. ACM Transactions on Reconfigurable Technology and Systems 7, 2, 6:1--6:30. DOI:http://dx.doi.org/10.1145/2617593 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. HPL 2009-85. HP Labs, Palo Alto, CA. http://www.hpl.hp.com/techreports/2009/HPL-2009-85.html.Google ScholarGoogle Scholar
  25. Kara K. W. Poon, Steven J. E. Wilton, and Andy Yan. 2005. A detailed power model for field-programmable gate arrays. ACM Transactions on Design Automation of Electronic Systems 10, 2, 279--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jonathan Rose, Jason Luu, Chi Wai Yu, Opal Densmore, Jeffrey Goeders, Andrew Somerville, Kenneth B. Kent, Peter Jamieson, and Jason Anderson. 2012. The VTR Project: Architecture and CAD for FPGAs from Verilog to routing. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, 77--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Tessier, V. Betz, D. Neto, A. Egier, and T. Gopalsamy. 2007. Power-efficient RAM mapping algorithms for FPGA embedded memory blocks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26, 2, 278--290. DOI:http://dx.doi.org/10.1109/TCAD.2006.887924 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Thompson. 1979. Area-time complexity for VLSI. In Proceedings of the ACM Symposium on Theory of Computing. 81--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Henry Wong, Vaughn Betz, and Jonathan Rose. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays. 5--14. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Impact of Parallelism and Memory Architecture on FPGA Communication Energy

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 9, Issue 4
      Regular Papers and Special Section on Field Programmable Gate Arrays (FPGA) 2015
      September 2016
      161 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/2984740
      • Editor:
      • Steve Wilton
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 August 2016
      • Accepted: 1 December 2015
      • Received: 1 July 2015
      Published in trets Volume 9, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!