skip to main content
research-article

A Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses

Authors Info & Claims
Published:29 December 2014Publication History
Skip Abstract Section

Abstract

This article presents a reconfigurable hardware/software architecture for binary acceleration of embedded applications. A Reconfigurable Processing Unit (RPU) is used as a coprocessor of the General Purpose Processor (GPP) to accelerate the execution of repetitive instruction sequences called Megablocks. A toolchain detects Megablocks from instruction traces and generates customized RPU implementations. The implementation of Megablocks with memory accesses uses a memory-sharing mechanism to support concurrent accesses to the entire address space of the GPP’s data memory. The scheduling of load/store operations and memory access handling have been optimized to minimize the latency introduced by memory accesses. The system is able to dynamically switch the execution between the GPP and the RPU when executing the original binaries of the input application. Our proof-of-concept prototype achieved geometric mean speedups of 1.60× and 1.18× for, respectively, a set of 37 benchmarks and a subset considering the 9 most complex benchmarks. With respect to a previous version of our approach, we achieved geometric mean speedup improvements from 1.22 to 1.53 for the 10 benchmarks previously used.

References

  1. J. R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. 1983. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages. ACM, 177--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Antonio Carlos S. Beck, Mateus B. Rutzig, Georgi Gaydadjiev, and Luigi Carro. 2008. Transparent reconfigurable acceleration for heterogeneous embedded applications. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’08). ACM, 1208--1213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. João Bispo and João M. P. Cardoso. 2010a. On identifying and optimizing instruction sequences for dynamic compilation. In Proceedings of the International Conference on Field-Programmable Technology (FPT’10). 437--440.Google ScholarGoogle Scholar
  4. João Bispo and João M. P. Cardoso. 2010b. On identifying segments of traces for dynamic compilation. In Proceedings of the International Conference Field-Programmable Logic Applications (FPL’10). 263--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. João Bispo, Nuno Paulino, João M. P. Cardoso, and João C. Ferreira. 2013a. Transparent runtime migration of loop-based traces of processor instructions to reconfigurable processing units. International Journal of Reconfigurable Computing (2013), 20. Article ID 340316.Google ScholarGoogle Scholar
  6. João Bispo, Nuno Paulino, João C. Ferreira, and João M. P. Cardoso. 2013b. Transparent trace-based binary acceleration for reconfigurable HW/SW systems. IEEE Transactions on Industrial Informatics 9, 3 (Aug. 2013), 1625--1634.Google ScholarGoogle ScholarCross RefCross Ref
  7. João Bispo. 2012. Mapping Runtime-Detected Loops from Microprocessors to Reconfigurable Processing Units. Ph.D. Dissertation. Instituto Superior susheel -- Universidade susheel de Lisboa.Google ScholarGoogle Scholar
  8. Nathan Clark, Jason Blome, Michael Chu, Scott Mahlke, Stuart Biles, and Krisztian Flautner. 2005. An architecture framework for transparent instruction set customization in embedded processors. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05). IEEE Computer Society, 272--283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. 2004. Application-specific processing on a general-purpose core via transparent instruction set customization. In Proceedings of the 37th International Symposium on Microarchitecture (MICRO’04). 30--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2011. Memory access optimization in compilation for coarse-grained reconfigurable architectures. ACM Transactions on Design Automation of Electron. Syst. 16, 4, Article 42 (Oct. 2011), 27 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Roman L. Lysecky and Frank Vahid. 2009. Design and implementation of a MicroBlaze-based warp processor. ACM Trans. Embedded Comput. Syst. 8, 3, Article 22 (April 2009), 22 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hamid Noori, Farhad Mehdipour, Koji Inoue, and Kazuaki Murakami. 2012. Improving performance and energy efficiency of embedded processors via post-fabrication instruction set customization. Journal of Supercomputing 60, 2 (May 2012), 196--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hamid Noori, Farhad Mehdipour, Kazuaki Murakami, Koji Inoue, and Morteza Saheb Zamani. 2008. An architecture framework for an adaptive extensible processor. Journal of Supercomputing 45, 3 (Sept. 2008), 313--340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jong Kyung Paek, Kiyoung Choi, and Jongeun Lee. 2011. Binary acceleration using coarse-grained reconfigurable architecture. SIGARCH Computer Architecture News 38, 4 (Jan. 2011), 33--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Nuno Paulino, João C. Ferreira, and João M. P. Cardoso. 2013. Architecture for transparent binary acceleration of loops with memory accesses. In Proceedings of the 9th International Conference on Reconfigurable Computing: Architectures, Tools, and Applications (ARC’13). Springer-Verlag, 122--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jeff Scott, Lea Hwang Lee, John Arends, and Bill Moyer. 1998. Designing the Low-Power M*CORE Architecture. In Proceedings of the Power Driven Microarchitecture Workshop at the IEEE International Symposium on Circuits and Systems (ISCAS’98). Barcelona, Spain.Google ScholarGoogle Scholar
  17. Seoul National University. 2006. SNU Real-Time Benchmarks. Retrieved from http://www.cprover.org/goto-cc/examples/snu.html.Google ScholarGoogle Scholar
  18. Greg Stitt and Frank Vahid. 2011. Thread warping: Dynamic and transparent synthesis of thread accelerators. ACM Transactions on Design Automation of Electronic Systems 16, 3, Article 32, 21 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Texas Instruments. 2008. TMS320C6000 Image Library (IMGLIB) - SPRC264. Retrieved from http://www.ti.com/tool/sprc264. (2008).Google ScholarGoogle Scholar
  20. Henry S. Warren. 2002. Hacker’s Delight. Addison-Wesley Longman.Google ScholarGoogle Scholar
  21. Wayne Wolf. 2003. A decade of hardware/software codesign. Computer 36 (April 2003), 38--43. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!