Abstract
This article presents a reconfigurable hardware/software architecture for binary acceleration of embedded applications. A Reconfigurable Processing Unit (RPU) is used as a coprocessor of the General Purpose Processor (GPP) to accelerate the execution of repetitive instruction sequences called Megablocks. A toolchain detects Megablocks from instruction traces and generates customized RPU implementations. The implementation of Megablocks with memory accesses uses a memory-sharing mechanism to support concurrent accesses to the entire address space of the GPP’s data memory. The scheduling of load/store operations and memory access handling have been optimized to minimize the latency introduced by memory accesses. The system is able to dynamically switch the execution between the GPP and the RPU when executing the original binaries of the input application. Our proof-of-concept prototype achieved geometric mean speedups of 1.60× and 1.18× for, respectively, a set of 37 benchmarks and a subset considering the 9 most complex benchmarks. With respect to a previous version of our approach, we achieved geometric mean speedup improvements from 1.22 to 1.53 for the 10 benchmarks previously used.
- J. R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. 1983. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages. ACM, 177--189. Google Scholar
Digital Library
- Antonio Carlos S. Beck, Mateus B. Rutzig, Georgi Gaydadjiev, and Luigi Carro. 2008. Transparent reconfigurable acceleration for heterogeneous embedded applications. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’08). ACM, 1208--1213. Google Scholar
Digital Library
- João Bispo and João M. P. Cardoso. 2010a. On identifying and optimizing instruction sequences for dynamic compilation. In Proceedings of the International Conference on Field-Programmable Technology (FPT’10). 437--440.Google Scholar
- João Bispo and João M. P. Cardoso. 2010b. On identifying segments of traces for dynamic compilation. In Proceedings of the International Conference Field-Programmable Logic Applications (FPL’10). 263--266. Google Scholar
Digital Library
- João Bispo, Nuno Paulino, João M. P. Cardoso, and João C. Ferreira. 2013a. Transparent runtime migration of loop-based traces of processor instructions to reconfigurable processing units. International Journal of Reconfigurable Computing (2013), 20. Article ID 340316.Google Scholar
- João Bispo, Nuno Paulino, João C. Ferreira, and João M. P. Cardoso. 2013b. Transparent trace-based binary acceleration for reconfigurable HW/SW systems. IEEE Transactions on Industrial Informatics 9, 3 (Aug. 2013), 1625--1634.Google Scholar
Cross Ref
- João Bispo. 2012. Mapping Runtime-Detected Loops from Microprocessors to Reconfigurable Processing Units. Ph.D. Dissertation. Instituto Superior susheel -- Universidade susheel de Lisboa.Google Scholar
- Nathan Clark, Jason Blome, Michael Chu, Scott Mahlke, Stuart Biles, and Krisztian Flautner. 2005. An architecture framework for transparent instruction set customization in embedded processors. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA’05). IEEE Computer Society, 272--283. Google Scholar
Digital Library
- Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. 2004. Application-specific processing on a general-purpose core via transparent instruction set customization. In Proceedings of the 37th International Symposium on Microarchitecture (MICRO’04). 30--40. Google Scholar
Digital Library
- Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2011. Memory access optimization in compilation for coarse-grained reconfigurable architectures. ACM Transactions on Design Automation of Electron. Syst. 16, 4, Article 42 (Oct. 2011), 27 pages. Google Scholar
Digital Library
- Roman L. Lysecky and Frank Vahid. 2009. Design and implementation of a MicroBlaze-based warp processor. ACM Trans. Embedded Comput. Syst. 8, 3, Article 22 (April 2009), 22 pages. Google Scholar
Digital Library
- Hamid Noori, Farhad Mehdipour, Koji Inoue, and Kazuaki Murakami. 2012. Improving performance and energy efficiency of embedded processors via post-fabrication instruction set customization. Journal of Supercomputing 60, 2 (May 2012), 196--222. Google Scholar
Digital Library
- Hamid Noori, Farhad Mehdipour, Kazuaki Murakami, Koji Inoue, and Morteza Saheb Zamani. 2008. An architecture framework for an adaptive extensible processor. Journal of Supercomputing 45, 3 (Sept. 2008), 313--340. Google Scholar
Digital Library
- Jong Kyung Paek, Kiyoung Choi, and Jongeun Lee. 2011. Binary acceleration using coarse-grained reconfigurable architecture. SIGARCH Computer Architecture News 38, 4 (Jan. 2011), 33--39. Google Scholar
Digital Library
- Nuno Paulino, João C. Ferreira, and João M. P. Cardoso. 2013. Architecture for transparent binary acceleration of loops with memory accesses. In Proceedings of the 9th International Conference on Reconfigurable Computing: Architectures, Tools, and Applications (ARC’13). Springer-Verlag, 122--133. Google Scholar
Digital Library
- Jeff Scott, Lea Hwang Lee, John Arends, and Bill Moyer. 1998. Designing the Low-Power M*CORE Architecture. In Proceedings of the Power Driven Microarchitecture Workshop at the IEEE International Symposium on Circuits and Systems (ISCAS’98). Barcelona, Spain.Google Scholar
- Seoul National University. 2006. SNU Real-Time Benchmarks. Retrieved from http://www.cprover.org/goto-cc/examples/snu.html.Google Scholar
- Greg Stitt and Frank Vahid. 2011. Thread warping: Dynamic and transparent synthesis of thread accelerators. ACM Transactions on Design Automation of Electronic Systems 16, 3, Article 32, 21 pages. Google Scholar
Digital Library
- Texas Instruments. 2008. TMS320C6000 Image Library (IMGLIB) - SPRC264. Retrieved from http://www.ti.com/tool/sprc264. (2008).Google Scholar
- Henry S. Warren. 2002. Hacker’s Delight. Addison-Wesley Longman.Google Scholar
- Wayne Wolf. 2003. A decade of hardware/software codesign. Computer 36 (April 2003), 38--43. Google Scholar
Digital Library
Index Terms
A Reconfigurable Architecture for Binary Acceleration of Loops with Memory Accesses
Recommendations
Architecture for transparent binary acceleration of loops with memory accesses
ARC'13: Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applicationsThis paper presents an extension to a hardware/software system architecture in which repetitive instruction traces, called Megablocks, Reconfigurable Processing Unit (RPU). This scheme is supported by a custom toolchain able to automatically generate a ...
Techniques for Dynamically Mapping Computations to Coprocessors
RECONFIG '11: Proceedings of the 2011 International Conference on Reconfigurable Computing and FPGAsIn embedded reconfigurable computing systems, general purpose processors (GPPs) are typically extended with coprocessors to meet specific goals, such as higher performance and/or energy savings. Coprocessors can range from specialized modules which ...
From Instruction Traces to Specialized Reconfigurable Arrays
RECONFIG '11: Proceedings of the 2011 International Conference on Reconfigurable Computing and FPGAsThis paper presents an offline tool-chain which automatically extracts loops (Mega blocks) from Micro Blaze instruction traces and creates a tailored Reconfigurable Processing Unit (RPU) for those loops. The system moves loops from the CPU to the RPU ...






Comments