Abstract
By using resource sharing field-programmable gate array (FPGA) compute engines, we can reduce the performance gap between soft scalar CPUs and resource-intensive custom datapath designs. This article demonstrates that Thread- and Instruction-Level parallel Template architecture (TILT), a programmable FPGA-based horizontally microcoded compute engine designed to highly utilize floating point (FP) functional units (FUs), can improve significantly the average throughput of eight FP-intensive applications compared to a soft scalar CPU (similar to a FP-extended Nios). For eight benchmark applications, we show that: (i) a base TILT configuration having a single instance for each FU type can improve the performance over a soft scalar CPU by 15.8 × , while requiring on average 26% of the custom datapaths’ area; (ii) selectively increasing the number of FUs can more than double TILT’s average throughput, reducing the custom-datapath-throughput-gap from 576 × to 14 × ; and (iii) replicated instances of the most computationally dense TILT configuration that fit within the area of each custom datapath design can reduce the gap to 8.27 × , while replicated instances of application-tuned configurations of TILT can reduce the custom-datapath-throughput-gap to an average of 5.22 × , and up to 3.41 × for the Matrix Multiply benchmark. Last, we present methods for design space reduction, and we correctly predict the computationally densest design for seven out of eight benchmarks.
- F. Anjam, M. Nadeem, and S. Wong. 2010. A VLIW softcore processor with dynamically adjustable issue-slots. In Proceedings of the International Conference on Field Programmable Technology (FPT’10). 393--398.Google Scholar
- V. E. Benes. 1964. Optimal rearrangeable multistage connecting networks. Bell Syst. Tech. J. 43, 4, 1641--1656.Google Scholar
Cross Ref
- F. Black and M. Scholes. 1973. The pricing of options and corporate liabilities. J. Politic. Econ. 81, 3, pp. 637--654.Google Scholar
Cross Ref
- E. S. Davidson, L. E. Shar, A. T. Thomas, and J. H. Patel. 1975. In Proceedings of the Effective control for pipelined computers (COMPCON’90). 181--184.Google Scholar
- R. Dimond, O. Mencer, and W. Luk. 2005. CUSTARD—A customisable threaded FPGA soft processor and tools. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’05).Google Scholar
- J. A. Fisher. 1979. The Optimization of Horizontal Microcode Within and Beyond Basic Blocks: An Application of Processor Scheduling with Resources. Ph.D. Dissertation. New York University. Google Scholar
Digital Library
- J. A. Fisher, P. Faraboschi, and C. Young. 2005. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Elsevier. Google Scholar
Digital Library
- B. Fort, D. Capalija, Z. G. Vranesic, and S. D. Brown. 2006. A multithreaded soft processor for SoPC area reduction. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’06). Google Scholar
Digital Library
- E. G. Haug. 2013. Black Scholes Code. Retrieved from http://www.espenhaug.com/black_scholes.html. (2013).Google Scholar
- A. L. Hodgkin and A. F. Huxley. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117, 4.Google Scholar
Cross Ref
- T. C. Hu. 1961. Parallel sequencing and assembly line problems. In Operat. Res. 9 (6). 841--848. Google Scholar
Digital Library
- A. K. Jones, R. Hoare, D. Kusic, F. Joshua, and F. John 2005. An FPGA-based VLIW processor with custom hardware execution. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’05). Google Scholar
Digital Library
- N. Kapre and A. DeHon. 2009. Accelerating SPICE model-evaluation using FPGAs. In Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM’09). Google Scholar
Digital Library
- N. Kapre and A. DeHon. 2011. VLIW-SCORE: Beyond C for sequential control of SPICE FPGA acceleration. In Proceedings of the International Conference on Field Programmable Technology (FPT’11).Google Scholar
- N. Kapre and A. DeHon. 2012. SPICE2: Spatial processors interconnected for concurrent execution for accelerating the SPICE circuit simulator using an FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 31, 1, 9--22. Google Scholar
Digital Library
- M. Labrecque and J. G. Steffan. 2007. Improving pipelined soft processors with multithreading. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’07).Google Scholar
- M. Lam. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation (PLDI’88). Google Scholar
Digital Library
- W. J. Lee, S. O. Woo, K. T. Kwon, S. J. Son, K. J. Min, C. H. Lee, K. J. Jang, C. M. Park, S. Y. Jung, and S. H. Lee. 2011. A scalable GPU architecture based on dynamically embedded reconfigurable processor. Proceedings of ACM High Performance Graphics 2011, Posters.Google Scholar
- Y. Lei, Y. Dou, J. Zhou, and S. Wang. 2011. VPFPAP: A special-purpose VLIW processor for variable-precision floating-point arithmetic. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’11). 252--257. Google Scholar
Digital Library
- Compiler LLVM. 2012. The LLVM Compiler Infrastructure. Retrieved from http://llvm.org. Version 3.1.Google Scholar
- C. Loken, D. Gruner, L. Groer, R. Peltier, N. Bunn, M. Craig, T. Henriques, J. Dempsey, C. Yu, J. Chen, L. Jonathan Dursi, J. Chong, S. Northrup, J. Pinto, N. Knecht, and R. Van Zon. 2010. SciNet: Lessons learned from building a power-efficient top-20 system and data centre. J. Phys.: Conf. Ser. 256, 1 (2010).Google Scholar
Cross Ref
- S. Mann and R. W. Picard. 1995. On being “undigital” with digital cameras: Extending dynamic range by combining differently exposed pictures. In Proceedings of the 1995 Conference on Imaging Science and Technology (IST’95). 442--448.Google Scholar
- B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2002. DRESC: A retargetable compiler for coarse-grained reconfigurable architectures. In Proceedings of the 2002 IEEE International Conference on Field-Programmable Technology (FPT’02). 166--173.Google Scholar
- B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In Proceedings of the Field Programmable Logic and Application, 13th International Conference (FPL’03). 61--70.Google Scholar
- MESA. 2013a. Matrix Inverse Code. Retrieved from http://express.ece.ucsb.edu/benchmark/mesa/invert_matrix_general.html.Google Scholar
- MESA. 2013b. Matrix Multiply Code. Retrieved from http://express.ece.ucsb.edu/benchmark/mesa/matmul.html.Google Scholar
- G. D. Micheli. 1994. Synthesis and Optimization of Digital Circuits. McGraw-Hill. Google Scholar
Digital Library
- T. Miyamori and K. Olukotun. 1998. REMARC: Reconfigurable multimedia array coprocessor. In Proceedings of the IEICE Transactions on Information and Systems E82-D. 389--397.Google Scholar
- R. Moussali, N. Ghanem, and M. A. R. Saghir. 2007. Microarchitectural enhancements for configurable multi-threaded soft processors. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’07).Google Scholar
Cross Ref
- NVidia. 2013a. Gaussian Blur Benchmark Code. Retrieved from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch40.html.Google Scholar
- NVidia. 2013b. N Body Benchmark Code. Retrieved from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html.Google Scholar
- Kalin Ovtcharov, Ilian Tili, and J. Gregory Steffan. 2013. TILT: A multithreaded VLIW soft processor family. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’13).Google Scholar
- M. A. R. Saghir, M. El-Majzoub, and P. Akl. 2006. Datapath and ISA customization for soft VLIW processors, In ReConFig 2006. In Proceedings of the IEEE International Conference on Reconfigurable Computing and FPGA’s (ReConFig’06).Google Scholar
- H. Singh, M. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. Comput. 49, 5, 465--481. Google Scholar
Digital Library
- H. Wong, V. Betz, and J. Rose. 2011. Comparing FPGA vs. custom CMOS and the impact on processor microarchitecture. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA’11). Google Scholar
Digital Library
- F. Xu, D. Li, and Y. Wang. 2011. An iterative approach for hybrid pipeline scheduling under throughput and resource constraints. In Proceedings of the IEEE International Conference on Computer Science and Automation Engineering (CSAE’11).Google Scholar
- Y. Yu and S. T. Acton. 2002. Speckle reducing anisotropic diffusion. Proceedings of the IEEE Transactions on Image Processing, 11 (2002), 1260--1270. Google Scholar
Digital Library
Index Terms
Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT
Recommendations
Soft vector processors vs FPGA custom hardware: measuring and reducing the gap
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arraysSoft processors are often used in FPGA-based systems because of their ease-of-use, but for a given computation there is a significant gap in area/performance between a C code implementation executing on a soft processor and a custom FPGA hardware ...
Scaling Soft Processor Systems
FCCM '08: Proceedings of the 2008 16th International Symposium on Field-Programmable Custom Computing MachinesAs FPGA-based systems including soft-processors become increasingly common we are motivated to better understand the best way to scale the performance of such systems. In this paper we explore the organization of processors and caches connected to a ...
Advanced performance features of the 64-bit PA-8000
COMPCON '95: Proceedings of the 40th IEEE Computer Society International ConferenceThe PA-8000 is Hewlett-Packard's first CPU to implement the new 64-bit PA2.0 architecture. It combines a high clock frequency with a number of advanced microarchitectural features to deliver industry-leading performance on commercial and technical ...






Comments