Abstract
A new computation model called CACHE (Cache Architecture for Configurable Hardware Engine) is proposed in this paper. This model does not require a dedicated host processor and its software to harness the reconfiguration. Autonomous reconfiguration is performed within a working-set of application datapaths. The CACHE model has lots of side effects; caching, resource allocation and assignment, placement and routing, and defragmentation, with a processing array itself and a special register called a working-set register file. The model aims to reduce three major workloads: (1) the processor and application design workload, (2) runtime resource management and scheduling workload, and (3) reconfiguration workload. In order to reduce these workloads, processor architecture is definitely different from traditional computing model and its microprocessor architecture. There are three major ideas to construct the computing system: (1) an on-chip working-set model mainly in order to control load and store of streams, namely to control traffics introducing overheads, (2) an on-chip deadlock properties model mainly in order to manage resources and to continuously configure datapaths corresponding to a working-set window, (3) a cache memory technique to work for these models, the mechanism is equivalent to the working-set window, and the cache memory's procedure is equivalent to resource request, acquirement, and release of deadlock properties. The first model focuses onto streaming applications, for example vector and matrix operations, filters, and so on, which takes coarser grained operations such as integer operations of C-language. Regarding performance compared with DSPs, that comes from constant throughput across different scale of the applications. In addition, extended model, we call Instant model that automatically generates instance of a datapath, outperforms the DSPs. This paper shows its computation model, architecture, low-level design, and analyses about basic characteristics of the execution.
- Ainsworth, T. W. and Pinkston, T. M. 2007. Characterizing the cell eib on-chip network. IEEE Micro 27, 5, 6--14. Google Scholar
Digital Library
- Asaovic, K. 1998. Vector microprocessors. Ph.D. thesis, University of California, Berkeley. Google Scholar
Digital Library
- Bobda, C. 2007. Introduction to Reconfigurable Computing: Architectures, Algorithms, and Applications. Springer. Google Scholar
Digital Library
- Bondalapati, K. and Prasanna, V. K. 2002. Reconfigurable computing systems. Proc. IEEE. 1201--1217.Google Scholar
- Brebner, G. 1996. A virtual hardware operating system for the Xilinx XC6200. In Proceedings of the 6th International Workshop on Field-Programmable Logic and Applications (FPL'96). Springer, 327--336. Google Scholar
Digital Library
- Briggs, P. 1992. Register allocation via graph coloring. Ph.D. thesis, Rice University. Google Scholar
Digital Library
- Brown, S. D., Francis, R., Rose, J., and Vranesic, Z. 1992. Field-Programmable Gate Arrays. Kluwer Academic Publishers. Google Scholar
Digital Library
- Buell, D., El-Ghazawi, T., Gai, K., and Kindratenko, V. 2007. High-performance reconfigurable computing. IEEE Comput. 40, 3, 23--27. Google Scholar
Digital Library
- Burns, J., Donlin, A., Hogg, L, Singh, S., and De Wit, M. 1997. A dynamic reconfiguration run-time system. In Proceedings of the 5th Annual IEEE Symposium on FPGAs for Custom Computing Machines. IEEE Computer Society Press, 66--75. Google Scholar
Digital Library
- Chaitin, G. 2004. Register allocation and spilling via graph coloring. SIGPLAN Not. 39, 4, 66--74. Google Scholar
Digital Library
- Chen, G., Li, F., Son, S., and Kandemir, M. 2008. Application mapping for chip multiprocessors. In Proceedings of the 45th Design Automation Conference (DAC'08). ACM/IEEE. 620--625. Google Scholar
Digital Library
- Compton, K., Cooley, L, Knol, S., and Hauck, S. 2002. Configuration relocation and defragmentation for fpgas. IEEE Trans. VLSI 10, 3, 209--220. Google Scholar
Digital Library
- DeHon, A. 1996. Reconfigurable architectures for general-purpose computing. Tech. rep. Massachusetts Institute of Technology Artificial Intelligence Laboratory. Google Scholar
Digital Library
- Denning, P. J. 1968. The working set model for program behavior. Comm. ACM 11, 5, 323--333. Google Scholar
Digital Library
- Espasa, R. 1997. Advanced vector microprocessors. Ph.D. thesis, Universitat Po1itecnica de Catalunya.Google Scholar
- Espasa, R., Valero, M., Padua, D., and Jimenez, M. 1995. Quantitative analysis of vector code. In Proceedings of the Euromicro Workshop on Parallel and Distributed Processing (PDP'95). IEEE Computer Society Press, 452--461. Google Scholar
Digital Library
- Hammond, L., Nayfeh, B. A., and Olukotun, K. 1997. A single-chip multiprocessor. Comput. 30, 9, 79--85. Google Scholar
Digital Library
- Hauser, J. and Wawrzynek, J. 1997. Garp: A mips processor with a reconfigurable coprocessor. In Proceedings of the 5th IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'97). IEEE Computer Society, Los Alamitos, CA, 12--21. Google Scholar
Digital Library
- Holt, R. C. 1972. Some deadlock properties of computer systems. ACM Comput. Surv. 4, 3, 179--196. Google Scholar
Digital Library
- Howard, J., Dighe, S., et al. 2011. A 48-core ia-32 processor in 45 nm cmos using on-die message-passing and dvfs for performance and power scaling. IEEE J. Solid-State Circ. 46, 1 173--183.Google Scholar
Cross Ref
- Huang, I.-J. and Peng, T.-C. 2002. Analysis of x86 instruction set usage for dos/windows application and its implication on superscalar design. IEICE Trans. Inf. Syst. E85-D, 6, 929--939.Google Scholar
- Khailany, B., Dally, W. J., Rixner, S., Kapasi, U. J., Mattson, P., Namkoong, J., Owens, J. D., Towles, B., and Chang, A. 2001. Imagine: Media processing with streams. IEEE Micro 21, 2, 35--46. Google Scholar
Digital Library
- Kozyrakis, C. 1999. A media-enhanced vector architecture for embedded memory systems. Tech. rep. UCB-CSD-99-1059, University of California, Berkeley. Google Scholar
Digital Library
- Ludden, J. M., Roesner, W., et al. 2002. Functional verification of the power4 microprocessor and power4 multiprocessor systems. IBM J. Resear. Devel. 46, 1, 53--76. Google Scholar
Digital Library
- Maestre, R., Fernandez, M., Kurdahi, F. J., Bagherzadeh, N., and Singh, H. 2000. Configuration management in multi-context reconfigurable systems for simultaneous performance and power optimization. In Proceedings of the International Symposium on System Synthesis. 107. Google Scholar
Digital Library
- Mangione-Smith, W., Hutchings, B., et al. 1997. Seeking solutions in configurable computing. Comput. 30, 12, 38--43. Google Scholar
Digital Library
- Manolios, P. 2005. Refinement maps for efficient verification of processor models. In Proceedings of the Conference on Design Automation and Test in Europe (DATE'05). IEEE Computer Society Press, 1304--1309. Google Scholar
Digital Library
- Mattson, R. L., Gecsei, 1., Slutz, D. R., and Trainger, 1. L. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2, 78--117. Google Scholar
Digital Library
- Matzke, D. 1997. Will physical scalability sabotage performance gains? IEEE Comput. 30, 9, 37--39. Google Scholar
Digital Library
- Moore, G. E. 1995. Lithography and the future of Moore's law. In Advances in Resist Technology and Processing XII, R. D. Allen, Ed., 2--17.Google Scholar
- Mueller, S. M., Paul, W. J., and Kroening, D. 1999. Proving the correctness of processors with delayed branch using delayed PC. http://www-wjp.cs.uni-saarland.de/publikationen/KMP99a.pdf.Google Scholar
- Murray, J., Salett, R., Hetherington, R., and McKeen, F. 1990. Micro-architecture of the VAX 9000. In Proceedings of the 35th IEEE Computer Society International Conference, Digest of Papers, 44--53.Google Scholar
Cross Ref
- Nagarajan, R., Sankaralingam, K., Burger, D., and Keckler, S. W. 2001. A design space evaluation of grid processor architectures. In Proceedings of the 4th Annual International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos, CA, 40--51. Google Scholar
Digital Library
- Olukotun, K., Hammond, L., and Laudon, J. 2007. Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency Vol. 2. Morgan & Claypool Publishers, San Rafael, CA. Google Scholar
Digital Library
- Palacharla, S., Jouppi, N. P., and Smith, J. E. 1997. Complexity-effective superscaJar processors. SIGARCH Comput. Archit. News 25, 2, 206--218. Google Scholar
Digital Library
- Qi, S., Zhang, M., Li, J., Zhao, T., Zhang, C., and Li, S. 2010. A high performance router with dynamic buffer allocation for on-chip interconnect networks. In Proceedings of the IEEE International Conference on Computer Design. 462--467.Google Scholar
- Rixner, S., Dally, W. J., Khailany, B., Mattson, P., Kapasi, U. J., and Owens, J. D. 2000. Register organization for media processing. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (HPCA'00). IEEE Computer Society, 375--386.Google Scholar
- Sankaralingam, K., Nagarajan, R., et al. 2006. The distributed microarchitecture of the trips prototype processor. In Proceedings of the 39th International Symposium on Microarchitecture. Google Scholar
Digital Library
- Schmit, H. 1997. Incremental reconfiguration for pipelined applications. In Proceedings of the 5th IEEE Symposium on FPGAsfor Custom Computing Machines (FCCM'97). IEEE Computer Society, Los Alamitos, CA, 47--55. Google Scholar
Digital Library
- Seiler, L., Carmean, D., et al. 2008, Larrabee; a many-core x86 architecture for visual computing. ACM Trans. Graph. 27, 3, I--15. Google Scholar
Digital Library
- Sima, D. 2000. The design space of register renaming techniques. IEEE Micro. 20, 5, 70--83. Google Scholar
Digital Library
- SLDS. 2010. Lpdsp (low power dsp). http://semicon.sanyo.comlslds/product/lpdsp.html.Google Scholar
- Smith, J. E. and Sohi, G. S. 1995. The microarchitecture of superscalar processors. Proc. IEEE.Google Scholar
- Takano, S. 2004. Adaptive processor: A model of stream processing. In Proceedings of the IEEE Reconfigurable Architectures Workshop (RAW'04). associated with the 18th International Parallel and Distributed Processing Symposium, (IPDPS'04).Google Scholar
Cross Ref
- Tomasulo, R. M. 1967. An efficient algorithm for exploiti~ multiple arithmetic units. IBM J. Resear. Devel. 11, 1, 25--33. Google Scholar
Digital Library
- Tran, A., Truong, D., and Baas, B. 2009. A GALS many-core heterogeneous DSP platform with sourcesynchronous on-chip interconnection network. In Proceedings of the 3rd ACM/IEEE International Symposium on Networks-on-Chip. 214--223. Google Scholar
Digital Library
- Trimberger, S., Carberry, D., Johnson, A., and Wong, J. 1997. A time-multiplexed fpga. In Proceedings of the 5th IEEE Symposium on FPGAs for Custom Computing Machines, (FCCM'97). IEEE Computer Society, 22--28. Google Scholar
Digital Library
- Tullsen, D., Eggers, S., and Levy, H. 1998. Simultaneous multithreading: maximizing on-chip parallelism. In ISCA'98: 25 Years of the International Symposia on Computer Architecture (Selected Papers). ACM, New York, NY, 533--544. Google Scholar
Digital Library
- Victor, D. W., Ludden, J. M., et al. 2005. Functional verification of the power5 microprocessor and power5 multiprocessor systems. IBM J. Resear. Devel. 49, 4/5, 541--552. Google Scholar
Digital Library
- Vuillemin, J., Bertin, P., Roncin, D., Shand, M., Touati, H., and Boucard, P. 1996. Programmable active memories: Reconfigurable systems come of age. IEEE Trans. VLSI Syst. 4, 56--69. Google Scholar
Digital Library
- Wall, D. W. 1993. Limits of instruction-level parallelism. Resear. rep. 93/6. Compaq Computer Corp.Google Scholar
- Weiss, S. and Smith, J. E. 1984. Instruction issue logic for pipelined supercomputers. SIGARCH Comput. Archit. News 12, 3, 110--118. Google Scholar
Digital Library
- Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao, C.-C., III, J. F. B., and Agarwal, A. 2007. On-chip interconnection architecture of the tile processor. IEEE Micro 27, 5, 15--31. Google Scholar
Digital Library
- Wigley, G. and Kearney, D. 2001. The development of an operating system for reconfigurable computing. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM). IEEE. Google Scholar
Digital Library
- Wirthin, M. J. and Hutchings, B. L. 1995. A dynamic instruction set computer. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'95). IEEE Computer Society, 99--109. Google Scholar
Digital Library
- Wulf, W. A. and McKee, S. A. 1995. Hitting the memory wall: Implications of the obvious. Comput. Archit. News 23, 20--24. Google Scholar
Digital Library
Index Terms
Design and analysis of adaptive processor
Recommendations
Design of a Reconfigurable Embedded Data Cache
ISED '10: Proceedings of the 2010 International Symposium on Electronic System DesignPerformance and power consumption are very important aspects of embedded systems design. Several studies have shown that cache memory consumes as much as 50\% of the total power in such systems. Thus, the architecture of the cache governs both ...
Performance advantage of reconfigurable cache design on multicore processor systems
With the trends of microprocessor design towards multicore, cache performance becomes more important because an off-chip access would be increasingly expensive due to the competition across the processor cores. A question arises: How to design the cache ...
Automatic Design of Area-Efficient Configurable ASIC Cores
Reconfigurable hardware has been shown to provide an efficient compromise between the flexibility of software and the performance of hardware. However, even coarse-grained reconfigurable architectures target the general case and miss optimization ...






Comments