Abstract
VLIW processors typically deliver high performance on limited budget making them ideal for a variety of communication and signal processing solutions. These processors typically need large multi-ported register files that can have side effects of increased cycle time and high power consumption. The access delay and energy of these register files can also become prohibitive when increasing the register count or the access ports, thus limiting the overall performance of the processor. Most prior art circumvent this problem by using multiple clusters with private register files, to lower the access delay and reduce energy consumption. However, clustering artifacts, like increased inter--cluster communication operations and spill-recovery code, result in a performance penalty.
This paper proposes CURE — a novel technique to considerably reduce the negative effects of clustering. CURE augments the ISA to expose the communication registers to the compilers to increase availability of architectural register state to all functional units. The inter--cluster communication operations are integrated into regular ALU and memory operations to improve instruction encoding efficiency. We also propose a new code scheduling heuristic to handle the ISA changes, and to realize the improvements in processor’s performance and energy consumption. Our quantitative analysis estimates that CURE, when compared to the baseline 8--issue uni--cluster processor, boosts average performance by 61% while reducing the average register dynamic energy by 77%.
- Alex Aletà, Josep M. Codina, Antonio González, and David Kaeli. 2003. Instruction replication for clustered microarchitectures. In MICRO-36. Google Scholar
Digital Library
- Alex Aletà, Josep M. Codina, Jesús Sánchez, and Antonio González. 2001. Graph-partitioning based instruction scheduling for clustered processors. In MICRO-34. Google Scholar
Digital Library
- Alex Aletà, Josep M. Codina, Jesús Sánchez, Antonio González, and David Kaeli. 2002. Exploiting pseudo-schedules to guide data dependence graph partitioning. In PACT. Google Scholar
Digital Library
- R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In MICRO-34. Google Scholar
Digital Library
- A. Capitanio, N. Dutt, and A. Nicolau. 1992. Partitioned Register Files For VLIWs: A preliminary analysis of tradeoffs. In MICRO-25. Google Scholar
Digital Library
- Gregory Chaitin. Register allocation and spilling via graph coloring. SIGPLAN Not. 39, 4. Google Scholar
Digital Library
- Lakshmi N. Chakrapani, John Gyllenhaal, Wenmei W. Hwu, Scott A. Mahlke, Krishna V. Palem, and Rodric M. Rabbah. 2004. Trimaran: An infrastructure for research in instruction-level parallelism. In In Instruction-level Parallelism. Lecture Notes in Computer Science. Springer-Verlag, www.trimaran.org.Google Scholar
- N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. 2011. FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template. In ISCA-38. Google Scholar
Digital Library
- Josep M. Codina, Jesús Sánchez, and Antonio González. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In PACT. Google Scholar
Digital Library
- L. Codrescu, W. Anderson, S. Venkumanhanti, M. Zeng, E. Plondke, C. Koob, A. Ingle, R. Maule, and R. Talluri. 2013. Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications. In Hot Chips.Google Scholar
- Osvaldo Colavin and Davide Rizzo. 2003. A scalable wide-issue clustered VLIW with a reconfigurable interconnect. In CASES. Google Scholar
Digital Library
- J.-L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham. 2000. Multiple-banked register file architectures. In ISCA-27. Google Scholar
Digital Library
- Nam Duong and R. Kumar. 2009. Register Multimapping: A technique for reducing register bank conflicts in processors with large register files. In SASP-7.Google Scholar
- John R. Ellis. 1985. Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific). Ph.D. thesis. Google Scholar
Digital Library
- Equator. 1998. MAP1000 unfolds at Equator. In Microprocessor Report.Google Scholar
- Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, and Fred Homewood. 2000. Lx: a technology platform for customizable VLIW embedded processing. In ISCA-27. Google Scholar
Digital Library
- K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic. 1997. The multicluster architecture: reducing cycle time through partitioning. In MICRO-30. Google Scholar
Digital Library
- Jose Fridman and Zvi Greenfield. 2000. The TigerSHARC DSP Architecture. IEEE Micro. Google Scholar
Digital Library
- A. Gangwar, M. Balakrishnan, P. R. Panda, and A. Kumar. 2005. Evaluation of bus based interconnect mechanisms in clustered VLIW architectures. In DATE. Google Scholar
Digital Library
- J. S. Gardner. 2012. CEVA Exposes DSP Six Pack. In Microprocessor Report.Google Scholar
- N. Goel, A. Kumar, and P. R. Panda. 2007. Power Reduction in VLIW Processor with Compiler Driven Bypass Network. In VLSID-20. Google Scholar
Digital Library
- A. Gonzalez, J. Gonzalez, and M. Valero. 1998. Virtual-physical registers. In HPCA-4. Google Scholar
Digital Library
- Texas Instrucments Inc. 1998. TMS320C62x/67x CPU and instruction set reference guide.Google Scholar
- Texas Instruments. 2010. TMS320C6745/C6747 Fixed/Floating- point digital signal processors (Rev.D).Google Scholar
- Intel. Intel Itanium Architecture Software Develorer‘s Manual: Intel Itanium Instruction Set. www.intel.com 3, 293--370.Google Scholar
- Krishnan Kailas and Ashok Agrawala. 2001. CARS: A new code generation framework for clustered ILP processors. In HPCA. Google Scholar
Digital Library
- P. Geoffrey Lowney, Stefan M. Freudenberger, Thomas J. Karzes, W. D. Lichtenstein, Robert P. Nix, John S. O’Donnell, and John C. Ruttenberg. 1993. The multiflow trace scheduling compiler. The Journal of Supercomputing 7 (1993), 51--142. Google Scholar
Digital Library
- R. Nagpal and Y. N. Srikant. 2007. Register file energy optimization for snooping based clustered VLIW architectures. In SBAC-PAD-19.Google Scholar
- V. R. K. Naresh, D. J. Palframan, and M. H. Lipasti. 2011. CRAM: Coded registers for amplified multiporting. In MICRO-44. Google Scholar
Digital Library
- Emre Özer, Sanjeev Banerjia, and Thomas M. Conte. 1998. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In MICRO-31. Google Scholar
Digital Library
- I. Park, M. D. Powell, and T. N. Vijaykumar. 2002. Reducing register ports for higher speed and lower energy. In MICRO-35. Google Scholar
Digital Library
- Roni Potasman. 1992. Percolation based compiling for evaluation of parallelism and hardware design trade-offs. Ph.D.Google Scholar
- C. Rowen, D. Nicolaescu, R. Ravindran, D. Heine, G. Martin, J. Kim, D. Maydan, N. Andrews, B. Huffman, V. Papaparaskeva, S. Gal-On, P. Nuth, P. Patwardhan, and M. Paradkar. 2011. The World's Fastest DSP Core: Breaking the 100 GMAC/s Barrier. In Hot Chips.Google Scholar
- A. Terechko, M. Garg, and H. Corporaal. 2005. Evaluation of speed and area of clustered VLIW processors. In VLSID-18. Google Scholar
Digital Library
- A. Terechko, E. Le Thenaff, M. Garg, J. van Eijndhoven, and H. Corporaal. 2003. Inter-cluster communication models for clustered VLIW processors. In HPCA-9 2003. Google Scholar
Digital Library
- J. H. Tseng and K. Asanovic. 2003. Banked multiported register files for high-frequency superscalar microprocessors. In ISCA-30. Google Scholar
Digital Library
- S. Wallace and N. Bagherzadeh. 1996. A scalable register file architecture for dynamically scheduled processors. In PACT. Google Scholar
Digital Library
- R. Yung and N. C. Wilhelm. 1995. Caching processor general registers. In ICCD. Google Scholar
Digital Library
- J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. 2000. Two-level hierarchical register file organization for VLIW processors. In MICRO-33. Google Scholar
Digital Library
- Javier Zalamea, Josep Llosa, Eduard Ayguad, and Mateo Valero. 2001. Modulo scheduling with integrated register spilling for clustered VLIW architectures. In Micro-34. Google Scholar
Digital Library
- Yingchao Zhao, C. J. Xue, Minming Li, and B. Hu. 2009. Energy-aware register file re-partitioning for clustered VLIW architectures. In ASP-DAC. Google Scholar
Digital Library
- V. Zyuban and P. Kogge. 1998. The energy complexity of register files. In ISLPED. Google Scholar
Digital Library
Index Terms
The CURE: Cluster Communication Using Registers
Recommendations
Multiple-banked register file architectures
Special Issue: Proceedings of the 27th annual international symposium on Computer architecture (ISCA '00)The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies ...
Multiple-banked register file architectures
ISCA '00: Proceedings of the 27th annual international symposium on Computer architectureThe register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies ...
Vector Register Design with Register Bypassing for Embedded DSP Core
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and SystemsIn this paper, we address the register file design with Single Instruction Multiple Data (SIMD) for multimedia processing applications. In a 32-bit processor, for one data unit of 8-bit in width, one SIMD instruction can operate on four units at a time ...






Comments