skip to main content
research-article
Public Access

The CURE: Cluster Communication Using Registers

Published:27 September 2017Publication History
Skip Abstract Section

Abstract

VLIW processors typically deliver high performance on limited budget making them ideal for a variety of communication and signal processing solutions. These processors typically need large multi-ported register files that can have side effects of increased cycle time and high power consumption. The access delay and energy of these register files can also become prohibitive when increasing the register count or the access ports, thus limiting the overall performance of the processor. Most prior art circumvent this problem by using multiple clusters with private register files, to lower the access delay and reduce energy consumption. However, clustering artifacts, like increased inter--cluster communication operations and spill-recovery code, result in a performance penalty.

This paper proposes CURE — a novel technique to considerably reduce the negative effects of clustering. CURE augments the ISA to expose the communication registers to the compilers to increase availability of architectural register state to all functional units. The inter--cluster communication operations are integrated into regular ALU and memory operations to improve instruction encoding efficiency. We also propose a new code scheduling heuristic to handle the ISA changes, and to realize the improvements in processor’s performance and energy consumption. Our quantitative analysis estimates that CURE, when compared to the baseline 8--issue uni--cluster processor, boosts average performance by 61% while reducing the average register dynamic energy by 77%.

References

  1. Alex Aletà, Josep M. Codina, Antonio González, and David Kaeli. 2003. Instruction replication for clustered microarchitectures. In MICRO-36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alex Aletà, Josep M. Codina, Jesús Sánchez, and Antonio González. 2001. Graph-partitioning based instruction scheduling for clustered processors. In MICRO-34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alex Aletà, Josep M. Codina, Jesús Sánchez, Antonio González, and David Kaeli. 2002. Exploiting pseudo-schedules to guide data dependence graph partitioning. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi. 2001. Reducing the complexity of the register file in dynamic superscalar processors. In MICRO-34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Capitanio, N. Dutt, and A. Nicolau. 1992. Partitioned Register Files For VLIWs: A preliminary analysis of tradeoffs. In MICRO-25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Gregory Chaitin. Register allocation and spilling via graph coloring. SIGPLAN Not. 39, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Lakshmi N. Chakrapani, John Gyllenhaal, Wenmei W. Hwu, Scott A. Mahlke, Krishna V. Palem, and Rodric M. Rabbah. 2004. Trimaran: An infrastructure for research in instruction-level parallelism. In In Instruction-level Parallelism. Lecture Notes in Computer Science. Springer-Verlag, www.trimaran.org.Google ScholarGoogle Scholar
  8. N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. 2011. FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template. In ISCA-38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Josep M. Codina, Jesús Sánchez, and Antonio González. 2001. A unified modulo scheduling and register allocation technique for clustered processors. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Codrescu, W. Anderson, S. Venkumanhanti, M. Zeng, E. Plondke, C. Koob, A. Ingle, R. Maule, and R. Talluri. 2013. Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications. In Hot Chips.Google ScholarGoogle Scholar
  11. Osvaldo Colavin and Davide Rizzo. 2003. A scalable wide-issue clustered VLIW with a reconfigurable interconnect. In CASES. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J.-L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham. 2000. Multiple-banked register file architectures. In ISCA-27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nam Duong and R. Kumar. 2009. Register Multimapping: A technique for reducing register bank conflicts in processors with large register files. In SASP-7.Google ScholarGoogle Scholar
  14. John R. Ellis. 1985. Bulldog: a compiler for vliw architectures (parallel computing, reduced-instruction-set, trace scheduling, scientific). Ph.D. thesis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Equator. 1998. MAP1000 unfolds at Equator. In Microprocessor Report.Google ScholarGoogle Scholar
  16. Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, and Fred Homewood. 2000. Lx: a technology platform for customizable VLIW embedded processing. In ISCA-27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic. 1997. The multicluster architecture: reducing cycle time through partitioning. In MICRO-30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jose Fridman and Zvi Greenfield. 2000. The TigerSHARC DSP Architecture. IEEE Micro. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Gangwar, M. Balakrishnan, P. R. Panda, and A. Kumar. 2005. Evaluation of bus based interconnect mechanisms in clustered VLIW architectures. In DATE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. S. Gardner. 2012. CEVA Exposes DSP Six Pack. In Microprocessor Report.Google ScholarGoogle Scholar
  21. N. Goel, A. Kumar, and P. R. Panda. 2007. Power Reduction in VLIW Processor with Compiler Driven Bypass Network. In VLSID-20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Gonzalez, J. Gonzalez, and M. Valero. 1998. Virtual-physical registers. In HPCA-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Texas Instrucments Inc. 1998. TMS320C62x/67x CPU and instruction set reference guide.Google ScholarGoogle Scholar
  24. Texas Instruments. 2010. TMS320C6745/C6747 Fixed/Floating- point digital signal processors (Rev.D).Google ScholarGoogle Scholar
  25. Intel. Intel Itanium Architecture Software Develorer‘s Manual: Intel Itanium Instruction Set. www.intel.com 3, 293--370.Google ScholarGoogle Scholar
  26. Krishnan Kailas and Ashok Agrawala. 2001. CARS: A new code generation framework for clustered ILP processors. In HPCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Geoffrey Lowney, Stefan M. Freudenberger, Thomas J. Karzes, W. D. Lichtenstein, Robert P. Nix, John S. O’Donnell, and John C. Ruttenberg. 1993. The multiflow trace scheduling compiler. The Journal of Supercomputing 7 (1993), 51--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. R. Nagpal and Y. N. Srikant. 2007. Register file energy optimization for snooping based clustered VLIW architectures. In SBAC-PAD-19.Google ScholarGoogle Scholar
  29. V. R. K. Naresh, D. J. Palframan, and M. H. Lipasti. 2011. CRAM: Coded registers for amplified multiporting. In MICRO-44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Emre Özer, Sanjeev Banerjia, and Thomas M. Conte. 1998. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In MICRO-31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. I. Park, M. D. Powell, and T. N. Vijaykumar. 2002. Reducing register ports for higher speed and lower energy. In MICRO-35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Roni Potasman. 1992. Percolation based compiling for evaluation of parallelism and hardware design trade-offs. Ph.D.Google ScholarGoogle Scholar
  33. C. Rowen, D. Nicolaescu, R. Ravindran, D. Heine, G. Martin, J. Kim, D. Maydan, N. Andrews, B. Huffman, V. Papaparaskeva, S. Gal-On, P. Nuth, P. Patwardhan, and M. Paradkar. 2011. The World's Fastest DSP Core: Breaking the 100 GMAC/s Barrier. In Hot Chips.Google ScholarGoogle Scholar
  34. A. Terechko, M. Garg, and H. Corporaal. 2005. Evaluation of speed and area of clustered VLIW processors. In VLSID-18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Terechko, E. Le Thenaff, M. Garg, J. van Eijndhoven, and H. Corporaal. 2003. Inter-cluster communication models for clustered VLIW processors. In HPCA-9 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. H. Tseng and K. Asanovic. 2003. Banked multiported register files for high-frequency superscalar microprocessors. In ISCA-30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Wallace and N. Bagherzadeh. 1996. A scalable register file architecture for dynamically scheduled processors. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Yung and N. C. Wilhelm. 1995. Caching processor general registers. In ICCD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. 2000. Two-level hierarchical register file organization for VLIW processors. In MICRO-33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Javier Zalamea, Josep Llosa, Eduard Ayguad, and Mateo Valero. 2001. Modulo scheduling with integrated register spilling for clustered VLIW architectures. In Micro-34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yingchao Zhao, C. J. Xue, Minming Li, and B. Hu. 2009. Energy-aware register file re-partitioning for clustered VLIW architectures. In ASP-DAC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. V. Zyuban and P. Kogge. 1998. The energy complexity of register files. In ISLPED. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The CURE: Cluster Communication Using Registers

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!