Abstract
A new instruction fetch method, forward semantic, is offered to enable the deeply pipelined processors to fetch one useful instruction every cycle. Forward semantic is an improved alternative to the delayed branching (with or without squashing), with five major advantages. Fist, no restriction is imposed on the type of instructions filling the branch slots, which allows a large number of slots to be filled. Second, no modification to the offsets and displacements is necessary when an instruction is copied to fill a branch slot, which simplifies the linker implementation. Third, an interrupted program can resume execution with a single program counter, eliminating the need for reloading the instruction pipeline before resuming execution. Fourth, programs compiled with N slots can execute on pipelines requiring K (K ≤ N) slots, which makes family architecture compatibility possible . Lastly, the filling of branch slots is totally transparent to code compaction and software interlocking schemes. These advantages combine to provide an efficient instruction fetch mechanism and to eliminate artificial penalties on branch cost. At the cost of 11% static code expansion, forward semantic achieves an instruction fetch cost of 1.2 cycles for pipelines requiring 10 slots for each taken branch. This level of instruction fetch efficiency has never been achieved before with conventional instruction fetch methods. The branch cost is dictated by the accuracy of the compile-time branch prediction rather than artificial limitations, such as data dependencies, which prevent the slots from being filled. These results are measured from the execution of real UNIX and CAD programs with complex control structures.
- 1 P. M. Kogge, The Architecture of Pipelined Computers, pp. 237-243, McGraw-Hill, 1981.]]Google Scholar
- 2 J. E. Smith, "A Study of 13ranch Prediction Strategies,' Proceedings of the 8th international Symposium of Computer Architecture, pp. 135 - 148, June, 1981.]] Google Scholar
Digital Library
- 3 J. K. F. Lee and A. J. Smith, "Branch Prediction Strategies and Branch Target Buffer Design,' IEEE Computer, January 1984.]]Google Scholar
- 4 J. A. DeRosa and H. M. Levy, "An Evaluation of Branch Architectures,' Proceedings of the 15th International Symposium on Computer Architecture, Honolulu, Hawaii, May 30 -June 2,1988.]] Google Scholar
Digital Library
- 5 S. McFarling and J.L. Hennessy, 'Reducing the Cost of Branches," The 13th International Symposium on Computer Architecture Conference Proceedings, pp. 396403, Tokyo, Japan, June 1986.]] Google Scholar
Digital Library
- 6 D. R. Ditzel and H. R. McLellan, 'Branch Folding in the CRISP Microprocessor: Reducing Branch Delay to Zero," Proceedings of the 14th Annual International Symposium on Computer Architecture, pp. 2 - 9, Pittsburgh, Pennsylvania, June 2-5.1987.]] Google Scholar
Digital Library
- 7 W. W. Hwu, T. M. Conte, and P. P. Chang, "Comparing Software and Hardware Schemes For Reducing the Cost of Branches," Proceedings of the 16th Annual International Symposium on Computer Architecture, Jerusalem, Israel, May 28 - June 1. 1989.]] Google Scholar
Digital Library
- 8 G. Kadm, "The 801 Minicomputer," Proceedings of the Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 39 - 47, March 1982.]] Google Scholar
Digital Library
- 9 D. A. Patterson and C. H. Sequin, "A VLSI RISC,' IEEE Computer, pp. 8 - 21, September, 1982.]]Google Scholar
- 10 J. L. Hennessy , N. Jouppi, F. Baskett, and J. Gill, "MIPS: A VLSI Processor Architecture,' Proceedings of the CMU Conference on VLSI Systems and Computations, October 1981.]]Google Scholar
- 11 J. S. Bimbaum and W. S. Worley, "Beyond RISC: High Precision Architecture,' Spring COMPCON, p. 40.1986.]]Google Scholar
- 12 T. R. Gross and J. L. Hermessy., "Optimizing Delayed Branches,' Proceedings of the 15th Microprogramming Workshop, pp. 114 - 120, October 1982.]] Google Scholar
Digital Library
- 13 M. Hill and et al, "Design Decisions in SPUR," IEEE Computer, pp. 8 - 22, November 1986.]] Google Scholar
Digital Library
- 14 P. Chow and M. Horowitz, "Architecture Tradeoffs in the Design of MIPS-X,' Proceedings of the $14 sup th$ Annual international Symposium on Computer Architecture, Pittsburgh, Pennsylvania, June 2-5, 1987.]] Google Scholar
Digital Library
- 15 G. Kane, MIPS R2000 RISC Architecture, Prentice Hall, Englewood Cliffs, NJ, 1987.]]Google Scholar
- 16 Charles Melear, "The Design of the 88000 RISC Family," IEEE MICRO, pp. 26 - 38, April 1989.]] Google Scholar
Digital Library
- 17 J. Emer and D. Clark, "A Characterization of Processor Performance in the VAX-11/780.""" Proceedings of the Ilth Annual Symposium on Computer Architecture, June 1984.]] Google Scholar
Digital Library
- 18 J. L. Hemtessy and T. Gross, "Postpass Code Gptimization of Pipeline Constraints," ACM Trans. on Programming Languages and Systems, vol. 5, pp. 422-448, ACM, July 1983.]] Google Scholar
Digital Library
- 19 R. M. Russell, "The Cray-1 Computer System," Comm. ACM, vol. 21, No. 1, pp. 63-72, January 1978.]] Google Scholar
Digital Library
- 20 S. Weiss and J. E. Smith, "Instruction Issue Logic in Pipelined Supercomputers," IEEE Transactions on Computers, vol. C-33, pp. 1013--1022, IEEE, November 1984.]]Google Scholar
- 21 Y. N. Patt, W. W. Hwu, and M. C. Shebanow, "HPS, A New Microarchitecture: Rationale and Introduction," Proceedings of the 18th International Microprogramming Workshop, pp. 103-108, Asilomar, CA, Dec. 1985.]] Google Scholar
Digital Library
- 22 W. W. Hwu, "Exploiting Concurrency to Achieve High Performance in a Single-chip Microarchitecture,' Ph.D. Dissertation, Computer Science Division Report, vol. No. UCB/CSD 88/398. University of California, Berkeley, January 1988.]] Google Scholar
Digital Library
- 23 R. D. Acosta, J. Kjelstrup. and H. C. Tomg, "An Instruction Issuing Approach to Enhancing Performance in Multiple Functional Unit Processors," IEEE Transactions on Computers, vol. C-35, no. 9, September 1986.]] Google Scholar
Digital Library
- 24 R. M. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM Journal of Research and Development, vol. 11, pp. 25-33, January 1967.]]Google Scholar
Digital Library
- 25 W. W. Hwu and Y. N. Patt, "Checkpoint Repair for High Performance Gut-of-order Execution Machines," IEEE Transaction on Computers, IEEE, December 1987.]] Google Scholar
Digital Library
- 26 P. P. Chang and W. W. Hwu, "Trace Selection for Compiling Large C Application Programs to Microcode," Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitectures, pp. 21-29. San Diego, California, November 29 - December 2.]] Google Scholar
Digital Library
Index Terms
Forward semantic: a compiler-assisted instruction fetch method for heavily pipelined processors
Recommendations
Forward semantic: a compiler-assisted instruction fetch method for heavily pipelined processors
MICRO 22: Proceedings of the 22nd annual workshop on Microprogramming and microarchitectureA new instruction fetch method, forward semantic, is offered to enable the deeply pipelined processors to fetch one useful instruction every cycle. Forward semantic is an improved alternative to the delayed branching (with or without squashing), with ...
An Effective Instruction Fetch Policy for Simultaneous Multithreaded Processors
HPCASIA '04: Proceedings of the High Performance Computing and Grid in Asia Pacific Region, Seventh International ConferenceSimultaneous Multithreaded (SMT) processors improve the instruction throughput by allowing fetching and running instructions from several threads simultaneously at a single cycle. As the number of competing threads increasing, instruction throughput is ...






Comments