Abstract
In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the improved sched- ules enabled by OOO hardware speculation support and its ability to generate different schedules on different occurrences of the same instructions based on operand and functional unit availability. We find that the ability to express good static schedules achieves the bulk of the speedup resulting from OOO. Specifically, of the 53% speedup achieved by OOO relative to a similarly provisioned in- order machine, we find that 88% of that speedup can be achieved by using a single "best" static schedule as suggested by observing an OOO schedule of the code. We discuss the ISA mechanisms that would be required to express these static schedules. Furthermore, we find that the benefits of dynamism largely come from two kinds of events that influence the application's critical path: load instructions that miss in the cache only part of the time and branch mispredictions. We find that much of the benefit of OOO dynamism can be achieved by the potentially simpler task of addressing these two behaviors directly.
- B. A. Babaian, S. K. Okunev, and V. Y. Volkonsky. Critical path optimization--unload hard extended scalar block. USPTO 6584611, 2001.Google Scholar
- R. D. Barnes, J.W. Sias, E. M. Nystrom, S. J. Patel, J. N. Navarro, and W.-m. W. Hwu. Beating in-order stalls with "flea-flicker" two-pass pipelining. IEEE Trans. Comput., 55(1):18--33, Jan. 2006. Google Scholar
Digital Library
- A. T. Brian Kreskamp, Pablo Montesinos. Enhancing mlp: Runahead execution and related techniques. IACOMA Technical Report 512, 2005.Google Scholar
- M. Butler and Y. Patt. An investigation of the performance of various dynamic scheduling techniques. In Proceedings of the 25th annual international symposium on Microarchitecture, MICRO 25, pages 1--9, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press. Google Scholar
Digital Library
- H. W. Cain and P. Nagpurkar. Runahead execution vs. conventional data prefetching in the ibm power6 microprocessor. In ISPASS, pages 203--212, 2010.Google Scholar
Cross Ref
- L. Carter, W. Chuang, and B. Calder. An epic processor with pending functional units. In H. Zima, K. Joe, M. Sato, Y. Seo, and M. Shimasaki, editors, High Performance Computing, volume 2327 of Lecture Notes in Computer Science, pages 445--448. Springer Berlin / Heidelberg, 2006. Google Scholar
Digital Library
- P. P. Chang, W. Y. Chen, S. A. Mahlke, and W.-m. W. Hwu. Comparing static and dynamic code scheduling for multiple-instruction-issue processors. In Proceedings of the 24th annual international symposium on Microarchitecture, MICRO 24, pages 25--33, New York, NY, USA, 1991. ACM. Google Scholar
Digital Library
- A. Deb, J. M. Codina, and A. Gonzalez. Softhv: a hw/sw co-designed processor with horizontal and vertical fusion. In Proceedings of the 8th ACM International Conference on Computing Frontiers, CF '11, pages 1:1--1:10, New York, NY, USA, 2011. ACM. Google Scholar
Digital Library
- J. C. Dehnert et al. The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Reallife Challenges. In Proceedings of the International Symposium on Code Generation and Optimization, pages 15--24, 2003. Google Scholar
Digital Library
- J. Doweck. Inside Intel Core Microarchitecture and Smart Memory Access. Intel Whitepaper, 2006.Google Scholar
- M. Dupre, N. Darch, and O. Teman. VHC: Quickly Building an Optimizer for Complex Embedded Architectures. In Proceedings of the International Symposium on Code Generation and Optimization, pages 53--64, 2004. Google Scholar
Digital Library
- K. Ebcioglu and E. R. Altman. DAISY: Dynamic compilation for 100% architectural compatibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 26--37, June 1997. Google Scholar
Digital Library
- B. Fahs et al. Performance Characterization of a Hardware Framework for Dynamic Optimization. In Proceedings of the 34th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2001. Google Scholar
Digital Library
- B. A. Fields, S. Rubin, and R. Bodik. Focusing processor policies via Critical-Path prediction. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 74--85, July 2001. Google Scholar
Digital Library
- J. Fritts and W. Wolf. Evaluation of static and dynamic scheduling for media processors. In Proceedings of the 2nd Workshop on Media Processors and DSPs, Micro '00, 2000.Google Scholar
- J. S. Gardner. Mips aptiv cores hit the mark. Microprocessor Report, May 2012.Google Scholar
- M. Gebhart, B. A. Maher, K. E. Coons, J. Diamond, P. Gratz, M. Marino, N. Ranganathan, B. Robatmili, A. Smith, J. Burrill, S. W. Keckler, D. Burger, and K. S. McKinley. An evaluation of the trips computer system. In Proceedings of the 14th international conference on Architectural support for programming languages and operating systems, ASPLOS '09, pages 1--12, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- J. P. Grossman. Cheap out-of-order execution using delayed issue. In Proceedings of the International Conference of Computer Design, CD 2000, pages 549 -- 551, 2000. Google Scholar
Digital Library
- M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y.Watanabe, and T. Yamazaki. Synergistic processing in cell's multicore architecture. IEEE Micro, 26(2):10--24, Mar. 2006. Google Scholar
Digital Library
- T. R. Halfhill. Netlogic doubles up xlp. Microprocessor Report, April 2011.Google Scholar
- M. Heffernan. Data-Dependency Graph Transformations for Instruction Scheduling. PhD thesis, Massachusetts Institute of Technology, 2007.Google Scholar
- A. Hilton, S. Nagarakatte, and A. Roth. icfp: Tolerating all-level cache misses in in-order processors. IEEE Micro, 30(1):12--19, Jan. 2010. Google Scholar
Digital Library
- M. Horowitz, M. Martonosi, T. C. Mowry, and M. D. Smith. Informing memory operations: Providing memory performance feedback in modern processors. In In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 260--270, 1996. Google Scholar
Digital Library
- Intel. Intel 64 and ia-32 architectures optimization reference manual. Intel Technical Manual, 2012.Google Scholar
- Intel. Intel architecture instruction set extensions programming reference. Intel Technical Manual, 2012.Google Scholar
- D. Kim and D. Yeung. Design and evaluation of compiler algorithms for pre-execution. In Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, ASPLOS-X, pages 159--170, New York, NY, USA, 2002. ACM. Google Scholar
Digital Library
- A. Klaiber. The Technology Behind Crusoe Processors. Transmeta Whitepaper, Jan. 2000.Google Scholar
- H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. IBM POWER6 microarchitecture. IBM J. Res. Dev., 51:639--662, November 2007. Google Scholar
Digital Library
- D. J. Lilja. Reducing the branch penalty in pipelined processors. Computer, 21(7):47--55, July 1988. Google Scholar
Digital Library
- C. E. Love and H. F. Jordan. An investigation of static versus dynamic scheduling. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 192--201, New York, NY, USA, 1990. ACM. Google Scholar
Digital Library
- C. McNairy and D. Soltis. Itanium 2 processor microarchitecture. IEEE Micro, 23(2):44--55, Mar. 2003. Google Scholar
Digital Library
- R. Nagarajan, S. K. Kushwaha, D. Burger, K. S. McKinley, C. Lin, and S. W. Keckler. Static placement, dynamic issue (spdi) scheduling for edge architectures. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT '04, pages 74--84, Washington, DC, USA, 2004. IEEE Computer Society. Google Scholar
Digital Library
- N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan, and C. Zilles. Hardware atomicity for reliable software speculation. In Proceedings of the 34th International Symposium on Computer Architecture, pages 174--185, 2007. Google Scholar
Digital Library
- O. Palomar, T. Juan, and J. J. Navarro. Reusing cached schedules in an out-of-order processor with in-order issue logic. In Proceedings of the 2009 IEEE international conference on Computer design, ICCD'09, pages 246--253, Piscataway, NJ, USA, 2009. IEEE Press. Google Scholar
Digital Library
- S. J. Patel and S. S. Lumetta. rePLay: A Hardware Framework for Dynamic Optimization. IEEE Transactions on Computers, 50(6):590--608, 2001. Google Scholar
Digital Library
- S. J. Patel, T. Tung, S. Bose, and M. M. Crum. Increasing the size of atomic instruction blocks using control flow assertions. In Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, MICRO 33, pages 303--313, New York, NY, USA, 2000. ACM. Google Scholar
Digital Library
- N. Ranganathan, R. Nagarajan, D. Jimnez, D. Burger, S. W. Keckler, and C. Lin. Combining hyperblocks and exit prediction to increase front-end bandwidth and performance. Technical report, 2002.Google Scholar
- B. R. Rau. Dynamically scheduled vliw processors. In Proceedings of the 26th annual international symposium on Microarchitecture, MICRO 26, pages 80--92, Los Alamitos, CA, USA, 1993. IEEE Computer Society Press. Google Scholar
Digital Library
- K. W. Rudd and M. J. Flynn. Instruction-level parallel processorsdynamic and static scheduling tradeoffs. In Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis, PAS '97, pages 74--, Washington, DC, USA, 1997. IEEE Computer Society. Google Scholar
Digital Library
- H. Sharangpani and K. Arora. Itanium processor microarchitecture. IEEE Micro, 20(5):24--43, Sept. 2000. Google Scholar
Digital Library
- J. L. Shin, H. Park, H. Li, A. Smith, Y. Choi, H. Sathianathan, S. Dash, S. Turullols, S. Kim, R. Masleid, G. Konstadinidis, R. T. Golla, M. J. Doherty, G. Grohoski, and C. McAllister. The next-generation 64b sparc core in a t4 soc processor. In ISSCC, pages 60--62, 2012.Google Scholar
Cross Ref
- G. Shobaki. Optimal Global Instruction Scheduling Using Enumeration. PhD thesis, University of California Davis, 2006. Google Scholar
Digital Library
- G. Shobaki, K. Wilken, and M. Heffernan. Optimal trace scheduling using enumeration. ACM Trans. Archit. Code Optim., 5(4):19:1--19:32, Mar. 2009. Google Scholar
Digital Library
- M. D. Smith, M. Horowitz, and M. S. Lam. Efficient superscalar performance through boosting. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 248--259, 1992. Google Scholar
Digital Library
- F. Spadini, B. Fahs, S. Patel, and S. S. Lumetta. Improving quasidynamic schedules through region slip. In Proceedings of the international symposium on Code generation and optimization: feedbackdirected and runtime optimization, CGO '03, pages 149--158, Washington, DC, USA, 2003. IEEE Computer Society. Google Scholar
Digital Library
- S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton. Continual flow pipelines. In Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, ASPLOS-XI, pages 107--119, New York, NY, USA, 2004. ACM. Google Scholar
Digital Library
- E. Talpes and D. Marculescu. Execution cache-based microarchitecture power-efficient superscalar processors. IEEE Trans. Very Large Scale Integr. Syst., 13(1):14--26, Jan. 2005. Google Scholar
Digital Library
- S. Undy. Poulson: An 8 core 32nm next generation intel itanium processor, 2011.Google Scholar
- M. G. Valluri, L. K. John, and K. S. McKinley. Low-power, low-complexity instruction issue using compiler assistance. In Proceedings of the 19th annual international conference on Supercomputing, ICS '05, pages 209--218, New York, NY, USA, 2005. ACM. Google Scholar
Digital Library
- D. W. Wall. Limits of instruction-level parallelism. SIGARCH Comput. Archit. News, 19(2):176--188, Apr. 1991. Google Scholar
Digital Library
- M. T. Yourst and K. Ghose. Incremental commit groups for nonatomic trace processing. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 38, pages 67--80, Washington, DC, USA, 2005. IEEE Computer Society. Google Scholar
Digital Library
Index Terms
Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?
Recommendations
Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsIn this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the ...
Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?
ASPLOS '13In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the ...
Streamlining long latency instructions for seamlessly combined out-of-order and in-order execution
In the current day wide-issue processors, the size of the instruction scheduling window (also called Issue Queue (IQ)) is limited mainly by the hardware complexity to design the logic, and thus limits the number of instructions scanned every cycle to ...







Comments