skip to main content
research-article

Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?

Published:16 March 2013Publication History
Skip Abstract Section

Abstract

In this paper, we set out to study the performance advantages of an Out-of-Order (OOO) processor relative to in-order processors with similar execution resources. In particular, we try to tease apart the performance contributions from two sources: the improved sched- ules enabled by OOO hardware speculation support and its ability to generate different schedules on different occurrences of the same instructions based on operand and functional unit availability. We find that the ability to express good static schedules achieves the bulk of the speedup resulting from OOO. Specifically, of the 53% speedup achieved by OOO relative to a similarly provisioned in- order machine, we find that 88% of that speedup can be achieved by using a single "best" static schedule as suggested by observing an OOO schedule of the code. We discuss the ISA mechanisms that would be required to express these static schedules. Furthermore, we find that the benefits of dynamism largely come from two kinds of events that influence the application's critical path: load instructions that miss in the cache only part of the time and branch mispredictions. We find that much of the benefit of OOO dynamism can be achieved by the potentially simpler task of addressing these two behaviors directly.

References

  1. B. A. Babaian, S. K. Okunev, and V. Y. Volkonsky. Critical path optimization--unload hard extended scalar block. USPTO 6584611, 2001.Google ScholarGoogle Scholar
  2. R. D. Barnes, J.W. Sias, E. M. Nystrom, S. J. Patel, J. N. Navarro, and W.-m. W. Hwu. Beating in-order stalls with "flea-flicker" two-pass pipelining. IEEE Trans. Comput., 55(1):18--33, Jan. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. T. Brian Kreskamp, Pablo Montesinos. Enhancing mlp: Runahead execution and related techniques. IACOMA Technical Report 512, 2005.Google ScholarGoogle Scholar
  4. M. Butler and Y. Patt. An investigation of the performance of various dynamic scheduling techniques. In Proceedings of the 25th annual international symposium on Microarchitecture, MICRO 25, pages 1--9, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. W. Cain and P. Nagpurkar. Runahead execution vs. conventional data prefetching in the ibm power6 microprocessor. In ISPASS, pages 203--212, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  6. L. Carter, W. Chuang, and B. Calder. An epic processor with pending functional units. In H. Zima, K. Joe, M. Sato, Y. Seo, and M. Shimasaki, editors, High Performance Computing, volume 2327 of Lecture Notes in Computer Science, pages 445--448. Springer Berlin / Heidelberg, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. P. Chang, W. Y. Chen, S. A. Mahlke, and W.-m. W. Hwu. Comparing static and dynamic code scheduling for multiple-instruction-issue processors. In Proceedings of the 24th annual international symposium on Microarchitecture, MICRO 24, pages 25--33, New York, NY, USA, 1991. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Deb, J. M. Codina, and A. Gonzalez. Softhv: a hw/sw co-designed processor with horizontal and vertical fusion. In Proceedings of the 8th ACM International Conference on Computing Frontiers, CF '11, pages 1:1--1:10, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. C. Dehnert et al. The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Reallife Challenges. In Proceedings of the International Symposium on Code Generation and Optimization, pages 15--24, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Doweck. Inside Intel Core Microarchitecture and Smart Memory Access. Intel Whitepaper, 2006.Google ScholarGoogle Scholar
  11. M. Dupre, N. Darch, and O. Teman. VHC: Quickly Building an Optimizer for Complex Embedded Architectures. In Proceedings of the International Symposium on Code Generation and Optimization, pages 53--64, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Ebcioglu and E. R. Altman. DAISY: Dynamic compilation for 100% architectural compatibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 26--37, June 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Fahs et al. Performance Characterization of a Hardware Framework for Dynamic Optimization. In Proceedings of the 34th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. A. Fields, S. Rubin, and R. Bodik. Focusing processor policies via Critical-Path prediction. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 74--85, July 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Fritts and W. Wolf. Evaluation of static and dynamic scheduling for media processors. In Proceedings of the 2nd Workshop on Media Processors and DSPs, Micro '00, 2000.Google ScholarGoogle Scholar
  16. J. S. Gardner. Mips aptiv cores hit the mark. Microprocessor Report, May 2012.Google ScholarGoogle Scholar
  17. M. Gebhart, B. A. Maher, K. E. Coons, J. Diamond, P. Gratz, M. Marino, N. Ranganathan, B. Robatmili, A. Smith, J. Burrill, S. W. Keckler, D. Burger, and K. S. McKinley. An evaluation of the trips computer system. In Proceedings of the 14th international conference on Architectural support for programming languages and operating systems, ASPLOS '09, pages 1--12, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. P. Grossman. Cheap out-of-order execution using delayed issue. In Proceedings of the International Conference of Computer Design, CD 2000, pages 549 -- 551, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y.Watanabe, and T. Yamazaki. Synergistic processing in cell's multicore architecture. IEEE Micro, 26(2):10--24, Mar. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. R. Halfhill. Netlogic doubles up xlp. Microprocessor Report, April 2011.Google ScholarGoogle Scholar
  21. M. Heffernan. Data-Dependency Graph Transformations for Instruction Scheduling. PhD thesis, Massachusetts Institute of Technology, 2007.Google ScholarGoogle Scholar
  22. A. Hilton, S. Nagarakatte, and A. Roth. icfp: Tolerating all-level cache misses in in-order processors. IEEE Micro, 30(1):12--19, Jan. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Horowitz, M. Martonosi, T. C. Mowry, and M. D. Smith. Informing memory operations: Providing memory performance feedback in modern processors. In In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 260--270, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Intel. Intel 64 and ia-32 architectures optimization reference manual. Intel Technical Manual, 2012.Google ScholarGoogle Scholar
  25. Intel. Intel architecture instruction set extensions programming reference. Intel Technical Manual, 2012.Google ScholarGoogle Scholar
  26. D. Kim and D. Yeung. Design and evaluation of compiler algorithms for pre-execution. In Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, ASPLOS-X, pages 159--170, New York, NY, USA, 2002. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Klaiber. The Technology Behind Crusoe Processors. Transmeta Whitepaper, Jan. 2000.Google ScholarGoogle Scholar
  28. H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. IBM POWER6 microarchitecture. IBM J. Res. Dev., 51:639--662, November 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. J. Lilja. Reducing the branch penalty in pipelined processors. Computer, 21(7):47--55, July 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. E. Love and H. F. Jordan. An investigation of static versus dynamic scheduling. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 192--201, New York, NY, USA, 1990. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. McNairy and D. Soltis. Itanium 2 processor microarchitecture. IEEE Micro, 23(2):44--55, Mar. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Nagarajan, S. K. Kushwaha, D. Burger, K. S. McKinley, C. Lin, and S. W. Keckler. Static placement, dynamic issue (spdi) scheduling for edge architectures. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, PACT '04, pages 74--84, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan, and C. Zilles. Hardware atomicity for reliable software speculation. In Proceedings of the 34th International Symposium on Computer Architecture, pages 174--185, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. O. Palomar, T. Juan, and J. J. Navarro. Reusing cached schedules in an out-of-order processor with in-order issue logic. In Proceedings of the 2009 IEEE international conference on Computer design, ICCD'09, pages 246--253, Piscataway, NJ, USA, 2009. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. J. Patel and S. S. Lumetta. rePLay: A Hardware Framework for Dynamic Optimization. IEEE Transactions on Computers, 50(6):590--608, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. J. Patel, T. Tung, S. Bose, and M. M. Crum. Increasing the size of atomic instruction blocks using control flow assertions. In Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, MICRO 33, pages 303--313, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. N. Ranganathan, R. Nagarajan, D. Jimnez, D. Burger, S. W. Keckler, and C. Lin. Combining hyperblocks and exit prediction to increase front-end bandwidth and performance. Technical report, 2002.Google ScholarGoogle Scholar
  38. B. R. Rau. Dynamically scheduled vliw processors. In Proceedings of the 26th annual international symposium on Microarchitecture, MICRO 26, pages 80--92, Los Alamitos, CA, USA, 1993. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. K. W. Rudd and M. J. Flynn. Instruction-level parallel processorsdynamic and static scheduling tradeoffs. In Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis, PAS '97, pages 74--, Washington, DC, USA, 1997. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. H. Sharangpani and K. Arora. Itanium processor microarchitecture. IEEE Micro, 20(5):24--43, Sept. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. L. Shin, H. Park, H. Li, A. Smith, Y. Choi, H. Sathianathan, S. Dash, S. Turullols, S. Kim, R. Masleid, G. Konstadinidis, R. T. Golla, M. J. Doherty, G. Grohoski, and C. McAllister. The next-generation 64b sparc core in a t4 soc processor. In ISSCC, pages 60--62, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  42. G. Shobaki. Optimal Global Instruction Scheduling Using Enumeration. PhD thesis, University of California Davis, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. G. Shobaki, K. Wilken, and M. Heffernan. Optimal trace scheduling using enumeration. ACM Trans. Archit. Code Optim., 5(4):19:1--19:32, Mar. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. D. Smith, M. Horowitz, and M. S. Lam. Efficient superscalar performance through boosting. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 248--259, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. F. Spadini, B. Fahs, S. Patel, and S. S. Lumetta. Improving quasidynamic schedules through region slip. In Proceedings of the international symposium on Code generation and optimization: feedbackdirected and runtime optimization, CGO '03, pages 149--158, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton. Continual flow pipelines. In Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, ASPLOS-XI, pages 107--119, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. E. Talpes and D. Marculescu. Execution cache-based microarchitecture power-efficient superscalar processors. IEEE Trans. Very Large Scale Integr. Syst., 13(1):14--26, Jan. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. S. Undy. Poulson: An 8 core 32nm next generation intel itanium processor, 2011.Google ScholarGoogle Scholar
  49. M. G. Valluri, L. K. John, and K. S. McKinley. Low-power, low-complexity instruction issue using compiler assistance. In Proceedings of the 19th annual international conference on Supercomputing, ICS '05, pages 209--218, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. D. W. Wall. Limits of instruction-level parallelism. SIGARCH Comput. Archit. News, 19(2):176--188, Apr. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. M. T. Yourst and K. Ghose. Incremental commit groups for nonatomic trace processing. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, MICRO 38, pages 67--80, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 48, Issue 4
        ASPLOS '13
        April 2013
        540 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2499368
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
          March 2013
          574 pages
          ISBN:9781450318709
          DOI:10.1145/2451116

        Copyright © 2013 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 March 2013

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!