skip to main content
10.1145/3297858.3304062acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures

Published:04 April 2019Publication History

ABSTRACT

Modern microarchitectures are some of the world's most complex man-made systems. As a consequence, it is increasingly difficult to predict, explain, let alone optimize the performance of software running on such microarchitectures. As a basis for performance predictions and optimizations, we would need faithful models of their behavior, which are, unfortunately, seldom available.

In this paper, we present the design and implementation of a tool to construct faithful models of the latency, throughput, and port usage of x86 instructions. To this end, we first discuss common notions of instruction throughput and port usage, and introduce a more precise definition of latency that, in contrast to previous definitions, considers dependencies between different pairs of input and output operands. We then develop novel algorithms to infer the latency, throughput, and port usage based on automatically-generated microbenchmarks that are more accurate and precise than existing work.

To facilitate the rapid construction of optimizing compilers and tools for performance prediction, the output of our tool is provided in a machine-readable format. We provide experimental results for processors of all generations of Intel's Core architecture, i.e., from Nehalem to Coffee Lake, and discuss various cases where the output of our tool differs considerably from prior work.

References

  1. Andreas Abel and Jan Reineke. 2013. Measurement-based modeling of the cache replacement policy. In 19th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS, Philadelphia, PA, USA . 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Andreas Abel and Jan Reineke. 2014. Reverse engineering of cache replacement policies in Intel microprocessors and their evaluation. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014, Monterey, CA, USA, March 23--25, 2014. 141--142.Google ScholarGoogle ScholarCross RefCross Ref
  3. Jung Ho Ahn, Sheng Li, Seongil O, and Norman P. Jouppi. 2013. McSimAGoogle ScholarGoogle Scholar
  4. : A manycore simulator with application-levelGoogle ScholarGoogle Scholar
  5. simulation and detailed microarchitecture modeling. In 2013 IEEE International Symposium on Performance Analysis of Systems & Software, Austin, TX, USA . 74--85.Google ScholarGoogle Scholar
  6. Vlastimil Babka and Petr Truma. 2009. Investigating Cache Parameters of x86 Family Processors. In Proceedings of the 2009 SPEC benchmark workshop. Springer, 77--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ramon Bertran, Alper Buyuktosunoglu, Meeta S. Gupta, Marc Gonzalez, and Pradip Bose. 2012. Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 199--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrea Di Biagio. 2018. llvm-mca: a static performance analysis tool. http://lists.llvm.org/pipermail/llvm-dev/2018-March/121490.htmlGoogle ScholarGoogle Scholar
  9. Nathan L. Binkert, Bradford M. Beckmann, Gabriel Black, Steven K. Reinhardt, Ali G. Saidi, Arkaprava Basu, Joel Hestness, Derek Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib Bin Altaf, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Computer Architecture News , Vol. 39, 2 (2011), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. S. Charif-Rubial, E. Oseret, J. Noudohouenou, W. Jalby, and G. Lartigue. 2014. CQA: A code quality analyzer tool at binary level. In 21st International Conference on High Performance Computing (HiPC). 1--10.Google ScholarGoogle Scholar
  11. C.L. Coleman and J.W. Davidson. 2001. Automatic memory hierarchy characterization. In ISPASS. 103--110.Google ScholarGoogle Scholar
  12. Jack J. Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You. 2004. Accurate Cache and TLB Characterization Using Hardware Counters. In ICCS . 432--439.Google ScholarGoogle Scholar
  13. FinalWire Ltd. {n. d.}. AIDA64 . https://www.aida64.com/Google ScholarGoogle Scholar
  14. Agner Fog. 2017. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Technical University of Denmark. http://www.agner.org/optimize/instruction_tables.pdfGoogle ScholarGoogle Scholar
  15. Karthik Ganesan, Jungho Jo, W. Lloyd Bircher, Dimitris Kaseridis, Zhibin Yu, and Lizy K. John. 2010. System-level Max Power (SYMPO): A Systematic Approach for Escalating System-level Power Consumption Using Synthetic Benchmarks. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10). ACM, New York, NY, USA, 19--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Karthik Ganesan and Lizy K. John. 2011. MAximum Multicore POwer (MAMPO): An Automatic Multithreaded Synthetic Power Virus Generation Framework for Multicore Systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). ACM, New York, NY, USA, Article 53, bibinfonumpages12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Google. {n. d.}. EXEgesis . https://github.com/google/EXEgesisGoogle ScholarGoogle Scholar
  18. Torbjörn Granlund. 2017. Instruction latencies and throughput for AMD and Intel x86 Processors. https://gmplib.org/ tege/x86-timing.pdfGoogle ScholarGoogle Scholar
  19. Julian Hammer, Georg Hager, Jan Eitzinger, and Gerhard Wellein. 2015. Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft. In Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems (PMBS '15). ACM, New York, NY, USA, Article 4, bibinfonumpages11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mohamed Hassan, Anirudh M. Kaushik, and Hiren D. Patel. 2015. Reverse-engineering embedded memory controllers through latency-based analysis. In 21st IEEE Real-Time and Embedded Technology and Applications Symposium, Seattle, WA, USA. 297--306.Google ScholarGoogle Scholar
  21. Intel Corporation. {n. d.} a. Intel Architecture Code Analyzer. https://software.intel.com/en-us/articles/intel-architecture-code-analyzerGoogle ScholarGoogle Scholar
  22. Intel Corporation. {n. d.} b. X86 Encoder Decoder (XED) . https://intelxed.github.io/Google ScholarGoogle Scholar
  23. Ajay Joshi, Lieven Eeckhout, Lizy K John, and Ciji Isen. 2008. Automated microprocessor stressmark generation. In International Symposium on High-Performance Computer Architecture-Proceedings. IEEE Computer Society, 209--219.Google ScholarGoogle ScholarCross RefCross Ref
  24. Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Gabriel H. Loh, Samantika Subramaniam, and Yuejian Xie. 2009. Zesto: A cycle-level simulator for highly detailed microarchitecture exploration. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2009, April 26--28, 2009, Boston, Massachusetts, USA, Proceedings . 53--64.Google ScholarGoogle ScholarCross RefCross Ref
  26. Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE Trans. Parallel Distrib. Syst. , Vol. 28, 1 (2017), 72--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Daniel Molka, Daniel Hackenberg, Robert Schö ne, and Matthias S. Mü ller. 2009. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques (PACT '09). IEEE, Washington, DC, USA, 261--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rafael H. Saavedra and Alan Jay Smith. 1995. Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes. IEEE Trans. Computers , Vol. 44, 10 (1995), 1223--1235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-core Systems. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 475--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Clark Thomborson and Yuanhua Yu. 2000. Measuring Data Cache and TLB Parameters Under Linux. In Proceedings of the Symposium on Performance Evaluation of Computer and Telecommunication Systems . 383--390. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.1427Google ScholarGoogle Scholar
  31. H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS . 235--246.Google ScholarGoogle Scholar
  32. Kamen Yotov, Sandra Jackson, Tyler Steele, Keshav Pingali, and Paul Stodghill. 2006. Automatic measurement of instruction cache capacity. In Proceedings of the 18th international workshop on Languages and Compilers for Parallel Computing . Springer, 230--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kamen Yotov, Keshav Pingali, and Paul Stodghill. 2005. Automatic measurement of memory hierarchy parameters. In SIGMETRICS . ACM, New York, NY, USA, 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures

                      Recommendations

                      Comments

                      Login options

                      Check if you have access through your login credentials or your institution to get full access on this article.

                      Sign in

                      PDF Format

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader