ABSTRACT
Modern microarchitectures are some of the world's most complex man-made systems. As a consequence, it is increasingly difficult to predict, explain, let alone optimize the performance of software running on such microarchitectures. As a basis for performance predictions and optimizations, we would need faithful models of their behavior, which are, unfortunately, seldom available.
In this paper, we present the design and implementation of a tool to construct faithful models of the latency, throughput, and port usage of x86 instructions. To this end, we first discuss common notions of instruction throughput and port usage, and introduce a more precise definition of latency that, in contrast to previous definitions, considers dependencies between different pairs of input and output operands. We then develop novel algorithms to infer the latency, throughput, and port usage based on automatically-generated microbenchmarks that are more accurate and precise than existing work.
To facilitate the rapid construction of optimizing compilers and tools for performance prediction, the output of our tool is provided in a machine-readable format. We provide experimental results for processors of all generations of Intel's Core architecture, i.e., from Nehalem to Coffee Lake, and discuss various cases where the output of our tool differs considerably from prior work.
- Andreas Abel and Jan Reineke. 2013. Measurement-based modeling of the cache replacement policy. In 19th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS, Philadelphia, PA, USA . 65--74. Google Scholar
Digital Library
- Andreas Abel and Jan Reineke. 2014. Reverse engineering of cache replacement policies in Intel microprocessors and their evaluation. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014, Monterey, CA, USA, March 23--25, 2014. 141--142.Google Scholar
Cross Ref
- Jung Ho Ahn, Sheng Li, Seongil O, and Norman P. Jouppi. 2013. McSimAGoogle Scholar
- : A manycore simulator with application-levelGoogle Scholar
- simulation and detailed microarchitecture modeling. In 2013 IEEE International Symposium on Performance Analysis of Systems & Software, Austin, TX, USA . 74--85.Google Scholar
- Vlastimil Babka and Petr Truma. 2009. Investigating Cache Parameters of x86 Family Processors. In Proceedings of the 2009 SPEC benchmark workshop. Springer, 77--96. Google Scholar
Digital Library
- Ramon Bertran, Alper Buyuktosunoglu, Meeta S. Gupta, Marc Gonzalez, and Pradip Bose. 2012. Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 199--211. Google Scholar
Digital Library
- Andrea Di Biagio. 2018. llvm-mca: a static performance analysis tool. http://lists.llvm.org/pipermail/llvm-dev/2018-March/121490.htmlGoogle Scholar
- Nathan L. Binkert, Bradford M. Beckmann, Gabriel Black, Steven K. Reinhardt, Ali G. Saidi, Arkaprava Basu, Joel Hestness, Derek Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib Bin Altaf, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Computer Architecture News , Vol. 39, 2 (2011), 1--7. Google Scholar
Digital Library
- A. S. Charif-Rubial, E. Oseret, J. Noudohouenou, W. Jalby, and G. Lartigue. 2014. CQA: A code quality analyzer tool at binary level. In 21st International Conference on High Performance Computing (HiPC). 1--10.Google Scholar
- C.L. Coleman and J.W. Davidson. 2001. Automatic memory hierarchy characterization. In ISPASS. 103--110.Google Scholar
- Jack J. Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You. 2004. Accurate Cache and TLB Characterization Using Hardware Counters. In ICCS . 432--439.Google Scholar
- FinalWire Ltd. {n. d.}. AIDA64 . https://www.aida64.com/Google Scholar
- Agner Fog. 2017. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Technical University of Denmark. http://www.agner.org/optimize/instruction_tables.pdfGoogle Scholar
- Karthik Ganesan, Jungho Jo, W. Lloyd Bircher, Dimitris Kaseridis, Zhibin Yu, and Lizy K. John. 2010. System-level Max Power (SYMPO): A Systematic Approach for Escalating System-level Power Consumption Using Synthetic Benchmarks. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10). ACM, New York, NY, USA, 19--28. Google Scholar
Digital Library
- Karthik Ganesan and Lizy K. John. 2011. MAximum Multicore POwer (MAMPO): An Automatic Multithreaded Synthetic Power Virus Generation Framework for Multicore Systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). ACM, New York, NY, USA, Article 53, bibinfonumpages12 pages. Google Scholar
Digital Library
- Google. {n. d.}. EXEgesis . https://github.com/google/EXEgesisGoogle Scholar
- Torbjörn Granlund. 2017. Instruction latencies and throughput for AMD and Intel x86 Processors. https://gmplib.org/ tege/x86-timing.pdfGoogle Scholar
- Julian Hammer, Georg Hager, Jan Eitzinger, and Gerhard Wellein. 2015. Automatic Loop Kernel Analysis and Performance Modeling with Kerncraft. In Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems (PMBS '15). ACM, New York, NY, USA, Article 4, bibinfonumpages11 pages. Google Scholar
Digital Library
- Mohamed Hassan, Anirudh M. Kaushik, and Hiren D. Patel. 2015. Reverse-engineering embedded memory controllers through latency-based analysis. In 21st IEEE Real-Time and Embedded Technology and Applications Symposium, Seattle, WA, USA. 297--306.Google Scholar
- Intel Corporation. {n. d.} a. Intel Architecture Code Analyzer. https://software.intel.com/en-us/articles/intel-architecture-code-analyzerGoogle Scholar
- Intel Corporation. {n. d.} b. X86 Encoder Decoder (XED) . https://intelxed.github.io/Google Scholar
- Ajay Joshi, Lieven Eeckhout, Lizy K John, and Ciji Isen. 2008. Automated microprocessor stressmark generation. In International Symposium on High-Performance Computer Architecture-Proceedings. IEEE Computer Society, 209--219.Google Scholar
Cross Ref
- Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 75--86. Google Scholar
Digital Library
- Gabriel H. Loh, Samantika Subramaniam, and Yuejian Xie. 2009. Zesto: A cycle-level simulator for highly detailed microarchitecture exploration. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2009, April 26--28, 2009, Boston, Massachusetts, USA, Proceedings . 53--64.Google Scholar
Cross Ref
- Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE Trans. Parallel Distrib. Syst. , Vol. 28, 1 (2017), 72--86. Google Scholar
Digital Library
- Daniel Molka, Daniel Hackenberg, Robert Schö ne, and Matthias S. Mü ller. 2009. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques (PACT '09). IEEE, Washington, DC, USA, 261--270. Google Scholar
Digital Library
- Rafael H. Saavedra and Alan Jay Smith. 1995. Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes. IEEE Trans. Computers , Vol. 44, 10 (1995), 1223--1235. Google Scholar
Digital Library
- Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-core Systems. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 475--486. Google Scholar
Digital Library
- Clark Thomborson and Yuanhua Yu. 2000. Measuring Data Cache and TLB Parameters Under Linux. In Proceedings of the Symposium on Performance Evaluation of Computer and Telecommunication Systems . 383--390. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.1427Google Scholar
- H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In ISPASS . 235--246.Google Scholar
- Kamen Yotov, Sandra Jackson, Tyler Steele, Keshav Pingali, and Paul Stodghill. 2006. Automatic measurement of instruction cache capacity. In Proceedings of the 18th international workshop on Languages and Compilers for Parallel Computing . Springer, 230--243. Google Scholar
Digital Library
- Kamen Yotov, Keshav Pingali, and Paul Stodghill. 2005. Automatic measurement of memory hierarchy parameters. In SIGMETRICS . ACM, New York, NY, USA, 181--192. Google Scholar
Digital Library
Index Terms
- uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures
Recommendations
Increasing the Instruction Fetch Rate via Block-Structured Instruction Set Architectures
To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance ...
Out-of-order vector architectures
MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on MicroarchitectureRegister renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory ...
An out-of-order superscalar processor on FPGA: the ReOrder buffer design
DATE '12: Proceedings of the Conference on Design, Automation and Test in EuropeEmbedded systems based on FPGA (Field-Programmable Gate Arrays) must exhibit more performance for new applications. However, no high-performance superscalar soft processor is available on the FPGA, because the superscalar architecture is not suitable ...





Comments